Some languages (French is one of them) use special signs to modify Latin alphabet letters:
L'HÔTEL
Such signs (diactrics) make it more complex to directly compare the words, because they change the character code.
Command
-strip_accents=true
strips the accents, so é becomes e and so on. It is recommended to issue this command when indexing the files in order to decrease the number of keywords in index database.
HTML and XML formats store some information inside tags <...>. Usually this information is out of interest, so the tag internals are eliminated when processing files. You can change this behavior by the command:
-stripdecor=false
-cp NNN
sets the only legal coding for documents. If a document has another coding defined it is ignored.
-prefer_cp NNN
sets the document coding if document does not have information about its coding.
Extended syntax
-prefer_cp "MMM;NNN;KKK"
sets the list of codepages to be used by codepage
guesser.
© Mental Computing 2009
changed 18-Apr-10 |