Text encodings, codepages and languages

Accent stripping

Some languages (French is one of them) use special signs to modify Latin alphabet letters:


Such signs (diactrics) make it more complex to directly compare the words, because they change the character code.



strips the accents, so é becomes e and so on. It is recommended to issue this command when indexing the files in order to decrease the number of keywords in index database.

HTML tags stripping

HTML and XML formats store some information inside tags <...>. Usually this information is out of interest, so the tag internals are eliminated when processing files. You can change this behavior by the command:


Document character encodings


-cp NNN

sets the only legal coding for documents. If a document has another coding defined it is ignored.


-prefer_cp NNN

sets the document coding if document does not have information about its coding.

Extended syntax

-prefer_cp "MMM;NNN;KKK"

sets the list of codepages to be used by codepage guesser.

Additional information

Embeddable search engine API

Search engine commands

   Mental Computing 2009  home  rss  email  icq  download

changed 18-Apr-10