Text encodings, codepages and languages

Accent stripping

Some languages (French is one of them) use special signs to modify Latin alphabet letters:

L'HÔTEL

Such signs (diactrics) make it more complex to directly compare the words, because they change the character code.

Command

-strip_accents=true

strips the accents, so é becomes e and so on. It is recommended to issue this command when indexing the files in order to decrease the number of keywords in index database.

HTML tags stripping

HTML and XML formats store some information inside tags <...>. Usually this information is out of interest, so the tag internals are eliminated when processing files. You can change this behavior by the command:

-stripdecor=false

Document character encodings

Command

-cp NNN

sets the only legal coding for documents. If a document has another coding defined it is ignored.

Command

-prefer_cp NNN

sets the document coding if document does not have information about its coding.

Extended syntax

-prefer_cp "MMM;NNN;KKK"

sets the list of codepages to be used by codepage guesser.

Additional information

Embeddable search engine API

Search engine commands


   Mental Computing 2009  home  rss  email  icq  download

changed 18-Apr-10