2text content metaplugin API

2text is a search engine plugin which calls external programs to extract text content from files. It launches an executable module for each document so there are some advantages and disadvantages:

Advantages:

1. External parsers are executed as separate processes so fatal bugs does not affect search engine. It means that you can use unstable programs as text extractor.

2. One can use Java, VisualBasic, Perl or even script languages like Python to implement text extractor.

3. Debugging is easy.

Disadvantages:

1. It is slow.

2. Whole text content is extracted at the time whereas common content plugins can extract text page by page (e.g. DjVu).

3. Text extractors can not access grammar and search engine services via IGrammarEngine and ISearchEngine interfaces.

Location

This plugin typically resides in \plugins\formats in search system installation directory. Subplugins can be placed in any other place. 

Source code

Source code is included in SDK installation package. Look at \demo\ai\solarix\search_engine\Filetype_plugin\2text.

Source code for 2text subplugin - DjVu text extractor is also included in SDK.

Configuration file

Rules for file type recognition are contained in XML configuration file 2text.xml. Each external extractor is described as XML entry <filter>...</filter>:

Common configuration nodes

XML node description obligatory
type extractor type, "external" for external executables, "internal" for built-in general text extractor. no
format first part of MIME yes
subformat second part of MIME yes
maxsize max size of files to handle to prevent hang up (10 Mb is default value) no

Configuration nodes for external extractors

XML node description obligatory
ext file name extension(s), delimited by ';' yes
exe filepath to extractor executable yes
args startup command line, {1} stands for input (source) document filepath, {2} stands for extraction result file yes
format first part of MIME yes
encoding result text encoding (for out_format=text), utf8 is allowed as well as many other codepage names; current session codepage is used by default. no
timeout maximum elapsed time, millisec; external program is aborted if specified value is exceeded; default is 10 minutes (600000 msec) no

Example entry:

   <filter>
  <ext>dvi</ext>
  <exe>dvi2tty.exe</exe>
  <args>-o{2} {1}</args>
  <format>text</format>
  <subformat>dvi</subformat>
  </filter>

 

Configuration nodes for built-in general text extractor

XML node description obligatory
startpos_type "begin" (default) when start position is from beginning of file, "end" when start position is relative to file ending, "signature" to search position by bytes sequence no
start_pos start position in bytes (see startpos_type), by default is set to 0 no
start_signature signature bytes sequence, decimals or hexadecimals (e.g. 0xab), delimited by spaces or commas no
block_len length (in bytes) of text block no
extract_encoding text encoding, may be "utf8", "utf16le", "utf16be", or ASCII codepage name (used by default); "acp" means current session ASCII codepage no

Example entry:

- <filter>
  <type>internal</type>
  <ext>fon</ext>
  <startpos_type>begin</startpos_type>
  <start_pos>0</start_pos>
  <block_len>512</block_len>
  <extract_encoding>acp</extract_encoding>
  <format>application</format>
  <subformat>font</subformat>
  </filter>

 


© Mental Computing 2009  rss  email  icq free counters Πειςθνγ@Mail.ru