Grammar engine dictionary compilation

Search engine dictionary

The faind search engine uses the natural language morphology and syntax rules to produce better results. The rules are stored in a number of binary files or SQL database tables. They are organized in fast-accessible scheme in order to limit the possible overheads of complex artificial intelligence algorithms. These files (of SQL database tables) are produces by the compiler program.

Reasons to recompile the dictionary

First, the text representation of dictionary is convenient for programmer but very inefficient for search engine. Symbolic names of grammatical classes, coordinates, states and entries are translated into integer numbers - internal tables indexes.

Second, the wordform lookup is very slow without optimization hints (total number of declared entries in dictionary can be seen here). The compiler creates auxiliary tables that accelerate the search. These tables are stored in binary dictionary so there is no need to rebuild them.

Third, the compiler generates some cpp files with C++ code that performs "heavy" operations. These files are compiled and linked to other project programs. For example, the rules for syntax analysis are used so intensively that even the small improvement of performance can boost the total efficiency of grammar engine.  Being translated into C++ code and compiled to native machine code these rules execute much faster.

What is needed for compilation?

First of all, it is necessary to download (or to compile) the compiler program.

Second, it is necessary to have got the dictionary source codes. They are available with SDK installation package.

How much memory is required?

256 Mb of memory is recommended. The maximum amount of allocated by compiler.exe memory for releases 0.75 and 0.80 is about 180 Mb. In case of memory shortage we recommend to modify the dictionary content (for example, to remove some parts of lexicon - read the guide on dictionary modifications).

Dictionary types

There are two basic types of generated dictionary:

1. Local binary files based

2. SQL backend server based

The first dictionary type is aimed for local desktop search engine. All language information is stored in binary files which are loaded into the memory on search engine startup. It is possible to generate either full version (read below) or simplified (light) version (read this) of the dictionary.

The second dictionary type requires SQL backend server to store the language morphology rules. It is specially designed for server multithreaded version of the search engine. Read this paragraph for more information.

All types of dictionary use the same source representation of the dictionary - the collection of plain text files. Moreover, any type of dictionary is generated by the same compiler program. The target version of the dictionary is specified the command line options.

The Compiler usage

Being launched without parameters it prints the brief help:

compiler

Result is as follows (for MS Windows XP):

The list of options can be printed by option

compiler -h=1

There are two basic ways to compile dictionary that yield completely different results.

First, it is possible to compile optimized dictionary with syntax analysis rules. This dictionary is used in complex methods of knowledge discovery in search tools. The main disadvantage of this compilation scheme is that you have to recompile end programs - because compiler produces some C++ code (read this).

Second, one can compile dictionary without morphology data. This dictionary has no rules section, and for this reason it is much smaller (we call it lite version), but it can be used only for experimental purposes.

 

Optimized compilation

It results in full version of dictionary suitable for complex methods of text search (with knowledge discovery). Dictionary source codes are necessary for compilation. Unpack them into the folder with compiler.exe and enter:

compiler -o -j=2 diction.mak

Option -o enables the dictionary optimization, -j=2 switches on the basic trace of parsing process.

Folder \scripts\dictionary contains ready-to-run scripts which build different versions of dictionary.

Dictionary compilation begins (MS Windows XP screenshot):

For MS Windows 98 the situation is nearly the same:

Screenshot for Linux:

Translation takes about a minute on P-IV 2.8 GHz (see the benchmarks).

If succeeds the program prints Translation completed.

For MS Windows:

For Linux:

 

"Light" dictionary version

"Reduced" version of dictionary becomes available in 0.80 release. Command line must be as follows:

compiler -j=2 -nolinks diction-lite.mak

where diction-lite.mak contains the list of dictionary source files. Options -nolinks and -sg_lite eliminate some information from dictionary and makes the dictionary smaller but prevent the search engine from using syntax analysis when searching.

You can use english-lite.cmd, russian-lite.cmd and other similar scripts in \scripts\dictionary to create the lite version of dictionary. 

Other options

1. Option-j=3 switches on the extended trace of parsing process. It can help to locate an error in dictionary source files:

compiler -j=3 diction.mak

results in:

2. Option -s is used to dump the source codes of dictionary after preprocessing. Yield (numbered lines of source codes) is dumped in Ygres log file (see -j). This options is useful for debugging of macros because the dump contains the result of macros substitution.

3. Option -nolinks forces the compiler to eliminate semantic net compilation. This results in smaller binary dictionary file, but prevents the search engine from performing online translation and synomization. This options is available in 0.80 release.

4. Option -sounds tells the compiler to include sound records in dictionary (they are skipped by default). This options is available in 0.80 release.

5. Options -save_affixes produces the file affixes.bin which contains the morphology information for the search engine stemmer.

6. Options -sg_lite modifies the internal format of dictionary binary file for search engine to eliminate all unnecessary information. It substantially decreases the size of dictionary file.

Results of compilation

Current folder should contain the following files after the compilation has been successfully completed (some of those files are created by explicit options in command line):

dictionary.xml - list of dictionary modules, see detailed description.

diction.bin - compiled dictionary. This file is used by other programs. It can also be downloaded from site (use it if dictionary source codes are unchanged).

syntax.bin - syntax analyzer data file.

thesaurus.bin - thesaurus database file.

journal - text file with messages emitted during compilation. It is used only for debugging.

_sg_api.h - C++ header file containing the constants declarations for grammar classes, coordinates, states and so on. These constants simplify the application of Grammar Engine. This file must be copied to LEM\include\solarix.

Additional information

Компиляция грамматического словаря (Russian)

Grammatical dictionary configuration file

  
  © Mental Computing 2010

изменено 04-Dec-10