Search engine dictionary compilation (YGRES program)

Search engine dictionary

The faind search engine uses the natural language morphology and syntax rules to produce better results. The rules are stored in a number of binary files or SQL database tables. They are organized in fast-accessible scheme in order to limit the possible overheads of complex artificial intelligence algorithms. These files (of SQL database tables) are produces by YGRES program - the search engine dictionary compiler.

Reason to recompile the dictionary

First, the text representation of dictionary is convenient for programmer but very inefficient for search engine. Symbolic names of grammatical classes, coordinates, states and entries are translated into integer numbers - internal tables indeces.

Second, the wordform lookup is very slow without optimization hints (total number of declared entries in dictionary can be seen here). YGRES compiler creates auxiliary tables that accelerates the search. These tables are stored in binary dictionary so there is no need to rebuild them.

Third, YGRES compiler generates some cpp files with C++ code that performs "heavy" operations. These files are compiled and linked to other project programs. For example, the rules for syntax analysis are used so intensively that even the small improvement of performance can boost the total effeciency of grammar engine.  Being translated into C++ code and compiled to native machine code these rules execute much faster..

What is needed for compilation?

First of all, it is necessary to download (or to compile) Ygres program. Please note that ygres.exe is accompanied by ygres.ini - text configuration file.

Second, it is necessary to download the dictionary source codes. Unpack them into the folder with ygres.exe (it is also possible to use -dir folder_name).

How much memory is required?

256 Mb of memory is recommended. The maximum amount of allocated by ygres.exe memory for releases 0.75 and 0.80 is about 180 Mb (see screenshot). In case of memory shortage we recommend to modify the dictionary content (for example, remove some parts of lexicon - read the guide on dictionary modifications).

Dictionary types

There are two basic types of generated dictionary:

1. Local binary files based

2. SQL backend server based

The first dictionary type is aimed for local desktop search engine. All language information is stored in binary files which are loaded into the memory on search engine startup. It is possible to generate either full version (read below) or simplified (light) version (read this) of the dictionary.

The second dictionary type requires SQL backend server to store the language morphology rules. It is specially designed for server multithreaded version of the search engine. Read this paragraph for more information.

All types of dictionary use the same source representation of the dictionary - the collection of plain text files. Moreover, any type of dictionary is generated by the same YGRES compiler program. The target version of the dictionary is specified the command line options.

YGRES program usage

Being launched without parameters it prints the brief help:

ygres

Result is as follows (for MS Windows XP):

The list of options can be printed by option

ygres -h=1

There are two basic ways to compile dictionary that yield completely different results.

First, it is possible to compile optimized dictionary with rules for syntax analysis. This dictionary is used in complex methods of knowledge discovery in search tools. The main disadvantage of this compilation scheme is that you have to recompile end programs - because compiler produces some C++ code (read this).

Second, one can compile dictionary without optimization. This dictionary has no rules section, and for this reason it is much smaller (we call it lite version), but it can be used only for experimental purposes.

 

Optimized compilation

It results in full version of dictionary suitable for complex methods of text search (with knowledge discovery). Dictionary source codes are necessary for compilation. Unpack them into the folder with ygres.exe and enter:

ygres -o -j=2 diction.mak

Option -o enables the dictionary optimization, -j=2 switches on the basic trace of parsing process.

Dictionary compilation begins (MS Windows XP screenshot):

For MS Windows 98 the situation is nearly the same:

Screenshot for Linux:

Translation takes about a minute on P-IV 2.8 GHz (see the benchmarks).

If succeeds the program prints Translation completed.

For MS Windows:

For Linux:

 

"Light" dictionary version

"Reduced" version of dictionary becomes available in 0.80 release. Command line must be as follows:

ygres -j=2 -nolinks -sg_lite diction-lite.mak

where diction-lite.mak contains the list of dictionary source files. Options -nolinks and -sg_lite eliminate some information from dictionary and makes the dictionary smaller but prevent the search engine from using syntax analysis when searching.

 

SQL server dictionary generation

This feature is partially available in 0.82 release

Option -sql=xxx produces SQL script to upload the dictionary into SQL database. Parameter xxx is a name of target SQL server:

    unknown - generic backend SQL server (no table definition commands will be issued in script)

    oracle - Oracle SQL backend server

    mysql - MySQL backend server

The generated script contains the operators to create the necessary table and constraints (primary indexes). Uploading for the particular RDBMS can be performed by sqlplus tool for Oracle and mysql console client for MySql.

SQL version of dictionary is used for server version of the search engine.

 

Other options

1. Option -j=3 switches on the extended trace of parsing process. It can help to locate the errors in dictionary source files:

ygres -j=3 diction.mak

results in:

2. Option -s is used to dump the source codes of dictionary after preprocessing. Yield (numbered lines of source codes) is dumped in Ygres log file. This options is useful for debugging of macros because the dump contains the result of macros substitution.

3. Option -nolinks forces the compiler to eliminate semantic net compilation. This results in smaller binary dictionary file, but prevents the search engine from performing online translation and synomization. This options is available in 0.80 release.

4. Option -sounds tells the compiler to include sound records in dictionary (they are skipped by default). This options is available in 0.80 release.

5. Options -save_affixes produces the file affixes.bin which contains the morphology information for the search engine.

6. Options -sg_lite modifies the internal format of dictionary binary file for FAIND search engine to eliminate all unnecessary information. It substantially decreases the size of dictionary file.

Results of compilation

Current folder should contain the following files after the compilation completed (some of those files are created by explicit options in command line):

diction.bin - compiled dictionary. This file is used by other programs. It can also be downloaded from site (use it if dictionary source codes are unchanged).

journal - text file with messages emitted during compilation. It is used only for debugging.

_aa_rulz.cpp, _sx_tpu.cpp, _la_find.cpp, _aa_groupz.cpp - these files contain the result of translation from PRIISK to C++ (YGRES translates the dictionary source codes written on PRIISK language into C++ code). These file must be copied into \LEM\AI\Ygres, and do not forget to do it after each dictionary compilation.

_sg_api.h - C++ header file containing the constants for grammar classes, coordinates, states and so on. These constants simplify the application of Grammar Engine. This file must be copied to LEM\Include\Solarix.

dictionary.sql - script for uploading the dictionary into the SQL backend server.

changed 30-SEP-2005

  © Mental Computing 2009  main page  rss  email  icq  download