The faind search engine uses the natural language morphology and syntax rules to produce better results. The rules are stored in a number of binary files or SQL database tables. They are organized in fast-accessible scheme in order to limit the possible overheads of complex artificial intelligence algorithms. These files (of SQL database tables) are produces by YGRES program - the search engine dictionary compiler.
First, the text representation of dictionary is convenient for programmer but very inefficient for search engine. Symbolic names of grammatical classes, coordinates, states and entries are translated into integer numbers - internal tables indeces.
Second, the wordform lookup is very slow without optimization hints (total number of declared entries in dictionary can be seen here). YGRES compiler creates auxiliary tables that accelerates the search. These tables are stored in binary dictionary so there is no need to rebuild them.
Third, YGRES compiler generates some cpp files with C++ code that performs "heavy" operations. These files are compiled and linked to other project programs. For example, the rules for syntax analysis are used so intensively that even the small improvement of performance can boost the total effeciency of grammar engine. Being translated into C++ code and compiled to native machine code these rules execute much faster..
First of all, it is necessary to download (or to compile) Ygres program. Please note that ygres.exe is accompanied by ygres.ini - text configuration file.
Second, it is necessary to download the dictionary source codes. Unpack them into the folder with ygres.exe (it is also possible to use -dir folder_name).
256 Mb of memory is recommended. The maximum amount of allocated by ygres.exe memory for releases 0.75 and 0.80 is about 180 Mb (see screenshot). In case of memory shortage we recommend to modify the dictionary content (for example, remove some parts of lexicon - read the guide on dictionary modifications).
There are two basic types of generated dictionary:
1. Local binary files based
2. SQL backend server based
The first dictionary type is aimed for local desktop search engine. All language information is stored in binary files which are loaded into the memory on search engine startup. It is possible to generate either full version (read below) or simplified (light) version (read this) of the dictionary.
The second dictionary type requires SQL backend server to store the language morphology rules. It is specially designed for server multithreaded version of the search engine. Read this paragraph for more information.
All types of dictionary use the same source representation of the dictionary - the collection of plain text files. Moreover, any type of dictionary is generated by the same YGRES compiler program. The target version of the dictionary is specified the command line options.
Being launched without parameters it prints the brief help:
ygres
Result is as follows (for MS Windows XP):

The list of options can be printed by option
ygres -h=1
There are two basic ways to compile dictionary that yield completely different results.
First, it is possible to compile optimized dictionary with rules for syntax analysis. This dictionary is used in complex methods of knowledge discovery in search tools. The main disadvantage of this compilation scheme is that you have to recompile end programs - because compiler produces some C++ code (read this).
Second, one can compile dictionary without optimization. This dictionary has no rules section, and for this reason it is much smaller (we call it lite version), but it can be used only for experimental purposes.
It results in full version of dictionary suitable for complex methods of text search (with knowledge discovery). Dictionary source codes are necessary for compilation. Unpack them into the folder with ygres.exe and enter:
ygres -o -j=2 diction.mak
Option -o enables the dictionary optimization, -j=2 switches on the basic trace of parsing process.
Dictionary compilation begins (MS Windows XP screenshot):

For MS Windows 98 the situation is nearly the same:

Screenshot for Linux:

Translation takes about a minute on P-IV 2.8 GHz (see the benchmarks).
If succeeds the program prints Translation completed.
For MS Windows:

For Linux:

"Reduced" version of dictionary becomes available in 0.80 release. Command line must be as follows:
ygres -j=2 -nolinks -sg_lite diction-lite.mak
where diction-lite.mak contains the list of dictionary source files. Options -nolinks and -sg_lite eliminate some information from dictionary and makes the dictionary smaller but prevent the search engine from using syntax analysis when searching.
This feature is
partially available in 0.82 release
Option -sql=xxx produces SQL script to upload the dictionary into SQL database. Parameter xxx is a name of target SQL server:
unknown - generic backend SQL server (no table definition commands will be issued in script)
oracle - Oracle SQL backend server
mysql - MySQL backend server
The generated script contains the operators to create the necessary table and constraints (primary indexes). Uploading for the particular RDBMS can be performed by sqlplus tool for Oracle and mysql console client for MySql.
SQL version of dictionary is used for server version of the search engine.
1. Option -j=3 switches on the extended trace of parsing process. It can help to locate the errors in dictionary source files:
ygres -j=3 diction.mak
results in:

2. Option -s is used to dump the source codes of dictionary after preprocessing. Yield (numbered lines of source codes) is dumped in Ygres log file. This options is useful for debugging of macros because the dump contains the result of macros substitution.
3. Option -nolinks forces the compiler to eliminate semantic net compilation. This results in smaller binary dictionary file, but prevents the search engine from performing online translation and synomization. This options is available in 0.80 release.
4. Option -sounds tells the compiler to include sound records in dictionary (they are skipped by default). This options is available in 0.80 release.
5. Options -save_affixes produces the file affixes.bin which contains the morphology information for the search engine.
6. Options -sg_lite modifies the internal format of dictionary binary file for FAIND search engine to eliminate all unnecessary information. It substantially decreases the size of dictionary file.
Current folder should contain the following files after the compilation completed (some of those files are created by explicit options in command line):
diction.bin - compiled dictionary. This file is used by other programs. It can also be downloaded from site (use it if dictionary source codes are unchanged).
journal - text file with messages emitted during compilation. It is used only for debugging.
_aa_rulz.cpp, _sx_tpu.cpp, _la_find.cpp, _aa_groupz.cpp - these files contain the result of translation from PRIISK to C++ (YGRES translates the dictionary source codes written on PRIISK language into C++ code). These file must be copied into \LEM\AI\Ygres, and do not forget to do it after each dictionary compilation.
_sg_api.h - C++ header file containing the constants for grammar classes, coordinates, states and so on. These constants simplify the application of Grammar Engine. This file must be copied to LEM\Include\Solarix.
dictionary.sql - script for uploading the dictionary into the SQL backend server.
changed 30-SEP-2005
© Mental Computing 2009