Dictionary compiler generates dictionary.xml file and fills it with the names of created database files. By default this file is written in ...\bin-windows or .../bin-linux for Windows and Linux respectfully. It ca be used without any modification by all SDK programs and components to load the grammatical dictionary.
Configuration file format is as follows:
All datafile paths are relative to dictionary.xml folder.
1. alphabet.bin - alphabet description: the list of symbols and their characteristics.
2. diction.bin - word entries for non-RDBMS version of dictionary.
3. Morphology datafile format.
4. lexicon.db - morphology database contains the definitions of parts of speech, grammatical attributes and other elements.
5. Thesaurus datafile format.
6. Thesaurus datafile name. it contains the relations between words - synonyms, antonyms, translations an so on.
7. Quick word search module datafile.
8. Lemmatizer data file. See additional information below.
9. Datafile for morphological analyser of non-dictionary words.
10. Datafile for instant word search module. It is used by Russian Grammatical Dictionary GUI program only.
11. Syntactical analyser and transformation rules.
12. Information for interactive debugger. It can be removed in release version.
13. Name of folder containing the Ngrams datafiles.
14. Tokenization module path. It specifies the name of DLL/SO/DYLYB of text segmentation engine.
15. Stemmer path. It is used by the search engine only.
16. The name of text file contatning the list of stopwords. It is used by the search engine only.
Grammatical dictionary modules can store the data and rules in relational databases. Connection to the database server is described as a set of parameters a.k.a connection string. Usually the connection string contains the server address, listener port number and authentication data, etc.
Configuration information for the particular module consists of RDBMS name and connection string. The following example shows the thesaurus and N-grams connected to MySQL database. The connection strings for both modules are the same, so the alias Local is used. The connection alias must be defined in connections section:
<?xml version="1.0" encoding="utf-8"?> <dataroot> ... <thesaurus_provider>mysql</thesaurus_provider> <thesaurus_db>Local</thesaurus_db> <ngrams_provider>mysql</ngrams_provider> <ngrams_db>Local</ngrams_db> ... <connections> <connection name="Local">host=127.0.0.1;port=3306;login=root;db=solarix;pool_size=1</connection> </connections> </dataroot>
The above configuration is identical to the previous:
<?xml version="1.0" encoding="utf-8"?> <dataroot> ... <thesaurus_provider>mysql</thesaurus_provider> <thesaurus_db>host=127.0.0.1;port=3306;login=root;db=solarix;pool_size=1</thesaurus_db> <ngrams_provider>mysql</ngrams_provider> <ngrams_db>host=127.0.0.1;port=3306;login=root;db=solarix;pool_size=1</ngrams_db> ... </dataroot>
There are no limitations to the number of connection string aliases in connections section. You can define the several string in connections and refer only one of them.
The sample of MySQL connection string:
The parameters are as follows:
host - server address, either ip or DNS name
port - listener port number
login - user login name
psw - password
db - database (schema) name
pool_size - connection pool size, 1 is quite enough in most cases.
This part of grammatical dictionary provides the way to get the lemmas of words. It is optional module, so its configuration may be missing or disabled by XML attributes.
There are two alternative implementation of lemmatizer. First impelemtation stores the lemmatization rules in relational SQL RDBMS, another one loads them from the binary file. Each implementation requires different set of XML nodes to describe the configuration.
The following example describes the SQL implementation of lemmatizer connecting to MySQL on localhost:
<?xml version="1.0" encoding="utf-8"?> <dataroot> ... <lemmatizer_provider enabled="true">mysql</lemmatizer_provider> <lemmatizer_db>Local</lemmatizer_db> ... <connections> <connection name="Local">host=127.0.0.1;port=3306;login=root;db=solarix;pool_size=1</connection> </connections> </dataroot>
Attribute enabled can be set to false in order to disable the lemmatizer. Connection string in is a reference to the appropriate connections node.
Another implementation of lemmatizer requires the name of binary file containing the lemmatization rules:
<?xml version="1.0" encoding="utf-8"?> <dataroot> ... <lemmatizer enabled="true" flags="default" absolute="false">lemmatizer.db</lemmatizer> ... </dataroot>
This configuration is created by default as a result of dictionary compilation.
There are several optional attributes in lemmatizer node.
enabled attribute can be used to disable the lemmatizer at all. flags sets the performance of lemmatizer, default is a slowest mode, other possibilities are faster and fastest. The default mode requires the least memory, whereas the fastest allocates a lot of RAM.
Lemmatization rules data file path is relative to dictionary.xml file directory by default. You can change the path interpretation to absolute by setting absolute="true".
Segmenters are means to divide the solid text into the words. They are necessary for several languages like Chinese or Japanese because these languages do not use spaces to indicate the word limits.
Language declaration can include the name of external segmenter. By default this name is combined with default extension for dinamic library for target platform and relative path to dictionary.xml folder.
It is possible to set additional information for segmenter load and initialization:
<segmentation_engines> <segmenter> <name>chinese_segmenter</name> <module>chinese_segmenter-v5.dll</module> <libpath absolute="true">e:\mvoice\lem\bin-windows</libpath> <datapath absolute="true">e:\mvoice\lem\bin-windows</datapath> <params></params> </segmenter> </segmentation_engines>
Each segmenter is described separately in segmenter subnodes.
As you can see the name of segmenter is used as identifier - subnode name. The real name of dynamic library to be loaded is written in module node. Relative or absolute path to the folder with DLL/SO is given in libpath node. Some segmenters require data files to be used. Use datapath node to define the relative or absolute path for such folders. The last subnode is params. This is an arbitrary string with parameters passed to the segmenter initialization procedure. The interpretation of these parameters is done by the segmenter itself.
There are two variants of N-grams data base architecture. First variant uses local data files and SQLite RDBMS or proprietary NoSQL engine to store and access data.
The folder path is specified in ngrams node:
<?xml version="1.0" encoding="utf-8"?> <dataroot> ... <ngrams>data_folder</ngrams> ... </dataroot>
This path is relative to the dictionary.xml folder. Attribute absolute="true" can be used to make this path absolute:
<?xml version="1.0" encoding="utf-8"?> <dataroot> ... <ngrams absolute="true">c:\data_folder</ngrams> ... </dataroot>
Second variant for N-grams storage is client-server architecture. The RDBMS name is specified in ngrams_provider, e.q. MySQL. Connection string can be specified explicitly or via alias in ngrams_db node:
<?xml version="1.0" encoding="utf-8"?> <dataroot> ... <ngrams_provider>mysql</ngrams_provider> <ngrams_db>Local</ngrams_db> ... </dataroot>
Russian Grammatical dictionary SDK
Grammatical dictionary compilation
Extending the grammatical dictionary
Конфигурация грамматического словаря
|© Mental Computing 2009||
||changed [an error occurred while processing this directive]|