indexer usage examples »
Index is a way to speed up the process of searching the keywords in documents. Indexer database stores the results of preprocessing the documents. Without the help of index database (this mode of work is available in the search engine for some specific cases) each search requires loading and testing all files in the domain - for example, on CD. When you repeat the search on the same files again and again, the index database greatly improves the performance. In some cases it becomes possible to get the list of matching files in seconds.
Another side of the medal is additional disk space that is occupied by the index database files. FAIND indexer has several indexing schemes with different performance, one of which results in database files space equals to 2-7 % of indexed documents (depending on the target documents language - smaller for English or French, larger for Russian). This result is very good (most of the search engines produce much more massive databases - up to the half of original documents size, and does not support different schemes of indexing). The indexing schemes can be selected by the search engine commands for each index independently.
The time elapsed to index the documents is roughly speaking equals the time to perform the search in these documents (see benchmarks).
Index creation takes two steps. First, new index is declared (-index create_domain). Second, the documents in search domain are indexed. After that, created index can be used for searching.
There are several indexer engines. You may choose appropriate indexer engine when creating the new index.
1. faind fast built-in native indexer best suitable for big sets of unchangable documents. It creates very compact databases.
2. clucene fast indexer for changable documents. The most important feature of this engine is instant reindex of changed documents.
3. MySQL backend - slow but robust, fault-tolerant engine.
There are three basic indexing scheme which differ in database size, indexing speed and search performance. They are as follows:
1. Fastest indexing. The words collected from documents are stored as is without using the morphology rules. The index is built very quickly (see benchmarks).
2. Most compact database. The index database contains only basic forms for words so the size of database file is smallest (see benchmarks). This method is slightly slower than n.1. This scheme is enabled by -index wordforms option.
3. Instant phrase search. If your application needs an instant search for phrases this scheme is most suitable. The indexing is rather slow and the size of index database is very big. The previous indexing schemes performs the quick word search but phrase search can take a lot of time for them. This scheme is enabled by -index wordforms -index proximity options.
There are additional commands to store the following information in index:
keyword frequencies (to ranking the documents) -index frequency;
results of document classifications -index topic.
Search domain is a set of files defined by the commands -dir, -file or -uri. Each domain is indexed and searched separately and independently from other zones.
There are 2 types of indexes.
Documents in the static indexes must not change at all. Once created such indexes are searched fast. The static index database is most compact. Good example of static zone is the files on compact-disk. If some files in static index are modified, then you have to refresh the index: 1) delete old index (-index domain=xxx -index purge), 2) build new index (-index domain=xxx -index reindex). When searching in static index you can not use -dir, -file or -uri commands, but filters (-iname, -modif, -creat, -size) are acceptable.
Dynamic indexes accumulate the files that change rarely. The indexer create bigger index database in compare with static ones, but the search is fast. Command -index xxx -index refresh must be used in order to refresh the dynamic index database . You can use -dir, -file or -uri commands to define the documents to be scanned, but these commands make searching much slower. Filters (-iname, -modif, -creat, -size) are acceptable and do not make such bad influence on search performance.
To print the statistics of the index database:
You'll see the path to the folder with indexer's data files.
This option can be combined with -index domain=xxx to get the info about named zone:
-index domain=xxx -index info
dumps the detailed statistics but it is for debug purposes only.
It is very convenient to create separate index for each CD, DVD or hard drive folder with big amount of documents. Each index is maintained as a separate database. It decreases the overhead of manipulating the big lists of keywords.
Static index is created this way:
Native built-in index engine (a.k.a faind) is used in this case.
When there is no guarantee that files in index would not be changed, you can create dynamic index using the flag dynamic:
-index create_domain=zone_name clucene dynamic
CLucene indexer engine is appointed for this index by 'clucene' keyword.
Once the index is created you can build the index database, for example:
-index domain=name -dir folders
-index import mysql host db login password
copies the index descriptions (not index databases) from given MySQL server.
dumps the list of created indexes.
Use the command:
in order to completely delete the database files and remove the declaration of index.
selects one working index. For example, the command -index domain="xxx" -index info prints the information about the index "xxx".
It is also possible to select multiple indexes by -index domain="xxx;yyy;zzz". This feature is useful when searching over the several indexes.
Complete index clean up (all database files are deleted):
This command must be used in combination with -index domain="xxx" to clean the index. The command only deletes the database files, but the index declaration is not removed from the list of indexes. Use -index delete_domain command to completely delete the index.
To turn the indexer on or off:
The indexer may be enabled or disabled by default thru the option in ini-file.
Index database creation can be done by the command:
faind -index domain=domain_name <search_domain_definition>
Where search_domain_definition is a definition of folders, files and other sources of text (see detailed description).
Some words are ignored when indexing the texts. They are called stop words. Articles, prepositions, conjunctions and some other words are frequent and meaningless, and they are not considered as keywords. The list of stop words is loaded from simple text file. The name of this file is defined in ini-file. There is possibility to consider the stop words as usual words - the command:
The default behavior is to skip stopwords.
When indexing the text files it is possible to consider the different forms of the same word (the wordforms) as one keyword. It gives very good results - the number of keywords decrease drastically. But the speed of indexing slows down. By default the search engine does not use this method of index database optimization, but you can switch it on:
enables more complex morphology analysis and acts as more accurate '-index wordforms' command. It slows down the indexer but results in more compact database files.
generates additional information in database, which greatly increases the speed of phrase search. The building of this additional information requires some time and increases the space occupied by index databases.
The frequency of the keywords in documents can be important information for document ranking. Command:
accumulates the frequencies of the keywords and stores them for each processed document.
Later this information can be used to calculate the ranks of matching documents:
Please refer -sort freq_rank command to get an example of ranking.
There is no doubts that this algorithm of relevancy estimation does not guarantee best matching results to be on first positions in results listing, but it works fast and commonly used by many search engines.
-index domain="xxx" -index refresh
Static index is purged before refreshing, so this command causes it to be created from the scratch.
Refreshing of dynamic indexes includes the scan for changed files and reindexing these files.
There is no special option to control the indexer when searching (except for -index off which totally disables the indexer). If not disabled (by -index off or ini-file flag) the indexer works automatically and selects the appropriate algorithm in each situation.
When there is no definition for search domain in the command line and -index domain=xxx defines the zone, the indexer work very fast - the list of matching files is shown instantly. If you also need the match contexts in the files, use -index touchfiles options:
-index domain="library" -sample "cat OR dog" -index touchfiles
The indexer allows the use of regular expressions (combination of options -rx -sample) and fuzzy search (-soundex option). All these options slow down the performance of search but sometimes are very useful (see the query patterns syntax).
is used to get the total statistics for all index databases.
Unlocking all indices:
This command unconditionally removes all lock for all index databases. It can results in unpredictable results when applied to multisession server. Single user search utilities like Integra issue this command during startup to clear index databases status after previous crash.
Unlocking an index:
-index domain=xxx -index unlock
clears the read and write locks for only one index database.It can be use to enable database access after server crash.
All index database files are stored in user's home directory (paths like c:\Documents and Settings\USERNAME\Application Data\Faind). It prevents any access to personal data from other users. If you do not want to have separated index databases for each user then you can easily change database file storage policy by editing ini file (do not forget about the file writing permissions). Indexer controls the files in database automatically and usually there is no need to do something with database file.
Implemented file storage technique has one important drop back. Reindexing of changed documents leads to the situation when some portions of database files are not used by indexer, but they still occupy the disk space. There is only one way to "squeeze" database file - completely clear the index database (-index domain=xxx -index purge) and do index the zone again.
There are some cases when the use of index database is not recommended.
First, binary files with unknown format should not be indexed (for example, exe files). Switch indexer off when searching in such file.
Second, when the files are changed constantly. In this case it is faster to use non-indexed search instead of rebuilding the index database each time the file changes.
Third, when you have to search in a big file and if you know, that this file is not to be used again.
Embeddable search engine API
Search engine commands
© Mental Computing 2009