Search engine tutorial

Simple queries

UNICODE support

Common regular expressions

Worded regular expressions -rx -sample

Boolean search

Working with indexer

Search among the resources of local area network (LAN)

Searching on HTTP servers

Searching on FTP servers

Search with translation

 

Invoking 'faind'

Usually the installer writes the program executable into "c:\program files\integra" directory under MS Windows. It is very important to remember that this path contains space char and requires apostrophes when typed in command line:

e:\"c:\program files\faind\faind.exe" --version

 

Simple queries

The basic function of this program is to search files for text (usually lines or sentences) that contain certain patterns. The following example performs the search for very first occurrence of single word "cat" in one file a.txt, morphology is disabled:

faind a.txt -sample "cat"

The name of processing text file is the first argument in command line, so there is no need to issue the command -file. Morphology is disabled by default. Search results will be only displayed on console. The source document is divided into the words (see description of -sample command).


The next example is nearly the same as above, but all files in current folder are scanned:

faind . -sample "cat"

Dot '.' means current folder as usual. The name of folder is the first argument of command line, so -dir command is not required. File mask is not set, so program tries to reads every file in the folder. If format of a file is unknown the program prints appropriate message on console.


File masks are used to filter files before matching against patterns. Common practice is to select files by their extension. More complex filtration by regular expressions is also available in the search engine. For example:

faind . -name "reading_*.txt" -sample "cat"

In this example file mask contains so called wildcards (letters ? and * can be used). Each file name matches for this mask and only those that satisfy it are loaded and processed. Files like reading_10.txt will be accepted and processed, files like writing_10.txt will be rejected.

The previous example can be rewritten to use regular expression file name filter:

faind . -name:rx "reading_(.+).txt" -sample "cat"

In some cases there is no alternative but regular expressions to filter the files. More information about regular expression syntax is available here.


Searching for pattern "white cat" in the single file a.txt with morphology engine enabled:

faind a.txt  -wordforms -distance=s -minbound=1 -sample "white cat"

Here:

1. the option -wordforms lets the program to find also the plural forms like "white cats".

2. -distance=s requires that all words of pattern to be in one sentence.

3. -minbound=1 requites that all words of query pattern must be found. Otherwise, too frequent and meaningless words (like artictes) can be omitted.

Options -ordered is not set so contexts like "cat is white" can also be matched.


UNICODE support

 

 


Common regular expressions

To collect all URLs in a web page:

faind -uri http://www.solarix.ru -stripdecor=false -href=false -regex "(http|ftp)://([0-9a-zA-Z/\.]+)\.(\w+)" -listfiles:xml

Here:

1. web page address is set by option -uri

2. option -stripdecor=false let the search engine to process HTML tags. Otherwise tags are stripped and only the text visible in browser is processed..

2. option -href=false forbids the engine to follow hyper references.

3. option -regex "..." sets the query pattern as regular expression.

4. search results are collected in the XML file res.xml due to the option -lisfiles.


Worded regular expressions -rx -sample

There is another type of regular expression pattern available in FAIND search engine. This type is a combination of -rx and -sample command (-rx must precede the -sample):

faind -dir folder -rx -sample "'cat(.*)' 'sleep(.*)'"

Please pay attention to the apostrophes around the every regular expression. They are needed to prevent any incorrect parsing of the pattern.

Each term in this type of pattern is considered as a regular expression matching against the lexems in documents. The search engine loads the document, break its content into the words and then matches the words with pattern regular expressions.

Sometimes this type of pattern can help to emulate the morphology algorithms of the search engine. For example, the regular expression 'cat(.*)' successfully matches the words 'cat', 'cats'. Also it matches 'catalogue', so it can not really make the morphology engine unnecessary.

 


Boolean search (logical operators)

The search engine allows to use the logical operators in the query pattern. This is usually called as boolean search. There are three basic logic operators: AND, OR, NOT.

For example, if you want to find the documents containing either 'dog' or 'cat' you can issue the command:

faind -dir folder -sample "cat OR dog"

The result of the search:

 


Working with indexer

First of all, we create new static zone named as 'eng':

faind -index create_domain=eng

Then display the list of declared zones:

faind -index domains

This screenshot shows the results:

Now we have only one named zone 'eng'. It needs to be indexed. Creation of the index is done by the command:

faind -index domain=eng -dir folder_to_be_indexed

The beginning of the process:

And the ending:

The zone status can be viewed by the command

faind -index domain=eng -index info

Performing the search in the named zone is simple:

faind -index domain=eng -sample "cat" -index touchfiles

The command -index touchfiles is used to find the matches (contexts) in the files:


Search with translation

This feature should become available in 0.80 release (it is scheduled at September 2005).

For example, you search for 'cat sleeps' in a document containing equivalent Russian text:

Большая ленивая кошечка лежит, спит и видит сны о мышках.

Big lazy cat is sleeping and dreaming.

You don't have to worry about Russian-English translation of source document because FAIND search engine performs this routine task:

faind -file cats.txt -index off -distance=s -onceperfile=false -soundex -semnet=1 -wordforms -sample "cat sleeps"

The key point is -semnet option - it performs some lookup in dictionary for word equivalents. The result:

As you can see the grammar engine have performed all necessary work with Russian and English morphology.

Solarix Intellectronix project has been designed to be multilingual - there is no special internal support for Russian or English language. This design feature let us to include some support for one more European language - French. The next example shows the way FAIND handles simultaneously three languages: Russian, English, French:

The query string contains the pattern "cat".

Additional information

Embeddable search engine API

Search engine commands

   Mental Computing 2009  home  rss  email  icq  download

changed 18-Apr-10