Text search engine: query pattern

 

Introduction

Patterns

    regular expression

    extended regular expression

    'all' quantor *

    grammatic quantors

    operator characters

Boolean logic operators

General options

    grammar options

    words arrangement options

    other options

 

Introduction

Pattern is a sample of text to be found. Faind is a full-text search engine, so it searches for documents including some specified words (query pattern). In some conditions the pattern can be matched against the filenames (-target_filename=true) and special tags (or commentaries) in multimedia files (MP3, Ogg Vorbis, JPEG etc). There are two basic types of patterns - words and regular expressions (wildcards are also supported).

The most simple example of faind usage:

search engine usage

Moreover, the pattern usually includes some options affecting the way of comparison (or matching) of pattern and text, for example the word order conditions, thesaurus usage etc. In most cases you use the simple pattern, i.e. just sequence of words. In this case the engine finds the documents containing all words given. This is the most natural and commonly accepted scheme in WWW search engines.

When this simple pattern matching scheme is not enough satisfying, you can use so called boolean search - the pattern with logical operators in it. For example, when you need to find the documents containing the word cat or dog the pattern -sample "cat OR dog" solves the problem. See paragraph "Boolean logic operators" for more information.

Another feature that can help you to find the documents is fuzzy search. Using the fuzzy search features you can find the documents than contain the words partially matching the words in query pattern. The algorithms performing fuzzy search are complex and slow enough but there is no substitution for them in some cases - see more about fuzzy search.

 

Patterns

Regular expression

This type of pattern gives powerful instrument for searching different types of text samples. Regular expressions are very common for text search tools like grep in GNU/Linux (there is no comparable tool in MS Windows, where FINDSTR is very simple and limited).

Syntax of pattern:

-regex "reg_expression"

One can use text file containing pattern:

-regex @filename

When being searched the documents are considered as string not broken into words. It means that spaces and end-of-line characters are to be considered explicitly. For example, use (\w+) to skip arbitrary number of spaces.

Regular expression syntax are described there.

The main drop back of this type of patterns is absence of tools to work with language morphology. The only way to imitate morphology aware search is usage of regular expression's quantors. For example, it is possible to find all occurrences of words scan, scans, scanning, scanned by pattern 'scan'. But English irregular verbs 'find-found' can not be processed this way.

There is an example of usage of ordinary regular expression for information retrieval.

 

Extended regular expression

Another type of pattern is named extended regular expression due to the fact that gives some additional options to work with natural language grammar (morphology first of all). Extended regular expressions are available to be used in your programs without other parts of search engine - read more about it here.

Syntax of this pattern type:

-sample "big dog"

Words in the pattern are separated by spaces. Enclose pattern text in quotes - it helps avoid problems in command line.

Pattern can be written into text file (use utf-8 or utf-16 for Russian). Then it is used like this:

faind ... -sample @my_query

Here my_query is the name of text file containing pattern text.

'ALL' quantor *

Symbol *, when not part of ordinary regular expression, matches any word in text. That is why we call it 'ALL' or 'ANY' quantor or metaword.

Please pay special attention to the fact that such interpretation of * quantor is different from ordinary regular expressions. The reason is that * matches exactly any ONE word. For example, pattern "a * b" successfully matches text a c b, but text a b does not match.

Grammatic quantors

Instead of declaring the word to be found it is possible to write special requirement on grammatical characteristics of the word (these characteristics are referred as lexical content of the word). Such group of characteristics is called grammatical quantor.

Syntax:

# CLASS:* { COORDINATE:STATE ... COORDINATE:STATE }

Here CLASS is the name of grammatical class (declared in dictionary). COORDINATE:STATE is called coordinate pair.

Diez character # starts quantor declaration.

For example:

"cat # ENG_VERB:* {}"

All classes, coordinates and their states are described here.

Modifiers

Tilde symbol ~ declares next word in pattern as regular expression. Quotes are used to group regular expressions characters in one word:

-sample "~'search(.*)' ~'text(.*)'"

In this example two regular expressions are declared. It matches texts "search text", "searching texts" and so on.

Plus symbol + placed before word means that this word is obligatory and must be found in text. Without such explicit command some words are not searched for. For example, articles 'a', 'the', prepositions 'from', 'after' are so called stop words. They are skipped when scanning a text because they are too frequent and usually does not carry semantic information. Explicit control over skipping meaningless words is done by option -skipword (read here).

Example:

-sample "cat +and mouse"

Keyword AS is used to name search result (a word that matches pattern point):

-sample "cat * AS Verb"

In this sample the word that matches quantor * (any word in text) will be marked as Verb in resultant dataset (e.g. XML file).

 

The regular expressions (-regex and -rx -sample commands) are on of the most powerful means in text processing, but sometimes their syntax is too complex. The simplified regular expressions with wildcards * and ? can be used as a replacement in simple cases. Asterisk sign '*' stands for arbitrary sequence of letters, question sign '?' stands for any single letter. The command -wildcards modifies the -rx and -regex so they accept the simplified syntax instead of normal regular expressions.

For example: searching for 'guid*' pattern yields the result:

regular expression in search query

 

Boolean logic operators

The searches can be arbitrarily complex using boolean operators AND, OR, NOT.  By default the pattern like "black cat" requires that all words (both black and cat) are to be found. In other words, logical condition AND is used. There is no limitation for the number of terms in the expression. Each document in the result listing will contain all of the pattern words.

Such implicit use of AND is common practice and widely used in search engines.

There are a number of additional logic operators available in extended regular expressions.

 

Operator NOT word requires that word does not present:

-sample "cat NOT black"

This expression means that the search results page must contain the documents with the noun cat but without the adjective black (so no black cat is found).

 

Operator OR has following syntax: word1 OR word2. This expressions requires either word1 or word:

-sample "cat OR dog"

There is no limitation for mixing different languages in logic expression, e.g.:

-sample "Nagasaki OR Нагасаки"

This pattern matches either English or Russian name of the city. It should be noted that the grammar engine has a faculty to translate the words between English, Russian and French languages (some other languages can be added to this list in future).

 

Operator AND:

-sample "(cat OR dog) AND (miaows OR sleeps)"

 

Last example shows how to use parenthesis to group subexpressions. Do not forget that AND operator has higher priority than OR, so expression

-sample "cat OR pussy AND miaows OR sleeps"

is interpreted this way

-sample "cat OR (pussy AND miaows) OR sleeps"

that may not be what you want.

Complexity of logic expression, including the number of parenthesis and terms is not limited. But do take into account that complex expressions are harder to match. The more complex pattern is the slower.

 

General options

Please keep in mind the options must precede the pattern! For example

-wordforms -sample "cat sleeps"

is ok, but:

-sample "cat sleeps" -wordforms

does not work as expected.

Grammar options

In general there are four groups of grammar options: morphology, syntax, thesaurus and others.

-wordforms - morphological analysis is on: do use dictionary to get basic forms of words. This options also switches some other grammatical capabilities on. Primary use of this option is to help matching different forms of the same word like cat-cats or find-found. Dictionary is needed for this option - it can be loaded here.

For example, the results of searching for 'tooth':

morphology enabled searching

 

-dynforms - this option enables some more complex morphology analysis algorithm. In general, it works like -wordforms but slower and more intelligently.

 

-soundex enables built-in fuzzy matching algorithm which can match words with spelling errors. It must be used together with -wordforms. For example, searching for 'Hirasima' by the pattern

-wordforms -soundex -sample "hirasima"

results in

fuzzy search

 

-case - to match characters case strictly. By default case case is not taken into account, so 'cat' and 'Cat' are equivalent.

 

-correlate

 

-aa - do perform syntax analysis of sentences. This options is still experimental and does not work in some cases,  so we prefer not to describe it at the moment. This option requires -distance=s.

 

-semnet=N - enables the thesaurus. It sets the maximum distance between words on semantic net. It is not good idea to describe all details of algorithm that is started by this options. In few words it may be shown by such example: -semnet=1 lets match infinitive ИДТИ, participle ШЕДШИЙ and gerund ИДЯ in Russian. Synonyms are also compared equal with thesaurus enabled.

 

-links=XXX - enables the specified links for thesaurus-based operations (see -semnet). You can use the following values for XXX:

@translate - English, Russian, French etc. translations of the words are taken into account when searching (see an example);

@grammar - grammar links, e.g. noun galaxy and adjective galactic.

@semantics - semantic links, e.g. Cat - Animal; synonyms are also enabled.

 

-language lang1;lang2;...langN loads the morphology analyzers for specified languages. This command affects the memory occupation, because the grammar engine loads all analyzer by default.

Example:

faind -language en -dir c:\docs -wordforms -sample server

Short names of the languages: en - English, ru - Russian, fr - French, es - Spanish. Besides, some special metanames are available: all or * - вall languaqes, user - current user's language.

Configuration file can also be used to set the list of languages.

 

Words arrangement options

-ordered - orders of words in pattern is important and must be preserved in text. By default order of words does not match. For example, the patterns "black cat" work just the same way as "cat black". It results in "white cat and black dog" matching the pattern "black cat". Use -ordered to prevent the word order

-distance=N - proximity limitation: maximum distance between words is N. By default distance between words is not taken into account. It can be incorrect - related words are usually 5-10 positions from each other.

In some cases it may be more reasonable to require that words are enclose in one sentence: use option -distance=s

There is another type of proximity limitation: all the words in one line of text (plain text files with '\n' line breaks only!): -distance=l

 

Other options

-rx - each word in a pattern is regular expression. For example:

faind ... -rx  -sample  " 'dog(.*)'  'bark(.*)' "

This option declares all words in pattern as regular expressions. Use special symbol ~ to declare any one word in pattern as regular expression (see syntax).

It is worth saying that in most practical cases it is better to use capabilities of search engine to handle language morphology instead of usage of ordinary regular exception. In the example shown above metachars * are used to match all wordforms of dog and to bark. It seems easy but irregular morphology of Russian and English (consider such cases as tooth-teeth or to find-found) makes it very hard to write correct regular expression in most cases. Last but not least - regular expressions requires high enough qualification to be used.

Do not mess -regex and -rx options. They are quite different. First option declares pattern (ordinary regular expression). Second one just modifies -sample options telling search engine to consider words in pattern as separated regular expressions. Input text is broken into lexems and each lexem matches against its regular expression in pattern.

 

-allow_partial=true - allow to match not all words of query pattern. The engine tries to match each query keyword to document content by default. The result reliability value depends on the unmatched words.

 

-minbound=n.nn - lowest bound of found context reliability. When matching pattern and text, search engine estimates how precise pattern matches the context in text. If this estimation fall lower that given value then context is skipped. For example, pattern

faind ... -minbound=1.0  -sample  "big dog"

requires that context is accurately big dog. Another pattern:

faind ... -minbound=0.2  -sample  "big dog"

lets the search engine to omit one of the words from pattern. Default value for minbound is set in ini file.

 

Pattern matcher target

The query pattern is searched in the content of documents by default. The search engine can also match the query against the filenames. There are two commands:

-target_content=false - do not match the pattern against the text content of the file.

-target_filename=true - do match the query pattern against the filename. This command is used for finding files by name.

For example, the commands -target_filename=true  -target_content=false  -regex  "\\r(.+)t"  find the files whose names match the regular expression \r(.+)t

matching filenames with regular expressions

Additional information

Embeddable search engine API

Search engine commands

   Mental Computing 2009  home  rss  email  icq  download

changed 18-Apr-10