Text search engine brief outline

Project Goals

The main goal of the project is to implement desktop search tools (full text search engine) for users (download page) and C++ supplementary library with simple API for software developers. Like other text search tools our software finds the text patterns in files on personal computer, local area network and internet sites.

What makes these search tools special is a set of features that are listed here. For example, there are few desktop search tools available for users (small tools are of no interest), but none of them is able to handle Russian morphology and syntax. More to say, there is no tool which can search for Russian text in document written in English, and vice versa.

 

Features

Key features include:

1. Open source codes (LGPL) for all tools and components. Other open source projects are also intensively used.

2. Sample programs for search engine (Win32 and .NET platforms) are available in source codes.

3. Search engine API is very simple (less then dozen functions) but allows full support for search engine features.

4. Archived and packed files (see the list of supported formats) are processed without external unpackers for the most popular formats. Search on local host disks, removable drives (CD, DVD), local network and internet (crawling the www hyper references).

5. Text extraction algorithm for unknown format files (ASCII, utf8 and utf16 are supported). Language and codepage guesser module.

6. Search results are represented in several formats including plain TXT, HTML, XML, SQL scripts.

7. Console tool (faind) and GUI tool (Integra) are available.

8. Queries can include logic operators (boolean search), regular expressions and extended regular expressions (natural language grammar operator).

9. Built-in support for natural languages morphology and syntax (grammar engine). This feature is optional and can be disabled.

10. Fuzzy search (aka partial word matching) features.

11. Built-in translation module allows search in mixed language texts.

12. Knowledge discovery and information retrieval features, including "natural language queries answering machine".

13. Built-in indexer works without external (back-end) RDBMS. The indexer allows several modes for indexing and searching.

 

Grammar aware search

When we talk about Russian language support in search tool it means that program must deal with Russian grammar, which is quite different from English one. This difference make English-focused search tools useless for Russian. Very complex morphology (irregular word formation), free order of words in sentences, subtle techniques for expressing qualities (adjectives) and very limited techniques for expressing actions (verbs) are all specific for Russian (and some other languages of cause). Search algorithms must be natively designed to cope with all these features.

It is well known for users that finding text pattern in documents can be non-trivial task. Let us consider such problem for example.

Pattern 'sleeping cats' is looked for. First of all, 'cats which sleeps' and 'cats that sleeps' match the pattern and should be found. Then, 'one black cat and one white cat that sleep' is also fit for. Last but not least - misspelling is a problem for English texts because of over complex rules for spelling (especially for foreigners). So 'sleeping kats' must be successfully  matched to the pattern.

Another hard task to be solved by search tool - big amount of non-relevant information found. For example, when you look for 'black cat' in some documents. There is impossible to limit the distance between words, because of cases like 'black and big cat'. But this leads to another problem with sentences like 'black dog and white cat'. This sentence will be successfully matched to pattern making trash results - program will find some useless information. The only way to solve this task is to implement full-scale syntax analyzer.

Generally speaking, there are two tasks for search tools which are contradictory at the first glance:

1) to find more pertinent information (comprehensive search)

2) to find less unnecessary information (no redundant results)

As an ideal result the tools must search and find only the information that is required - neither less nor more.

The task mentioned above is both actual and unsolved. We are developing special programs for analysis of verbal information (which includes both written and spoken) and hope that our know-how's and original ideas will solve the challenge of smart search.

 

Natural languages support

Both Russian and English languages are currently supported. French language support will also be available.

Native support for Russian language was the central idea for project. This language is quite different from English, so current techniques for handling texts with English grammar are of no use for Russian. It has very complex and surprisingly irregular morphology, unfamiliar and unusual for English-speaking people way to combine word in sentence (mostly free words ordering), and other features.

All these peculiarities make it necessary to implement several algorithms which are abundant for English.

 

Is it freeware?

Well, it is not freeware. Open source does not mean free of charge - read more about it.

There are some unconditionally free components: FAIND - console version of personal search system and the pack of sample programs with source codes for developers (API demos).

Resources for software developers

The program package ‘Solaris Intellectronics’ (hereinafter designated briefly as Solarix) is a big set of C++ classes and routines realizing the state-of-the-art algorithms for verbal information processing. Also there is a component for MS Windows .NET framework, which implements complete (!) search engine with easiest API you can imagine.

Search engine API is open and simple. We supply some sample programs (in C++ for different platforms and C# for Windows .NET) with open sources. These samples demonstrate the way third-party's program can implement full text search features and linguistic analysis. You can use free-of-charge provided license (LGPL) is accepted.

 

Implementation

Source code is designed to be portable, at least MS Windows (both Win32 and .NET) and GNU/Linux are supported. Almost all source codes are written on pure standard C++, that makes it simple enough to port libraries and end-user code to other platforms.

All tools are designed only for local files search (or desktop search) There are some plans for internet meta-search capabilities. In simple words, search tool will use well-known internet search systems to collect start portions of data - list of references to web pages, and then look through these pages more thoroughly.

 

text revised 10.08.2007

  © Mental Computing 2009  main page  rss  email  icq  download