Another Brick Through the Window, Part 2
FAIND is an advanced desktop search engine (personal search system) designed especially to work with texts on natural languages. It searches for text patterns (words or regular expressions) in the files in different locations: personal computer, local area network and web sites.
This utility enables you to find the files in one or more directory trees and inside the archives that:
have names that contain certain text or match a certain pattern (regular expression of wildcards);
were created or modified used during a certain period of time;
are within a certain size range;
contain text that matches a certain pattern (regular expression or wildcards) taking into account the morphology and syntax;
or some combination of the above.
There are a lot of options which allow to control the text file processing (encodings, special tags etc.), pattern matching (morphology, syntax, thesaurus). I try to make the search engine commands set compatible with GNU find and grep programs, although some features are missing. Moreover, there are some commands which semantics are different for find and faind. For example, -regex marks the query pattern as regular expression for faind, while for find it works as a filter for file names.
Once you have found the files you're looking for, you can list their names on console (with some additional information, e.g. the fragments of matched text) or process the files in many ways by other system commands.
Built-in indexer/cataloguer is an important part of the tool that makes it possible to process and store the text information from files in a database (built-in engine is used). Searching the files by the keywords can be performed much faster by use of index database (instantly in some cases).
There are several incarnations of this search engine:
faind - command prompt utility for MS Windows 9x/NT
faind.net - command prompt utility for MS Windows .NET (currently it is MS Windows 9x/NT with .NET Framework installed)
faind.net.dll - .NET component to be used by software developers in their projects (more »)
faind.win32.dll - MS Windows DLL to be used by software developers (more »)
Integra - desktop search tool with GUI (MS Windows.NET).
They all share the same code and have almost the same functionality with little variations. Faind utility was initially designed as a test tool for the search engine, but it also can be used as a powerful replacement for system find/findstr programs. If you aren't familiar with MS Windows command prompt you'd better try to use Integra - another search tool with graphical user interface. FAIND utility is more suitable for advanced users, system administrators and programmers.
About the name
The name of utility fAInd (pronounced the same way as find) is constructed as combination of verb 'to find' and acronym for Artificial Intelligence - AI.
The reason was that this utility does the same work as standard UNIX find tools, and from the other hand it uses algorithms and approaches associated with machine reasoning (a.k.a artificial intelligence).
There is another reason, less serious at first glance. Try to search for 'faind' in internet. You will wonder how many web pages contain word 'faind' instead of correct word 'find'. So many people make mistakes writing English texts (because of very complex rules for spelling) that it is quite reasonable to implement fuzzy search algorithms in search engine.
We have been working in this field of computer science for a long time (since 1995 - some of the source files have got date stamp with this year). Original goal of SOLARIX Intellectronix project was the speech recognition system (verbal analysis and synthesis to be more accurate) based on some know-how's and new approaches. Several algorithms for natural language handling have been developed at that time. Solid scientific basis lets us implement new generation search tools which have got some features missing in currently available utilities.
This program works in command line (MS Windows and GNU/Linux versions). For this reason it is intended for skilled users who knows how to work in command prompt.
There are some features which make the command line search tool unique.
First of all, it works fine in batch files (command interpreter's scripts). It means that you can implement pretty complex algorithms of automatic information processing, especially in GNU/Linux family OSes.
Second, this program is quite compact without GUI, windows and other parts in compare with Integra.
Third, it does not require installation (although installer is available for full version). All configuration parameters are defined in ini file. and can be changed by any text editor.
Forth, being the full-scale local text search engine (plus local area network and internet sites search) it lets any user to control every nuance of search process with the wide range of options.
It can be said that FAIND combines two standard *NIX utilities - find and grep. At first glance it breaks the *NIX system paradigm - small programs each performing one function. But artificial division of file scanning/enumeration (find) and text search (grep) would make FAIND extremely inefficient. The reason is processing natural language: to handle it the program needs dictionary file to be loaded. It takes some seconds to load diction.bin. It would be very time consuming strategy to have separate grep analogue which loads dictionary for every file it processes.
Internally FAIND consists of these parts:
1. Search domain scanner ( also known as crawler or spider in internet search systems). It recurs subdirectories, filters the files with different criteria, unpack the archives and compressed files.
2. Lexer. It parses files with different formats and retrieves the text from them. Current version of search engine contains universal text retriever - it extracts the text from arbitrary format files (executables or database files for example).
3. Text pattern search algorithm. Extended regular expressions engine is used to perform this task. This part of search engine does the most complex job - it performs the word rooting, regular expression matching, syntactical analysis and some other things.
There are a lot of options for search domain and text pattern construction, for controlling the matching of patterns to text, result formatting and so on. All of them are described in this manual, but you can also get the full list of options from FAIND tool - start it without any argument.
The search system loads a dictionary only if morphology analyzer is activated (see -wordforms and -index wordforms commands). You can see the version and statistics for installed dictionary by issuing -help=6 command. If no dictionary is installed, you can download and install one of the free dictionaries.
Embeddable search engine API
Search engine commands
© Mental Computing 2009