Text search utility user's guide

Introduction

About the name

Brief outline
     Dictionaries

Embedded help

Installation and use

Configuring ini file

Search domain

Query pattern

Search progress

Results

Downloaded files cache

Indexer

Codepages and languages

Examples

Utility compilation

List of all options

 

Another Brick Through the Window, Part 2

We don't need no pull-down-menus
We don't need no rescaled fonts
No dark icons in the corner
Hackers, leave those Macs alone.
Hey! Hackers! Leave them Macs alone!
All in all its just another WIMP up for sale
All in all you're just another WIMP up for the sale.

We don't need no fancy windows
We don't need no title bars
No MultiFinder in the startup
Hackers leave them Macs alone
Hey! Hackers! Leave them Macs alone!
All in all its just another WIMP up for sale
All in all you're just another WIMP up for the sale
.

Another Brick Through the Window, Part 3

I don't need no mice around me
And I don't need no fonts to calm me.
I have seen the writing on the wall.
Don't think I need any WIMP at all.
No! Don't think I need any WIMP at all.
No! Don't think I'll need any WIMP at all.
All in all it was all just bricks through the window.
All in all you were all just bricks through the window.


- Nathan Torkington

 

Introduction

FAIND is an advanced desktop search engine (personal search system) designed especially to work with texts on natural languages. It searches for text patterns (words or regular expressions) in the files in different locations: personal computer, local area network and web sites.

This utility enables you to find the files in one or more directory trees and inside the archives that:

There are a lot of options which allow to control the text file processing (encodings, special tags etc.), pattern matching (morphology, syntax, thesaurus). I try to make the search engine commands set compatible with GNU find and grep programs, although some features are missing. Moreover, there are some commands which semantics are different for find and faind. For example, -regex marks the query pattern as regular expression for faind, while for find it works as a filter for file names.

Once you have found the files you're looking for, you can list their names on console (with some additional information, e.g. the fragments of matched text) or process the files in many ways by other system commands.

Built-in indexer/cataloguer is an important part of the tool that makes it possible to process and store the text information from files in a database (built-in engine is used). Searching the files by the keywords can be performed much faster by use of index database (instantly in some cases).

There are several incarnations of this search engine:

They all share the same code and have almost the same functionality with little variations. Faind utility was initially designed as a test tool for the search engine, but it also can be used as a powerful replacement for system find/findstr programs.  If you aren't familiar with MS Windows command prompt you'd better try to use Integra - another search tool with graphical user interface. FAIND utility is more suitable for advanced users, system administrators and programmers.

 

About the name

The name of utility fAInd (pronounced the same way as find) is constructed as combination of verb 'to find' and acronym for Artificial Intelligence - AI.

The reason was that this utility does the same work as standard UNIX find tools, and from the other hand it uses algorithms and approaches associated with machine reasoning (a.k.a artificial intelligence). 

There is another reason, less serious at first glance. Try to search for 'faind' in internet. You will wonder how many web pages contain word 'faind' instead of correct word 'find'. So many people make mistakes writing English texts (because of very complex rules for spelling) that it is quite reasonable to implement fuzzy search algorithms in search engine.

 

Overview

We have been working in this field of computer science for a long time (since 1995 - some of the source files have got date stamp with this year). Original goal of SOLARIX Intellectronix project was the speech recognition system (verbal analysis and synthesis to be more accurate) based on some know-how's and new approaches. Several algorithms for natural language handling have been developed at that time. Solid scientific basis lets us implement new generation search tools which have got some features missing in currently available utilities.

This program works in command line (MS Windows and GNU/Linux versions). For this reason it is intended for skilled users who knows how to work in command prompt.

There are some features which make the command line search tool unique.

It can be said that FAIND combines two standard *NIX utilities - find and grep. At first glance it breaks the *NIX system paradigm - small programs each performing one function. But artificial division of file scanning/enumeration (find) and text search (grep) would make FAIND extremely inefficient. The reason is processing natural language: to handle it the program needs dictionary file to be loaded. It takes some seconds to load diction.bin. It would be very time consuming strategy to have separate grep analogue which loads dictionary for every file it processes.

Internally FAIND consists of these parts:

1. Search domain scanner ( also known as crawler or spider in internet search systems). It recurs subdirectories, filters the files with different criteria, unpack the archives and compressed files.

2. Lexer. It parses files with different formats and retrieves the text from them. Current version of search engine contains universal text retriever - it extracts the text from arbitrary format files (executables or database files for example).

3. Text pattern search algorithm. Extended regular expressions engine is used to perform this task. This part of search engine does the most complex job - it performs the word rooting, regular expression matching, syntactical analysis and some other things.

There are a lot of options for search domain and text pattern construction, for controlling the matching of patterns to text, result formatting and so on. All of them are described in this manual, but you can also get the full list of options from FAIND tool - start it without any argument.

Morphology analyzer dictionaries

The search system loads a dictionary only if morphology analyzer is activated (see -wordforms and -index wordforms commands). You can see the version and statistics for installed dictionary by issuing -help=6 command. If no dictionary is installed, you can download and install one of the free dictionaries.

Additional information

Embeddable search engine API

Search engine commands

   Mental Computing 2009  home  rss  email  icq  download

changed 18-Apr-10