Archives and compressed files
Selecting files by size
Selecting files by modification time
URL (web sites)
Meta options - 'my documents', 'my computer', 'network neighborhood'
Searching the indexed files
The search engine is designed to work with files, although it can search in text string by API functions. The files to be processed can locate in different places:
1. local file system and connected network disks (local spider or crawler);
2. MS Windows local area network (taking into account authorization);
2. web sites (including hyper-referenced - so called www crawler);
3. archives and compressed files (most of them are processed without an external tool);
4. named pipes and sockets.
Search domain (or search area) also includes file filters - masks for file names, conditions on file size, modification and creation dates. Folders are scanned (or crawled) recursively. Archives and compressed files are processed on-the-fly.
File masks are of two types: with wildcards (e.g. *.txt or *.htm?) and regular expressions (see syntax).
Web sites are located by URL. Additionally URL mask can be used to filter documents to download.
The contents of files can be in a wide variety of formats, including plain TXT ASCII, TXT utf-8 and utf-16 for national characters, HTML, XML, Acrobat PDF, Microsoft Word and some other formats (full list of supporting formats can be viewed here).
One of the most useful features of the search engine is searching indexes. It allows to search in indexed sets of files on removable media (CD/DVD) without the physical access to the files.
The folder to be scanned must be the first argument in command line:
faind c:\ ...
There is another way to declare the folder name to be processed - by -dir command:
faind -dir c:\
The location of this command in the query string is not restricted to first position.
By default the subfolders are also crawled (see ini file to get instructions how to change the default behavior). Use option -recurse:
-recurse=on - subfolders are scanned recursively
-recurse=off - subfolders ARE NOT scanned
-r - equals to -recurse=on
It may be necessary to define several folders to process. In this case you can list all these folders in single -dir option:
Folder names are separated by semicolon ";".
List of folders can also be stored in text filed and then used in command line (pay attention to @ character):
CD/DVD drives can be pointed by the command -cdrom
The name of file to process can be declared as the very first argument in the command line:
faind e:\docs\cats.txt ...
You can issue the -file command in any place in command line:
faind ... -file e:\docs\cat.txt ...
It is possible to declare several -file commands in one query. In this case you can either put several -file commands or issue one -file command followed by the list of files:
faind ... -file "aaa;bbb;ccc" ...
The filenames are separated by colon. The length of the list is limited only by OS shell.
The filenames can be stored in a text file and this text file is used as argument of -file command:
faind ... -file @eee ...
Compression formats supported by search engine are: ARJ, GZIP, TAR, BZIP, RAR, 7ZIP (see the actual list). Full list of supported formats can be obtained by command:
When search engine's file scanner finds archive with supported format it automatically unpacks it in memory.
-unpack=on - to scan archive content. In order to not scan archives do use -unpack=off. Default value for this option is defined in ini-file.
Archives and compressed files from web sites are also automatically downloaded, unpacked and scanned.
Note that archive support significantly increases the total size of search engine code. In some cases this support is redundant or useless. One of the possible examples is embedded search engine. In this case the developer can exclude the useless features by the search engine recompilation.
If first argument of command line is the name of file then this file will be processed.
For example, the command:
faind CAT.RAR -sample "dog"
searches for pattern dog in file CAT.RAR (it is compressed file).
There are situations when files are listed in text file. For example, list of files is a result of another program execution:
dir /b *.txt > list
for MS Windows or
ls *.txt > list
Next file with files list can be processed by use of option:
There is a possibility to enumerate files right in the command line:
Option -flist is used to load list of files from an XML file. This XML file has simple format (described here). It can be generated by previous execution of search tool FAIND (see option -listfiles:xml). For example, at first round
-faind c:\ -name *.txt -listfiles my_files ...
search engine accumulates files matching the wide conditions and writes them to XML file. At second round the command
faind -flist my_files ...
starts scanning the files selected at first step.
File masks help to select files by their name (by file name extension, usually).
-name xxx - ordinary file mask with wildcards - symbols * and ? (or list of masks divided by semicolon ';').
-name:rx xxx - regular expression (or list of expressions divided by semicolon ';').
-iname -xxx - differs from -name by case insensitive behavior.
faind c:\ -name *.txt
faind \home -iname "*.txt;*.htm*"
faind e:\docs -name:rx "cat(\w*).(.*)"
MS Windows ignores case in file names, so -iname and -name are equivalent, whereas GNU/Linux takes the case into acount.
List of masks can be stored in text file and later referred this way:
faind c:\ -name @text_files
Use regular expression masks only in the case if you really understand the syntax of regular expressions, which is not easy and obvious. For example, options
filters the text files (extension *.txt) with the name beginning from 'a' letter. Head part of regular expression (.+)/ is used to skip absolute path to file which is passed to the filter.
To select files of a certain size, use the -size options, following it with the condition and the file size to match.
General syntax is:
where CCC is a condition, SSS is a size.
Condition is a character (or two):
+ or >
greater than the given size
greater than or equal to the given size
- or <=
less than given size
= or ==
equal to the size
not equal to the given size
Size may be given in three scales:
1. as bytes - by default
1. as kilobytes - when the size is followed by K
2. as megabytes - when the size is followed by M
-size "<=100K" search the files whose size is less than or equal to 100 Kb
-size "+1M" search the files whose size is greater than 1 Mb
-size "!=10000" search the files whose size is not equal to 10000 bytes
The search engine allows two filter options to be used to limit the range:
-size ">1K" -size "<100K" search the files whose size is between 1 and 100 Kb
-empty selects the empty files (it is implemented for compatibility with GNU find).
The following command filters files by their last modification time:
-modif "CCCdate time" "date_format time_format"
-modif "CCCdate" "date_format"
CCC is a condition sign:
+ or >
greater or equal
- or <
less or equal
= or ==
Date and time format string contains the floowing control characters:
DD - day number
MM - month number for date or minutes for time
YYYY - 4-digit year number
MMM - 3-letter month name (JAN-FEB-...DEC)
HH - hour
SS - seconds
-modif ">=12-01-2003" "dd-mm-yyyy"
-modif "==12.01.2003 15:00:00" "dd.mm.yyyy hh:mm:ss"
Another syntax of the command is possible:
stands for files modified today,
stands for files modifed yesterday, and so on.
The search engine processes only text documents by default. Use -store_all_files=true command in order to process all files of search domain. This command is used by Integra cataloguer to store the whole list of files on CD/DVD, for example.
-allowraw=true activates an heuristic algorithm which extracts the text from binary files with unknown format. The command must be used in combination with -raw_ext "aaa;bbb;ccc", which sets the file extensions to be processed by the text extraction algorithm.
-allow_audio=true enables the extraction of tags from some audio files (mp3, for example). This option became available in version 0.91.
allow_gfx=true text commentaries must be extracted from picture files (JPEG, for example).
allow_video=true text commentaries must be extracted from video files.
allow_exec=true enables the extraction of version number/developer name from executables.
to scan web site. Command argument is URL - address of web site or address of web page. For example:
Configuration for proxy server (if access to internet requires proxy) is done in ini file - variable proxy in section internet:
proxy = "http://184.108.40.206:3129"
Search engine would follow hyper references if option -href=true is used. Hyper references are ignored by default, because they can cause surprising effect - search engine would start scanning more and more web sites. There are three ways to limit uncontrolled serfing.
First, search engine can be forbidden to go beyond original site:
This option tells the search algorithm to follow those hyper references only which jumps to the same site.
Second, it is possible to limit the depth of web search, that is the number of jumps from one reference to another:
Let us consider the case, when -maxdepth=2. Search engine starts from -uri=http://www.solarix.ru. Scanner will access title page of the site (it is index.shtml file). After that the scanner finds hyper reference, which points to www.solarix.ru/for_users/dowsload_them/faind/faind.shtml. This is first jump. Search engine loads that page and analyses it. This page also has hyper references. Any one of them causes the scanner to make second jump. All of hyper references on .../faind.shtml are processed, and any one causes second jump in depth. That is all - no deeper jumps will be done.
Third, it is possible to use masks to web address:
URL mask is regular expression. Each hyper reference checks by masks. If any of masks makes success then hyper reference is used. The set of masks can be declared in one -urimask.
-urinotmask can be used to prevent crawler from following the hyper references:
All of options listed above can be used in arbitrary combination.
Additional features are as follows.
List of masks can be stored in text file and later used like this:
List of web addresses can be stored in text file (as usual for FAIND!) and then used at any moment::
makes it possible to limit the internet traffic when scanning the web sites. The limit value can be bytes, kilobytes (suffix K) or megabytes (suffix M), e.g.:
Downloaded documents can be stored in the files on local host. It allows to browse them offline without the need to download them again. Download mode can be switched on by the option:
Default value for this parameter is defined in ini file.
Downloaded documents are saved in special folder defined in ini file - variable download_dir in section internet.
It should be said that there is no simple correspondence between URL of original documents and the name of saved file. There are two ways to solve the problem.
First, every downloaded files has the pair file with the same name and extension 'uri'. This file contains description of document source.
Second, XML file with search results stores both original name of document (+ its source) and name of file in download directory.
HTML result file (it is generated by -listfiles:html option) does all work - it contains clickable references to the downloaded documents.
Metaoptions - 'my documents', 'my computer', 'network neighborhood'
These options simplify the search in the home catalog (documents folder) of currently logged user, on all disks and in the local area network. We call them 'metaoptions' because the search engine transforms these options into absolute folder names before searching. Note that searching in network neighborhood can take significant time to prepare the list of available network resources by MS Windows.
searches the files in 'MyDocuments' folder (each user has got personal folder for documents on the most modern OSes, including MS Windows and Linux).
search the files in every directory of all hard disks - you don't have to enumerate all drives by hands. Please pay attention to the file access questions: when the search engine encounters the file access denial problem, it prints the diagnostic warning on the screen (for console version of search tools) and continues the searching.
starts crawling the local area network. Every host (shared resource to be precise) is opened (if possible) and scanned for files. You can see an example of LAN search.
enables OCR system for text extraction from some type of documents.
Another type of search domain definition is index:
-index domain "CD science fiction" -sample "Stanislaw lem"
More information about the zone is in the "Indexer" chapter.
Embeddable search engine API
Search engine commands
© Mental Computing 2009