Search domain definition commands

      Introduction

      Folders

      Archives and compressed files

      File lists

      File masks

      Selecting files by size

      Selecting files by modification time

      File categories

      URL (web sites)

      Meta options - 'my documents', 'my computer', 'network neighborhood'

      Searching the indexed files

 

Introduction

The search engine is designed to work with files, although it can search in text string by API functions. The files to be processed can locate in different places:

 1. local file system and connected network disks (local spider or crawler);

 2. MS Windows local area network (taking into account authorization);

 2. web sites (including hyper-referenced - so called www crawler);

 3. archives and compressed files (most of them are processed without an external tool);

 4. named pipes and sockets.

Search domain (or search area) also includes file filters - masks for file names, conditions on file size, modification and creation dates. Folders are scanned (or crawled) recursively. Archives and compressed files are processed on-the-fly.

File masks are of two types: with wildcards (e.g. *.txt or *.htm?) and regular expressions (see syntax).

Web sites are located by URL. Additionally URL mask can be used to filter documents to download.

The contents of files can be in a wide variety of formats, including plain TXT ASCII,  TXT utf-8 and utf-16 for national characters, HTML, XML, Acrobat PDF, Microsoft Word and some other formats (full list of supporting formats can be viewed here).

One of the most useful features of the search engine is searching indexes. It allows to search in indexed sets of files on removable media (CD/DVD) without the physical access to the files.

Folders

The folder to be scanned must be the first argument in command line:

faind c:\ ...

There is another way to declare the folder name to be processed - by -dir command:

faind -dir c:\

The location of this command in the query string is not restricted to first position.

By default the subfolders are also crawled (see ini file to get instructions how to change the default behavior). Use option -recurse:

-recurse=on - subfolders are scanned recursively

-recurse=off - subfolders ARE NOT scanned

-r - equals to -recurse=on

It may be necessary to define several folders to process. In this case you can list all these folders in single -dir option:

-dir "folder1;folder2;..."

Folder names are separated by semicolon ";".

List of folders can also be stored in text filed and then used in command line (pay attention to @ character):

-dir @list_file

 

CD/DVD drives can be pointed by the command -cdrom

 

Files

The name of file to process can be declared as the very first argument in the command line:

faind e:\docs\cats.txt ...

You can issue the -file command in any place in command line:

faind ... -file e:\docs\cat.txt ...

It is possible to declare several -file commands in one query. In this case you can either put several -file commands or issue one -file command followed by the list of files:

faind ... -file "aaa;bbb;ccc" ...

The filenames are separated by colon. The length of the list is limited only by OS shell.

The filenames can be stored in a text file and this text file is used as argument of -file command:

faind ... -file @eee ...

Archives and compressed files

Compression formats supported by search engine are: ARJ, GZIP, TAR, BZIP, RAR, 7ZIP (see the actual list). Full list of supported formats can be obtained by command:

faind -help=5

When search engine's file scanner finds archive with supported format it automatically unpacks it in memory.

-unpack=on - to scan archive content. In order to not scan archives do use -unpack=off. Default value for this option is defined in ini-file.

Archives and compressed files from web sites are also automatically downloaded, unpacked and scanned.

Note that archive support significantly increases the total size of search engine code. In some cases this support is redundant or useless. One of the possible examples is embedded search engine. In this case the developer can exclude the useless features by the search engine recompilation.

 

File lists

If first argument of command line is the name of file then this file will be processed.

For example, the command:

faind CAT.RAR -sample "dog"

searches for pattern dog in file CAT.RAR (it is compressed file).

There are situations when files are listed in text file. For example, list of files is a result of another program execution:

dir /b *.txt > list

for MS Windows or

ls *.txt > list

for Linux.

Next file with files list can be processed by use of option:

-file @list

There is a possibility to enumerate files right in the command line:

-file "filename1;filename2;..."

Option -flist is used to load list of files from an XML file. This XML file has simple format (described here). It can be generated by previous execution of search tool FAIND (see option -listfiles:xml). For example, at first round

-faind c:\ -name *.txt -listfiles my_files ...

search engine accumulates files matching the wide conditions and writes them to XML file. At second round the command

faind -flist my_files ...

starts scanning the files selected at first step.

File masks

File masks help to select files by their name (by file name extension, usually).

-name xxx - ordinary file mask with wildcards - symbols * and ? (or list of masks divided by semicolon ';').

-name:rx xxx - regular expression (or list of expressions divided by semicolon ';').

-iname -xxx - differs from -name by case insensitive behavior.

Examples:

faind c:\ -name *.txt

faind \home -iname "*.txt;*.htm*"

faind e:\docs -name:rx "cat(\w*).(.*)"

MS Windows ignores case in file names, so -iname and -name are equivalent, whereas GNU/Linux takes the case into acount.

List of masks can be stored in text file and later referred this way:

faind c:\ -name @text_files

Use regular expression masks only in the case if you really understand the syntax of regular expressions, which is not easy and obvious. For example, options

-iname:rx "(.+)/a(.+)\.txt"

filters the text files (extension *.txt) with the name beginning from 'a' letter. Head part of regular expression (.+)/ is used to skip absolute path to file which is passed to the filter.

 

Selecting files by size

To select files of a certain size, use the -size options, following it with the condition and the file size to match.

General syntax is:

-size "CCCSSS"

where CCC is a condition, SSS is a size.

Condition is a character (or two):

+ or >

greater than the given size

>=

greater than or equal to the given size

- or <=

less than given size

= or ==

equal to the size

!=

not equal to the given size

Size may be given in three scales:

1. as bytes - by default

1. as kilobytes - when the size is followed by K

2. as megabytes - when the size is followed by M

Examples:

-size "<=100K"    search the files whose size is less than or equal to 100 Kb

-size "+1M"          search the files whose size is greater than 1 Mb

-size "!=10000"   search the files whose size is not equal to 10000 bytes
 

The search engine allows two filter options to be used to limit the range:

-size ">1K" -size "<100K"  search the files whose size is between 1 and 100 Kb

 

-empty selects the empty files (it is implemented for compatibility with GNU find).

 

Selecting files by modification time

 The following command filters files by their last modification time:

-modif "CCCdate time" "date_format time_format"

-modif "CCCdate" "date_format"

CCC is a condition sign:

+ or >

greater than

>=

greater or equal

- or <

less than

<=

less or equal

= or ==

equal

!=

not equal

Date and time format string contains the floowing control characters:

DD - day number

MM - month number for date or minutes for time

YYYY - 4-digit year number

MMM - 3-letter month name (JAN-FEB-...DEC)

HH - hour

SS - seconds

Examples:

-modif ">=12-01-2003" "dd-mm-yyyy"

-modif "==12.01.2003 15:00:00" "dd.mm.yyyy hh:mm:ss"

Another syntax of the command is possible:

-modif 0

stands for files modified today,

-modif 1

stands for files modifed yesterday, and so on.

File categories

The search engine processes only text documents by default. Use -store_all_files=true command in order to process all files of search domain. This command is used by Integra cataloguer to store the whole list of files on CD/DVD, for example.

-allowraw=true activates an heuristic algorithm which extracts the text from binary files with unknown format. The command must be used in combination with -raw_ext "aaa;bbb;ccc", which sets the file extensions to be processed by the text extraction algorithm.

 
-allow_audio=true enables the extraction of tags from some audio files (mp3, for example). This option became available in version 0.91.

 
allow_gfx=true text commentaries must be extracted from picture files (JPEG, for example).

 
allow_video=true text commentaries must be extracted from video files.

 
allow_exec=true enables the extraction of version number/developer name from executables.

  

URL (web sites)

Use option

-uri address

to scan web site. Command argument is URL - address of web site or address of web page. For example:

-uri http://www.somedomain.ru

or

-uri http://127.0.0.1:8080/default.shtml

Configuration for proxy server (if access to internet requires proxy) is done in ini file - variable proxy in section internet:

[internet]
proxy = "http://172.168.1.222:3129"

Search engine would follow hyper references if option -href=true is used. Hyper references are ignored by default, because they can cause surprising effect - search engine would start scanning more and more web sites. There are three ways to limit uncontrolled serfing.

First, search engine can be forbidden to go beyond original site:

-same_domain=true

This option tells the search algorithm to follow those hyper references only which jumps to the same site.

Second, it is possible to limit the depth of web search, that is the number of jumps from one reference to another:

-maxdepth=NN

Let us consider the case, when -maxdepth=2. Search engine starts from -uri=http://www.solarix.ru. Scanner will access title page of the site (it is index.shtml file). After that the scanner finds hyper reference, which points to www.solarix.ru/for_users/dowsload_them/faind/faind.shtml. This is first jump. Search engine loads that page and analyses it. This page also has hyper references. Any one of them causes the scanner to make second jump. All of hyper references on .../faind.shtml are processed, and any one causes second jump in depth. That is all - no deeper jumps will be done.

Third, it is possible to use masks to web address:

-urimask "(.+)\.gov"

URL mask is regular expression. Each hyper reference checks by masks. If any of masks makes success then hyper reference is used. The set of masks can be declared in one -urimask.

-urinotmask can be used to prevent crawler from following the hyper references:

-urinotmask "(.+)banner(.+);(.+)\.xxx"

 

All of options listed above can be used in arbitrary combination.

Additional features are as follows.

List of masks can be stored in text file and later used like this:

-name @urls_masks

List of web addresses can be stored in text file (as usual for FAIND!) and then used at any moment::

-uri @urls_file

The command:

-maxtraffic=XXX

makes it possible to limit the internet traffic when scanning the web sites. The limit value can be bytes, kilobytes (suffix K) or megabytes (suffix M), e.g.:

-maxtraffic=500K

Downloaded documents can be stored in the files on local host. It allows to browse them offline without the need to download them again. Download mode can be switched on by the option:

-store_download=true

Default value for this parameter is defined in ini file.

Downloaded documents are saved in special folder defined in ini file - variable download_dir in section internet.

It should be said that there is no simple correspondence between URL of original documents and the name of saved file. There are two ways to solve the problem.

First, every downloaded files has the pair file with the same name and extension 'uri'. This file contains description of document source.

Second, XML file with search results stores both original name of document (+ its source) and name of file in download directory.

HTML result file (it is generated by -listfiles:html option) does all work - it contains clickable references to the downloaded documents.

 

Metaoptions - 'my documents', 'my computer', 'network neighborhood'

These options simplify the search in the home catalog (documents folder) of currently logged user, on all disks and in the local area network. We call them 'metaoptions' because the search engine transforms these options into absolute folder names before searching. Note that searching in network neighborhood can take significant time to prepare the list of available network resources by MS Windows.

Option

-mydocs

searches the files in 'MyDocuments' folder (each user has got personal folder for documents on the most modern OSes, including MS Windows and Linux).

Option

-mycomp

search the files in every directory of all hard disks - you don't have to enumerate all drives by hands. Please pay attention to the file access questions: when the search engine encounters the file access denial problem, it prints the diagnostic warning on the screen (for console version of search tools) and continues the searching.

Option

-lan

starts crawling the local area network. Every host (shared resource to be precise) is opened (if possible) and scanned for files. You can see an example of LAN search.

OCR

Command

-ocr use

enables OCR system for text extraction from some type of documents.
 

Using indexes

Another type of search domain definition is index:

-index domain "CD science fiction" -sample "Stanislaw lem"

More information about the zone is in the "Indexer" chapter.

Additional information

Embeddable search engine API

Search engine commands

   Mental Computing 2009  home  rss  email  icq  download

changed 18-Apr-10