Searching the HTTP resources

see also LAN searching »

Web crawler is a part of the search engine which allows to scan the web pages linked by the hyper references. Original web page is defined by -url command:

faind -url "http://solarix.ru" -sample "search engine"

Configuration of the proxy server must be correctly defined before using web crawler.

By default the search engine does not visit the hyper references found on web pages. It prevents unexpected traffic generated by hyper references from original web address to other web sites. For this reason the only one web pages will be processed in the first example.

There are three commands which allow the crawler to follow the links. First, -href=true allows to scan the web pages for hyper references. Second, -maxdepth=N defined the depth of hyper references recursion. The next example shows how to force the crawler to process all pages on a web server:

faind -url "http://solarix.ru" -href=true -maxdepth=100 -sample "search engine"

 

There is another default limitation: the crawler does not visit hyper references outside the original domain. Command -same_domain=false removes this limitation. It is highly recommended that you use URL filters (see -urimask and -urinotmask) to limit the range of visited  web pages:

faind -url "http://solarix.ru" -href=true -maxdepth=100 -same_domain=false -urimask ".+\.ru"  -urimask ".+adv.+" -sample "search engine"

This example shows how to filter web pages by their URL's: 1) only pages in RU domain are allowed, 2) some ad links are forbidden.

 

There is no default limitation of traffic for a search session. You can use -maxtraffix command for this purpose:

faind -url "http://solarix.ru" -href=true -maxdepth=100 -same_domain=false -maxtraffic=1M -sample "search engine"

  
   Mental Computing 2009  home  rss  email  icq  download

changed 31-Jan-10