edu.unm.cs.cs351.tdrl.f09.p1
Class RobotHandler

java.lang.Object
  extended by edu.unm.cs.cs351.tdrl.f09.p1.RobotHandler

public class RobotHandler
extends Object

Object responsible for validating robots.txt files. This object represents the robots exclusion logic specified in the robots.txt spec file:

A Standard for Robot Exclusion

This class caches robot info as it runs, so each robots.txt file should be checked only a single time per web site (technically, per authority, via the URL.getAuthority() method), regardless of how many individual pages are accessed on that site. However, no attempt is made to cache information durably across sessions, so robot parsing will have to start from scratch on every invocation of Crawler.

The key method in this class (and the only non-trivial public method) is isAllowed(URL). See its documentation for usage.

Version:
1.0
Author:
terran

Constructor Summary
RobotHandler()
          Constructor.
 
Method Summary
 int getDownloadCount()
          Gets the count of total number of page accesses performed by this module in the process of retrieving robots.txt files.
 boolean isAllowed(URL u)
          Test whether a URL is allowed by the /robots.txt file for a site.
 String toString()
          Dumps a roughly readable version of the complete robot cache as a single string.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RobotHandler

public RobotHandler()
Constructor. Takes no arguments -- just initialzies the RobotHandler to an empty state, wherein it has seen no rules yet and accepts everything by default.

Method Detail

isAllowed

public boolean isAllowed(URL u)
Test whether a URL is allowed by the /robots.txt file for a site. This method validates a given URL against the exclusion rules specified in a site's /robots.txt file. It automatically looks up the site's robots file, caches the patterns found there, and validates the URL against the pattern list. By caching the exclusion list, it avoids accessing the robots file every time the site is accessed -- this should return in O(lg n) time, where n is the number of prefixes given in the robots file for the site (plus, possibly, the extra amount of time necessary to access the robots file the first time).

Parameters:
u - URL to test
Returns:
true if the crawler is allowed to access the specified URL, otherwise false.

toString

public String toString()
Dumps a roughly readable version of the complete robot cache as a single string. This could potentially be quite a long representation, depending on how many robot files have been scanned.

Overrides:
toString in class Object
See Also:
Object.toString()

getDownloadCount

public int getDownloadCount()
Gets the count of total number of page accesses performed by this module in the process of retrieving robots.txt files. This counts only successful accesses, in the sense that a robots.txt file existed and was opened and downloaded. A file is counted if it exists but is empty.

Returns:
Total number of robots page accesses.