|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectedu.unm.cs.cs351.tdrl.f09.p1.RobotHandler
public class RobotHandler
Object responsible for validating robots.txt files.
This object represents the robots exclusion logic specified
in the robots.txt spec file:
A Standard for Robot Exclusion
This class caches robot info as it runs, so each robots.txt file should be
checked only a single time per web site (technically, per authority, via
the URL.getAuthority() method), regardless of how many individual
pages are accessed on that site. However, no attempt is made to cache
information durably across sessions, so robot parsing will have to start
from scratch on every invocation of Crawler.
The key method in this class (and the only non-trivial public method)
is isAllowed(URL). See its documentation for usage.
| Constructor Summary | |
|---|---|
RobotHandler()
Constructor. |
|
| Method Summary | |
|---|---|
int |
getDownloadCount()
Gets the count of total number of page accesses performed by this module in the process of retrieving robots.txt files. |
boolean |
isAllowed(URL u)
Test whether a URL is allowed by the /robots.txt file for a site. |
String |
toString()
Dumps a roughly readable version of the complete robot cache as a single string. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public RobotHandler()
RobotHandler
to an empty state, wherein it has seen no rules yet and accepts everything
by default.
| Method Detail |
|---|
public boolean isAllowed(URL u)
/robots.txt file for a site.
This method validates a given URL against the exclusion rules specified
in a site's /robots.txt file. It automatically looks up the
site's robots file, caches the patterns found there, and validates the
URL against the pattern list. By caching the exclusion list, it avoids
accessing the robots file every time the site is accessed -- this should
return in O(lg n) time, where n is the number of prefixes
given in the robots file for the site (plus, possibly, the extra amount of
time necessary to access the robots file the first time).
u - URL to test
true if the crawler is allowed to access the specified
URL, otherwise false.public String toString()
toString in class ObjectObject.toString()public int getDownloadCount()
robots.txt files. This counts
only successful accesses, in the sense that a robots.txt
file existed and was opened and downloaded. A file is counted
if it exists but is empty.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||