|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
public interface CrawlState
Interface to the current state of a web crawl. Objects of this type maintain both the current web graph that the crawler has located to date and the "to do" queue (OPEN LIST) of pages that still need to be downloaded. These objects are essentially the "bookkeepers" of the web crawler -- they keep track of what has already been downloaded and what remains to be downloaded so that the crawler doesn't have to do that itself. They also make sure that each page is only downloaded once so that the crawler doesn't end up going in a loop.
These objects are not responsible for actually connecting to the web and downloading anything. That's the crawler's job.
Objects of this type MUST be serializable.
Objects of this type MUST ensure that:
popNextURL()
Note that, unlike the Graph interface, this interface is not generic --
it is concrete and works only with String objects. Thus, it provides a translation
layer between Graph and the full Crawler.java program so that the Crawler
doesn't have to worry about instantiating Graph correctly every time.
| Method Summary | |
|---|---|
void |
addHref(String currURL,
String hrefURL)
Add a link between the current page and some page that it links to. |
Graph<String> |
getGraph()
Returns a reference to the underlying graph. |
Queue<String> |
getQueue()
Returns a reference the to-do queue (OPEN LIST). |
boolean |
hasNextURL()
Tests whether there are any more new URLs to examine. |
String |
popNextURL()
Pop the next element from the "to-do" queue (OPEN LIST) and return it. |
int |
queueLength()
Return the number of outstanding URLs on the to-do queue (OPEN LIST). |
void |
saveYourself(String fname)
Save the complete contents of a CrawlState object to disk. |
| Method Detail |
|---|
String popNextURL()
null. (Note: this MUST NOT generate an error or
exception if
it is called multiple times on an empty queue.)
This method MUST guarantee to return each unique URL only once.
This method MUST run in O(1) time.
null if the queue is empty.boolean hasNextURL()
Iterator.hasNext() method and is used in a similar way. It
tells the caller that it is ok to call popNextURL() and continue
the search. (I.e., that the OPEN LIST is not empty.) When this returns
false, it is time to stop the
search because there are no more unique URLs to examine.
This MUST run in O(1) time.
true iff the to-do queue is non-empty.
void addHref(String currURL,
String hrefURL)
throws GraphStructureException
Graph data structure representing
the HREF link between the "current" page (page being crawled right now) and
the "destination" page (represented by the HREF). It is also responsible
for adding new URLs to the to-do queue (OPEN LIST). This method must
determine whether a given URL is novel -- has never been seen before -- and
should be added to the OPEN LIST.
NOTE This method is solely responsible for ensuring the "each page only downloaded once" requirement. This method MUST only put each unique URL onto the OPEN LIST once.
This method MUST run in amortized O(1) time and require only
amortized O(1) space. (Note that these bounds are the same as the
underlying Graph.addEdge(Object, Object) method.)
currURL - URL of the page currently being crawled.hrefURL - URL of a cross-reference from the current page to
some destination page.
GraphStructureException - Thrown if this attempt to add an edge violates
graph structure assumptions (e.g., currURL or hrefURL doesn't exist in the
graph). This should be impossible.Graph<String> getGraph()
Graph object, it does
not make a copy of that object.
This method MUST require O(1) time and 0 additional space allocation.
It MUST NOT make a copy of, or call clone() on, the underlying graph data.
Graphint queueLength()
This method MUST run in O(1) time and 0 space allocation.
Queue<String> getQueue()
getGraph(),
this MUST NOT copy the underlying data.
If the OPEN LIST is currently empty, this still returns a Queue, but that queue
is empty (getQueue().size()==0).
This method MUST run in O(1) time. The returned Queue view MUST
use O(1) space
void saveYourself(String fname)
throws IOException
CrawlState object to disk. This
serializes the CrawlState object into a file of the specified
name. It MAY also compress that data state (e.g., via the
GZIPOutputStream class). It MAY use the Java serialization mechanism
(Serializable). (In fact, it is very strongly suggested that
it use the serialization mechanism.) This method MUST save enough state of this
object so that it can be retrieved from disk and used to re-start a web crawl
wherever it left off.
Any objects that implement this method MUST also implement a method with the
following signature:
public static CrawlState loadYourself(String fname) throws IOException
which is the inverse of this method. The loadYourself() method MUST
retrieve a disk-image CrawlState, formerly saved by this method, and
reconstitute it into a working object representing the same state that was
saved.
Both methods throw IOExceptions if there is a difficulty saving/loading
the target file.
If the implementer uses some approach other than Java serialization, it MUST require only O(V+E+Q) time and space on disk, where V,E are the sizes of the node set and edge set, respectively, and Q is the length of the to-do queue.
fname - Name of the target file into which the serialized disk representation
should be stored
IOException - If there is an error creating the file or writing
the object state to it.Serializable,
Sun Documentation
on Object Serialization,
"Horstmann & Cornell, Core Java 2: Volume 1 -- Fundamentals"
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||