edu.unm.cs.cs351.tdrl.f09.p1
Interface CrawlState


public interface CrawlState

Interface to the current state of a web crawl. Objects of this type maintain both the current web graph that the crawler has located to date and the "to do" queue (OPEN LIST) of pages that still need to be downloaded. These objects are essentially the "bookkeepers" of the web crawler -- they keep track of what has already been downloaded and what remains to be downloaded so that the crawler doesn't have to do that itself. They also make sure that each page is only downloaded once so that the crawler doesn't end up going in a loop.

These objects are not responsible for actually connecting to the web and downloading anything. That's the crawler's job.

Objects of this type MUST be serializable.

Objects of this type MUST ensure that:

Note that, unlike the Graph interface, this interface is not generic -- it is concrete and works only with String objects. Thus, it provides a translation layer between Graph and the full Crawler.java program so that the Crawler doesn't have to worry about instantiating Graph correctly every time.

Version:
1.0
Author:
terran

Method Summary
 void addHref(String currURL, String hrefURL)
          Add a link between the current page and some page that it links to.
 Graph<String> getGraph()
          Returns a reference to the underlying graph.
 Queue<String> getQueue()
          Returns a reference the to-do queue (OPEN LIST).
 boolean hasNextURL()
          Tests whether there are any more new URLs to examine.
 String popNextURL()
          Pop the next element from the "to-do" queue (OPEN LIST) and return it.
 int queueLength()
          Return the number of outstanding URLs on the to-do queue (OPEN LIST).
 void saveYourself(String fname)
          Save the complete contents of a CrawlState object to disk.
 

Method Detail

popNextURL

String popNextURL()
Pop the next element from the "to-do" queue (OPEN LIST) and return it. If the queue is empty, this returns null. (Note: this MUST NOT generate an error or exception if it is called multiple times on an empty queue.)

This method MUST guarantee to return each unique URL only once.

This method MUST run in O(1) time.

Returns:
Next URL from the to-do queue, or null if the queue is empty.

hasNextURL

boolean hasNextURL()
Tests whether there are any more new URLs to examine. This method resembles the Iterator.hasNext() method and is used in a similar way. It tells the caller that it is ok to call popNextURL() and continue the search. (I.e., that the OPEN LIST is not empty.) When this returns false, it is time to stop the search because there are no more unique URLs to examine.

This MUST run in O(1) time.

Returns:
true iff the to-do queue is non-empty.

addHref

void addHref(String currURL,
             String hrefURL)
             throws GraphStructureException
Add a link between the current page and some page that it links to. This creates an EDGE in the underlying Graph data structure representing the HREF link between the "current" page (page being crawled right now) and the "destination" page (represented by the HREF). It is also responsible for adding new URLs to the to-do queue (OPEN LIST). This method must determine whether a given URL is novel -- has never been seen before -- and should be added to the OPEN LIST.

NOTE This method is solely responsible for ensuring the "each page only downloaded once" requirement. This method MUST only put each unique URL onto the OPEN LIST once.

This method MUST run in amortized O(1) time and require only amortized O(1) space. (Note that these bounds are the same as the underlying Graph.addEdge(Object, Object) method.)

Parameters:
currURL - URL of the page currently being crawled.
hrefURL - URL of a cross-reference from the current page to some destination page.
Throws:
GraphStructureException - Thrown if this attempt to add an edge violates graph structure assumptions (e.g., currURL or hrefURL doesn't exist in the graph). This should be impossible.

getGraph

Graph<String> getGraph()
Returns a reference to the underlying graph. Note: This returns a reference (essentially, pointer) to the Graph object, it does not make a copy of that object.

This method MUST require O(1) time and 0 additional space allocation. It MUST NOT make a copy of, or call clone() on, the underlying graph data.

Returns:
Reference to the Graph

queueLength

int queueLength()
Return the number of outstanding URLs on the to-do queue (OPEN LIST). This fetches the length of the to-do queue. If the queue is empty, it returns 0. Otherwise, it returns a positive integer count.

This method MUST run in O(1) time and 0 space allocation.

Returns:
Number of elements in the to-do queue (OPEN LIST).

getQueue

Queue<String> getQueue()
Returns a reference the to-do queue (OPEN LIST). Like getGraph(), this MUST NOT copy the underlying data.

If the OPEN LIST is currently empty, this still returns a Queue, but that queue is empty (getQueue().size()==0).

This method MUST run in O(1) time. The returned Queue view MUST use O(1) space

Returns:
Immutable queue view of to-do queue (OPEN LIST)

saveYourself

void saveYourself(String fname)
                  throws IOException
Save the complete contents of a CrawlState object to disk. This serializes the CrawlState object into a file of the specified name. It MAY also compress that data state (e.g., via the GZIPOutputStream class). It MAY use the Java serialization mechanism (Serializable). (In fact, it is very strongly suggested that it use the serialization mechanism.) This method MUST save enough state of this object so that it can be retrieved from disk and used to re-start a web crawl wherever it left off.

Any objects that implement this method MUST also implement a method with the following signature: public static CrawlState loadYourself(String fname) throws IOException which is the inverse of this method. The loadYourself() method MUST retrieve a disk-image CrawlState, formerly saved by this method, and reconstitute it into a working object representing the same state that was saved.

Both methods throw IOExceptions if there is a difficulty saving/loading the target file.

If the implementer uses some approach other than Java serialization, it MUST require only O(V+E+Q) time and space on disk, where V,E are the sizes of the node set and edge set, respectively, and Q is the length of the to-do queue.

Parameters:
fname - Name of the target file into which the serialized disk representation should be stored
Throws:
IOException - If there is an error creating the file or writing the object state to it.
See Also:
Serializable, Sun Documentation on Object Serialization, "Horstmann & Cornell, Core Java 2: Volume 1 -- Fundamentals"