Definitions

The following definitions will be used in this document:

BASE-NAME
The initial, stem part of a file name, before the extension. Usually specified by the user. Note that a BASE-NAME MAY contain one or more periods, if so specified by the user.
CLOSED LIST
The set of PAGEs that have been retrieved (downloaded) by MSpider. Used for cycle detection. NOTE This is known colloquially as a ``list'', but that does not mean that it should be literally implemented as a list data structure. It is likely that a different data structure will be more effective for the goals of the CLOSED LIST.
CRAWL-MAX
The maximum number of PAGEs that the spider engine can download in a single run (including restarts). The spider MUST NOT crawl more PAGEs than allowed by CRAWL-MAX.
MAY
A requirement that the product can choose to implement if desired. Can also indicate a choice among acceptable alternatives (e.g., ``The program MAY do x, y, or z.'' indicates that the choice of behavior x, y, or z is up to the designer.)
MUST
A requirement that the product must implement for full credit.
MUST NOT
A behavior or assumption that must not be violated. Violating a MUST NOT restriction will result in a penalty on the assignment.
OPEN LIST
The set of URLs that have been parsed out of previously retrieved PAGEs, but that have not yet been retrieved or examined themselves. These URLs are candidates for future crawling. NOTE This is known colloquially as a ``list'', but that does not mean that it should be literally implemented as a list data structure. It is likely that a different data structure will be more effective for the goals of the OPEN LIST.
PAGE
The content document pointed to by a URL. This project MUST support ``text/html'' documents, but the system MAY support additional document types, at the designer's discretion.
RECOVERABLE ERROR
An error condition that the software can ignore, correct, or otherwise recover from. The program MUST produce a warning message and then cleanly continue with no corruption or loss of valid data.
REVERSE INDEX
A mapping that records which PAGEs each WORD occurs in.
SHOULD
A requirement that is recommended, but not required. The designer may violate a SHOULD requirement, but should be prepared to explain why.
SPIDER
The module responsible for crawling the web to locate and retrieve PAGEs.
PUNCTUATION
Punctuation characters. For the purposes of this project, the punctuation characters are considered to be any characters other than WHITESPACE, letters, digits, or parentheses.
QUERY
A user's request for relevant PAGEs, expressed as a sequence of WORDs joined by an implicit conjunctive query (Section 5.3.2) or (optionally) an AND/OR query language (Appendix C).
TF/IDF
A scoring function that attempts to assess the relevance of a PAGE with respect to a QUERY. See Appendix A for details.
UNRECOVERABLE ERROR
An error condition from which recovery is impossible. The program MUST produce an error message describing the condition and then cleanly halt.
URL
A pointer to a PAGE, represented in html with a fully qualified universal resource locator. Because each URL maps one-to-one onto a single PAGE, this specification will often use the two interchangeably. Note that to ensure the one-to-one mapping, URLs will have to be canonicalized.
WEB DATABASE
The complete database of information necessary for the Moogle client to do its job. This includes the REVERSE INDEX, but will also include additional information to, for example, support crawl restarts and TF/IDF result sorting.
WHITESPACE
Non-printable characters including (but not limited to) space, horizontal and vertical tabs, newlines, and carriage returns. See the Java JDK API call Character.isWhitespace().
WORD
The smallest unit of parsing (for the MSpider engine) or querying (for the Moogle client). In this project, WORDs are considered to be sequences of letters and digits. WORDs SHOULD be treated as case-insensitive, but MAY be treated as case-sensitive. Either way, case (in)sensitivity MUST be documented in the user manual.

Terran Lane 2005-09-21