Definitions

The following definitions will be used in this document:

AND/OR
The QUERY language for the Moogle client. Allows a sequence of WORDs joined by AND and OR conjunctives, as well as parenthetical grouping. See Section 4.3.2 for details.
CRAWL-MAX
The maximum number of PAGEs that the spider engine can download in a single run (including restarts). The spider MUST NOT crawl more PAGEs than allowed by CRAWL-MAX.
MAY
A requirement that the product can choose to implement if desired. Can also indicate a choice among acceptable alternatives (e.g., ``The program MAY do x, y, or z.'' indicates that the choice of behavior x, y, or z is up to the designer.)
MUST
A requirement that the product must implement for full credit.
MUST NOT
A behavior or assumption that must not be violated. Violating a MUST NOT restriction will result in a penalty on the assignment.
PAGE
The content document pointed to by a URL. This project MUST support ``text/html'' documents, but the system MAY support additional document types, at the designer's discretion.
RECOVERABLE ERROR
An error condition that the software can ignore, correct, or otherwise recover from. The program MUST produce a warning message and then cleanly continue with no corruption or loss of valid data.
REVERSE INDEX
A mapping that records which PAGEs each WORD occurs in.
SHOULD
A requirment that is recommended, but not required. The designer may violate a SHOULD requirement, but should be prepared to explain why.
PUNCTUATION
Punctuation characters. For the purposes of this project, the punctuation characters are considered to be any characters other than WHITESPACE, letters, digits, or parentheses.
QUERY
A user's request for relevant PAGEs, expressed as a sequence of WORDs joined by an AND/OR query language. See Section 4.3.2 for details.
TF/IDF
A scoring function that attempts to assess the relevance of a PAGE with respect to a QUERY. See Appendix A for details.
UNRECOVERABLE ERROR
An error condition from which recovery is impossible. The program MUST produce an error message describing the condition and then cleanly halt.
URL
A pointer to a PAGE, represented in html with a fully qualified universal resource locator. Because each URL maps one-to-one onto a single PAGE, this specification will often use the two interchangably. Note that to ensure the one-to-one mapping, URLs will have to be canonicalized.
WEB DATABASE
The complete database of information necessary for the Moogle client to do its job. This includes the REVERSE INDEX, but will also include additional information to, for example, support crawl restarts and TF/IDF result sorting.
WHITESPACE
Non-printable characters including (but not limited to) space, horizontal and vertical tabs, newlines, and carriage returns. See the Java JDK API call Character.isWhitespace().
WORD
The smallest unit of parsing (for the MSpider engine) or querying (for the Moogle client). In this project, WORDs are considered to be sequences of letters and digits, excluding the two reserved words AND and OR. WORDs SHOULD be treated as case-insensitive (with the exception of AND and OR), but MAY be treated as case-sensitive. Either way, case (in)sensitivity MUST be documented in the user manual.

Terran Lane 2005-02-14