Definitions
The following definitions will be used in this document:
- AND/OR
- The QUERY language for the Moogle client.
Allows a sequence of WORDs joined by AND and OR
conjunctives, as well as parenthetical grouping. See
Section 4.3.2 for details.
- CRAWL-MAX
- The maximum number of PAGEs that the spider engine
can download in a single run (including restarts). The spider MUST
NOT crawl more PAGEs than allowed by CRAWL-MAX.
- MAY
- A requirement that the product can choose to implement if
desired. Can also indicate a choice among acceptable alternatives
(e.g., ``The program MAY do x, y, or z.'' indicates that the choice of
behavior x, y, or z is up to the designer.)
- MUST
- A requirement that the product must implement for full
credit.
- MUST NOT
- A behavior or assumption that must not be violated.
Violating a MUST NOT restriction will result in a penalty on the
assignment.
- PAGE
- The content document pointed to by a URL. This project
MUST support ``text/html'' documents, but the system MAY support
additional document types, at the designer's discretion.
- RECOVERABLE ERROR
- An error condition that the software can
ignore, correct, or otherwise recover from. The program MUST produce
a warning message and then cleanly continue with no corruption or loss
of valid data.
- REVERSE INDEX
- A mapping that records which PAGEs each WORD
occurs in.
- SHOULD
- A requirment that is recommended, but not required.
The designer may violate a SHOULD requirement, but should be
prepared to explain why.
- PUNCTUATION
- Punctuation characters. For the purposes of this
project, the punctuation characters are considered to be any
characters other than WHITESPACE, letters, digits, or parentheses.
- QUERY
- A user's request for relevant PAGEs, expressed as a
sequence of WORDs joined by an AND/OR query language. See
Section 4.3.2 for details.
- TF/IDF
- A scoring function that attempts to assess the
relevance of a PAGE with respect to a QUERY. See Appendix A for
details.
- UNRECOVERABLE ERROR
- An error condition from which recovery is
impossible. The program MUST produce an error message describing the
condition and then cleanly halt.
- URL
- A pointer to a PAGE, represented in html with a fully
qualified universal resource locator. Because each URL maps
one-to-one onto a single PAGE, this specification will often use the
two interchangably. Note that to ensure the one-to-one mapping, URLs
will have to be canonicalized.
- WEB DATABASE
- The complete database of information necessary
for the Moogle client to do its job. This includes the
REVERSE INDEX, but will also include additional information to, for
example, support crawl restarts and TF/IDF result sorting.
- WHITESPACE
- Non-printable characters including (but not limited
to) space, horizontal and vertical tabs, newlines, and carriage
returns. See the Java JDK API call Character.isWhitespace().
- WORD
- The smallest unit of parsing (for the MSpider
engine) or querying (for the Moogle client). In this
project, WORDs are considered to be sequences of letters and digits,
excluding the two reserved words AND and OR. WORDs
SHOULD be treated as case-insensitive (with the exception of
AND and OR), but MAY be treated as case-sensitive.
Either way, case (in)sensitivity MUST be documented in the user
manual.
Terran Lane
2005-01-26