REVERSE INDEX Report

MoogAlyzerMUST create a file that displays the mapping from WORDs to PAGEs, along with the count of the number of occurrences of each WORD in each PAGE and the corresponding TF/IDF score for that WORD/PAGE combination.

Specifically, the REVERSE INDEX report consists of a series of newline-separated WORD entries, sorted in increasing alphabetic order by WORD. Each WORD entry consists of a single WORD, followed by a single tab character and the number of distinct PAGEs that that WORD was found in (PageCount), followed by a single newline. That line is then followed by a sequence of newline-separated URL entries for that WORD. Each URL entry consists of a single tab, followed by canonical-form URL (see Section 5.4.4), a tab, WORD count for that WORD/URL pair (WordURLCount), a tab, and finally the TF/IDF score for that WORD/URL pair represented with two significant digits (i.e., via the java.util.Formatter ``%5.2g'' specifier, or equivalent).

More formally, the format of the REVERSE INDEX report is given by the following BNF notation syntax:

        Report := WordEntry*
        WordEntry := WORD "\t" PageCount "\n" URLEntry+
        URLEntry := "\t" CanonicalURL "\t" WordURLCount "\t" TFIDF "\n"

The name of the REVERSE INDEX report file MUST be

        BASE-NAME ".words" ".rpt"

Terran Lane 2005-08-23