It is relatively straightforward to make a search engine that retrieves documents in random order, but it's much harder to make it return documents in a relevant order. The real difference between Google and all of the search engines that lost is that Google had a much more effective sorting criterion. In this project, you're not going to implement Google's sort criterion (the ``PageRank'' algorithm), but you will implement a reasonably effective scoring function, the ``term frequency/inverse document frequency'' (TF/IDF) function.
Let a query be a sequence of
words,
and the set of all possible queries
be
. Note that, with respect to querying, the AND/OR query
language and the parse tree (Appendix C) is irrelevant - the set of
documents matching the entire AND/OR query has already been selected.
We can regard the terms of the AND/OR query as a simple sequence of
words. Now, given a set of
documents,
, a scoring function
assigns a real-valued score
to each document/query pair:
. Given such a
score, it is straightforward to sort the document set into decreasing
order by score. If the scoring function is good, then the documents
that the user finds most relevant will be returned at the top of the
list.
Suppose that there are a total of
different unique words,
, found in all documents. Let the number of
occurrences of word
in document
be
. Further, let the
number documents in which word
is found be
. Then the
TF/IDF score assigned to document
with respect to a query
, is defined to be:
Essentially, Equation 1 says that a document is scored
more highly if it contains many examples of a query term (the
factor), while it is penalized slightly if word
is a common term
(the log fraction factor). The intuition is that very common terms
(e.g., ``the'' or ``computer'') give you very little indication of
the true relevance of the document, while rare words should indicate
very important documents. Note that, in principle, you could skip
the query search phase and simply apply TF/IDF directly to all
PAGEs in the WEB DATABASE and pick off the top candidates. This
would, however, be highly expensive and is well beyond the scope of
this project.
Now you have enough mathematical background to implement the search engine. The rest is Java...
Terran Lane 2005-09-21