Appendix A: The TF/IDF Scoring Function

It is relatively straightforward to make a search engine that retrieves documents in random order, but it's much harder to make it return documents in a relevant order. The real difference between Google and all of the search engines that lost is that Google had a much more effective sorting criterion. In this project, you're not going to implement Google's sort criterion (the ``PageRank'' algorithm), but you will implement a reasonably effective scoring function, the ``term frequency/inverse document frequency'' (TF/IDF) function.

Let a query be a sequence of $ m$ words, $ q=\LR{q_{1},q_{2},\dotsc ,q_{m}}$ and the set of all possible queries be $ Q$. Note that, with respect to querying, the AND/OR query language and the parse tree (Appendix C) is irrelevant - the set of documents matching the entire AND/OR query has already been selected. We can regard the terms of the AND/OR query as a simple sequence of words. Now, given a set of $ N$ documents, $ D=\{d_{1}, d_{2}, \dotsc ,
d_{N}\}$, a scoring function $ s()$ assigns a real-valued score to each document/query pair: $ s:D\Cross Q \Onto \Reals$. Given such a score, it is straightforward to sort the document set into decreasing order by score. If the scoring function is good, then the documents that the user finds most relevant will be returned at the top of the list.

Suppose that there are a total of $ k$ different unique words, $ t_{1},\dotsc ,t_{k}$, found in all documents. Let the number of occurrences of word $ i$ in document $ j$ be $ w_{ij}$. Further, let the number documents in which word $ i$ is found be $ c_{i}$. Then the TF/IDF score assigned to document $ d_{j}$ with respect to a query $ \mathbf{q}=\LR{q_{1},\dotsc ,q_{m}}$, is defined to be:

$\displaystyle s_{\mathrm{TF/IDF}}(d_{j},\mathbf{Q})=\sum_{i=1}^{m} w_{ij}\ln\left(\frac{N}{c_{i}}\right)$ (1)

Essentially, Equation 1 says that a document is scored more highly if it contains many examples of a query term (the $ w_{ij}$ factor), while it is penalized slightly if word $ i$ is a common term (the log fraction factor). The intuition is that very common terms (e.g., ``the'' or ``computer'') give you very little indication of the true relevance of the document, while rare words should indicate very important documents. Note that, in principle, you could skip the query search phase and simply apply TF/IDF directly to all PAGEs in the WEB DATABASE and pick off the top candidates. This would, however, be highly expensive and is well beyond the scope of this project.

Now you have enough mathematical background to implement the search engine. The rest is Java...

Terran Lane 2005-08-23