CS 361: Semester Project
The goal of the project is to learn about experimental studies
of the performance of data structures. To this end, you
need to propose a project, collect software to support it, design
experiments, implement drivers to run these experiments and collect
data, analyze the data, and write a report presenting your results
and explaining (inasmuch as possible) your observations.
What Form Can A Project Take?
The simplest form for this project is a comparison between two to three
different data structures for the same ADT, in terms of various
performance measures. The experiments may be entirely based on
random generation of data, but they can also be driven by an existing
application you already have or use data sets collected from the web
(e.g., you can test dictionary lookup by inserting/looking up
every word in the standard Unix spell dictionary, or in one of the many
full-text classic novels available on the web).
More specialized projects can probe the details of functioning of a single
data structure: for instance, you can investigate the expected number of
rotations in an insertion or deletion in an AVL tree, the expected number
of levels through which an element is sifted in a binary heap during insertion
or deletion, etc. Another possibility is to study the cache behavior of
various implementations of the same data structure -- modify the code to
allocate storage differently in a tree, or compare linear probing (next
location in the table) with double hashing or quadratic probing in terms
of both running time and number of probes, testing to see if there is a
crossover value. Another type of specialized project takes a single
data structure and compares several (say five or so) variants in terms of
their performance; for instance, you could compare five different probing
schemes for collision resolution in a hash table.
You should plan to choose a project by the end of March. Here are some possible
choices for the simplest type (compare various data structures for the same
ADT):
-
ADT dictionary, highly dynamic context (almost as many insertions and deletions
as there are lookups): compare plain binary search trees with two of AVL trees,
red-black trees, splay trees, tries, skip lists, and randomized binary search trees (treaps).
-
ADT dictionary, minimally dynamic context (far more lookups than insertions or
deletions): compare three of plain binary search trees, AVL trees,
red-black trees, splay trees, tries, skip lists, randomized binary search trees (treaps),
and various hash tables (with resolution inside or outside the table).
-
ADT dictionary, minimally dynamic context (far more lookups than insertions or
deletions): compare various hash tables (with resolution inside or outside the
table) among themselves.
-
ADT ordered list: compare lists with move-to-front heuristics with
plain binary search trees and one of splay trees, AVL trees, red-black trees,
skip lists, or randomized binary search trees (treaps).
-
ADT priority queue: compare binary heaps with two of leftist trees,
binomial heaps, skew heaps, Fibonacci heaps, pairing heaps, and relaxed heaps.
You can easily code hash tables (of any kind) yourself; ditto for lists with
move-to-front heuristics, plain binary search trees, randomized binary
search trees (treaps), binary heaps,
binomial heaps, skew heaps, trivial pairing heaps, and leftist trees.
I recommend you use software off the web for AVL trees, red-black trees, splay
trees, tries, skip lists, Fibonacci heaps, sophisticated pairing heaps,
and relaxed heaps.
Designing A Project
Once you have selected a project, start collecting software for it while
deciding what you want to measure. Running time is an obvious measure
(you can measure it with the time command for the whole program or,
better, with the getrusage OS call for whichever selected portion of
the code you want, more details on the timing page);
since the running time will be very small, be sure
to measure it on large datasets for many repetition, so as to obtain reasonable
precision. Separate from running time, you can instrument your
code to count various types of low-level actions, such as pointer
dereferencings or memory accesses, or various types of high-level algorithmic
steps, such as the number of rotations, probes, key comparisons, etc.
(Naturally, the instrumented version should not be used for timing!)
You need to decide what conditions you will be using for measurement:
purely random data? stable conditions (e.g., a large dictionary with
equal probability of insertion or deletion) or transient ones (long
chain of insertions followed by a long chain of deletions)? real data
with real frequencies (e.g., look up words in a novel or in large email
folders)? the relative importance of the operations in the ADT (in
a dictionary, lookup versus insertion/deletion; in a priority queue,
priority changes and queue meldings versus insert and deletemin)?
You want to plan a course of action that will not require excessive time,
yet will give you sufficient information to write a real report.
Conducting The Experiments
Then code what you have to code; this may include scripts to analyze, filter,
refine, and plot the data. Early runs will probably prompt you to collect
data you had not planned on collecting. Once you are fairly confident
that you have all the pieces in place, start collecting data in earnest --
run fair numbers of experiments on a fair sample of different size so
as to be able to plot curves and also so as to have some estimate of
mean and standard deviation at each plot point.
Analyzing The Results
Finally reason about the data: why have the numbers come out the way they did?
how well were they predicted by theory? are there surprises? are some
of the effects explainable by cache performance? how does the data affect
the performance (especially if you use real data)?
Writing It All Up
Last, but not least, write a report (your report is the only thing I will
look at). The report should state what you set out to investigate,
what software you used and from where, the measurements you took, the
type of experiments you ran, and present and discuss the results. The results
must be presented in graphical form (plots, bar or pie charts, scatter plots,
etc.), not as tables of numbers! Your discussion should present
the reasoning mentioned above: how do your observations correlate with
theory? what is the influence of specific data? are there unexplainable
anomalies for which you might want to hazard a guess? are your data conclusive
(i.e., is one choice better than the others in the context you chose)?
I expect that a typical report will have 6-12 pages of text, plus several
pages of figures.