CS 361: Semester Project
The goal of the project is to learn about experimental studies of the
performance of data structures and/or algorithms. To this end, you
need to propose a project, write and collect software to support it,
design experiments, implement drivers to run these experiments and
collect data, analyze the data, and write a report presenting your
results and explaining (inasmuch as possible) your observations.
What Form Can A Project Take?
The simplest form for this project is a comparison of performance
measures for between two to five different data structures for the
same ADT. For groups composed of a large number of students, I will
expect a comparison of a larger number of data structures, whereas for
a group composed of just one or two students, two or three data
structures will suffice. The experiments may be entirely based on
random generation of data, but they can also be driven by an existing
application you already have or use data sets collected from the web
(e.g., you can test dictionary lookup by inserting/looking up every
word in the standard Unix spell dictionary, or in one of the many
full-text classic novels available on the web).
More specialized projects can probe the details of functioning of a
single data structure: for instance, you can investigate the expected
number of rotations in an insertion or deletion in an AVL tree, the
expected number of levels through which an element is sifted in a
binary heap during insertion or deletion, etc. Another possibility is
to study the cache behavior of various implementations of the same
data structure -- write code to allocate storage differently in a
tree, or compare linear probing (next location in the table) with
double hashing or quadratic probing in terms of both running time and
number of probes, testing to see if there is a crossover value.
Another type of specialized project takes a single data structure and
compares several (say five or so) variants in terms of their
performance; for instance, you could compare five or more different
probing schemes for collision resolution in a hash table.
You should plan to choose a project by the end of March. Here are
some possible choices for the simplest type (compare various data
structures for the same ADT):
You can easily code hash tables (of any kind) yourself; ditto for
lists with move-to-front heuristics, plain binary search trees,
randomized binary search trees (treaps), binary heaps, binomial heaps,
skew heaps, trivial pairing heaps, skip lists, and leftist trees. I
prefer that, if possible, you code up these data structures yourself,
and include your source code in the report. However for more
complicated data structures such as red-black trees, splay trees,
tries, Fibonacci heaps, sophisticated pairing heaps, and relaxed
heaps, I recommend you use software off the web.
ADT dictionary, highly dynamic context (almost as many insertions and
deletions as there are lookups): compare plain binary search trees
with at least two of: AVL trees, red-black trees, splay trees, tries,
skip lists, and randomized binary search trees (treaps).
ADT dictionary, minimally dynamic context (far more lookups than
insertions or deletions): compare at least three of plain binary
search trees, AVL trees, red-black trees, splay trees, tries, skip
lists, randomized binary search trees (treaps), and various hash
tables (with resolution inside or outside the table).
ADT dictionary, minimally dynamic context (far more lookups than
insertions or deletions): compare various hash tables (with resolution
inside or outside the table) among themselves.
ADT ordered list: compare lists with move-to-front heuristics with
plain binary search trees and at least one of splay trees, AVL trees,
red-black trees, skip lists, and randomized binary search trees
ADT priority queue: compare binary heaps with at least two of leftist
trees, binomial heaps, skew heaps, Fibonacci heaps, pairing heaps, and
Designing A Project
Once you have selected a project, start writing and collecting
software for it while deciding what you want to measure. Running time
is an obvious measure (you can measure it with the time
command for the whole program or, better, with the getrusage
OS call for whichever selected portion of the code you want, more
details on the timing page); since the
running time will be very small, be sure to measure it on large
datasets for many repetitions, so as to obtain reasonable precision.
Separate from running time, you can instrument your code to
count various types of low-level actions, such as pointer
dereferencings or memory accesses, or various types of high-level
algorithmic steps, such as the number of rotations, probes, key
comparisons, etc. (Naturally, the instrumented version should not be
used for timing!) You need to decide what conditions you will be
using for measurement: purely random data? stable conditions (e.g., a
large dictionary with equal probability of insertion or deletion) or
transient ones (long chain of insertions followed by a long chain of
deletions)? real data with real frequencies (e.g., look up words in a
novel or in large email folders)? the relative importance of the
operations in the ADT (in a dictionary, lookup versus
insertion/deletion; in a priority queue, priority changes and queue
meldings versus insert and deletemin)? You want to plan a course of
action that will not require excessive time, yet will give you
sufficient information to write a real report.
Conducting The Experiments
After you have designed the experiments, code what you have to code;
this may include scripts to analyze, filter, refine, and plot the
data. Early runs will probably prompt you to collect data you had not
planned on collecting. Once you are fairly confident that you have
all the pieces in place, start collecting data in earnest -- run fair
numbers of experiments on a fair sample of different size so as to be
able to plot curves and also so as to have some estimate of mean and
standard deviation at each plot point.
Analyzing The Results
Finally reason about the data: why have the numbers come out the way
they did? how well were they predicted by theory? are there
surprises? are some of the effects explainable by cache performance?
how does the data affect the performance (especially if you use real
Writing It All Up
Last, but not least, write a report (your report is the only thing I
will look at). The report should state what you set out to
investigate, what software you wrote yourself, what software you used
from the web and where you got it, the measurements you took, the type
of experiments you ran, and present and discuss the results. The
results must be presented in graphical form (plots, bar or pie charts,
scatter plots, etc.), not as tables of numbers! Your
discussion should present the reasoning mentioned above: how do your
observations correlate with theory? what is the influence of specific
data? are there unexplainable anomalies for which you might want to
hazard a guess? are your data conclusive (i.e., is one choice better
than the others in the context you chose)? I expect that a typical
report will have 6-12 pages of text, plus several pages of figures.
For groups composed of a larger number of students, a longer and more
comprehensive report will be expected, whereas for groups composed of
one or two students, a shorter report will suffice.
This report on "Cache Performance of
Indexing Data Structures" by Joshua MacDonald and Ben Zhao is a good
example of what I'm looking for in terms of the scope and quality of
your projects. This is a very nice paper on the effect of caching on
index structures which was turned in as the final report for a class
similar to ours at UC Irvine.
Top n Project Mistakes
Your report should contain enough information to convince a "skeptical
reader" that your hypothesis is correct. As much as possible, try to
objectively evaluate your report and experiments, to decide if they
will convince a skeptical reader.
Generating inappropriate test suites: Make sure that your test suite
is sufficiently challenging. For example, if you're testing a binary
tree, do not simply insert a bunch of data, and then delete it in the
same order. This will make the delete operate look much too good.
Instead, try several different types of tests, one of which should be
randomly interleaving inserts with deletes.
Insufficient data: You need to take enough data to convince a
skeptical reader that the behavior really is e.g. Theta(n log n), and
get a reasonable estimate of the constant. If n is too small, you'll
get a "time" of zero, so you need to take enough data at larger values
of n --- ranging over at least a factor of 100. You should also, when
possible, take the average over many independent trials for each value
of n (at least 20). This will prevent the data from fluctuating
wildly. If your data is so noisy that the n log n behavior isn't
convincing, you haven't passed the "skeptical reader" test.
- Irreproducible Results: You need to explain enough about how the
experiment was performed so that another person could read your report
and reproduce your results. That is, from the report *alone*, they
could reproduce your experiments and results. For instance, you need
to include information like how many trials did you do for each value
of n? You need to state how the test data are collected or generated,
what operations are used in the test, how many, in what sequence and
mix, what kind of data (timings, structural values) are collected and
how, etc. In dictionaries, you need to distinguish successful from
unsuccessful search; in all cases, you need to distinguish insertion
from deletion and from search.
Improper plotting of data: the data should be normalized and the plots scaled so
as best to show the behavior of the structure and thus be able to
compare it to the analytical prediction. For instance, you should
plot search/insertion/deletion times per operation
rather than cumulatively and you should use a log scale on the data
size, so that the theoretically predicted curve would be a straight
line -- it's easy to detect a deviation from a straight line, but hard
to look at a curve and say "this is logarithmic" (it could be some
type of root and look much the same).
Bad plots: Every single plot you include in your report should have
labels for both the x and y axes. Every one! The axis labels should
be located on the plots themselves. In addition, each plot should
have a title, a caption below it describing the plot, and it should be
referenced in the text of the report! The plots also need to have
figure numbers and need to be referred to explicitly in the text.
Insufficient Evaluation of data: You should use both timing measures
and structural measures to evaluate performance. Structural measures
include: cointing pointer dereferences, memory accesses, etc. Timing
alone may simply reflect errors in the code or errors in data
generation or unusual data patterns generated by your test driver.
The same may go for structural measures alone. But both together
generally tell a better story and may help you identify bugs or
explain apparently strange behavior. Also, don't be too trusting.
Initially be skeptical of the data you collect, before you trust it,
you should at least try to make sure that the data makes sense. If
not, your code probably has some kind of bug.
Using unreasonable test settings: many students use values too small
to show much or use a single test pattern that puts one or two of the
structures at a huge disadvantage. (One example is testing splay
trees against other trees in a setting where each item is searched for
exactly once; not a bad test in itself, but it cannot be used alone
for a sweeping conclusion, since it clearly is the worst possible
situation for a splay tree.)
Staying slave to bad code: not all code on the web work; of those that
work, many still have small bugs; and even the bug-free code may be
very inefficient. Having to curtail an experiment because the code
you downloaded seg-faults (or similar trouble) is not acceptable --
you can debug the code, write your own code, download other code, or
drop that structure and use another.