CS 361: Semester Project
The goal of the project is to learn about experimental studies of the
performance of data structures and/or algorithms. To this end, you
need to propose a project, write and collect software to support it,
design experiments, implement drivers to run these experiments and
collect data, analyze the data, and write a report presenting your
results and explaining (inasmuch as possible) your observations.
What Form Can A Project Take?
The simplest form for this project is a comparison of performance
measures for between two to five different algorithms which address
the same problem (e.g. sorting), or comparing performance measures for
two to five data structures for the same abstract data type. For
groups composed of a large number of students, I will expect a
comparison of a larger number of algorithms or data structures,
whereas for a group composed of just one or two students, two or three
algorithms or data structures will suffice. The experiments may be
entirely based on random generation of data, but they can also be
driven by an existing application you already have or use data sets
collected from the web (e.g., you can test dictionary lookup by
inserting/looking up every word in the standard Unix spell dictionary,
or in one of the many fulltext classic novels available on the web).
More specialized projects can probe the details of functioning of a
single algorithm or data structure: for instance, you can investigate
the expected number of rotations in an insertion or deletion in an AVL
tree, the expected number of levels through which an element is sifted
in a binary heap during insertion or deletion, efficiency of quicksort
under many different rules for choosing the pivot, etc. Another
possibility is to study the cache behavior of various implementations
of the same data structure or algorithm  write code to allocate
storage differently in a tree, or compare linear probing (next
location in the table) with double hashing or quadratic probing in
terms of both running time and number of probes, testing to see if
there is a crossover value. Another type of specialized project takes
a single data structure and compares several (say five or so) variants
in terms of their performance; for instance, you could compare five or
more different probing schemes for collision resolution in a hash
table.
Here are some possible choices for a simple type of project (compare
various data structures for the same ADT):

ADT dictionary, highly dynamic context (almost as many insertions and
deletions as there are lookups): compare plain binary search trees
with at least two of: AVL trees, redblack trees, splay trees, tries,
skip lists, and randomized binary search trees (treaps).

ADT dictionary, minimally dynamic context (far more lookups than
insertions or deletions): compare at least three of plain binary
search trees, AVL trees, redblack trees, splay trees, tries, skip
lists, randomized binary search trees (treaps), and various hash
tables (with resolution inside or outside the table).

ADT dictionary, minimally dynamic context (far more lookups than
insertions or deletions): compare various hash tables (with resolution
inside or outside the table) among themselves.

ADT ordered list: compare lists with movetofront heuristics with
plain binary search trees and at least one of splay trees, AVL trees,
redblack trees, skip lists, and randomized binary search trees
(treaps).

ADT priority queue: compare binary heaps with at least two of leftist
trees, binomial heaps, skew heaps, Fibonacci heaps, pairing heaps, and
relaxed heaps.
You can easily code hash tables (of any kind) yourself; ditto for
lists with movetofront heuristics, plain binary search trees,
randomized binary search trees (treaps), binary heaps, binomial heaps,
skew heaps, trivial pairing heaps, skip lists, and leftist trees. I
prefer that, if possible, you code up these data structures yourself,
and include your source code in the report. However for more
complicated data structures such as redblack trees, splay trees,
tries, Fibonacci heaps, sophisticated pairing heaps, and relaxed
heaps, I recommend you use software off the web.
Designing A Project
Once you have selected a project, start writing and collecting
software for it while deciding what you want to measure. Running time
is an obvious measure (you can measure it with the time
command for the whole program or, better, with the getrusage
OS call for whichever selected portion of the code you want, more
details on the timing page); since the
running time will be very small, be sure to measure it on large
datasets for many repetitions, so as to obtain reasonable precision.
Separate from running time, you should instrument your code
to collect structural measures. For example, you may want to
count various types of lowlevel actions, such as pointer
dereferencings or memory accesses, or various types of highlevel
algorithmic steps, such as the number of rotations, probes, key
comparisons, etc. (Naturally, the instrumented version should not be
used for timing!) You need to decide what conditions you will be
using for measurement: purely random data? stable conditions (e.g., a
large dictionary with equal probability of insertion or deletion) or
transient ones (long chain of insertions followed by a long chain of
deletions)? real data with real frequencies (e.g., look up words in a
novel or in large email folders)? the relative importance of the
operations in the ADT (in a dictionary, lookup versus
insertion/deletion; in a priority queue, priority changes and queue
meldings versus insert and deletemin)? You want to plan a course of
action that will not require excessive time, yet will give you
sufficient information to write a real report.
Conducting The Experiments
After you have designed the experiments, code what you have to code;
this may include scripts to analyze, filter, refine, and plot the
data. Early runs will probably prompt you to collect data you had not
planned on collecting. Once you are fairly confident that you have
all the pieces in place, start collecting data in earnest  run fair
numbers of experiments on a fair sample of different size so as to be
able to plot curves and also so as to have some estimate of mean and
standard deviation at each plot point.
Analyzing The Results
Finally reason about the data: why have the numbers come out the way
they did? how well were they predicted by theory? are there
surprises? are some of the effects explainable by cache performance?
how does the data affect the performance (especially if you use real
data)?
Writing It All Up
Last, but not least, write a report (your report is the only thing I
will look at). The report should state what you set out to
investigate, what software you wrote yourself, what software you used
from the web and where you got it, the measurements you took, the type
of experiments you ran, and present and discuss the results. The
results must be presented in graphical form (plots, bar or pie charts,
scatter plots, etc.), not as tables of numbers! Your
discussion should present the reasoning mentioned above: how do your
observations correlate with theory? what is the influence of specific
data? are there unexplainable anomalies for which you might want to
hazard a guess? are your data conclusive (i.e., is one choice better
than the others in the context you chose)? I expect that a typical
report will have 612 pages of text, plus several pages of figures.
For groups composed of a larger number of students, a longer and more
comprehensive report will be expected, whereas for groups composed of
one or two students, a shorter report will suffice.
Example Project
This report on "Cache Performance of
Indexing Data Structures" by Joshua MacDonald and Ben Zhao is a good
example of what I'm looking for in terms of the scope and quality of
your projects. This is a very nice paper on the effect of caching on
index structures which was turned in as the final report for a class
similar to ours at UC Irvine.
Top n Project Mistakes
(Note: much of this material is taken from Bernard Moret's CS361 web
page). Your report should contain enough information to convince a
"skeptical reader" that your hypothesis is correct. As much as
possible, try to objectively evaluate your report and experiments, to
decide if they will convince a skeptical reader.

Generating inappropriate test suites: Make sure that your test suite
is sufficiently challenging. For example, if you're testing a binary
tree, do not simply insert a bunch of data, and then delete it in the
same order. This will make the delete operate look much too good.
Instead, try several different types of tests, one of which should be
randomly interleaving inserts with deletes.

Insufficient data: You need to take enough data to convince a
skeptical reader that the behavior really is e.g. Theta(n log n), and
get a reasonable estimate of the constant. If n is too small, you'll
get a "time" of zero, so you need to take enough data at larger values
of n  ranging over at least a factor of 100. You should also, when
possible, take the average over many independent trials for each value
of n (at least 20). This will prevent the data from fluctuating
wildly. If your data is so noisy that the n log n behavior isn't
convincing, you haven't passed the "skeptical reader" test.
 Irreproducible Results: You need to explain enough about how the
experiment was performed so that another person could read your report
and reproduce your results. That is, from the report *alone*, they
could reproduce your experiments and results. For instance, you need
to include information like how many trials did you do for each value
of n? You need to state how the test data are collected or generated,
what operations are used in the test, how many, in what sequence and
mix, what kind of data (timings, structural values) are collected and
how, etc. In dictionaries, you need to distinguish successful from
unsuccessful search; in all cases, you need to distinguish insertion
from deletion and from search.

Improper plotting of data: the data should be normalized and the plots
scaled so as best to show the behavior of the structure and thus be
able to compare it to the analytical prediction. For instance, you
should plot search/insertion/deletion times per
operation rather than cumulatively. Further, if you're
predicted run time is logarithmic, then you should use a log scale on
the data size, so that the theoretically predicted curve is a straight
line  it's easy to detect a deviation from a straight line, but hard
to look at a curve and say "this is logarithmic" (it could be some
type of root and look much the same).

Bad plots: Every single plot you include in your report should have
labels for both the x and y axes. Every one! The axis labels should
be located on the plots themselves. In addition, each plot should
have a title, a caption below it describing the plot, and it should be
referenced in the text of the report! The plots also need to have
figure numbers and need to be referred to explicitly in the text.

Insufficient Evaluation of data: You should use both timing
measures and structural measures to evaluate performance. Structural
measures include: cointing pointer dereferences, memory accesses, etc.
Timing alone may simply reflect errors in the code or errors in data
generation or unusual data patterns generated by your test driver.
The same may go for structural measures alone. But both together
generally tell a better story and may help you identify bugs or
explain apparently strange behavior. Also, don't be too trusting.
Initially be skeptical of the data you collect, before you trust it,
you should at least try to make sure that the data makes sense. If
not, your code probably has some kind of bug.

Using unreasonable test settings: many students use values too small
to show much or use a single test pattern that puts one or two of the
structures at a huge disadvantage. (One example is testing splay
trees against other trees in a setting where each item is searched for
exactly once; not a bad test in itself, but it cannot be used alone
for a sweeping conclusion, since it clearly is the worst possible
situation for a splay tree.)

Staying slave to bad code: not all programs on the web work;
of those that work, many still have small bugs; and even the bugfree
code may be very inefficient. Having to curtail an experiment because
the code you downloaded segfaults (or similar trouble) is not
acceptable  you can debug the code, write your own code, download
other code, or drop that structure and use another.