CS 361: Semester Project

The goal of the project is to learn about experimental studies of the performance of data structures. To this end, you need to propose a project, collect software to support it, design experiments, implement drivers to run these experiments and collect data, analyze the data, and write a report presenting your results and explaining (inasmuch as possible) your observations.

What Form Can A Project Take?

The simplest form for this project is a comparison between two to three different data structures for the same ADT, in terms of various performance measures. The experiments may be entirely based on random generation of data, but they can also be driven by an existing application you already have or use data sets collected from the web (e.g., you can test dictionary lookup by inserting/looking up every word in the standard Unix spell dictionary, or in one of the many full-text classic novels available on the web).

More specialized projects can probe the details of functioning of a single data structure: for instance, you can investigate the expected number of rotations in an insertion or deletion in an AVL tree, the expected number of levels through which an element is sifted in a binary heap during insertion or deletion, etc. Another possibility is to study the cache behavior of various implementations of the same data structure -- modify the code to allocate storage differently in a tree, or compare linear probing (next location in the table) with double hashing or quadratic probing in terms of both running time and number of probes, testing to see if there is a crossover value. Another type of specialized project takes a single data structure and compares several (say five or so) variants in terms of their performance; for instance, you could compare five different probing schemes for collision resolution in a hash table.

You should plan to choose a project by the end of March. Here are some possible choices for the simplest type (compare various data structures for the same ADT):

You can easily code hash tables (of any kind) yourself; ditto for lists with move-to-front heuristics, plain binary search trees, randomized binary search trees (treaps), binary heaps, binomial heaps, skew heaps, trivial pairing heaps, and leftist trees. I recommend you use software off the web for AVL trees, red-black trees, splay trees, tries, skip lists, Fibonacci heaps, sophisticated pairing heaps, and relaxed heaps.

Designing A Project

Once you have selected a project, start collecting software for it while deciding what you want to measure. Running time is an obvious measure (you can measure it with the time command for the whole program or, better, with the getrusage OS call for whichever selected portion of the code you want, more details on the timing page); since the running time will be very small, be sure to measure it on large datasets for many repetition, so as to obtain reasonable precision. Separate from running time, you can instrument your code to count various types of low-level actions, such as pointer dereferencings or memory accesses, or various types of high-level algorithmic steps, such as the number of rotations, probes, key comparisons, etc. (Naturally, the instrumented version should not be used for timing!) You need to decide what conditions you will be using for measurement: purely random data? stable conditions (e.g., a large dictionary with equal probability of insertion or deletion) or transient ones (long chain of insertions followed by a long chain of deletions)? real data with real frequencies (e.g., look up words in a novel or in large email folders)? the relative importance of the operations in the ADT (in a dictionary, lookup versus insertion/deletion; in a priority queue, priority changes and queue meldings versus insert and deletemin)? You want to plan a course of action that will not require excessive time, yet will give you sufficient information to write a real report.

Conducting The Experiments

Then code what you have to code; this may include scripts to analyze, filter, refine, and plot the data. Early runs will probably prompt you to collect data you had not planned on collecting. Once you are fairly confident that you have all the pieces in place, start collecting data in earnest -- run fair numbers of experiments on a fair sample of different size so as to be able to plot curves and also so as to have some estimate of mean and standard deviation at each plot point.

Analyzing The Results

Finally reason about the data: why have the numbers come out the way they did? how well were they predicted by theory? are there surprises? are some of the effects explainable by cache performance? how does the data affect the performance (especially if you use real data)?

Writing It All Up

Last, but not least, write a report (your report is the only thing I will look at). The report should state what you set out to investigate, what software you used and from where, the measurements you took, the type of experiments you ran, and present and discuss the results. The results must be presented in graphical form (plots, bar or pie charts, scatter plots, etc.), not as tables of numbers! Your discussion should present the reasoning mentioned above: how do your observations correlate with theory? what is the influence of specific data? are there unexplainable anomalies for which you might want to hazard a guess? are your data conclusive (i.e., is one choice better than the others in the context you chose)? I expect that a typical report will have 6-12 pages of text, plus several pages of figures.