CS 361: Semester Project

The goal of the project is to learn about experimental studies of the performance of data structures and/or algorithms. To this end, you need to propose a project, write and collect software to support it, design experiments, implement drivers to run these experiments and collect data, analyze the data, and write a report presenting your results and explaining (inasmuch as possible) your observations.

What Form Can A Project Take?

The simplest form for this project is a comparison of performance measures for between two to five different algorithms which address the same problem (e.g. sorting), or comparing performance measures for two to five data structures for the same abstract data type. For groups composed of a large number of students, I will expect a comparison of a larger number of algorithms or data structures, whereas for a group composed of just one or two students, two or three algorithms or data structures will suffice. The experiments may be entirely based on random generation of data, but they can also be driven by an existing application you already have or use data sets collected from the web (e.g., you can test dictionary lookup by inserting/looking up every word in the standard Unix spell dictionary, or in one of the many full-text classic novels available on the web).

More specialized projects can probe the details of functioning of a single algorithm or data structure: for instance, you can investigate the expected number of rotations in an insertion or deletion in an AVL tree, the expected number of levels through which an element is sifted in a binary heap during insertion or deletion, efficiency of quicksort under many different rules for choosing the pivot, etc. Another possibility is to study the cache behavior of various implementations of the same data structure or algorithm -- write code to allocate storage differently in a tree, or compare linear probing (next location in the table) with double hashing or quadratic probing in terms of both running time and number of probes, testing to see if there is a crossover value. Another type of specialized project takes a single data structure and compares several (say five or so) variants in terms of their performance; for instance, you could compare five or more different probing schemes for collision resolution in a hash table.

Here are some possible choices for a simple type of project (compare various data structures for the same ADT):

You can easily code hash tables (of any kind) yourself; ditto for lists with move-to-front heuristics, plain binary search trees, randomized binary search trees (treaps), binary heaps, binomial heaps, skew heaps, trivial pairing heaps, skip lists, and leftist trees. I prefer that, if possible, you code up these data structures yourself, and include your source code in the report. However for more complicated data structures such as red-black trees, splay trees, tries, Fibonacci heaps, sophisticated pairing heaps, and relaxed heaps, I recommend you use software off the web.

Designing A Project

Once you have selected a project, start writing and collecting software for it while deciding what you want to measure. Running time is an obvious measure (you can measure it with the time command for the whole program or, better, with the getrusage OS call for whichever selected portion of the code you want, more details on the timing page); since the running time will be very small, be sure to measure it on large datasets for many repetitions, so as to obtain reasonable precision. Separate from running time, you should instrument your code to collect structural measures. For example, you may want to count various types of low-level actions, such as pointer dereferencings or memory accesses, or various types of high-level algorithmic steps, such as the number of rotations, probes, key comparisons, etc. (Naturally, the instrumented version should not be used for timing!) You need to decide what conditions you will be using for measurement: purely random data? stable conditions (e.g., a large dictionary with equal probability of insertion or deletion) or transient ones (long chain of insertions followed by a long chain of deletions)? real data with real frequencies (e.g., look up words in a novel or in large email folders)? the relative importance of the operations in the ADT (in a dictionary, lookup versus insertion/deletion; in a priority queue, priority changes and queue meldings versus insert and deletemin)? You want to plan a course of action that will not require excessive time, yet will give you sufficient information to write a real report.

Conducting The Experiments

After you have designed the experiments, code what you have to code; this may include scripts to analyze, filter, refine, and plot the data. Early runs will probably prompt you to collect data you had not planned on collecting. Once you are fairly confident that you have all the pieces in place, start collecting data in earnest -- run fair numbers of experiments on a fair sample of different size so as to be able to plot curves and also so as to have some estimate of mean and standard deviation at each plot point.

Analyzing The Results

Finally reason about the data: why have the numbers come out the way they did? how well were they predicted by theory? are there surprises? are some of the effects explainable by cache performance? how does the data affect the performance (especially if you use real data)?

Writing It All Up

Last, but not least, write a report (your report is the only thing I will look at). The report should state what you set out to investigate, what software you wrote yourself, what software you used from the web and where you got it, the measurements you took, the type of experiments you ran, and present and discuss the results. The results must be presented in graphical form (plots, bar or pie charts, scatter plots, etc.), not as tables of numbers! Your discussion should present the reasoning mentioned above: how do your observations correlate with theory? what is the influence of specific data? are there unexplainable anomalies for which you might want to hazard a guess? are your data conclusive (i.e., is one choice better than the others in the context you chose)? I expect that a typical report will have 6-12 pages of text, plus several pages of figures. For groups composed of a larger number of students, a longer and more comprehensive report will be expected, whereas for groups composed of one or two students, a shorter report will suffice.

Example Project

This report on "Cache Performance of Indexing Data Structures" by Joshua MacDonald and Ben Zhao is a good example of what I'm looking for in terms of the scope and quality of your projects. This is a very nice paper on the effect of caching on index structures which was turned in as the final report for a class similar to ours at UC Irvine.

Top n Project Mistakes

(Note: much of this material is taken from Bernard Moret's CS361 web page). Your report should contain enough information to convince a "skeptical reader" that your hypothesis is correct. As much as possible, try to objectively evaluate your report and experiments, to decide if they will convince a skeptical reader.
  1. Generating inappropriate test suites: Make sure that your test suite is sufficiently challenging. For example, if you're testing a binary tree, do not simply insert a bunch of data, and then delete it in the same order. This will make the delete operate look much too good. Instead, try several different types of tests, one of which should be randomly interleaving inserts with deletes.
  2. Insufficient data: You need to take enough data to convince a skeptical reader that the behavior really is e.g. Theta(n log n), and get a reasonable estimate of the constant. If n is too small, you'll get a "time" of zero, so you need to take enough data at larger values of n --- ranging over at least a factor of 100. You should also, when possible, take the average over many independent trials for each value of n (at least 20). This will prevent the data from fluctuating wildly. If your data is so noisy that the n log n behavior isn't convincing, you haven't passed the "skeptical reader" test.
  3. Irreproducible Results: You need to explain enough about how the experiment was performed so that another person could read your report and reproduce your results. That is, from the report *alone*, they could reproduce your experiments and results. For instance, you need to include information like how many trials did you do for each value of n? You need to state how the test data are collected or generated, what operations are used in the test, how many, in what sequence and mix, what kind of data (timings, structural values) are collected and how, etc. In dictionaries, you need to distinguish successful from unsuccessful search; in all cases, you need to distinguish insertion from deletion and from search.
  4. Improper plotting of data: the data should be normalized and the plots scaled so as best to show the behavior of the structure and thus be able to compare it to the analytical prediction. For instance, you should plot search/insertion/deletion times per operation rather than cumulatively. Further, if you're predicted run time is logarithmic, then you should use a log scale on the data size, so that the theoretically predicted curve is a straight line -- it's easy to detect a deviation from a straight line, but hard to look at a curve and say "this is logarithmic" (it could be some type of root and look much the same).
  5. Bad plots: Every single plot you include in your report should have labels for both the x and y axes. Every one! The axis labels should be located on the plots themselves. In addition, each plot should have a title, a caption below it describing the plot, and it should be referenced in the text of the report! The plots also need to have figure numbers and need to be referred to explicitly in the text.
  6. Insufficient Evaluation of data: You should use both timing measures and structural measures to evaluate performance. Structural measures include: cointing pointer dereferences, memory accesses, etc. Timing alone may simply reflect errors in the code or errors in data generation or unusual data patterns generated by your test driver. The same may go for structural measures alone. But both together generally tell a better story and may help you identify bugs or explain apparently strange behavior. Also, don't be too trusting. Initially be skeptical of the data you collect, before you trust it, you should at least try to make sure that the data makes sense. If not, your code probably has some kind of bug.
  7. Using unreasonable test settings: many students use values too small to show much or use a single test pattern that puts one or two of the structures at a huge disadvantage. (One example is testing splay trees against other trees in a setting where each item is searched for exactly once; not a bad test in itself, but it cannot be used alone for a sweeping conclusion, since it clearly is the worst possible situation for a splay tree.)
  8. Staying slave to bad code: not all programs on the web work; of those that work, many still have small bugs; and even the bug-free code may be very inefficient. Having to curtail an experiment because the code you downloaded seg-faults (or similar trouble) is not acceptable -- you can debug the code, write your own code, download other code, or drop that structure and use another.