Semester Project in Experimental Studies
Schedule:
-
end of March: everyone emails a project proposal (a couple of paragraphs)
to Prof. Moret
-
end of 1st week of April: everyone emails a tentative project design (what
measures, what experimental conditions, what goals, what expectations) to Prof. Moret; everyone makes sure he/she can produce plots, graphs, etc., from
data.
-
month of April: set up the software and data (or data generators) needed to
run the experiments; run initial sets of experiments for calibration and to
check is anything crucial has been overlooked; when all ready, run all
experiments one after the other over a few days
-
First week of May: analyze the results and write a first draft of the
project report
-
May 10: all reports due
Remarks on Projects
I graded the projects on a scale of 0 to 100, but using the full scale
-- I have grades from 20 to 95 --, so that grades of 45 or above are actually
passing grades for the project. Here is the distribution of 37 of the 39
projects (2 reports are due in):
90- : 4
80-89: 4
70-79: 1
60-69: 7
50-59: 4
40-49: 3
30-39: 9
20-29: 5
A rough letter grade correspondence would be:
A: above 89 (4)
B+: 80 to 89 (3)
B-: 70 to 79 (1)
C+: 60 to 69 (6)
C-: 45 to 59 (5)
D+: 35 to 44 (3)
D: 25 to 34 (8)
D-: below 25 (5)
Common problems:
-
Plotting data: the data should be normalized and the plots scaled
so as best to show the behavior of the structure and thus be able to
compare it to the analytical prediction. For instance, you should plot
search/insertion/deletion times per operation rather
than cumulatively (many of you even thought that the cumulative values
should be logarithmic and made comments about being surprised at observing
an apparently linear behavior, when in fact you should have expected
n log n behavior in the cumulative values) and you should
use a log scale on the data size, so that the theoretically
predicted curve would be a straight line -- it's easy to detect a deviation
from a straight line, but hard to look at a curve and say "this is logarithmic"
(it could be some type of root and look much the same).
-
Evaluating data: timing alone may simply reflect errors in the code or
errors in data generation or unusual data patterns generated by your
test driver (many of you inserted, then deleted in the same order in
search trees, thereby making delete look much too good). The same
may go for structural measures alone. But both together generally tell
a better story and may help you identify bugs or explain apparently
strange behavior. Many of you collected only one kind of data.
Many of you were also much too trusting and believed that
the data you collected was somehow right, when in fact your code clearly
had some kind of bug.
-
Using reasonable test settings: many of you used values too small to
show much or used a single test pattern that also put one or two of
the structures at a huge disadvantage. (One example is testing
splay trees against other trees in a setting where each item is searched
for exactly once; not a bad test in itself, but it cannot be used alone
for a sweeping conclusion, since it clearly is the worst possible situation
for a splay tree.)
-
Staying slave to bad code: not all codes on the web work; of those that
work, many still have small bugs; and even the bug-free codes may be
very inefficient. Having to curtail an experiment because the code you
downloaded seg-faults (or similar trouble) is not acceptable -- you can
debug the code, download another code, or drop that structure and use
another.
-
Not saying enough in the report: a very common problem was saying next to
nothing about the nature of the tests run. You need to state how the
test data were collected or generated, what operations were used in the test,
how many, in what sequence and mix, what kind of data (timings, structural
values) were collected and how, etc. In dictionaries, you need to distinguish
successful from unsuccessful search; in all cases, you need to distinguish
insertion from deletion and from search. The plots need to have figure numbers
and need to be referred to explicitly in the text.
All in all, not a bad effort -- and I hope that all of you learned something
about obtaining code, re-using it, and running experiments.