Data Formats

There are five different data files, of three different formats, that we'll use in this project:

The 3-D facial image data are stored in VRML files. There is one .wrl for each subject/image in each data set. There's a lot of data in these files, much of which is not necessarily relevant to us. The minimal part of the file that is useful is the IndexedFaceSet:Coordiate:point list. This gives a linear list of comma-separated 3-tuples of spatial locations for surface points from a face scan.

The problem is that the order of these points is not guaranteed, nor is there a fixed number of surface points per file. (This happens because the files have been pre-trimmed to excise background points, leaving only "face" points, and every face has a slightly different surface area.) In order to render them comparable, therefore, we require landmarks to assist with registration.

The landmarks are stored in tab-delimited text files. Each file contains:

The labels file is a tab-delimited text file containing four columns and 21 rows (20 subjects and one header row). The three labels are:

ID#
The subject ID, corresponding to one of the subjects in the feature detection data/landmarks.
Race
One or more self-defined race identifiers. This is from a discrete (nominal/categorical), but unbounded set.
Ethnicity
One or more self-defined ethnicity/ancestry labels. Again, these are discrete labels, but are from a potentially unbounded set.
Sex
One of {F,M}

Like the landmarks data, some values are unobserved (i.e., not reported by the subjects). These are represented with a ? in the corresponding field value in the file. You need report predictions only for the subjects who have corresponding labels, but you may choose to treat the task as semi-supervised and use both labeled and unlabeled examples in your learner.

Each of the three label columns yields a distinct prediction task. You should investigate each of them independently. (I.e., ask about what features are most predictive for each and what classification rate you can get for each.) In addition, you may choose to attempt some form of joint analysis, in which you try to predict all labels simultaneously (taking advantage of the statistical dependencies that may be present among the labels).

One interesting property of the labels is that many subjects chose multiple labels for either or both of the race or ethnicity columns. For example, one subject reports as Race=White/Asian and Ethnicity=European/Indian. From a data perspective, you can choose to think of these either as distinct, opaque labels (i.e., European/Indian is an atomic label, distinct from both European and Indian), as elements of the Cartesian product space (i.e., European/Indian is a member of the product space that contains both European and Indian individually), or as a statistical mixture (e.g., European/Indian might be interpreted as 50% European and 50% Indian). These three ways of thinking of it lead to different learning strategies. The choice is up to you.