Note: References to code or actual data look like this.
A1: Don't panic! As the README included with the data indicates, this release is a preliminary data set — it may not match the spec in all ways. The slightly longer story is that Siemens is preparing a more extensive version of the training data. But it turned out to take longer to produce that data than they had expected, and we judged that it was more important to go forward with the competition than to wait on the final data. So we released the data that is currently available. The spec is based on Siemens's projections of what the final data will look like and is out of synch with the preliminary data that we released. The "final" data may or may not become available in time to be of use during the competition; if it does become available I will release it as soon as I can. At that point, I'll correct the spec to reflect the final status of the training data.
A2: The 'label' field actually gives the index of the PE with which the candidate
is associated. That is, label=0 means 'this candidate not associated with
a PE', while label=X means 'this candidate associated with PE number X' (for
X!=0).
Patient ID) than the 'KDDPETrain.txt' file fields (117)?
Tissue Feature and is missing from the original
feature names file.
A4: Yes, that's a mistake in the documentation. The normalization is into a unit range and roughly a zero mean. But feel free to re-normalize the data in any way that you like.
A5: I'm working on sorting out the submission process now. I'll let you all know more when I do.
A6: Yes. Once the competition is complete, the intention is to release the entire data set.
A7: Siemens is unwilling to provide this information. They are hoping for "abstract" approaches that aren't engineered to specific feature sets. (E.g., so that they're easily extensible to different medical problems.)
A7': That's a good question. It's part of the challenge for you to determine that. No prior domain knowledge is available on this point.
A8: There are indeed strong correlations within patient and even within patient groups from a single hospital. Part of the challenge is to find useful ways to exploit that information. You will be given patient ID in the final test data, but you will not be given a hospital ID and you may not assume that all of the patients are drawn from the same hospital. The test data patient set will be disjoint from the training data patient set.
A9: I'll say more about the answer format in the future but the short answer is: no. You must commit to an answer for each candidate. In practice, if your algorithm was embedded in a piece of medical imaging equipment, you would be required to return an absolute yes/no answer to the physician — there is no room for "I don't know".
A10: Answer from Siemens: "We have high confidence that the labels given in the training and test data are correct. They have been rigorously examined individually by domain experts. But there is always an element of human error — there may be small errors in either direction, but we believe that such errors are minimal, at best. There exists the possibility that there are PEs in the original data to which no candidates were assigned at all, but this is beyond the scope of this competition. (Such PEs will have no candidates in the test data, so you won't be scored on them one way or the other.)"
Last Modfied: June 19, 2006