Homework 1

Due: Thurs, Feb 2, 2012, start of class.

All students:

  1. Write a decision tree learning program (in a language of your choice). Your program must support the following:
    • Arbitrary number of features (e.g., limited only by system memory or array size of underlying arrays or something -- no hard-coded limits in your code).
    • Arbitrary number of instances
    • Binary features (1/0)
    • Binary class labels (1/0)
    • Entropy gain objective function
    • Early stopping (see below)

    Apply your code to the following data sets, providing the requested output for each:

    1. Synth0 X Synth0 Y
      • Report the entropy gain for each possible top level split (i.e., of the six features, what is the entropy gain that would result from choosing each one as the root split of a decision tree?)
      • Report the full, unpruned, decision tree that your learner constructs on this data. For each leaf node report the class ratios at that node. For each decision node report the information gain relative to its parent node.
      • Report the reduced (pruned) tree that your learner constructs. Include the same decision and leaf node statistics as above.
    2. Synth1 X Synth1 Y (Both gzip compressed)
      • Report the total number of leaves and decision nodes in the full, unpruned tree that your learner constructs.
      • Report the entire pruned tree that your learner constructs, including the same decision and leaf node statistics as in the previous question.

    Early stopping rule: The simplest stopping rule you can use for growing your tree is: "is the impurity change smaller than some constant?". So for this assignment, you can use the following rule to stop growing your tree:

    a=pickBestSplitAttribute()
    if (gain(a)<lambda) {
    // stop growing tree
    return new leafNode(...)
    }
    // otherwise, continue growing tree
    ...
    where lambda is a "hyperparameter" -- i.e., a parameter that governs how the learning algorithm works. lambda is your model complexity control -- a larger value of lambda will yield smaller trees. In the framework of lecture, a larger lambda means a smaller hypothesis space. You will have to pick lambda manually -- I encourage you to play with it some and examine how it influences tree size.

    Note Early stopping is not the best form of complexity control for decision trees. While it will yield small trees, they empirically do not typically generalize as well as trees that have been grown out to full depth and then trimmed ("pruned") back to a smaller size in a post-processing step. Post-pruning is (slightly) more complicated to implement, so you're not required to do it for HW1. But you may, if you want to examine the tradeoffs between the two.

    Extra credit 1: Extend your DT learner to support arbitrary discrete features and class labels. (I.e., your learner should be able to accept as input both discrete numeric features/labels -- 0, 1, ..., 53, ... -- or arbitrary strings -- "apple", "orange", "kumquat". In either case, you can assume a finite, but a priori unbounded, number of feature values/labels.) You may require an auxiliary input that denotes the type of each feature/label. You must provide test data and example output demonstrating the correctness of your code.

    Extra credit 2: Extend your DT learner to support real-valued features (not necessarily labels). You must support at least single-precision floating point representations. You may require an auxiliary input that denotes the type of each feature/label. You must provide test data and example output demonstrating the correctness of your code.

Students enrolled in the 529 section also do the following:
  1. Show that entropy gain is concave (i.e., anti-convex).
  2. Show that a binary, categorical decision tree, using information gain as a splitting criterion, always increases purity. That is, information gain is non-negative for all possible splits, and is 0 only when the split leaves the data distribution unchanged in both leaves.
  3. Prove that the basic decision tree learning algorithm (i.e., the greedy, recursive, tree-growing algorithm from class, with no pruning), using the information gain splitting criterion, halts.