In case you need to reach me, my home number is (505) 243-6329, office number is (505) 277-7833. Email address is vanbelle@cs.unm.edu.

Code Factoring using Genetic Programming

Here are two new graphs from the same experiment that Dave and I wrote up in "Code Factoring and the Evolution of Evolvability" (GECCO 2002).

The experiment was symbolic regression on the formula y = Asin(Ax) where A is a real constant that changes every five generations.

Evolvability is all about getting the most improvement for the least amount of effort. In the original paper I measured effort in time (generations). We can also measure it in terms of the amount of change to the code.

Fitness is measured in hits, which are the number of test points (out of 200) that are within 0.01 of the correct value.

Hit gain is the average improvement in fitness for best of population from the start of an epoch to its end.

Hit gain per node changed is the hit gain divided by the average number of nodes that have changed in the best of population genome over the current epoch.

Hit gain per percentage change is the hit gain divided by the average percentage of the tree that has changed over the current epoch.

Improving Package Structure

Java classes are divided into packages. I've studied how to improve the package structure for three real-world applications. These graphs pertain to Jikes RVM, an open-source research Java virtual machine. Since Jikes doesn't use the "package" keyword, I look at how the files are divided up into directories. Each directory is a unique "package".

I've invented two metrics to measure how good a package structure is, based on changes I derive from the CVS history of the project. Each change is a subset of the files that have been touched during an hour.

The breadth of a package structure is the average number of packages touched by a change.

The weight of a package structure is the average of the total number of files in the touched packages.

Breadth is trivially optimized by putting all files into one package, but this pessimizes weight.

Weight is trivially optimized by putting each file into its own package, but this pessimizes breadth.

Best of all would be to minimize both.

I apply two algorithms to the job of optimizing breadth and weight. In one approach, I apply a clustering to correlations between files, seeking to group highly correlated files into the same package.

I also apply a variant of Kernighan-Lin, which is a modified hillclimbing algorithm that is allowed to go into "valleys" of the fitness landscape, in hopes that it might find higher ground on the other side. Fitness is computed via a linear combination of breadth and weight.

Here are the performances of Clustering and Kernighan-Lin with various parameter settings (threshold for Clustering, coefficient for Kernighan-Lin). The red dot is the score for the current Jikes package structure.

Here I apply Kernighan-Lin, starting with the current package structure (upper-right point). The movement down and left towards (1, 1) indicates improvement in breadth and weight.