Structure & Strangeness

 

: Self-Organizing Maps (a.k.a. Kohonen maps) :

: A self-organizing map (SOM) is a kind of neural network that implements
: what's a k-means cluster algorithm. Essentially, the neual network maps
: the *topology* of whatever input space it's exposed to. SOMs are
: amazing tools for analyzing high dimensional data sets with clustering.
: They've been applied to texture discrimination, feature detection/selection,
: genetic activity mapping, drug discovery, cloud classification, and natural
: language (voice recognition, etc.), among others.

: SOMs were created by Kohonen, who's laboratory has published a free
: (under a GNU license) toolbox called the SOM Toolbox for MatLab. For
: the purpose of this tutorial(?), I'll be employing that toolbox running on
: an SGI Octane box named Ginger.

: Getting Started :

: Because self-organizing maps are rather esoteric and can be very difficult
: to penetrate for high-dimensional and/or real-world data sets, I've constructed
: a simple data set with three clusters of data points. The clusters are uniformly
: distributed about the points (0,0,0), (3,3,3) and (9,0,0) with a maximum
: deviation from the center of 1 unit. We'll label them cluster1, cluster2,
: and cluster3 and give them 10, 15 and 20 points respectively.

: MatLab code :
: >> cluster1 = (2*(rand(10,3) - 0.5);
: >> cluster2 = (2*(rand(15,3) - 0.5);
: >> cluster3 = (2*(rand(20,3) - 0.5); cluster3(:,1) = cluster3(:,1) + 9;
: >> data = [cluster1; cluster2; cluster3]; %join cluster data

: >> for i=1:size(sD.data,1) %generate and store input vector indices
: >> point_labels(i) = cellstr(num2str(i));
: >> end
: >> point_labels = point_labels';
: >> sD = som_data_struct(data,'name', 'Data','comp_names',...
: >> {'x','y','z'},'labels',point_labels); %generate som data struct

: >> plot3(sD.data(:,1),sD.data(:,2),sD.data(:,3),'+r') %plot data
: >> view(3), axis tight, view(-46,28)


som_data3dplot.gif

: Initializing and Training the SOM :

: To initially train the SOM, I use mostly the default settings. If you're
: interested in playing with the toolbox, I strongly suggest you read through
: the help files, as they're excellent documentation. In this example,
: because the dimensionality of the input space (3) is larger than the
: dimensionality of the SOM (2, it's just a flat sheet), the map will try to
: balance the competing errors in how well it maps the data points vs. how
: well it maps the topology (imagine trying to bend a sheet of paper to
: fill the interior of an empty cube).

: I'll let the som_make() function determine the best size for the map
: (it does this by calculating the two largest eigenvalues of the data set
: (sD) and uses those values as the dimensions). If the data range were
: particularly skewed in one dimension (those value were much larger than
: the other values), we would need to normalize the data to prevent that
: component from dominating the map topology.

: >> sM = som_make(sM, sD, 'comp_names', comps, 'labels', point_labels);

: The map trains itself pretty quickly because the smart (and altruistic)
: guys at CIS programmed a batch training method. I could also have used
: a sequential training method, but the batch seems better all-round.
: Because the data was from 3d space, I can visualize the map by simply
: plotting it in the same 3d space I plotted the clusters above.

: >> colormap(gray);
: >> som_grid(sMap,'Coord',sMap.codebook,...
: >> 'Markersize',2,'Linecolor','k','Surf',sMap.codebook(:,3))
: >> axis tight view(-46,28)


som_map3dplot.gif

: It's a little hard to see with this graphic, but the map distributes nodes to
: clusters proportionate to the percentage of the data space which is contained
: within the cluster Ð i.e. cluster 1, which has the fewest data points receives
: the fewest number of map nodes.

: Analysis and Visualization :

: Now the real power of SOMs comes into play.
: With our toy data set, it's easy to see the clustering without any fancy tools,
: imagine trying to visualize the clustering of a 4-dimensional data set, or a
: 77-dimensional data set! (I've done that, it's hard) Regardless, this is where
: the big guns come out to play.

: The basic analysis tool available is the so-called U-matrix. I've plotted it
: below, along with a component map for each dimension of our data set.

: >> colormap(gray)
: >> som_show(sMap,'umat','all','comp',[1:size(sD.data,2)],'norm','d');


som_uMatrix.gif

: Understanding the U-matrix and Component Maps :

: The SOM Toolbox graphics are truly very informative. Each of the above
: plot displays the Euclidean distance between neighboring map nodes, where
: dark colors indicate smaller distances (clustering), while lighter colors indicate
: empty space. You can already see that there are three dark spots on the
: map corresponding to our three clusters.

: Additionally, we can begin to pick out more information about the clusters.
: The x-map's dark area indicates that those nodes are pretty close to each
: other, i.e. there's a distinct cluster in the x-dimension (cluster 3). It also
: indicates that the opposite side of the map (light area) is about 9 units away
: (cluster 3), while the average node is 4.5 units away from other nodes.
: 4.5 units is roughly about how far cluster 1 is from cluster 2, and how far cluster
: 2 is from cluster 3 - but we don't know that yet! The y- and z-maps have two
: clusters (dark spots) both of which are about 3.3 units away from the other
: side of the map.

: We can concretely say so far that there are two clusters evident in the y- and
: z-dimensions. Definitely at least one well-defined cluster in the x-dimension
: as well, but further analysis will reveal how this all adds up.

: More Self-Organizing Maps! (it gets better)

 

: Creative :
.: Photography :.
.: Artistic :.
.: Blog :.
.: Thinking :.
.: Research :.

: Persona :
.: About :.
.: .plan :.
.: Vitae :.

: Website :
.: Search :.
.: Copyright :.
.: Sitemap :.
.: Links :.

© Aaron Clauset

updated 7.17.01