March 29, 2010
The trouble with community detection
Attention conservation notice: this is a posting about a talk I'm giving tomorrow at Dalhousie University in Nova Scotia.
For most of this week, I'll be visiting the math department of Dalhousie University in Halifax Nova Scotia, as a speaker in the Modelling and Mining of Network Information Spaces seminar series and a guest of Jeannette Janssen.
For my part, I'm giving a talk (see below) on the results of my summer student Ben Good's project on the difficulties of identifying dense "communities" (or "modules", or "compartments") in networks using topological information alone. I'm pleased to say that this paper was recently accepted at Physical Review E. 
The problem of detecting communities in networks has received an enormous amount of attention (more than it deserves, in my opinion), and there are now literally dozens of reasonable-sounding ways to find the "clusters" in networks. To give you a sense of just how much attention, the first few papers in the field have received hundreds of citations and a few have even received thousands. And yet, I'm increasingly skeptical that all this effort has produced much of lasting value. On the up side, it's produced lots of clever methodological tricks and insights, and certainly I've enjoyed chewing on these problems myself . But, I'm increasingly pessimistic about the goal of automatically extracting meaningful "clusters" from interaction data alone. In short, I don't believe there is a universally useful definition of a network cluster and I'm skeptical that any of the community detection methods currently available actually produce results that can be trusted.
Current methods do okay on trivial test cases of various kinds, but they all have methodological problems (some of them quite severe) that make it difficult to unambiguously interpret the scientific significance of their output. And, every method makes assumptions that are almost surely highly unrealistic for almost any system you might care to think about. On the other hand, some standard data analysis methods have similar problems (e.g., hierarchical clustering algorithms for spatial data) but still manage to be useful. I think this is partly because we understand pretty well how these methods fail, and thus how their output should be interpreted and under what conditions they can be expected to perform unambiguously. I don't think we're there yet with network clustering methods, but perhaps one day we'll get there.
If you're in the Halifax area and are interested in the talk, here are the details:
Date: Tuesday March 30, 2010 at 2:30 p.m.
Location: Jacob Slonim Conference Room (430), 6050 University Ave., Halifax
Coffee and cookies will be provided, courtesy of Faculty of Computer Science.
The trouble with community detection
Although widely used in practice, the performance of the popular network clustering technique called "modularity maximization" is not well understood when applied to networks with unknown modular structure. In this talk, I'll show that precisely in the case we want it to perform the best--that is, on modular networks--the modularity function Q exhibits extreme degeneracies, in which the global maximum is hidden among an exponential number of high-modularity solutions. Further, these degenerate solutions can be structurally very dissimilar, suggesting that any particular high-modularity partition, or statistical summary of its structure, should not be taken as representative of the other degenerate solutions. These results partly explain why so many heuristics do well at finding high-modularity partitions and why different heuristics can disagree on the modular composition the same network. I'll conclude with some forward-looking thoughts about the general problem of identifying network modules from connectivity data alone, and the likelihood of circumventing this degeneracy problem.
Update 31 March 2010: For those of you interested in reproducing our results or applying our methods to your own networks, Ben has placed implementations online here for his simulated annealing code for sampling the local optima of the modularity function and his code for taking those sampled optima and reconstructing the 3D visualization of the modularity landscape.
Update 15 April 2010: Updated the journal ref.
 B. H. Good, Y.-A. de Montjoye and A. Clauset. " The performance of modularity maximization in practical contexts." Physical Review E 81, 046106 (2010).
 My most cited paper, by far, is my first paper on detecting communities by maximizing modularity using a greedy agglomerative algorithm.