« Workshop: Statistical modelling and inference for networks | Main | As a real doctor... »

## March 29, 2010

### The trouble with community detection

*Attention conservation notice: this is a posting about a talk I'm giving tomorrow at Dalhousie University in Nova Scotia.*

For most of this week, I'll be visiting the math department of Dalhousie University in Halifax Nova Scotia, as a speaker in the Modelling and Mining of Network Information Spaces seminar series and a guest of Jeannette Janssen.

For my part, I'm giving a talk (see below) on the results of my summer student Ben Good's project on the difficulties of identifying dense "communities" (or "modules", or "compartments") in networks using topological information alone. I'm pleased to say that this paper was recently accepted at *Physical Review E*. [1]

The problem of detecting communities in networks has received an enormous amount of attention (more than it deserves, in my opinion), and there are now literally dozens of reasonable-sounding ways to find the "clusters" in networks. To give you a sense of just how much attention, the first few papers in the field have received hundreds of citations and a few have even received thousands. And yet, I'm increasingly skeptical that all this effort has produced much of lasting value. On the up side, it's produced lots of clever methodological tricks and insights, and certainly I've enjoyed chewing on these problems myself [2]. But, I'm increasingly pessimistic about the goal of automatically extracting meaningful "clusters" from interaction data alone. In short, I don't believe there is a *universally useful* definition of a network cluster and I'm skeptical that *any* of the community detection methods currently available actually produce results that can be trusted.

Current methods do okay on trivial test cases of various kinds, but they all have methodological problems (some of them quite severe) that make it difficult to unambiguously interpret the scientific significance of their output. And, every method makes assumptions that are almost surely highly unrealistic for almost any system you might care to think about. On the other hand, some standard data analysis methods have similar problems (e.g., hierarchical clustering algorithms for spatial data) but still manage to be useful. I think this is partly because we understand pretty well how these methods fail, and thus how their output should be interpreted and under what conditions they can be expected to perform unambiguously. I don't think we're there yet with network clustering methods, but perhaps one day we'll get there.

If you're in the Halifax area and are interested in the talk, here are the details:

Date: Tuesday March 30, 2010 at 2:30 p.m.

Location: Jacob Slonim Conference Room (430), 6050 University Ave., Halifax

Coffee and cookies will be provided, courtesy of Faculty of Computer Science.

The trouble with community detection

Although widely used in practice, the performance of the popular network clustering technique called "modularity maximization" is not well understood when applied to networks with unknown modular structure. In this talk, I'll show that precisely in the case we want it to perform the best--that is, on modular networks--the modularity function *Q* exhibits extreme degeneracies, in which the global maximum is hidden among an exponential number of high-modularity solutions. Further, these degenerate solutions can be structurally very dissimilar, suggesting that any particular high-modularity partition, or statistical summary of its structure, should not be taken as representative of the other degenerate solutions. These results partly explain why so many heuristics do well at finding high-modularity partitions and why different heuristics can disagree on the modular composition the same network. I'll conclude with some forward-looking thoughts about the general problem of identifying network modules from connectivity data alone, and the likelihood of circumventing this degeneracy problem.

**Update 31 March 2010**: For those of you interested in reproducing our results or applying our methods to your own networks, Ben has placed implementations online here for his simulated annealing code for sampling the local optima of the modularity function and his code for taking those sampled optima and reconstructing the 3D visualization of the modularity landscape.

**Update 15 April 2010**: Updated the journal ref.

-----

[1] B. H. Good, Y.-A. de Montjoye and A. Clauset. " The performance of modularity maximization in practical contexts." *Physical Review E* 81, 046106 (2010).

[2] My most cited paper, by far, is my first paper on detecting communities by maximizing modularity using a greedy agglomerative algorithm.

posted March 29, 2010 08:24 AM in Self Referential | permalink

## Comments

I think it's best to simultaneously be pessimistic and see what one is able to do anyway, especially when the problem is interesting to study. (And I definitely agree that it's crucial to understand to the extent possible when and how methods fail!) Ultimately, what makes me tick scientifically is that the problem is interesting to study, and it's true that I have some internal hope that things might eventually be genuinely useful as well. I think the greatest hope for these methods is to narrow down possibilities so that, e.g., maybe the experimentalists can look at 50 things carefully instead of 5000. (I took the specific numbers out of my ass, but I think it is possible in principle to use these methods, or more advanced versions thereof, for such narrowing down.) But at the end of the day my scientific interest in this business is mathematical first and everything else second. (And, granted, that is a luxury that mathematicians have more than most other scientists.)

As for something getting more attention "than it deserves", surely this is a scientific principle that has a name? People often one-by-one (or at least few-by-few) will often first get excited about something (power laws or community detection or whatever) and will often later observe saturation and then one-by-one (or few-by-few) feel that something has gotten too much attention---for various reasons, including that of failed promise (though I am not as pessimistic as you are here... there is a lot of work to do, though...).

Anyway, I'll see you in Maryland, and we can talk about variation of information and other things.

Posted by: Mason Porter at March 29, 2010 04:36 PM

From a mathematical perspective, I definitely think the community detection problem is interesting and a lot of fun. I guess the thing I'm pessimistic about is whether all the mathematical fun we've had with this problem is helping anyone else actually answer scientific questions. It's standard to claim so in the introduction of clustering papers (and I've written my fair share of such introductions), but lately I've really begun to wonder just how realistic a claim that really is. Maybe you're right that some genuine usefulness is there...

Personally, I see a lot of parallels with spatial clustering methods. Have you seen Kleinberg's paper titled "An Impossibility Theorem for Clustering"? It's a real gem, and I suspect there's an analogous impossibility theorem for network clustering.

I don't get to Maryland until Tuesday afternoon, so I'm afraid I'll miss your talk! It's a shame, because I really like that paper.

Posted by: Aaron at March 30, 2010 08:08 AM

I have seen that paper but haven't read it in detail. I agree with you that there ought to be a similar result for network clustering.

We can always talk separately about the follow-ups (both theoretical and to applications) that we're now doing. I don't think I'll mention those in the talk anyway.

I think there is some genuine usefulness, but the question for me is if the promise will ever be fulfilled. I agree with you about the introductions, by the way. I've been trying to tone that down in recent papers...

Posted by: Mason Porter at March 30, 2010 01:23 PM

i saw this http://realitytv.about.com/library/pages/bl-aaron.htm

If average joe publishes a paper in Science and Nature in single year, i wonder which set you are using to calculate the average.

Posted by: none at April 1, 2010 03:04 PM