« Science funding | Main | Modernizing Kevin Bacon »

January 31, 2007

My kingdom for a good null-model

The past few days, I've been digging into the literature on extreme value theory, which is a rather nice branch of probability theory that shows how the distribution of the largest (or, smallest) observed value varies. This exercise has been mostly driven by a desire to understand how it connects to my own research on power-law distributions (I'm reluctant to admit that I'm actually working on a lengthy review article on the topic, partially in an attempt to clear up what seems to be substantial confusion over both their significance and how to go about measuring them in real data). But, that's a topic for a future post. What I really want to mention is an excellent example of good statistical reasoning in experimental high energy physics (hep), as related by Prof. John Conway over at CosmicVariance. Conway is working on the CDF experiment (at Fermilab), and his story kicks off with the appropriate quip "Was it real?" The central question Conway faces is whether or not a deviation / fluctuation in his measurements is significant. If it is, then it's evidence for the existence of a particular particle called the Higgs boson - a long sought-after component of the Standard Model of particle physics. If not, then it's back to searching for the Higgs. What I liked most about Conway's post is the way the claims of significance - the bump is real - are carefully vetted against both theroetical expectations of random fluctuations and a desire to not over-hype the potential for discovery.

In the world of networks (and power laws), the question of "Is it real?" is one that I wish was asked more often. When looking at complex network structure, we often want to know whether a pattern or the value of some statistical measure could have been caused by chance. Crucially, though, our ability to answer this question depends on our model of chance itself - this point is identical to the one that Conway faces, however, for hep experiments, the error models are substantially more precise than what we have for complex networks. Historically, network theorists have used either the Erdos-Renyi random graph or the configuration model (see cond-mat/0202208) as the model of chance. Unfortunately, neither of these look anything like the real-world, and thus probably provide terrible over-estimates of the significance of any particular network pattern. As a modest proposal, I suggest that hierarchical random graphs (HRGs) seem to serve as a more robust null-model, since they can capture a wide range of the heterogeneity that we observe in the real-world, e.g., community structure, skewed degree distribution, high clustering coefficient, etc. The real problem, of course, is that a good null-model depends heavily on what kind of question is being asked. In the hep experiment, we know enough about what the results would look like without the Higgs that, if it does exist, then we'd see large (i.e., statistically large) fluctuations at a specific location in the distribution.

Looking forward, the general problem of coming up with good null-models of network structure, against which we can reasonably benchmark our measurements and their deviations from our theoretical expectations, is hugely important, and I'm sure it will become increasingly so as we delve more deeply into the behavior of dynamical processes that run on top of a network (e.g., metabolism or signaling). For instance, what would a reasonable random-graph model of a signaling network look like? And, how can we know if the behavior of a real-world signaling network is within statistical fluctuations of its normal behavior? How can we tell whether two metabolic networks are significantly different from each other, or whether their topology is identical up to a small amount of noise? Put another way, how can we tell when a metabolic network has made a significant shift in its behavior or struture as a result of natural selection? One could even phrase the question of "What is a species?" as a question of whether the difference between two organisms is within statistical fluctuations of a cannonical member of the species.

posted January 31, 2007 12:25 AM in Scientifically Speaking | permalink