« ISI Workshop: Theoretical Aspects and Models of Large Complex and Open Information Networks | Main | Just in time for the holidays »

December 02, 2007

ISI Workshop: the summary

The talks at the Institute for Scientific Interchange's (ISI) workshop spanned quite a few areas of network research. One of the more popular themes (four talks) related to tag-networks, using data from places like Flickr, CiteULike and Bibsonomy, with much of the work presented being funded by an EU project called Tagora. The most interesting talk of this group was by Ciro Cattuto (abstract) on the semantics of tags. The thing I liked about this work was the use of known semantic relationships (constructed laboriously by human linguists) to infer the semantic relationships of the tags. This combination allowed them to constructed a system that seems to be able to identify a reasonable (to a human) set of synonyms to the tags a user gives a resource. Given the difficulty of algorithmically defining such semantic relations, this is a nice result.

Another good talk was by Filippo Menczer (abstract) on the behavior of web surfers as it relates to the simple model of behavior that PageRank (the basic algorithm that most modern search engines use) assumes. This idea is interesting on one level because you might imagine that you want search engine results to be based on an accurate model of user behavior. That is, you might want the search engine to give results equivalent to polling the world for its collective opinion about which page is most relevant to your query. But, as we know (from years of human-computer-interaction research), being realistic may not be the best way to build the most useable tool. In short, it may be the case that the "best" kind of search engine is one based on an inaccurate model of surfer behavior (obviously, the key here is knowing what "good" search results are).

Anyway, with that caveat in mind, Fil showed that several assumptions of the PageRank (or, more correctly, eigenvector centrality) are qualitatively inaccurate models of real user behavior. The first, that all edges have equal weight, is violated as users follow some links much more preferentially than others. Fil showed us the distribution of weights and claimed it was a power-law distribution. The fact that he used a logarithmic binning scheme makes this conclusion less believable, since a judicious choice of bins can make non-power-law distributions look more power-law-like (fortunately, there are better methods available). Similarly, the probability of a surfer "teleporting" to another page is typically assumed to be uniform among pages, but Fil showed us that some pages are much more likely to be "start" pages for a browsing session (the ones you would expect: Google, Yahoo, etc.). In a sense, these facts are unsurprising -- assuming that humans do anything uniformly at random almost always proves to be wrong -- but probably worth pointing out. I wonder if a version of eigenvector centrality, corrected for these differences in real user behavior, would be more useful as a basis for search engine results...

The third talk I really enjoyed was by Luca Dall'Asta (abstract) on his recent work generalizing some of my own work with Cris Moore on the bias of traceroute to multi-source studies. That is, it's now been known for a couple of years that when you sample a network's topology using short-path probes from a single source, you get a very biased view of the topology (so biased that the degree distribution of the real network may be concentrated around a single value, while the distribution of the sampled network has no characteristic value, i.e., looks like a power-law). A couple of years ago, Cris and I showed, using numerical studies, that the marginal utility of additional sources can be very low, and that almost all nodes need to be sources in the study before the observed network accurately represents the real one. Luca generalized our differential equations approach to multiple sources, and showed some very nice results for Erdos-Renyi random graphs (where every edge exists independently, but with the same probability). We talked briefly afterward about his preliminary results for random graphs with power-law degree distributions, and I'm hopeful that his approach can analytically explain the character of our numerical results.

The point of this kind of work is that most empirical data we have on network topology are actually derived from some kind of sampling scheme. That is, only in a few cases do we know the full network explicitly (e.g., the airport network). For the Internet (both the IP and the BGP), samples of the topology are derived from sampling paths through the network, and thus probably exhibit the kind of bias that Luca talked about. The ultimate goal here is to understand the bias well enough that we can invert the problem, converting a biased sample to an unbiased estimate of the underlying topology. This is a hard problem though, made worse by the fact that the sampling bias seems to map different underlying topologies to similar observed topologies. The importance of the Internet's topology may seem a bit esoteric, but similar kinds of biases exist in our samples of biological networks. Given the amount of interest in understanding the "systems-level" structure of things like the protein-interaction network (important if, say, you want to design drugs with few side-effects), you might imagine that there's a lot of work being done to understand (and invert) the sampling biases present there. But, sadly, you'd be wrong. On the other hand, the scale of the problem for the protein-interaction network is enormous: current estimates of our accuracy here suggest we only have 10% of the topology correct.

posted December 2, 2007 12:09 PM in Conferences and Workshops | permalink


"PageRank (the basic algorithm that most modern search engines use)"

This statement raised an eyebrow. In the spirit of bringing together two blogs, I invite you to visit Daniel Lemire's recent thread on why this claim is highly suspect.

Posted by: Fernando Diaz at December 2, 2007 07:59 PM