« November 2007 | Main | January 2008 »

December 28, 2007

2007: a year in review

As the last days of 2007 dwindle, here's a look back on my year, by the numbers.

Papers published on which I was first author or a major contributor: 3
Papers currently under review: 4
Manuscripts near completion: 1
Software packages released: 2
Half-baked projects unlikely to be completed: 5
Projects in-the-works: 3
Manuscripts reviewed for various journals / conferences: >10
Papers read: >120
Research talks given: 15
Invited talks: 7
Conferences / workshops / summer schools attended: 15
Number of those at which I delivered a research talk: 11
Students advised: 2

Conferences / workshops organized: 1
Grants awarded: 1
Grants pending: 2

Emails sent: >2588
Emails received (non-spam): >6104 (these numbers are still depressingly high)
Number of those about work-related topics: >4293
Plane flights: 44 (woe is my carbon footprint)

Laptops stolen: 2 (one was my replacement; stolen off the back of a FedEx truck)
Blog entries written: 80 (this number is only marginally less depressingly than last year)
Movies via Netflix: 46
Pictures posted on Flickr: 169
Major life / career decisions: 0
Fun trips with friends / family: >15
Foreign countries visited: 2 (Italy and South Korea)
Weddings attended: 1
Yards landscaped: 2

And, I met Oprah.

posted December 28, 2007 10:33 AM in Self Referential | permalink | Comments (0)

December 19, 2007

Just in time for the holidays

There's excellent coverage elsewhere of the recent trashing of science funding by Congress and the Whitehouse, for instance, Cosmic Variance, Science Magazine (with gruesome details about who didn't get what), and Computer Research Policy Blog (with details on all the devious tricks pulled to make the funding changes look less terrible than they actually are).

The past few years have seen support for basic research decline quite a bit at the federal level (for instance, the DoD basically eliminated all of its basic research funding, which forced many of its previous researchers to go instead to the already overburdened NSF for funds). The funding decrease this year would have only produced a lot of grumbling and complaining, and probably even a lot of upset letters had it not been for the promises by Congress and the Whitehouse for funding increases this year (via the America COMPETES initiative (whitehouse.gov link).

In other, more happy news, I just had my first paper accepted at Nature. The review process was considerably more painful than I expected. The final product is certainly improved over the initial one, on account of changes we made to address some of the questions by the reviewers (along with changes based on feedback from many talks I've given on the topic over the past year). It's comforting to know that good, interdisciplinary work can actually get published in a vanity journal like Nature. That being said, I really prefer to write papers in which I can actually discuss the technical details in the main body of the paper, rather than hiding them all in the appendices that no one actually reads. I also like writing papers longer than 15 paragraphs.

With that, happy holidays to you all. I head east in the morning for the usual family festivities, which will induce yet another hiatus from blogging for me. I'll return to beautiful New Mexico just in time to do my usual year-end blog wrap up here.

posted December 19, 2007 09:05 PM in Rant | permalink | Comments (3)

December 02, 2007

ISI Workshop: the summary

The talks at the Institute for Scientific Interchange's (ISI) workshop spanned quite a few areas of network research. One of the more popular themes (four talks) related to tag-networks, using data from places like Flickr, CiteULike and Bibsonomy, with much of the work presented being funded by an EU project called Tagora. The most interesting talk of this group was by Ciro Cattuto (abstract) on the semantics of tags. The thing I liked about this work was the use of known semantic relationships (constructed laboriously by human linguists) to infer the semantic relationships of the tags. This combination allowed them to constructed a system that seems to be able to identify a reasonable (to a human) set of synonyms to the tags a user gives a resource. Given the difficulty of algorithmically defining such semantic relations, this is a nice result.

Another good talk was by Filippo Menczer (abstract) on the behavior of web surfers as it relates to the simple model of behavior that PageRank (the basic algorithm that most modern search engines use) assumes. This idea is interesting on one level because you might imagine that you want search engine results to be based on an accurate model of user behavior. That is, you might want the search engine to give results equivalent to polling the world for its collective opinion about which page is most relevant to your query. But, as we know (from years of human-computer-interaction research), being realistic may not be the best way to build the most useable tool. In short, it may be the case that the "best" kind of search engine is one based on an inaccurate model of surfer behavior (obviously, the key here is knowing what "good" search results are).

Anyway, with that caveat in mind, Fil showed that several assumptions of the PageRank (or, more correctly, eigenvector centrality) are qualitatively inaccurate models of real user behavior. The first, that all edges have equal weight, is violated as users follow some links much more preferentially than others. Fil showed us the distribution of weights and claimed it was a power-law distribution. The fact that he used a logarithmic binning scheme makes this conclusion less believable, since a judicious choice of bins can make non-power-law distributions look more power-law-like (fortunately, there are better methods available). Similarly, the probability of a surfer "teleporting" to another page is typically assumed to be uniform among pages, but Fil showed us that some pages are much more likely to be "start" pages for a browsing session (the ones you would expect: Google, Yahoo, etc.). In a sense, these facts are unsurprising -- assuming that humans do anything uniformly at random almost always proves to be wrong -- but probably worth pointing out. I wonder if a version of eigenvector centrality, corrected for these differences in real user behavior, would be more useful as a basis for search engine results...

The third talk I really enjoyed was by Luca Dall'Asta (abstract) on his recent work generalizing some of my own work with Cris Moore on the bias of traceroute to multi-source studies. That is, it's now been known for a couple of years that when you sample a network's topology using short-path probes from a single source, you get a very biased view of the topology (so biased that the degree distribution of the real network may be concentrated around a single value, while the distribution of the sampled network has no characteristic value, i.e., looks like a power-law). A couple of years ago, Cris and I showed, using numerical studies, that the marginal utility of additional sources can be very low, and that almost all nodes need to be sources in the study before the observed network accurately represents the real one. Luca generalized our differential equations approach to multiple sources, and showed some very nice results for Erdos-Renyi random graphs (where every edge exists independently, but with the same probability). We talked briefly afterward about his preliminary results for random graphs with power-law degree distributions, and I'm hopeful that his approach can analytically explain the character of our numerical results.

The point of this kind of work is that most empirical data we have on network topology are actually derived from some kind of sampling scheme. That is, only in a few cases do we know the full network explicitly (e.g., the airport network). For the Internet (both the IP and the BGP), samples of the topology are derived from sampling paths through the network, and thus probably exhibit the kind of bias that Luca talked about. The ultimate goal here is to understand the bias well enough that we can invert the problem, converting a biased sample to an unbiased estimate of the underlying topology. This is a hard problem though, made worse by the fact that the sampling bias seems to map different underlying topologies to similar observed topologies. The importance of the Internet's topology may seem a bit esoteric, but similar kinds of biases exist in our samples of biological networks. Given the amount of interest in understanding the "systems-level" structure of things like the protein-interaction network (important if, say, you want to design drugs with few side-effects), you might imagine that there's a lot of work being done to understand (and invert) the sampling biases present there. But, sadly, you'd be wrong. On the other hand, the scale of the problem for the protein-interaction network is enormous: current estimates of our accuracy here suggest we only have 10% of the topology correct.

posted December 2, 2007 12:09 PM in Conferences and Workshops | permalink | Comments (1)