Network Analysis and Modeling (CSCI 5352)
This semester I developed and taught a new graduate-level course on network science, titled Network Analysis and Modeling (listed as CSCI 5352 here at Colorado). As usual, I lectured from my own set of notes, which total more than 150 pages of material. The class was aimed at graduate students, so the pace was relatively fast and I assumed they could write code, understood basic college-level math, and knew basic graph algorithms. I did not assume they had any familiarity with networks.
To some degree, the course followed Mark Newman's excellent textbook Networks: An Introduction, but I took a more data-centric perspective and covered a number of additional topics. A complete listing of the lecture notes are below, with links to the PDFs.
I also developed six problem sets, which, happily, the students tell me where challenging. Many of the problems were also drawn from Newman's textbook, although often with tweaks to make the more suitable for this particular class. It was a fun class to teach, and overall I'm happy with how it went. The students were enthusiastic throughout the semester and engaged well with the material. I'm looking forward to teaching it again next year, and using that time to work out some of the kinks and add several important topics I didn't cover this time.
Lecture 1,2 : An introduction and overview, including representing network data, terminology, types of networks (pdf)
Lecture 3 : Structurally "important" vertices, degree-based centralities, including several flavors of eigenvector centrality (pdf)
Lecture 4 : Geometric centralities, including closeness, harmonic and betweenness centrality (pdf)
Lecture 5 : Assortative mixing of vertex attributes, transitivity and clustering coefficients, and reciprocity (pdf)
Lecture 6 : Degree distributions, and power-law distributions in particular (pdf)
Lecture 7 : Degree distributions, and fitting models, like the power law, to data via maximum likelihood (pdf)
Lecture 8 : How social networks differ from biological and technological networks, small worlds and short paths (pdf)
Lecture 9 : Navigability in networks (discoverable short paths) and link-length distributions (pdf)
Lecture 10 : Probabilistic models of networks and the Erdos-Renyi random graph model in particular (pdf)
Lecture 11 : Random graphs with arbitrary degree distributions (the configuration model), their properties and construction (pdf)
Lecture 12 : Configuration model properties and its use as a null model in network analysis (pdf)
Lecture 13 : The preferential attachment mechanism in growing networks and Price's model of citation networks (pdf)
Lecture 14 : Vertex copying models (e.g., for biological networks) and their relation to Price's model (pdf)
Lecture 15 : Large-scale patterns in networks, community structure, and modularity maximization to find them (pdf)
Lecture 16 : Generative models of networks and the stochastic block model in particular (pdf)
Lecture 17 : Network inference and fitting the stochastic block model to data (pdf)
Lecture 18 : Using Markov chain Monte Carlo (MCMC) in network inference, and sampling models versus optimizing them (pdf)
Lecture 19 : Hierarchical block models, generating structurally similar networks, and predicting missing links (pdf)
Lecture 20 : Adding time to network analysis, illustrated via a dynamic proximity network (pdf)
Problem set 1 : Checking the basics and a few centrality calculations (pdf)
Problem set 2 : Assortativity, degree distributions, more centralities (pdf)
Problem set 3 : Properties and variations of random graphs, and a problem with thresholds (pdf)
Problem set 4 : Using the configuration model, and edge-swapping for randomization (pdf)
Problem set 5 : Growing networks through preferential or uniform attachment, and maximizing modularity (pdf)
Problem set 6 : Free and constrained inference with the stochastic block model (pdf)
Upcoming networks meetings
There are a number of great looking meetings about networks coming up in the next 12 months [1,2]. Here's what I know about so far, listed in their order of occurrence. If you know of more, let me know and I'll add them to the list.
- Santa Fe Institute's Short Course on Complexity: Exploring Complex Networks
Location: Austin TX
Date: 4-6 September, 2013
Organized by the Santa Fe Institute.
Confirmed speakers: Luis Bettencourt (SFI), Aaron Clauset (Colorado, Boulder), Simon DeDeo (SFI), Jennifer Dunne (SFI), Paul Hines (Vermont), Bernardo Huberman (HP Labs), Lauren Ancel Meyers (UT Austin), and Cris Moore (SFI).
Registration: until full (registration link here)
- Workshop on Information in Networks (WIN)
Location: New York University, Stern Business School
Date: 4-5 October 2013
Invited speakers: Lada Adamic (Facebook), Christos Faloutsos (CMU), James Fowler (UCSD), David Lazer (Northeastern), Ilan Lobel (NYU), John Padgett (Chicago), Sandy Pentland (MIT), Patrick Perry (NYU), Alessandro Vespignani (Northeastern) and Duncan Watts (MSR).
Application deadline (for oral presentations): already passed
Early registration deadline: 13 September 2013
- SAMSI Workshop on Social Network Data: Collection and Analysis
Location: SAMSI, Research Triangle Park, NC
Date: 21-23 October 2013
Invited speakers: 12-15 invited talks (not clear who!)
Registration: 13 September 2013
- Network Frontier Workshop
Location: Northwestern University (Evanston, IL)
Date: 4-6 December 2013
Invited speakers: lots (14), including many notables in the field.
Application deadline (for oral presentations): 18 October 2013
- Frontiers of Network Analysis: Methods, Models, and Applications
Location: NIPS 2013 Workshop, Lake Tahoe NV
Date: December 9 or 10 (TBD), 2013
Call for papers: here
Submission deadline: 23 October 2013 (extended deadline; via EasyChair, see workshop webpage for link)
- Mathematics of Social Learning
Location: Institute for Pure & Applied Mathematics (IPAM) at UCLA
Date: 6-10 January 2014
Confirmed speakers: lots (24), including the organizers and many notables in the field.
Application deadline (for travel support): 11 November 2013 (application here)
- Workshop on Complex Networks (CompleNet)
Location: University of Bologna (Bologna, Italy)
Date: 12-14 March 2014
Invited speakers: Juyong Park (KAIST), Mirko Degli Esposti, Stephen Uzzo (NY Hall of Science & NYU), Giorgio Fagiolo, Adriano Barra
Paper submission: 6 October 2013
Early registration: 10 December 2013
- Les Houches Thematic School on Structure and Dynamics of Complex Networks
Location: Les Houches, France
Date: 7-18 April 2014
Confirmed speakers: lots (14), including the organizers and many notables in the field.
Registration deadline: 15 December 2013 (opens in October)
- Mathematics Research Communities on Network Science
Location: Snowbird UT
Date: 34-30 June 2014
Format: this is a workshop aimed at graduate students and early postdocs, focused around small group research projects on network science.
Applications open: 1 November 2013 (see American Mathematical Society page for the link)
Updated 4 September 2013: Added the Les Houches meeting.
Updated 6 September 2013: Added the Northwestern meeting.
Updated 9 September 2013: Added SAMSI, WIN, and CompleNet meetings.
Updated 9 October 2013: Updated the NIPS 2013 workshop deadline.
 It has not escaped my attention that I am either an organizer or speaker at many of these meetings.
If birds are this smart, how smart were dinosaurs?
I continue to be fascinated and astounded by how intelligent birds are. A new paper in PLOS ONE [1,2,3] by Auersperg, Kacelnik and von Bayern (at Vienna, Oxford and the Max-Planck-Institute for Ornithology, respectively) uses Goffin’s Cockatoos to demonstrate that these birds are capable of learning and performing a complex sequence of tasks in order to get a treat. Here's the authors describing the setup:
Here we exposed Goffins to a novel five-step means-means-end task based on a sequence of multiple locking devices blocking one another. After acquisition, we exposed the cockatoos to modifications of the task such as reordering of the lock order, removal of one or more locks, or alterations of the functionality of one or more locks. Our aim was to investigate innovative problem solving under controlled conditions, to explore the mechanism of learning, and to advance towards identifying what it is that animals learn when they master a complex new sequential task.
The sequence was sufficiently long that the authors argue the goal the cockatoos were seeking must have been an abstract goal. The implication is that the cockatoos are capable of a kind of abstract, goal-oriented planning and problem solving that is similar to what humans routinely exhibit. Here's what the problem solving looks like:
The authors' conclusions put things nicely into context, and demonstrate the appropriate restraint with claims about what is going on inside the cockatoos' heads (something we humans could do better at in interpreting each others' behaviors):
The main challenge of cognitive research is to map the processes by which animals gather and use information to come up with innovative solutions to novel problems, and this is not achieved by invoking mentalistic concepts as explanations for complex behaviour. Dissecting the subjects’ performance to expose their path towards the solution and their response to task modifications can be productive; even extraordinary demonstrations of innovative capacity are not proof of the involvement of high-level mental faculties, and conversely, high levels of cognition could be involved in seemingly simple tasks. The findings from the transfer tests allow us to evaluate some of the cognition behind the Goffins’ behaviour. Although the exact processes still remain only partially understood, our results largely support the supposition that subjects learn by combining intense exploratory behavior, learning from consequences, and some sense of goal directedness.
So it seems that the more we learn about birds, the more remarkably intelligent some species appear to be. Which begs a question: if some bird species are so smart, how smart were dinosaurs? Unfortunately, we don't have much of an idea because we don't know how much bird brains have diverged from their ancestral non-avian dinosaur form. But, I'm going to go out on a limb here, and suggest that they may have been just as clever as modern birds.
 Auersperg, Kacelnik and von Bayern, Explorative Learning and Functional Inferences on a Five-Step Means-Means-End Problem in Goffin’s Cockatoos (Cacatua goffini) PLOS ONE 8(7): e68979 (2013).
 Some time ago, PLOS ONE announced that they were changing their name from "PLoS ONE" to "PLOS ONE". But confusingly, on their own website, they give the citation to new papers as "PLoS ONE". I still see people use both, and I have a slight aesthetic preference for "PLoS" over "PLOS" (a preference perhaps rooted in the universal understanding that ALL CAPS is like yelling on the Interwebs, and every school child knows that yelling for no reason is rude).
 As it turns out these same authors have some other interesting results with Goffin's Cockatoos, including tool making and use. io9 has a nice summary, plus a remarkable video of the tool-making in action.
Small science for the win? Maybe not.
(A small note: with the multiplication of duties and obligations, both at home and at work, writing full-length blog entries has shifted to a lower priority relative to writing full-length scientific articles. Blogging will continue! But selectively so. Paper writing is moving ahead nicely.)
In June of this year, a paper appeared in PLOS ONE  that made a quantitative argument for public scientific funding agencies to pursue a "many small" approach in dividing up their budgets. The authors, Jean-Michel Fortin and David Currie (both at the University of Ottawa) argued from data that scientific impact is a decreasing function of total research funding. Thus, if a funding agency wants to maximize impact, it should fund lots of smaller or moderate-sized research grants rather than fund a small number of big grants to large centers or consortia. This paper made a big splash among scientists, especially the large number of us who rely on modest-sized grants to do our kind of science.
Many of us, I think, have felt in our guts that the long-running trend among funding agencies (especially in the US) toward making fewer but bigger grants was poisonous to science , but we lacked the data to support or quantify that feeling. Fortin and Currie seemed to provide exactly what we wanted: data showing that big grants are not worth it.
How could such a conclusion be wrong? Despite my sympathy, I'm not convinced they have it right. The problem is not the underlying data, but rather overly simplistic assumptions about how science works and fatal confounding effects.
Fortin and Currie ask whether certain measures of productivity, specifically number of publications or citations, vary as a function of total research funding. They fit simple allometric or power functions  to these data and then looked at whether the fitted function was super-linear (increasing return on investment; Big Science = Good), linear (proportional return; Big Science = sum of Small Science) or sub-linear (decreasing return; Big Science = Not Worth It). In every case, they identify a sub-linear relationship, i.e., increasing funding by X% yields less a progressively smaller proportional increase in productivity. In short, a few Big Sciences are less impactful and less productive than many Small Sciences.
But this conclusion only follows from the data if you believe that all scientific questions are equivalent in size and complexity, and that any paper could in principle be written by a research group of any funding level. These beliefs are certainly false, and this implies that the conclusions don't follow from the data.
Here's an illustration of the first problem. In the early days of particle physics, small research groups made significant progress because the equipment was sufficiently simple that a small team could build and operate it, and because the questions of scientific interest were accessible using such equipment. Small Science worked well. But today the situation is different. The questions of scientific interest are largely only accessible with enormously complex machines, run by large research teams supported by large budgets. Big Science is the only way to make progress because the tasks are so complex. Unfortunately, this fact is lost in Fortin and Currie's analysis because their productivity variables do not expose the difficulty of the scientific question being investigated, and are likely inversely correlated with difficulty.
This illustrates the second problem. The cost of answering a question seems largely independent of the number of papers required to describe the answer. There will only be a handful of papers published describing the experiments that discovered the Higgs boson, even though its discovery was possibly one of the costliest physics experiments so far. Furthermore, it is not clear that the few papers produced by a big team should necessarily be highly cited, as citations are produced by the publication of new papers, and if there are only a few research teams working in a particularly complex area, the number of new citations is not independent of the size of the teams.
In fact, by defining "impact" as counts related to paper production [4,5], Fortin and Currie may have inadvertently engineered the conclusion that Small Science will maximize impact. If project complexity correlates positively with grant size and negatively with paper production, then a decreasing return in paper/citation production as a function of grant size is almost sure to appear in any data. So while the empirical results seem correct, they are of ambiguous value. Furthermore, a blind emphasis on funding researchers to solve simple tasks (small grants), to pick all that low-hanging fruit people talk about, seems like an obvious way to maximize paper and citation production because simpler tasks can be solved faster and at lower cost than harder, more complex tasks. You might call this the "Washington approach" to funding, since it kicks the hard problems down the road, to be funded, maybe, in the future.
From my own perspective, I worry about the trend at funding agencies away from small grants. As Fortin and Currie point out, research has shown that grant reviewers are notoriously bad at predicting success  (and so are peer reviewers). This fact alone is a sound argument for a major emphasis on funding small projects: making a larger number of bets on who will be successful brings us closer to the expected or background success rate and minimizes the impact of luck (and reduces bias due to collusion effects). A mixed funding strategy is clearly the only viable solution, but what kind of mix is best? How much money should be devoted to big teams, to medium teams, and to small teams? I don't know, and I don't think there is any way to answer that question purely from looking at data. I would hope that the answer is chosen not only by considering things like project complexity, team efficiency, and true scientific impact, but also issues like funding young investigators, maintaining a large and diverse population of researchers, promoting independence among research teams, etc. We'll see what happens.
Update 1 July 2013: On Twitter, Kieron Flanagan asks "can we really imagine no particle physics w/out big sci... or just the kind of particle physics we have today?" To me, the answer seems clear: the more complex the project, the legitimately bigger the funding needs are. That funding manages the complexity, keeps the many moving parts moving together, and ultimately ensures a publication is produced. This kind of coordination is not free, and those dollars don't produce additional publications. Instead, their work allows for any publication at all. Without them, it would be like dividing up 13.25 billion dollars among about 3,000 small teams of scientists and expecting them to find the Higgs, without pooling their resources. So yes, no big science = no complex projects. Modern particle physics requires magnificently complex machines, without which, we'd have some nice mathematical models, but no ability to test them.
To be honest, physics is perhaps too good an example of the utility of big science. The various publicly-supported research centers and institutes, or large grants to individual labs, are a much softer target: if we divided up their large grants into a bunch of smaller ones, could we solve all the same problems and maybe more? Or, could we solve a different set of problems that would more greatly advance our understanding of the world (which is distinct from producing more papers or more citations)? This last question is hard because it means choosing between questions, rather than being more efficient at answering the same questions.
 J.-M. Fortin and D.J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6): e65263 (2013).
 By imposing large coordination costs on the actual conduct of science, by restricting the range of scientific questions being investigated, by increasing the cost of failure to win a big grant (because the proposals are enormous and require a huge effort to produce), by channeling graduate student training into big projects where only a few individuals actually gain experience being independent researchers, etc. Admittedly, these claims are based on what you might call "domain expertise," rather than hard data.
 Functions of the form log y = A*log x + B, which looks like a straight line on a log-log plot. These bivariate functions can be fitted using standard regression techniques. I was happy to see the authors include non-parametric fits to the same data, which seem to mostly agree with the parametric fits.
 Citation counts are just a proxy for papers, since only new papers can increase an old paper's count. Also, it's well known that the larger and more active fields (like biomedical research or machine learning) produce more citations (and thus larger citation counts) than smaller, less active fields (like number theory or experimental particle physics). From this perspective, if a funding agency took Fortin and Currie's advice literally, they would cut funding completely for all the "less productive" fields like pure mathematics, economics, computer systems, etc., and devote that money to "more productive" fields like biomedical research and whoever publishes in the International Conference on World Wide Web.
 Fortin and Currie do seem to be aware of this problem, but proceed anyway. Their sole caveat is simply thus: "Campbell et al.  discuss caveats of the use of bibliometric measures of scientific productivity such as the ones we used."
 For instance, see Berg, Nature 489, 203 (2012).
Design and Analysis of Algorithms (CSCI 5454)
Returning from paternity leave last semester, it is back into the fire this semester with my large graduate-level algorithms class, Design and Analysis of Algorithms (CSCI 5454). This year, I am again shocked to find that it is the largest grad class in my department, with 60 enrolled students and 30 students on the waiting list. In the past, enrollment churn at the beginning of the semester  has been high enough that everyone on the wait list has had an opportunity to take the class. I am not sure that will be true this year. This year will also be the last time I teach it, for a while.
My version of this course covers much of the standard material in algorithms, but with a few twists. Having tinkered considerably with the material a year ago, I'm going to tinker a little less this year. That being said, improving is a never-ending process and there are a number of little things on the problem sets and lectures that I want to try differently this time. The overall goals remain the same: to show students how different strategies for algorithm design work (and sometimes don't work), to get them thinking about boundary cases and rigorous thinking, to get them to think carefully about whether an algorithm is correct or not, and to introduce them to several advanced topics and algorithms they might encounter out in industry.
For those of you interested in following along, I will again be posting my lecture notes on the class website.
 Caused in part by some students enrolling thinking the class would be easy, and then dropping it when they realize that it requires real effort. To be honest, I am okay with this, although it does make for scary enrollment numbers on the first day.
2012: a year in review
This is probably it for the year, so here's a look back at 2012, by the numbers.
Papers published (or accepted): 4 (the pipeline is moving again)
Pre-prints posted on the arxiv: 5
Other publications: 0
Papers currently under review: 3
Manuscripts near completion: 7
Rejections: 9 (includes rejection without review)
New citations to past papers: 1495 (+20% over 2011)
Projects in-the-works: too many to count
Half-baked projects unlikely to be completed: already forgotten
Papers read: >202
Research talks given: 9
Invited talks: 8
Visitors hosted: 1
Conferences, workshops organized: 0
Conferences, workshops, summer schools attended: 4
Number of those at which I delivered a research talk: 3
Number of times other people have written about my research: >11
Number of times Nate Silver wrote about my research: 1 (cool)
Number of interviews about my research: 1
Students advised: 13 (7 PhD, 1 MS, 3 BS; 2 rotation student)
Students graduated: 1 MS
Thesis/dissertation committees: 6
Number of recommendation letters written: 10
Summer school faculty positions: 2
University courses taught: 1 (repeated)
Students enrolled in said courses: 51 grad
Number of problems assigned: 85
Pages of student work graded: 5282 (roughly 103 pages per student, with 2 graders)
Number of class-related emails received: >1174
Number of conversations with the faculty honor council liaison: 0
Guest lectures for colleagues: 0
Proposals refereed for grant-making agencies: 15
Manuscripts refereed for various journals, conferences: 16
Words written in per report: 1528 (average)
Referee requests declined: 50 (-4% over 2011)
Conference program committees: 1
Grant proposals submitted: 6 (totaling $2,212,343)
Proposals rejected: 2
New grants awarded: 3 (totaling $1,379,260)
Proposals pending: 3
New proposals in the works: 2
Emails sent: >8061 (+9% over 2011)
Emails received (non-spam): >15503 (+18% over 2011)
Fraction about work-related topics: 0.89 (same as 2011)
Emails received about power-law distributions: 146 (3 per week)
Unique visitors to professional homepage: 31,000
Hits overall: 79,000
Fraction of visitors looking for power-law distributions: 0.63 (wow)
Unique visitors to blog: 11,500
Hits overall: 18,000
Most popular blog post among those visitors: Our ignorance of intelligence (from 2005)
Blog posts written: 14 (-30% from last year)
Most popular 2012 blog post: A crisis in higher education?
Number of twitter accounts: 1, my first (I blame peer pressure)
Retweets: 330ish (remarkably)
New followers on Twitter: >346 (astonishingly)
Number of computers purchased: 1
Movies/shows via Netflix: 32 dvds, 100 instant
Books purchased: 11
Songs added to iTunes: 148
Photos added to iPhoto: 874
Jigsaw puzzle pieces assembled: >5,000
Major life / career changes: 2 (see next two entries)
Houses purchased: 1
Babies: 1, naturally born
Semesters of paternity leave: 1
Photos taken of baby so far: >683 (about 5 per day)
Fun trips with friends / family: 9
Half-marathons completed: 0
Trips to Las Vegas, NV: 1
Trips to New York, NY: 0
Trips to Santa Fe, NM: 7
States visited (in the US): 4
Foreign countries visited: 1 (Germany)
Other continents visited: 1
Airplane flights: 22
Here's to a great year, and hoping that 2013 is even better.
BioFrontiers is hiring
And, in more hiring news, the BioFrontiers Institute here at the University of Colorado Boulder is hiring new tenure-track faculty this year. One of the great things about this hiring line is that the candidate basically gets to select which of nine departments (including Computer Science, Physics, Applied Math., and a variety of biology departments) they want to call home. Another is that there are genuinely interesting and exciting things happening in research and education within it. In short: it's a great place to work.
We invite applications for a tenure-track faculty position from candidates seeking to develop an innovative research program that addresses significant problems in biology or medicine at the interface with the physical sciences, mathematics, and/or computer science. Researchers in the area of biological engineering are asked to look for an advertisement next year, when we anticipate searching for faculty in that area.