2010 in review

The artificial intelligence engine at WordPress (who hosts this page) sent me an email with some stats on how the site has been doing since I set it up back in June. According to the analysis, the page is “fresher than ever”, so I’m delighted. The email even had a convenient button to post the whole thing right at the bottom. And since I haven’t posted anything for a while I thought, “why not”.

No review of my online 2010 would be complete, however, without mentioning the Twittermood project I did with Alan Mislove, YY Ahn, JP Onnela, and Niels Rosenquist. That project earned us 302 713 views on YouTube (at the time of writing) and global press attention with large amounts TV, radio, print, and internet coverage (click here for full details). Recently, the visualization was mentioned first among Mashable’s best infographics of 2010, which generated a mini-surge of traffic for the YouTube video.

Anyway, the unedited message is below:

The stats helper monkeys at WordPress.com mulled over how this blog did in 2010, and here’s a high level summary of its overall blog health:

Healthy blog!

The Blog-Health-o-Meter™ reads Fresher than ever.

Crunchy numbers

Featured image

A helper monkey made this abstract painting, inspired by your stats.

A Boeing 747-400 passenger jet can hold 416 passengers. This blog was viewed about 3,600 times in 2010. That’s about 9 full 747s.

In 2010, there were 13 new posts, not bad for the first year! There were 38 pictures uploaded, taking up a total of 53mb. That’s about 3 pictures per month.

The busiest day of the year was July 22nd with 207 views. The most popular post that day was Worlds Colliding. Part II.

Where did they come from?

The top referring sites in 2010 were twitter.com, ccs.neu.edu, barabasilab.com, iq.harvard.edu, and barabasilab.neu.edu.

Some visitors came searching, mostly for sune lehmann, sune lehman, sune, lehmann sune, and sune lehmann nature.

Attractions in 2010

These are the posts and pages that got the most views in 2010.


Worlds Colliding. Part II July 2010


About June 2010


Press June 2010


Visualizing Link Communities November 2010
1 comment


Mood, twitter, and the new shape of America July 2010

Visualizing Link Communities

When YY Ahn, Jim Bagrow, and I published our paper on communities of links in complex networks, we did share the code for the algorithm, but one of the essentials missing from our package was a good way to visualize the highly overlapping link communities.

Link-communities Visualization

Thus, I’m delighted to report that Rob Spencer over at Scaled Innovation has done a great job of visualizing the detected link communities (including a new client-side implementation, I might add). The technical details are interesting and available.

The example displayed above is lifted from Scaled Innovation and shows the network of characters in The Wizard of Oz. In addition to the central visualization reproduced above (see below for details),  the page also shows the full link dendrogram and many other treats; everything is beautifully crafted. Note the community assignment matrix on the right, which is a neat way of probing the issue of nested communities. On the page, Rob has a number of interesting observations regarding visualization of the link communities and explains the layout above in further detail. I quote:

The good news is that the ABL method is powerful and flexible. The challenge is that the communities it reveals are of links, not nodes, and therefore not as obvious to portray and interpret. So far the literature method is to use a traditional force-based network diagram and color the lines between the dots, rather than color the dots. Not bad, but this has the limitations of force-directed network diagrams have always had: a big “wow factor” but of limited practical interpretive use because of the spaghetti of crossing lines. So here you’ll find outright experiments, and that means that some will be different!

In the upper circular graph the dots are the nodes and the polygons show community membership of those nodes (the colors match the table and dendrogram); line crossing is minimized by working around in cluster-joining order (same as the ROYGBIV color order). Communities are equally distributed around the circle with anchor points shown as black-centered dots; each node is placed as the weighted sum of its coordinates of each anchor to which it belongs, plus some random jitter to separate nodes with single community membership. The community ordering and coloring has an interesting result: the diagram gets simpler to see as the number of communities is increased, even far above the partition density “optimum”.

The method is fast because it’s completely deterministic and drawn in one pass, i.e. it’s not an iterative force-relaxation method.

Pervasive overlap and visualizations

While Rob’s visualization shows tremendous progress on a number of fronts (just compare it to our own – primitive – first stab at visualizing the network of characters in Les Miserables), I still think that node based visualizations of the link communities work best when we study ego-networks (a single person and her neighbors).

As we point out in the paper, we can visualize the ego-network precisely because the central node’s communities are largely non-overlapping. So in the example above, Dorothy is the Ego, placed in the center of the visualization, while the various non-overlapping story lines appear as communities surrounding her.

One of the consequences of pervasive overlap (when every node is a member of multiple communities), is that we can no longer display the communities as block structures in the network adjacency matrix. Roughly speaking, to form a block structure, we need a single block per node. Some overlap is possible within the framework of block modeling, but when we can have more communities than nodes, this approach breaks down.

A similar problem arises in visualization. My guess is that any strategy for visualizing pervasive overlap where nodes are the basis of the visualization will ultimately turn out to be problematic for a full network. One possible solution is to follow the example of CFinder and construct a visualization based on the network of communities but with the ability to zoom into each community. At the local level, Rob’s visualization would be perfect.

Comments/ideas are welcome. Note – this post can also be found at the Complexity and Social Networks Blog.

Twittermood 2: Election special

The midterm elections are coming up, so we decided to create our own little twitter mood election center.

“Twitter has grown to become an important aspect of public debate and leading up to Tuesday’s midterms, the Twitterverse is abuzz with conversations on the topics that will decide the individual races.

It is well known that the state you live in plays a role in deciding what issues you care about. By utilizing the fact that conversations on twitter are public, we can geocode individual tweets, and study where Americans are talking about specific issues.

In this way, Twitter allows us to extrapolate from millions of water cooler conversations and show where the conversations are taking place right now.”

Check it out by clicking on one of the images below:

Standard representation

Basically, the idea was to play around with the Twitter stream and do something in real-time for the midterm elections. So we decided to dig into where people are talking about the various issues that are shaping the debate leading up to the election.

See the page for full details.

The end of Supporting Material?

Maybe this is how it happens: You see an interesting (seemingly innocuous) paper and decide to read it. Upon finding it very information-dense, you decide to take a look at the supporting information (SI) and notice that the SI has a word count greater in size than an average PhD thesis. Or maybe it’s when you decide to print the SI and realize something unusual is going on when your printer is still spitting out paper after half an hour.

However you have become aware it, scientific practice has been changing in the last few years. If I remember correctly, supporting information packages started becoming the norm for papers (at least in some journals) a only few years ago and the average SI length has been growing steadily ever since.

Now something interesting has happened. From November 1st and onwards, The Journal of Neuroscience (JNS), a leading Journal in that field, will no longer allow authors to include supplemental material when submitting new manuscripts (JNS agrees to link to non-peer reviewed supporting material on the author’s own site). The decision is explained in detail by Editor-In-Chief John Maunsell, who presents a lucid and interesting argument. He explains that on one hand, the decision was made to make the task of peer reviewing a paper more manageable, i.e. to help the referees:

Although [JNS], like most journals, currently peer reviews supplemental material, the depth of that review is questionable. Most well qualified reviewers are overburdened with requests to review manuscripts, and many feel that it is too much to ask them to also evaluate supplemental material that can be as extensive as the article itself. It is obvious to editors that most reviewers put far less effort (often no effort) into examining supplemental material. Nevertheless, we certify the supplemental material as having passed peer review.

This surely is an accurate description of the situation many referees find themselves in. Going over every equation and argument in a 100 page SI takes several days, an amount of time that most academics simply don’t have available. (In fact the current state of peer review, even without mammoth SI’s, has been argued to be suffering from serious problems.)

On the other hand the decision is also intended to protect the authors.

Another troubling problem associated with supplemental material is that it encourages excessive demands from reviewers. Increasingly, reviewers insist that authors add further analyses or experiments “in the supplemental material.” These additions are invariably subordinate or tangential, but they represent real work for authors and they delay publication. Such requests can be an unjustified burden on authors. In principle, editors can overrule these requests, but this represents additional work for the editors, who may fail to adequately referee this aspect of the review.

Reviewer demands in turn have encouraged authors to respond in a supplemental material arms race. Many authors feel that reviewers have become so demanding they cannot afford to pass up the opportunity to insert any supplemental material that might help immunize them against reviewers’ concerns.

The “supplemental material arms race” described eloquently above is another element that I, as an author, can relate to—and suspect that many others feel the same.

With no room for peer reviewed SI, each manuscript must be self contained and convincing on its own merits:

A change is needed if we are to maintain the integrity and value of peer-reviewed articles. We believe that this is best accomplished by removing the supplemental material from the peer review process and requiring that each submission be evaluated and approved as a complete, self-contained scientific report […] With this change, the review process will focus on whether each manuscript presents important and compelling results.

I think most scientists can agree that large SI’s present a challenge to the scientific method as we know it. As is argued by JNS, large SI’s present a challenge to referees and authors alike and contain the potential for a potentially harmful “SI arms race”.

But let’s consider the suggested solution. In my interpretation, the proposed solution is to introduce more trust into the process. By eliminating the peer reviewed SI, the Editor-In-Chief is effectively stating that referees should trust that the authors have done their legwork (data preprocessing, programming, statistical analysis, and other “boring” elements underlying the main results) properly.

Of course, the entire foundation of peer review is trust. As referees we begin our task trusting that authors have done their work properly and presented their results honestly. Even a good referee can only be expected to catch mistakes and problems in the material presented to him. So why not a little additional trust?

Personally, I am unsure what to think. On one side, I wholeheartedly agree that there are important problems with the current state of affairs. But, on the other side, I think that there are important arguments against allowing too much of the ‘legwork’ to left out of the peer review process. Firstly, examples of scientific misconduct are many and the elimination of peer reviewed SI will make sloppy or dishonest science easier. Secondly, and more importantly, as John Timmer at Ars Technica has recently pointed out, the increasing use of computers could potentially put an end to the entire concept of scientific reproducibility (precisely because of extensive preprocessing of data, etc). Without peer reviewed SI, this problem will even more difficult to counter.

Regardless of the pros and cons, this is an interesting move by JNS. Since JNS allows fairly long articles (typically over ten pages), getting rid of the SI might be easier for JNS and other journals aimed at specific scientific disciplines, than for highly cited interdisciplinary journals – say Science or Nature – where word-count restrictions for main text are taken very seriously.

It will be interesting to see if this policy of “no supporting material” catches on.

Bipartite Network gets a Makeover

I guess my research is slowly changing focus and is more and more about some kind of data science (although I still bill myself as a physicist turned network scientist). While statistics and mathematical models are still driving this type of research, an increasingly important part of data science is visualization – finding neat ways to display subtle and complicated mathematical concepts in a way that is immediately understandable.

Sometimes, however, visualization can be completely gratuitous eye-candy. Last week, I played around with displaying a weighted bipartite network. One of the default layouts looked something like this:

Adding Bezier curves, more pleasing node shapes, and a little color, the final network comes across slightly more pleasing to the eye (in my opinion, anyway):

Stay tuned for the next episode of ‘Pimp my Network’.

Worlds Colliding. Part II

Back in March, I wrote a post entitled Worlds Colliding explaining the failure of Google Buzz as a failure to understand the fundamental structure of complex networks.

Buzz received a large amount of criticism for automatically adding the most contacted people from your inbox to your Buzz follower list. My post explained that because individuals in social network are a member of many social contexts (family, work, friends, etc), nodes from all of these to a single list would cause these contexts to collide (e.g. adding both your wife and your (no longer) secret mistress to your list of followers).

The last couple of days, the following talk (from July 1st) by Paul Adams who is a User Experience Researcher at Google has been very visible on the interwebs.

From the looks of it, the good people at the Googleplex have either been reading my blog and the accompanying scientific paper and are scrambling to keep up (I consider this scenario highly unlikely) or, the User Experience Group at Google was never in touch with the group behind Buzz.

Let me repeat that last part for dramatic effect: the User Experience Group at Google was never in touch with the group behind Buzz. The knowledge about pervasive overlap and overlapping communities was present within Google, but never diffused to their initial social networking attempt. So the failure of Buzz was in some sense due to separate worlds within Google not communicating properly. That strikes me as textbook case of tragic irony.

Update, July 15th

I’ve included YY‘s recent slides from the New Frontiers in Complex Networks conference as a quick intro to our thinking regarding pervasive overlap.

The proper reference is Link communities reveal multiscale complexity in networks. Nature (2010), doi:10.1038/nature09182.

Mood, twitter, and the new shape of America

Twitter is a gigantic repository for our collective state of mind.

Every second, thousands of tweets reveal what everybody and their mother had for lunch, what Justin Bieber is up to, or what magnificent link you should be checking out right now. Individually, each tweet is mostly interesting to friends/fans of the tweeter, but taken together they add up to something more.

In analogy to individual neurons firing together to add up to the human consciousness, the billions of tweets have meaningful macro-states that contain information about the whole system rather than the individual tweeters. But we need to do a little data mining to extract meaningful information about these states, to expose our collective states of mind.

As a proof-of-concept we’ve1 been studying the mood2 of all of the public tweets. While there are many services that will allow you to study the mood of your own tweets (and also an neat little DIY project to show you the global average of twitter), much less effort has gone into studying how the mood breaks down according to geography. Below, I show a brand new video displaying the pulsating 24-hour twitter mood cycle of the United States (I’ll explain just what you’re looking at, in the following).

In the video, green corresponds to a happy mood and red corresponds to a grumpier state of mind. The area of each state is scaled according to the number of tweets originating in that state. Note how the East Coast is consistently 3 hours ahead of the West Coast, so when we’re sleeping in Boston, the Californians are tweeting away. It’s also interesting that better weather seems to make you happier (or rather, that better weather is correlated with happier tweets): Florida and California seems to be consistently in a better mood than the remaining US. Also note how New Mexico and Delaware behave very differently from their neighbors. Full results, individual maps, and a high-res poster can be found on the dedicated Twitter Mood website.

How to construct the mood map

Since many twitter users list their location, we’ve assigned every tweet in our (massive) database to a US county and extracted their mood. This allows us to average over tweets and plot the mood of the US as a function of geography (and time). However, since the US is unevenly populated, the resulting maps are boring since only a few counties (the centers of cities) contain most of the tweets (not too many tweets in Ellsworth, Nebraska yet).

Luckily, brilliant people have come up with a cool way of solving this problem using a technique called density equalizing maps3. (or cartograms) The idea here is simple: warp the map in such a way that certain features of shape are conserved, but in such a way that the (population) density becomes the same everywhere. The resulting maps look like something from an alternate universe and allow us to show the US mood much more clearly.


  1. The twittermood project members are Alan Mislove, YY Ahn, JP Onnela, Niels Rosenquist, and undersigned.
  2. For a deeper explanation of how we evaluate the mood of tweets, see the Twitter Mood website.
  3. An easily accessible explanation of the density equalizing maps, is posted on the Twitter Mood website.

Erdös Number

The scientific version of the Bacon number is the Erdös number. Via a post on Finn Nielsen’s blog, I learned that i have a reasonably low Erdös number – three. (I also learned that Finn is one of the few people with a finite Erdös-Bacon number). The reason for both Finn’s and my own low Erdös number, is that my PhD advisor Lars Kai Hansen has co-authored a (highly cited) paper with Peter Salamon who has a bacon number of one. The links are:

  • P. Salamon and P. Erdös. The Solution to a Problem of Grünbaum, Canadian Mathematical Bulletin, 31: 129-138 (1988).
  • L.K. Hansen and P. Salamon. Neural Network Ensembles, I.E.E.E. Transactions on Pattern Analysis and Machine Intelligence, 12: 993-1001 (1990).
  • S. Lehmann, M. Schwartz, L.K.Hansen. Biclique communities. Physical Review E 78:016108 (2008).

With respect to the Erdös-Bacon number, I could make the case that I should have a number of four. The reason is that I actually appear in the documentary (it’s just an uncredited half-second shot of me sitting at my computer) Connected – The power of six degrees, which features my ex-boss and renowned scientist Albert-Laszlo Barabási. Here’s the trailer:

But since I don’t appear on IMDb, I guess it doesn’t really count…

Pervasive Overlap

Just recently, I came across the following video showing LinkedIn chief scientist DJ Patil explaining the egocentric networks (networks consisting of an individual and their immediate friends) for a few individuals based on their LinkedIn connections.

Although the individuals in the center of these egocentric networks are unusual (in the sense that they have many more LinkedIn connections than the average user), the video clearly shows that each person is a member of multiple communities where the communities are dense and almost fully connected, while there are fewer connections between the communities. (If any of this sounds familiar, it’s because I wrote about this subject a couple of months ago on the Complexity and Social Networks Blog).

This notion of social structure implies that — seen from the perspective of a single node — everything is relatively simple: the world breaks neatly into easily recognizable parts (e.g. family,  co-workers, and friends). There are few or no links between the communities because we actively work to keep them separate (more here, on why this is the case).

I’ve been thinking about the consequences of this local structure for a while, and recently coauthored a paper this subject with YY Ahn and Jim Bagrow [1]. Here, and in an upcoming blog post, I’ll be writing about some insights from that work.

The idea I hope to explore here has to do with the global structure that arises when all nodes in a network have multiple community affiliations, when there is pervasive overlap. In the follow up, I’ll explore how a single hierarchical organization of the network can exist in the presence of pervasive overlap.

Untangling the hairball

In the standard view of communities in networks, the global structure is modular [2]. This situation is shown below (left), where the communities are labeled using different colors (image from gephi.org). Modular structure on the global level implies, however, that individual nodes can have only a single community affiliation!

If every node is a member of more than one community — and this is clearly the case in the LinkedIn example, as well as in real social networks — then the global structure of the network is not at all modular. Rather, the network will be a dense mess with no visually discernible structure. The network will look like ball of yarn … or a hairball (above, right). In fact, this is precisely the type of structure which has recently been discovered in empirical investigations of a comprehensive set of large networks (social and otherwise) [2, 3].

So the question becomes: How do we find network communities in the hairball? This is the question YY, Jim and I answer in Ref [1]. The trick is that although nodes have many community memberships, each link is mostly uniquely defined. For example, the link you have to one coworker is similar to the link you have to other coworkers. Thus, by formulating community detection as a question of categorizing links rather than nodes, we are able to detect communities in networks with pervasive overlap.

Using our algorithm, for example, we show that dense hairball-networks, such as the word association network (which is what is pictured above, right) contain highly organized internal structure with well defined and pervasively overlapping communities. We’re hoping that our algorithm will help reveal new insights about some of the many highly overlapping social networks, such as the LinkedIn data shown above.

Code for our algorithm may be downloaded here; that site also features a neat interactive visualization of the link clustering algorithm.

Note: This entry was originally posted on the Complexity and Social Networks Blog.