Erdös Number

The scientific version of the Bacon number is the Erdös number. Via a post on Finn Nielsen’s blog, I learned that i have a reasonably low Erdös number – three. (I also learned that Finn is one of the few people with a finite Erdös-Bacon number). The reason for both Finn’s and my own low Erdös number, is that my PhD advisor Lars Kai Hansen has co-authored a (highly cited) paper with Peter Salamon who has a bacon number of one. The links are:

  • P. Salamon and P. Erdös. The Solution to a Problem of Grünbaum, Canadian Mathematical Bulletin, 31: 129-138 (1988).
  • L.K. Hansen and P. Salamon. Neural Network Ensembles, I.E.E.E. Transactions on Pattern Analysis and Machine Intelligence, 12: 993-1001 (1990).
  • S. Lehmann, M. Schwartz, L.K.Hansen. Biclique communities. Physical Review E 78:016108 (2008).

With respect to the Erdös-Bacon number, I could make the case that I should have a number of four. The reason is that I actually appear in the documentary (it’s just an uncredited half-second shot of me sitting at my computer) Connected – The power of six degrees, which features my ex-boss and renowned scientist Albert-Laszlo Barabási. Here’s the trailer:

But since I don’t appear on IMDb, I guess it doesn’t really count…

Pervasive Overlap

Just recently, I came across the following video showing LinkedIn chief scientist DJ Patil explaining the egocentric networks (networks consisting of an individual and their immediate friends) for a few individuals based on their LinkedIn connections.

Although the individuals in the center of these egocentric networks are unusual (in the sense that they have many more LinkedIn connections than the average user), the video clearly shows that each person is a member of multiple communities where the communities are dense and almost fully connected, while there are fewer connections between the communities. (If any of this sounds familiar, it’s because I wrote about this subject a couple of months ago on the Complexity and Social Networks Blog).

This notion of social structure implies that — seen from the perspective of a single node — everything is relatively simple: the world breaks neatly into easily recognizable parts (e.g. family,  co-workers, and friends). There are few or no links between the communities because we actively work to keep them separate (more here, on why this is the case).

I’ve been thinking about the consequences of this local structure for a while, and recently coauthored a paper this subject with YY Ahn and Jim Bagrow [1]. Here, and in an upcoming blog post, I’ll be writing about some insights from that work.

The idea I hope to explore here has to do with the global structure that arises when all nodes in a network have multiple community affiliations, when there is pervasive overlap. In the follow up, I’ll explore how a single hierarchical organization of the network can exist in the presence of pervasive overlap.

Untangling the hairball

In the standard view of communities in networks, the global structure is modular [2]. This situation is shown below (left), where the communities are labeled using different colors (image from gephi.org). Modular structure on the global level implies, however, that individual nodes can have only a single community affiliation!

If every node is a member of more than one community — and this is clearly the case in the LinkedIn example, as well as in real social networks — then the global structure of the network is not at all modular. Rather, the network will be a dense mess with no visually discernible structure. The network will look like ball of yarn … or a hairball (above, right). In fact, this is precisely the type of structure which has recently been discovered in empirical investigations of a comprehensive set of large networks (social and otherwise) [2, 3].

So the question becomes: How do we find network communities in the hairball? This is the question YY, Jim and I answer in Ref [1]. The trick is that although nodes have many community memberships, each link is mostly uniquely defined. For example, the link you have to one coworker is similar to the link you have to other coworkers. Thus, by formulating community detection as a question of categorizing links rather than nodes, we are able to detect communities in networks with pervasive overlap.

Using our algorithm, for example, we show that dense hairball-networks, such as the word association network (which is what is pictured above, right) contain highly organized internal structure with well defined and pervasively overlapping communities. We’re hoping that our algorithm will help reveal new insights about some of the many highly overlapping social networks, such as the LinkedIn data shown above.

Code for our algorithm may be downloaded here; that site also features a neat interactive visualization of the link clustering algorithm.

Note: This entry was originally posted on the Complexity and Social Networks Blog.

References

High Throughput Humanities: Final Call for Abstracts

Note: This post was originally posted on the Complexity and Social Networks Blog.

A quick reminder that April 30th is the final chance to submit an abstract to the High Throughput Humanities Workshop that I’m organizing along with Riley Crane , Gourab Ghoshal, and Max Schich, at this years European Conference on Complex Systems in Lisbon this September (I wrote about this in more detail a couple of months ago).

We have an amazing Program Committee that includes:

Albert-László Barabási, CCNR Northeastern University, USA.
Guido Caldarelli, INFM-CNR Rome, Italy.
Gregory Crane, Tufts University, USA.
Lars Kai Hansen, Technical University of Denmark.
Bernardo Huberman, HP Laboratories, USA.
Martin Kemp, Trinity College, Oxford, UK.
Roger Malina, Leonardo/ISAST, France.
Franco Moretti, Stanford University, USA.
Didier Sornette, ETH Zurich, Switzerland.

Full details can be found at the workshop website http://hth.eccs2010.eu/. There’s even a neat little introductory video (from our talk at Ignite Boston 7):

We hope you will submit an abstract!

Worlds Colliding

Note: This post was originally posted on the Complexity and Social Networks Blog.

During a press conference at last week’s SxSW conference, product manager of Google’s gmail team, Todd Jackson, revealed an interesting bit of information about the company’s problem-ridden new service Google Buzz:

Jackson told the crowd, as he’s previously said to reporters, that too much was assumed about how Buzz would work best and be received based on Google’s internal testing. Google employees didn’t have a strong use case for “muting” their fellow Google employees, and the people they’d want to follow and be followed by closely matched up to their contact lists. In general, too, Jackson suggested that Google underestimated the impact of “having a social, public service appear inside … what is a very private thing (email) for some people [1].

So by testing their social service inside a single context (Google employees only), the developers failed to notice that in real life, people participate in multiple contexts (family, work, friends, etc) that they work actively to keep separate. The reasons for wanting to keep these groups separate can range from wanting to keep an illicit affair secret from your spouse to political activists in oppressive regimes wanting to keep certain connections secret from the government [2]. Another important reason to keep our communities separate, is that we often play different roles – and communicate differently – in different contexts, as illustrated beautifully in the following clip from TV’s Seinfeld:

So, ironically, the key problem for Buzz, Google’s social network service was that the engineers at the Googleplex had failed to understand an essential property of real-world social networks. Figure 1 illustrates the problem:

google_vs_real.jpg

Figure 1A shows a cartoon version of Google’s internal testing situation. It’s clear that in this situation, since an individual (the gray node) only belongs to a single social context, sharing contact information with his neighbors reveals no new information to his social network. However, an ego-centered network in the wild looks more like the situation depicted in Figure 1B. Here, the gray node is a member of several communities (nodes with different colors) with very little communication between communities. Now, because people typically manage all of their ‘worlds’ from their email inbox, what Google did when they created Buzz’ automatic friends-lists, was to implicitly link people’s worlds, revealing the precisely the information that people work to supress. Sometimes with serious implications.

It is interesting to consider what the structure displayed in Figure 1B implies for the full graph. For an individual, the world breaks neatly into a small set of social contexts, but when every single node is in this situation, then the resulting total structure becomes very different from many of the model networks that are currently in use. In my own corner of the complex networks world, this has serious implications for rapidly growing field of community detection [3]. Currently, most algorithms are designed to search for densely connected sets of nodes that are weakly connected to the rest of the network, and while some methods do include the possibility of community overlap, most break down if the overlap constitutes more than a small fraction of the number of nodes. If Figure 1B is correct and overlap is present for all nodes, then the idea of communities as weakly connected to the remainder of the network is false — since communities will have many more links to the outside world than to the inside.

I hope to see more research investigating this problem!

Oh – and George Costanza gets to have the last word…

Update April 3rd, 2010

I’ve just become aware of a few excellent blog posts that discuss problems related to buzz, drawing on ideas very similar to what I present above. Fred Stutzman writes eloquently about buzz and colliding worlds inspired by Erving Goffman here. That post sparked additional ‘world-colliding’ thoughts from David Truss (via this post from George Siemens).

References

High Throughput Humanities

Note: This post was originally posted on the Complexity and Social Networks Blog.

Along with Riley Crane (of Darpa Challenge and Colbert Report fame), physicist Gourab Ghoshal, and quantitatively minded art historian Max Schich, I’m putting together a workshop on High Throughput Humanities as a satellite meeting at this years European Conference on Complex Systems in Lisbon this September. The general idea is to put together people who ask interesting questions of massive data sets. More specifically – as the title implies – we want to figure out how to use computers to do research in the humanities in a way extends beyond what can currently be accomplished by human beings.

Entire libraries are in the process of being scanned and we would like to begin to investigate questions like: Are there patterns in history that are currently ‘invisible’ due to the fact that humans have limited bandwidth – that we can only read small fraction of all books in a lifetime?

We have an exciting program committee so it should be an interesting day!

Confirmed Programme Committee Members

  • Albert-László Barabási, CCNR Northeastern University, USA.
  • Guido Caldarelli, INFM-CNR Rome, Italy.
  • Gregory Crane, Tufts University, USA.
  • Lars Kai Hansen, Technical University of Denmark.
  • Bernardo Huberman, HP Laboratories, USA.
  • Martin Kemp, Trinity College, Oxford, UK.
  • Roger Malina, Leonardo/ISAST, France.
  • Franco Moretti, Stanford University, USA.
  • Didier Sornette, ETH Zurich, Switzerland.

Practical information can be found at the conference website. Oh, and did I mention that Lisbon is beautiful in September! Sign up an join us. The workshop abstract is reprinted below.

Abstract

The High Throughput Humanities satellite event at ECCS’10 establishes a forum for high throughput approaches in the humanities and social sciences, within the framework of complex systems science. The symposium aims to go beyond massive data aquisition and to present results beyond what can be manually achieved by a single person or a small group. Bringing together scientists, researchers, and practitioners from relevant fields, the event will stimulate and facilitate discussion, spark collaboration, as well as connect approaches, methods, and ideas.

The main goal of the event is to present novel results based on analyses of Big Data (see NATURE special issue 2009), focusing on emergent complex properties and dynamics, which allow for new insights, applications, and services.

With the advent of the 21st century, increasing amounts of data from the domain of qualitative humanities and social science research have become available for quantitative analysis. Private enterprises (Google Books and Earth, Youtube, Flickr, Twitter, Freebase, IMDb, among others) as well as public and non-profit institutions (Europeana, Wikipedia, DBPedia, Project Gutenberg, WordNet, Perseus, etc) are in the process of collecting, digitizing, and structuring vast amounts of information, and creating technologies, applications, and services (Linked Open Data, Open Calais, Amazon’s Mechanical Turk, ReCaptcha, ManyEyes, etc), which are transforming the way we do research.

Utilizing a complex systems approach to harness these data, the contributors of this event aim to make headway into the territory of traditional humanities and social sciences, understanding history, arts, literature, and society on a global-, meso- and granular level, using computational methods to go beyond the limitations of the traditional researcher.