CCCSS Workshop

The program for our workshop on December 15th is finally available for your reading pleasure. We’ll be talking about how to design a great sociometer experiment – and what the most exciting research questions are. Note that the workshop is open to the public, so if you’re in (or near) Copenhagen, do stop by!

We do have limited seating, so please send an email to David (ddl@econ.ku.dk) by December 12th, if you plan on attending.

Workshop:

The Copenhagen Center for Computational Social Science Inaugural Workshop. December 15th, 2011. Organized by Anders Blok, Søren KyllingsbækDavid Dreyer Lassen, Morten Axel Pedersen and yours truly.

Abstract:

At last year’s Techonomy Conference, former Google CEO, Eric Schmidt, noted: “There was 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days, and the pace is increasing”. This massive increase in the rate of data generation has opened up new possibilities for computational investigations of human behavior. We – a multi-disciplinary team of scholars from the Faculty of Social Sciences at the University of Copenhagen and the Technical University of Denmark – are interested in taking advantage of the recent technological developments in order to push the current boundaries of quantitatively based understandings of social systems.

Specifically, the aim of our proposed research program CCCSS (Copenhagen Center for Computational Social Science) is to record the network of social interactions with very high resolution (both in terms of temporal sampling and number of recorded communication channels) by using smart phones as sensors for sampling a variety of communication channels, e.g. face-to-face via Bluetooth, geolocation via GPS, social network data (Facebook, Twitter) via apps, and telecommunication data via call logs. Based on this highly complex and dynamic network, we want to develop computational (mathematical) approaches to describe the underlying social system. In addition to this overall goal, we are interested in a five concrete themes, which will support and inform our efforts to formulate a general theoretical framework spanning across different scientific disciplines:

  1. Incomplete data and sampling. The significance of having access to only a small fraction of the full data in a networked system is poorly understood at present. We will use our findings from this high-resolution sample as a tool to understand much larger `low resolution’ data sets describing millions of individuals and billions of interactions.
  2. Information stored in relationships. We know, in a casual sense, that it is possible to learn about a person by the company she keeps. We show that we can quantify this notion in a social network and we study to what extent our behavioral patterns are encoded in our social relations.
  3. Influence in social systems. We wish to study how influence spreads in social systems, which is a problematic issue in most datasets. Our experimental setup allows us to probe causal issues by running controlled interventions; we will be able to run field experiments to test our hypotheses.
  4. Methodological experiments and their epistemological effects. For a long time, social scientific methods have been split according to a qualitative/quantitative divide. Based on our experiment, we want to explore how new high-resolution datasets may shift the terms of this debate. As part of this effort, we also wish to investigate what the increasing use of digital setups in social network analysis means for the nature of the (social) scientific experiment
  5. Privacy and ethics in social network research. We explore the question of privacy and develop novel strategies to ensure that our research (and the research of others working on similar topics) does not violate individual and collective rights to privacy.

Program:

Thursday, December 15th, 2011:

  • 9.30 Coffee
  • 10.00 Sune Lehmann: Introduction
  • 11.00 Martin Raubal: Socially informed location-based knowledge discovery
  • 12.00 Lunch
  • 13.00 Daniele Quercia: Personality and Language in Social Media
  • 14.00 Tea
  • 14.30 Alan Mislove: Privacy in Online Social Networks
  • 15.30 Matt Candea: The quantity and quality of gaps: On the value of not knowing certain things

Note that we’ll follow the format 30 min. + discussion for all talks

Venue:

The Seminar Room (2nd floor, CSS 26.2.21)
Department of Economics
Building 26, Centre of Health and Society (CSS)
Øster Farimagsgade 5 (http://g.co/maps/3f5dd)
1353 København, Denmark

Google’s generosity goes to zero!

According to the best of my calculations, the growth of Gmail storage is linear in time. Today, I recorded the amount of storage at two different times and found the rate of storage growth to be about 4.06541 bytes per second. This is consistent with Wikipedia’s report that, as of Jan 18th, 2010 Gmail’s storage was increasing at a rate of approximately 0.000004 MB per second. In other words, Google is giving away space at a constant rate.

Now, since the price of hard drive storage space seems to drop exponentially (over the last 30 years, space per unit cost has doubled roughly every 14 months (increasing by an order of magnitude every 48 months), this implies that Google is paying exponentially less for their new hard drive space [1]. The only reasonable conclusion is that Google’s generosity is rapidly approaching zero!

Just to be extra silly, I actually plugged the growth-data from my own account and used the regression fit from the site above in order to estimate the cost per gmail account as a funtion of time.

Full disclosure: There are a number of problems with the approach of estimating the cost of an account as current storage multiplied by current cost of storage. And let me just mention some of them here for transparency. Firstly, my storage price is based on consumer hardware prices, and I’m betting that Google probably can probably get some kind of bulk deal. Secondly, I assume that Google has some kind of backup system in place, which increases the need for storage beyond the account size reported by Google. Finally and most importantly, the correct price for storage over time should probably be estimated as accumulated price paid for hardware at time t compared with the total amount of storage offered for free at time t.

And there’s one final problem with the linear growth of storage. The issue becomes extra noticeable because all this cheap storage also applies to our personal computers … and to the average attachment size, which is probably growing in proportion to the size of the hard drive it was sent from [2]. What this means is that we’re likely to use up Gmail storage space at a rapid increasing rate.

I’m not saying that this is a violation of the “don’t be evil” maxim. It’s just that I’m running out of inbox space and don’t want to pay for additional storage.

Footnotes

[1] See also http://ns1758.ca/winch/winchest.html for more info on historical hard drive pricing. [2] I don’t really have data to support this claim, but it sounds reasonable to me.

On ‘Frictionless Sharing’

If you like sharing everything and if you think that pressing a ‘like’-button is too much work, you’re going to love Facebook’s new frictionless sharing. If you like to steal a private moment once in a while and sometimes try to pretend to be cooler than you are, you might not like it so much.

Ok, first, let’s recap the basic idea behind the frictionless sharing: If one of Facebook’s ‘social plugins‘ is installed on a site you’re visiting you’re automatically sending anything you read into your Facebook news feed. And the only clicking you’ll have to do is the actual clicking through to the article. Oh, and the final piece of good news is that you don’t even have to be logged into Facebook for the social plugins to work [update, seems like that issue has actually been fixed].

One reason I think this might end badly is that an essential part of the Facebook experience is the pleasure of carefully creating a gently improved online/external version of who you are. I’m not sure people are going to like when that aspect is slowly eroded away.

My favorite example for when this external persona comes into being is when you explain to people what kind of music you like. It’s nearly impossible (for me at least) not to bring up the coolest music that you listen to, rather than the music you like the most. For example, you might mention Animal Collective’s Merriweather Post Pavilion as your favorite album [1], rather than point out that a mix including Bangles’ Eternal Flame, Christina Aguilera’s Beautiful and Bon Jovi’s Bad Medicine has an a power-law tail worthy play count in your iTunes player. Not being allowed to construct an idealized version of yourself is a bit like being forced to always wear t-shirt, jeans, and flip flops.

However, sharing everything has other downsides, the most important of which is that ‘oversharing’ might rob us of the ability to steal a moment once in a while. Let me try and explain why that might be a problem:

A recent post on kottke.org has the title “Why is Sergei Brin so good at angry birds“. Kottke writes:

I spent perhaps too much time this morning pondering one of the mysteries of the internet: Sergey Brin’s astronomically high scores on the Google+ version of Angry Birds. For instance, Brin’s high score on the easiest level of the game is 36240. It’s a legit score (here’s a higher one) and he has impressive scores on several other levels.

It’s a neat observation [2], but the crucial point of the story is that everyone is left to wonders: ‘Why is Brin spending his time playing Angry Birds, when he should be at work running Google?’ And most of us aren’t even shareholders.

Or the other day, on my way home from work, I noticed that the fall sunlight was particularly golden — and on a whim, I took a small detour to enjoy a couple of additional minutes outside before returning home to help with tired/moody toddler care (including diaper changes) and other post-work chores. Without frictionless sharing, I can still get away with stuff like that, but I’m wondering what my wife would have thought if Facebook had posted something like ‘Sune took a detour in the sun today’, while she was at home working hard to rein in a tired 1.3 year old.

Now, I could (and would) certainly argue that stealing a moment was a good idea – that a couple of minutes of unplanned meandering once in a while is what keeps me (and, I think, other people … for example Sergei Brin) sane in an increasingly busy world. And I’m also pretty sure that I could have convinced my wife that that detour was not a waste of time. The problem is that having to explain that moment would have kind of ruined it. So if had known that my stolen moment had been actively shared by Facebook, I probably would have gone straight home.

And that’s the problem: It’s not that you can’t still steal a moment with frictionless sharing. It is the fact that you might have to justify each one that might ruin those moments; perhaps even make you decide not to steal any more moments. And that seems to me like something almost worse than a simple invasion of privacy.

Let me know what you think in the comments!

Notes

[1] Ok, so that’s probably not a hip album anymore, but I’m much to busy to be a hip these days

[2] Also note that Kottke is making excuses for stealing a moment to ponder silly stuff like Sergei Brin’s Angry Bird’s score.

More on TweetQuakes

A few days ago, I wrote (with Alan Mislove) about our TweetQuake visualization (read the relevant post here). Some of the commenters pointed out that it’s not really surprising that tweets travel faster than earthquakes. Here’s Andrew Gelman (I don’t know if it’s that famous Andrew Gelman, but I think so) commenting on The Monkey Cage Blog:

And he’s right. Information traveling via optical fiber is about as fast as anything you can find in the universe (and as Gelman points out, other important examples of rapid communication technology includes telephone/radio communication). This much was even clear to yours truly when I read the xkcd comic no. 723 back in April of 2010. I tweeted:

i guess it’s somewhat trivial, but nonetheless – it seemed profund when i read it: tweets are faster than earthquakes http://bit.ly/a7w0MY

So why did it seem profound when I read the comic? Why is it still interesting that Twitter is faster than an earthquake? The fact that the news of the earthquake on twitter spreads faster geographically than the earthquake itself is something non-trivial and profound.

And I think I can explain why. Until now, we’ve categorized earthquakes among events happen so quickly that they’re instantaneous for all intents and purposes. An event that propagates between 6 700 and 11 200 miles/hour is incredibly fast.

So the surprise is not that electronic signals are fast, but that a news medium (i.e. Twitter/Facebook) can deliver news faster than things that used to be instantaneous. That is what is new (and kind of awesome)!

But not that awesome – because even though you know the earthquake is coming before it hits, there’s still not really time to react properly to the threat; the earthquake will still be there in a few seconds time. And the Twitter advertisement team picked up on just this fact in their most recent advertisement, embedded below.

The message is clear: You do get the news about the quake arriving, but it doesn’t really change anything.

But let’s dig a little deeper. Last year, when we created the twitter Pulse of the Nation visualization (check it out here if you haven’t seen it), I came up with a highly speculative (and self-important) analogy that I love to talk about.

The general idea is that even though the importance of individual tweets is highly variable, something interesting begins to happen when we look at thousands, millions, or even billions of them. I wrote:

In analogy to individual neurons firing together to add up to the human consciousness, the billions of tweets have meaningful macro-states that contain information about the whole system rather than the individual tweeters. But we need to do a little data mining to extract meaningful information about these states, to expose our collective states of mind. [quoted from here]

Now, I think the earthquake visualization can be thought of as a a manifestation of the same kind of phenomenon. If the twitterverse is to be taken seriously as some kind of global-scale nervous system, the earthquake response is not something like the state-of-mind or consciousness that I claimed the mood was.

The earthquake response is something closer to that ultra fast reflex that kicks in right before you’re unavoidably punched in the face. Like the guy in the movie below at around 16 seconds in. Notice him closing his eyes and clenching his facial muscles tightly in anticipation:

He knows something uncomfortable is coming, but has to hang tight and hope that it’s not too tough. And that’s the type of edge that twitter has given us with respect to the earthquake.

Let me know what you think in the comments!

TweetQuake

This is a joint post with Alan Mislove, based on our work with Yong-Yeol Ahn and Chloe Kliman-Silver.

On on August 23, 2011, at 1:51 PM EDT a magnitude 5.8 earthquake hit the Piedmont region of the U.S. state of Virginia. Orders of magnitude smaller than the recent earthquake in Japan, this quake was nonetheless the largest in the U.S. east of the Rocky Mountains in 114 years (according to Wikipedia).

But why are we talking about earthquakes? We should be talking about people talking about earthquakes. And people really did some talking. The official twitter account (@twitter) posted three back-to-back tweets on the subject:

Are Tweets faster than seismic waves? We can’t speak to speed of seismic waves, but a Tweet can reach your followers in less than a second. [link]

Within a minute of today’s #earthquake, there were more than 40,000 earthquake-related Tweets. [link]

And, we hit about 5,500 Tweets per second (TPS). For context, this TPS is more than Osama Bin Laden’s death & on par w/ the Japanese quake. [link]

Now, as I am sure many people have already pointed out (e.g. on twitter), this situation was deftly analyzed and anticipated by Randall Munroe, author of the wonderful webcomic xkcd back in April 2010. Here’s the strip:

seismic_waves.pngAs Munroe points out, the speed of “damaging” seismic waves is around 3-5 km/second, which is much slower than the speed of information spreading on the internet. This simple fact means that if you’re more than 100 km away from the epicenter you can read about the quake on twitter before it hits you.

Now, combine idea from the xkcd strip with data from the tweetquake and it’s possible to observe this phenomenon in practice. In the visualization below, we’ve generated a video of the mentions of the work “earthquake” in tweets from the gardenhose in the 5 minutes immediately following the earthquake. For simplicity, we have assumed a uniform 4 km/s wave and ignored deformations due to map projections, etc (we’re not geologists, after all).

The comic strip doesn’t factor in the time it takes to actually write a tweet, and since seconds count, it takes more than 100 km before we see tweets posted outside the wavefront (validating the last frame of the comic strip). It is awe inspiring to see a truly real time news medium in action.

Notes:

Link communities R package

A while ago, I wrote about Rob Spencer over at Scaled Innovation‘s implementation of the algorithm for detecting link communities. Today, I am happy to report on another exciting development for the alorithm. Alex Kalinka from the Tomancak lab at the Max Plank Institute (MPI-CBG) has written a great implementation in R, called linkcomm. It is now up on CRAN:

http://cran.r-project.org/web/packages/linkcomm/index.html

While everything is excellent, the graphics are particularly beautiful – much prettier than our own visualizations – check out the colored link dendrogram plot (from the CRAN website)

And the spatial network layout options are great as well; the various community visualizations are simple, elegant, and very pretty:

The panel on the left shows a 'Spencer circle' layout, while the panel on the right shows a Fruchterman-Reingold layout. From the linkcomm documentation.

In addition, there are many neat features. For example, linkcomm allows you to visualize sub-communities by themselves. Alex has also published an Application Note in Bioinformatics about the implementation, so take a look if you’re interested:

http://bioinformatics.oxfordjournals.org/content/early/2011/05/19/bioinformatics.btr311.abstract (open access).

We also link to the package from our link communities download page.

Tu Vuò Fà L’Americano

I’m excited to leave Boston for a bit to participate in ARS’11: The Third International Workshop on Social Network Analysis, Collaboration Networks and Knowledge diffusion: Theory, Data and Methods. It takes place in Naples, Italy this week, and the speaker line-up looks exciting (despite the fact that they invited me) [1].

Here’s a bit of text from the official description:

ARS’11 International Workshop is a follow up to two very successful previous editions ( ARS’07 and ARS’09) and will be held on June 23-25, 2011 in Naples (Italy).
Collaboration networks attract a lot of attention in many fields and are considered a key element in the advancement and dissemination of knowledge in scientific as well as in socio-economic domains. The workshop has the objective of presenting the most relevant results and recent developments in the areas of Collaboration Networks, Innovation Networks and Knowledge Diffusion.

The workshop also aims to deepen existing scientific cooperation between Social network analysts, to establish new cooperation between researchers, and to provide a forum for exchange of ideas among them.

The workshop topics include:

  • Collaboration theory
  • Analysis of innovation networks in economics environments
  • Sources of collaboration data
  • Social Network Analysis methods for collaboration data

Notes:

[1] I stole the idea for this elegant, faux self deprecating plug from Aaron Clauset’s blog.

Back in the USA

I’m delighted to report that I’m back in Boston for the summer. The next couple of days (May 31st and June 1st), I’ll be attending the Interdisciplinary Workshop on Information and Decision in Social Networks, which looks to be really exciting.

And for the next couple of months I’ll primarily be at the Center for Complex Network Research at Northeastern University. So do look me up if you’re in town.

Whitman

I recently came across the following Whitman poem:

When I Heard the Learn’d Astronomer

When I heard the learn’d astronomer,
When the proofs, the figures, were ranged in columns before me,
When I was shown the charts and diagrams, to add, divide, and measure them,
When I sitting heard the astronomer where he lectured with much applause in the lecture-room,
How soon unaccountable I became tired and sick,
Till rising and gliding out I wander’d off by myself,
In the mystical moist night-air, and from time to time,
Look’d up in perfect silence at the stars.

This poem beautifully captures the feeling that when you quantitatively analyze something (be it Nature or literature), it often feels like some of the initial beauty and magic of the phenomenon disappears [1].

As a scientist, the position that a scientific viewpoint somehow diminishes ‘beauty and magic’, is something you run into once in a while, so it’s good to have an answer. My own reply is that while it’s true that analysis tends to strip many phenomena of some kind of immediate (and often trivial) appeal, digging deeper almost always reveals new layers of beauty.

I had developed some examples to go along with this argument, based on my own experiences, but a couple of years ago, I watched an interview with Richard Feynman [2], and his answer is so much better than mine that I’ll leave the rebuttal of Whitman to him:

Postscript

After writing the above, I googled the poem – I guess I should have done that before writing – and found a lot of fun/interesting discussions. One commenter pointed to a modern version of Whitman’s standpoint courtesy of the Insane Clown Posse (from Miracles, 2009):

Water, fire, air and dirt
Fucking magnets, how do they work?
And I don’t wanna talk to a scientist
Y’all motherfuckers lying, and getting me pissed.

Check out the pages below for more. Particularly the comment thread for the first post is a treasure trove:

References

[1] My own favorite example is that – when conditions are good – there are 9110 stars visible to unaided human eye. I’m pretty sure that bringing up this factoid could ruin a romantic evening under the stars. Anyway, I’m rambling.

[2] From the BBC program Horizon. Interview recorded in 1981 – the whole thing is highly recommended.