Have you recently finished your PhD? And would you like to come to Denmark to work with deep learning on an amazing dataset? Then keep reading. There’s a great opportunity for DTU funding that we can apply for together
Proposal: Deep learning, network structure, and language on Twitter
Based on a massive dataset (10% of all tweets going back to 2012), we wish to study the interplay between language and network structure. Specifically, we wish to study the interplay between language evolution and network evolution across time (effectively the co-evolution of language and network structure).
As part of the grant application, you will be part of shape the research questions, but a rough idea would be to use deep learning approaches (word embeddings, LSTMs) to represent the language component, and state-of-the-art network science approaches for the network evolution.
At the time of recruitment (1 July 2017) applicants must not have resided or carried out their main activity in Denmark or at DTU for more than 12 months in the 3 years immediately prior to recruitment (excl. holidays and short visits)
Successful applicants must move to Denmark by the time of employment at the latest;
The applicant must, by the time of recruitment (1 July 2017), be in possession of a doctoral degree or have at least 4 years of full-time equivalent research experience
Renowned network scientist and creator of InfoMap (probably the world’s best community detection algorithm for complex networks), Martin Rosvall, is visiting Copenhagen. And I’ve managed to convince him to visit DTU to give a talk!
Martin is an associate professor at the department of physics at the university of Umeå (Sweden). He’s an accomplished author of many highly cited papers, and a great speaker. Thus, I strongly recommend you come see his talk.
The details are below:
Time: Wednesday December 7, 11:00am
Place: Technical University of Denmark. Building 321, 1st floor Lab Space.
Title: Maps of sparse Markov chains efficiently reveal community structure in network flows with memory
Abstract: To better understand the flows of ideas or information through social and biological systems, researchers develop maps that reveal important patterns in network flows. In practice, network flow models have implied memoryless first-order Markov chains, but recently researchers have introduced higher-order Markov chain models with memory to capture patterns in multi-step pathways. Higher-order models are particularly important for effectively revealing actual, overlapping community structure, but higher-order Markov chain models suffer from the curse of dimensionality: their vast parameter spaces require exponentially increasing data to avoid overfitting and therefore make mapping inefficient already for moderate-sized systems. To overcome this problem, we introduce an efficient cross-validated mapping approach based on network flows modeled by sparse Markov chains. To illustrate our approach, we present a map of citation flows in science with research fields that overlap in multidisciplinary journals. Compared with currently used categories in science of science studies, the research fields form better units of analysis because the map more effectively captures how ideas flow through science.
I’m super excited to announce that we recently had a new paper published in PNAS. And by ‘we’ I mean my former PhD Student Vedran Sekara (first author), my former PostDoc Arek Stopczynski, along with yours truly.
I’m very proud of the work we’ve done, and somehow we got away with giving the paper the not-so-humble title Fundamental Structures of Dynamic Social Networks. The cool thing is that even though the title is perhaps ostentatious, I actually think that we’re on to something fundamental here. I’ve attempted to write a non-technical explanation below.
Prologue: The connection to communities
Community detection is a big deal in network science. Just look at this plot I created that shows the number of papers about community detection per year.
There are literally thousands of papers that address the topic of finding communities in networks published every single year, so in my world this is an important topic. Detecting communities in complex networks is usually all about finding groups of nodes with many links between then – and only few links to the rest of the network. The typical example network in a community detection paper looks something like this:
Back in 2010, YY Ahn, Jim Bagrow and I wrote a paper where we argue that there’s something fundamentally wrong with this idea of communities. The problem is that the illustration above assumes that each node is a member of only one single community. In that paper we argue that this assumption is often wrong. In most networks, each node is a member of more than one community. In social networks, for example, we are in communities of friends, family, co-workers, sports buddies, etc.
When each node is a member of many communities, the global picture gets more messy. The network doesn’t fall apart into neat chunks as above, rather it looks like a mess of a hairball. [I’ve written a popular explanation of those findings here plus a follow-up here.] The hairball below shows a real social network from the PNAS paper.
Back then, we did not have access to temporal information, but as part of trying to wrap our brains around how this hairball arises, Jim, YY, and I came up with the picture below (Jim actually drew it and impressively figured out how to do the perspective). This illustration – as we shall see below – turned out to be quite prophetic.
The illustration shows that when single individuals (marked in green and turquoise) participate in multiple communities the underlying simplicity is obscured in the aggregated network.
I had forgotten all about communities when my graduate student Vedran and I started looking at the incredible detailed data my group had just started collecting as part of the Copenhagen Networks Study (CNS). CNS contains 2.5 years of data collected by handing out 1000 smartphones to nearly all the DTU freshman students, collecting physical proximity data (using Bluetooth to measure the distance between pairs of individuals), phone calls, text messages, Facebook interactions, as well as GPS data. All of this with high temporal resolution (e.g. we recorded face-to-face meetings every 5 minutes)
Working as lead hacker-in-residence on top of his data science duties, Arek used a mix of 26-hour days & what I can only describe as pure black magic to start almost from scratch and orchestrate the software infrastructure needed to collect and store all of these data sources in something like six months.With CNS we finally had access to the temporal networks dataset needed to dig deeper.
When we looked at the physical proximity data we noticed that, as we considered finer and finer time resolution, the hair-ball (beautifully) dissolved into meaningful structures.
The green hairball shows everyone who has spent time together across an entire day. The orange network shows physical contacts aggregated over an hour, and the blue network shows the interactions for a five-minute time slice. The exciting thing is that in the blue network, we can directly observe the groups of people hanging out together. No community detection necessary – we had solved that question those thousands of papers in Figure 1 are addressing, simply by changing the temporal resolution . Said differently we’ve just identified a case where understanding the network got easier by adding more data (That’s why Renaud’s commentary is called “Rich Gets Simpler”).
Usually it’s the opposite. Things usually get a lot more complex when we have to account for more data. Just check any paper on temporal networks (for example take a look at this excellent review). I take the fact that more data has simplified the problem to mean that we’re on to something: that we’re looking at the network represented at the right temporal resolution.
Anyway. We’d just found out how to identify all of the little communities in a timeslice. Now we needed to put the pieces together again. But since we’d figured out the underlying simple principle, we began to study how meetings between people develop over time – simply by matching up groups between neighboring timeslices.
The result is gatherings – the temporal representation of a meeting between a group of individuals that can last anywhere from 15 mins to several hours.
We have a great visualization (with accompanying explainer-video, embedded below) that beautifully describes what gatherings are and how they work. Check out the video, it’s only 90 seconds long.
The visualization was created by Ulf Aslak Jensen, a newly started PhD student in my group. And it is officially awesome: earlier this year it won Science Magazine’s Data Stories Competition!
But, while they’re already great and exciting, gatherings are only the beginning of the story. If a group of people have a real social connection, they meet again and again. We call gatherings that occur repeatedly, cores. It is the cores that are the ‘fundamental structures’ that organize/simplify the dynamics we observe on the network. Let’s dig deeper.
First, let’s think about what the network looks like from the perspective of a single node. Below, we show an example from a real (and representative) individual.
Instead of modeling each and every interaction in the network, we now have a framework that allows us to think about a node’s social activity in a different way. We are able to think about the node as participating in a sequence of gatherings, where each gathering is an instance of a core.
The node pictured above is a member of 9 cores, each of which has gathered multiple times. If we plot when in time each core is active, it looks like this:
We call this pattern of interactions a person’s social trajectory, because we can think of the person’s journey through the network as jumping from core to core – from social context to social context.
It is a massive simplification of the hairball from Figure 3. And it is this simplification – the fact that we are now able to think about dynamic social network in terms of cores and their activations – that I think is the paper’s main contribution.
(Plus, having seen how the cores work, I hope it’s clear why I said that Figure 4 turned out to be a nice representation of what’s actually happening in real networks. )
In the paper we also spend quite a bit of time showing how this simplification is, in fact, useful for a number of purposes. But since this post is probably already a bit tl;dr I’ll save a detailed description of those results for another day. But I’ll summarize them here.
Firstly, we show that we can use cores to predict where people will be in the future. The idea is simple. A core is a ‘real’ object in the network in the sense that when we see a gathering, all of its members must be present. This means that observing a part of a core is a signal that soon we’ll soon see the remaining members.
In the paper we look at cores of size three and show how a sighting of two core members signals the arrival of the third group member.
Secondly, we realized that social trajectories have a lot in common with spatial trajectories. Spatial trajectories describe how we move from location to location. From ‘home’ to ‘work ‘to ‘supermarket’, etc.
The fact that we move through social contexts (cores) just like we move through physical space opens an interesting connection to work on human mobility. Specifically, we connect the work on cores to a seminal paper on Limits of Predictability in Human Mobility, which showed that for most people, given a sequence of past locations, the next location can be predicted with high accuracy .
We find a similar level of predictability given social trajectories, as well as an interesting interplay between the social and geo-spatial predictability (when people are highly unpredictable wrt. their location, they tend to be highly predictable wrt. their social context).
There is much more in the actual paper. For example, we talk about how the cores leave traces in other communication channels. And the paper also contains the technical details (although a lot of them are contained in the massive Supporting Information document). I will write more about the predictability results in a later post (since those findings are actually pretty cool as well).
In summary, I hope that I’ve managed to give you a sense of the paper’s central contribution – and perhaps also provided a bit more of an explicit link to the literature (including my own past research) than is readily available from the paper.
 The data was retrieved using the following Google Scholar search query: (“complex network” OR “complex networks” OR “network data”) AND (“community detection” OR “community assignment” OR “network community” OR “network communities” OR “community finding”). The idea for that query comes form Conrad Lee.
 I’m exaggerating a little bit for effect here. The approach we’re discussing only works for systems where people are actually meeting face-to-face. Community detection in phone call networks or Facebook is a different story.
 It’s a little bit confusing because we’re talking about two distinct kinds of predictability. The predictability related to a sequence of location/social contexts has to do with to the amount of routine in someone’s behavior.
[Note: Thanks for the many emails on this!! I will post new openings right here on this blog when they arise.]
I’m currently involved in two super-exciting projects that are currently hiring postdocs, so if you ever thought about moving to Copenhagen to do great science, now is the time. And with all these job postings, might even be able to bring a (scientifically outstanding) friend. As you probably know Denmark continues to be the happiest country on the planet, and the food & drink is amazing – with almost too many Michelin stars and a true abundance of hipster beer.
The scientific environment is also pretty great (if I have to say so myself). We’ve built an amazing group around the Copenhagen Networks Study (you may remember that we handed out 1000 smartphones to freshmen at DTU and collected network data for 2.5 years), and recently strengthened the efforts with the Copenhagen Center for Social Data Science (SODAS) at University of Copenhagen (where I’m now associate director). So we have a nice critical mass of interesting graduate students & postdocs with whom to spar, hang out, and grab lunch.
What I love about these projects is that they’re truly Data Science in the Drew Conway Data Science Venn Diagram (see below) sense of the word
In the language of the illustration above, we want you to have hacking skills + math and stats knowledge. What we offer is projects are carried out in close collaboration with people that have domain knowledge. (I’m beginning to have experience with this kind of project, and it’s completely amazing and refreshing to have a partner who can actually help place your data-driven results in context.)
The projects are:
Twitter Bots. This is the data-science component of a larger project, directed by political scientist, Prof Rebecca Adler-Nissen. The successful candidate will be associated with SODAS and work at closely with my group as well as political scientists exploring various qualitative aspects. The full project title is Digital Disinformation: Exploring the Influence of Disinformation on Western Public Debate. This is extra fun because I actually have some practical experience building twitter bots.
Network Analysis of case law from international courts. This one is in collaboration with Henrik Palmer Olsen at the faculty of law, and the successful candidate will be formally associated with both the faculty of law and my research group. Read more here.
We’re also collaborating with natural language processing expert Anders Søgaardon both of these projects. (Both networks have lots of text metadata associated with each node, so it’s kind of fantastic to have an NLP expert on the team).
Both positions are connected to a specific project where you’re expected to deliver certain results, but we support ‘blue sky’ research and once you’re set up, we welcome your ideas, and participation in ongoing research topics.
Action item: If you’re interested, send me an email!
PS. We’re also looking for a third postdoc in a more NLP minded project for the Data Transparency Lab, this one helmed by Anders. This project is about how well we can identify an author in large text corpus based on e.g. their tweets. There’s a great team on this one listed in Anders’ tweet below.
Monday June 13th is shaping up to be an exciting day for data science in Copenhagen. I’ve already announced that Christo Wilson is giving a talk at DTU, but now I’m happy to add Esteban Moro to the speaker line-up for a fantastic double bill. (And Piotr’s PhD defense at 2pm that afternoon will also be quite an event)
Esteban’s work is creative, inspiring, and always exciting (plus often covered in the press). We are lucky to have him. The details of Esteban’s talk are
Time Monday June 13th, 10:45am
Place: DTU, Building 321, 1st floor lab space
Title: Pace of change in urban social networks
Abstract: Urban communities are seen both as highly structured social settings as well as distinctly vibrant environments for interaction, where personal relationships are initiated, consolidated and, eventually, lost and replaced by new relationships. Here we investigate statistical relationships between the social structure of the urban community and the pace at which such structure changes over time. To this end, we analyze the 19-month evolution of the social interactions pertaining to urban communities in England, Wales and Scotland, as described by 700 million of mobile phones calls made among 20 million inhabitants. We find that different urban communities display not only distinct social structures but also alter such structures at widely different paces. Furthermore, we investigate the impact of this heterogeneity in the network varying structure on information diffusion processes by simulating SI models. Our results indicate that time to infection can be well predicted using only static variables of the network, such as the number of connections, leading to the conclusion that the observed vibrant mechanics in link creation have a negligible impact on the information diffusion in terms of geographical spreading.
A PhD defence is a great way to bring interesting people to Denmark, and Piotr’s defense on June 13th is no exception. This time we’re lucky to have recent NSF Career grant recipient Christo Wilson from Northeastern University visiting. Christo’s work includes auditing algorithms, security and privacy, and online social networks. Much of his work focuses on using measured data to analyze and understand complex phenomena on the Web. In many cases, he has leveraged the knowledge gained from measurements of the Web to build systems that improve security, privacy, and transparency for users – and getting lots of nice press coverage in the process.
Time: Monday June 13th, 10am
Location: DTU, Building 321, 1st floor lab space
Title: Caught Red Handed: Tracing Information Flows Between Ad Exchanges Using Retargeted Ads
Abstract: Numerous surveys have shown that Web users are seriously concerned about the loss of privacy associated with online tracking. Alarmingly, these surveys also reveal that people are also unaware of the amount of data sharing that occurs between ad exchanges, and thus underestimate the privacy risks associated with online tracking.
In reality, the modern ad ecosystem is fueled by a flow of user data between trackers and ad exchanges. Although recent work has shown that ad exchanges routinely perform cookie matching with other exchanges, these studies are based on brittle heuristics that cannot detect all forms of information sharing, especially under adversarial conditions.
In this study, we develop a methodology that is able to detect client- and server-side flows of information between arbitrary ad exchanges. Our key insight is to leverage retargeted ads as a mechanism for identifying information flows. Intuitively, our methodology works because it relies on the semantics of how exchanges serve ads, rather than focusing on specific cookie matching mechanisms. Using crawled data on 35,448 ad impressions, we show that our methodology can successfully categorize four different kinds of information sharing between ad exchanges, including cases were existing heuristic methods fail.
Ulf Aslak Jensen, who’s writing his M.Sc thesis in my group (well, actually he’s at the Weizman institute working with Uri Alon, but that’s another story) has just wonScience Magazine‘s Data Stories competition with the following video about a cool visualization he created based on SensibleDTU data.
Ulf has gotten lots of nice coverage, both internationally
Next Thursday, we’re lucky to have Dave Choffnes visiting the lab. David Choffnes is an assistant professor in the College of Computer and Information Science at Northeastern University. His research is primarily in the areas of distributed systems and networking, with a recent focus on mobile systems and privacy. Much of his work entails crowdsourcing measurement and performance evaluation of Internet systems by deploying software to users at the scale of tens or hundreds of thousands of users. He earned his PhD from Northwestern (not in the northwest), and completed a postdoc at the University of Washington (in the northwest) prior to joining Northeastern (both in the northeast and northwest). He sees no reason why this should at all be confusing. He is a co-author of three textbooks, and his research has been supported by the NSF, Google, the Data Transparency Lab, VidScale, M-Lab, and a Computing Innovations Fellowship.
Time: Thursday May 19th, 11am
Location: DTU, Building 321, 1st floor lab space
Title: ReCon: Identifying and Controlling Privacy Leaks from Mobile Devices
Abstract: Mobile systems have become increasingly popular for providing ubiquitous Internet access; however, recent studies demonstrate that software running on these systems extensively tracks and leaks users’ personally identifiable information (PII). I argue that these privacy leaks persist in large part because mobile users have little visibility into PII leaked through the network traffic generated by their devices, and have poor control over how, when and where that traffic is sent and handled by third parties.
In this talk, I describe ReCon, a cross-platform system that reveals PII leaks and gives users control over them without requiring any special privileges or custom OSes. Specifically, our key observation is that PII leaks must occur over the network, so we implement our system in the network using a software middlebox. We then use a machine learning approach to to efficiently and accurately detect users’ PII without knowing a priori the content that is PII. Further, we develop techniques to block, obfuscate, or ignore the PII leak, by displaying leaks via a visualization tool and letting the user decide how the system should act on transmitted PII. I discuss the design and implementation of the system and evaluate its methodology with measurements from controlled experiments and flows from a user study with more than 100 participants. In addition to revealing and controlling PII leaks, we are using our machine-learning-based techniques to automatically identify and block malware based on network behaviors.
In early December, Alan Mislove (who’s spending his sabbatical here in Copenhagen) and I, got the Volvo and headed out to the Amager Campus of University of Copenhagen to pick up Anders Søgaard, a professor of linguistics, to work on a top secret research project.
The project itself is still classified, but one of the things we’re looking into is word-usage in geo-coded tweets across the globe (to begin with, just America). To do this, Alan has trawled through something like 65 billion tweets and extracted the ones with geotags (1-2% of all tweets) further grabbing the ones that are from the US (about a third of those), ending up with a set of around 450 million geotagged tweets.
We couldn’t help ourselves – this dataset was just too cool not to visualize. And because Alan is a wizard, you can try this out for yourself on http://twitter-research.ccs.neu.edu/language/index.html. Once this thing hit twitter, people found lots of fantastic examples, and I’ve included some of my personal favorites below
As we enter the new year, it’s always fun to reflect on the year that’s just passed. And it’s been a good one. So good that I almost entitled this post “Everything is awesome”. Below is a list containing a lot of the stuff I should have written about during the year.
Back in June, Vedran Sekara became the first PhD graduate from my group. His thesis was on Dynamics of High Resolution Networks – a fine piece of work. And we were lucky to have Petter Holme and James Bagrow visit to be on the committee; it was great to see them both again.
Upon graduating, Vedran landed a nice job with Sony (Lund offices) as a data scientist. He’s still a visiting researcher in the group and we’re currently collaborating on a few super interesting projects based on Sony’s LifeLog App data.
Arek @ Google
And Vedran is not the only person with a cool new job. Arek Stopczynski, a senior postdoc in my group (and all-round awesome data scientist) has landed a super exciting job with Google in California.
It's official, next month I will be starting at @google in People Analytics data science team. Should be quite fun…
Arek’s work with Google is (of course) top-secret, but they’re lucky to have him!
Also this year, good friend, brilliant computer scientist, and associate professor at Northeastern University, Alan Mislove (+ familiy) is spending his sabbatical here in Denmark, with Alan visiting my group. Having him around is not only a lot of fun, but also enlightening … and we have a few exciting projects in the ‘under construction’ phase. And Alan is going to be around for another six months 🙂
For me, it was a big deal to receive the Sapere Aude Young Investigator Grant from the Danish Council for Independent research. The grant title is Microdynamics of Influence in Social Systems, and you can read a popular description of it here (it’s in Danish). This grant is not easy to win, and will keep me in business for the next few years.
There were other fancy speakers, for example the Danish Minister of the Interior (“Social- og indenrigsminister”) Karen Elleman.
This year, my group received lot’s of nice press coverage. Below is a selection.
As a first, fun thing I was interviewed on TV for the first time. It was just a local Copenhagen channel, but it was still scary to be right there in a pro studio being interviewed “live on tape”. Oh and the interview (which is in Danish) was about the Science paper Unique in the Shopping Mall by some of our good friends and collaborators at MIT.
There were a couple of additional videos about our works. One created by DEIC as part of their new e-Science knowledge portal. Watch it here. And German TV also sent a crew to report on the SensibleDTU experiment.
We also received lots of other nice Press coverage. I was in the DTU paper talking about how academics can use Twitter. You can find a link in the nice tweet from The Danish Agency for Science, Technology and Innovation (Forsknings og Innovationsstyrelsen).
The full details on all of this can be found on the Press page, when I get around to updating that.
Great exchange visits
This was also the where year two of my PhD students were spending 6 months of their program abroad (this is standard for Danish PhD students). Piotr Sapiezynski visited Jure Leskovec at Stanford and Andrea Cuttone is still visiting Marta Gonzalez at MIT. Feel very lucky to be able to send the guys out to these groups that are among the most exciting places on the planet.
And I also created a Coursera version of my Social Graphs and Interactions course. Here’s a link to the course page: https://dtu.coursera.org/course/02805. The video explains it pretty well.
We were in excellent company – the other grantees were from prestigious universities like Princeton University, Carnegie Mellon University, Northwestern University, Columbia University, and many other fine schools. Here’s a little 40 sec. video explaining the project.