Big Data for Breaking News: Lessons from #Aurora, Colorado

On July 20th we were glued to our computer screens obsessively tracking information coming out of Aurora, Colorado about the terrible shooting that took place. When breaking news events occur, we try to grab as much data as we can, to see what we can learn about the event. One of the simplest plots that we create is hashtag/word usage over time. When we looked at the #Aurora hashtag, we saw a really odd shape:

Why odd? Breaking news, especially of this sort, tend to look quite different. As news spreads about an event, conversation tends to spike and organically fall (in mathematical terms, the discrete observed frequencies at various time intervals follow the Zipfian distribution). In this case, the #Aurora hashtag started being used right after the shooting took place (7am UTC), and slowly rose as the US east coast woke up to the news (12pm UTC). Now notice the massive spike that happens around 6pm UTC. It is more than five times the height of the average beforehand, rapidly rising to 1000 tweets per minute!

In the following post we analyze over 150k tweets containing people’s reactions, published on Twitter with the #Aurora hashtag during the first 24 hours after the shooting. Additionally we use trending topics data, and highlight a topic-modeling method we use to identify significant groups of words that represent the different facets of conversation. Finally, we show what Kim Kardashian has to do with all this.

Growth in New Users

One data point we look at is the rate at which new users join a topic, or in this case, the #Aurora hashtag. We’re interested in which type of events trigger an onslaught of new participants, what gets more folks engaged in a topic. For example, when looking at conference hashtags, we tend to see a consistent set of users throughout the conference days. With sports events we see massive spikes in new user activity towards the end of a game – users anticipating and excited towards the end of the match. In the case of #Aurora, we see a curve that’s very similar to the one above:

There’s a steady rise in new users reacting to the news as the US east coast wakes up (1pm UTC). Similar to the plot above, we see a massive spike between 6 and 7pm UTC. So  what happened there?

Words and Co-occurrence

If we look at the tweets that were published during that 5-7pm UTC time segment, we can get a better sense for what exactly happened there. By creating a word co-occurrence graph we can immediately see the most important words used during that time period.

The graph above makes it clear. The most prominent words that appeared alongside with #aurora during that large spike were @CelebBoutique, Shop and Kim. If you weren’t tracking this story, an erroneous tweet that was published by the UK fashion boutique, Celeb Boutique. This is a classic case of context collapse: person in a different location who’s not aware that a contextually different event has invaded the namespace. For Celeb Boutique, #Aurora has always represented a white pleated V-neck frock inspired by Kim Kardashian. When they noticed the hashtag trending they assumed that context. But understandably, the majority of folks posting and consuming the hashtag took offense, and responded angrily to the post.

What’s interesting about this case is the comparative level of attention that the @CelebBoutique mistake accrued in comparison to news about the actual event. While this analysis only takes into account the #Aurora hashtag excluding other topically relevant content, the sheer 5x spike in reactions is impressive.

What else can we learn about people’s reactions?

Pointwise Mutual Information

An insightful method we use to map out a topic space is called Pointwise Mutual Information (PMI), an information theory approach to finding collocations – a measure of how much one word tells us about the other, how much information we gain about one word from another. For example, we see a high PMI value for the words ‘fast’ and ‘food’, since they tend to appear together often. As an attempt to identify the different topics folks were attentive to, we calculated PMI values for words that appeared in tweets with #Aurora, and saw the following outcome:

The larger a node, the more central it is within the topic (i.e. the more times it was used with all the other words). Colors represent Modularity - similar colored nodes have appeared together more than the average, and in this case are considered part of the same sub-topic. We can easily identify the most prominent grouping of words (bright green, 26%), which includes words such as colorado, shooting, movie, victims, theater, prayers, 71. Clearly folks responding to information about the event (71 was the reported number of people shot).

Another prominent word group (purple, 10%) includes words such as people, love, victim, #ripjessica, daughter and died. As reported by the Denver Post, friends honored Jessica Ghawi’s death by using the hashtag #RIPJessica and managed to make it trend on Twitter.

Finally, we identify a group (yellow, 4%) that accounts for the @CelebBoutique debacle, including words such as shop, new and inspired. 

This method gives us the ability to quantify the complexity of the event, in terms of the language being used by people who talk about it. In this case, we clearly see that even though the #Aurora Celeb Boutique/Kim K tweet generated an incredibly high number of responses (close to 16% of the all tweets using the hashtag that day), it only accounts for a very small cluster (4%) when looking at the breadth of words and conversations that were happening that day. Effectively, the responses to @CelebBoutique were using very similar language (most likely RTs), versus the other word groupings, which were more nuanced and diverse.

Trending Topics

Another way we learn about an event is by tracking where and when it trends. Twitter’s trending topics (TTs) give us an interesting view into what topics are relatively important in cities across the world (we’re written extensively about TTs). When a term trends in a geographic location, it is used significantly more than the norm for that location. In the case of #Aurora, it is not surprising that the hashtag first trended in Denver Colorado:

By hovering over the plot above you can see how the trend spread across US cities first, and then to Europe and the rest of the world.

A trend dispersion plot is another way to take a look at the same data. Here we see a dot every time the hashtag trends in the city, over the period of 1000 minutes (since it first trended in Denver). We see it quickly spread to cities such as Minneapolis, Pittsburgh and Portland. In Los Angeles, we see it slightly trend that night, then a gap (most folks asleep), yet the next day the trend persists continuously.  This makes a lot of sense given Hollywood and the highly anticipated release of the movie. Trend persistence is a good proxy for what populations are attentive to, especially trends that persist throughout the day.

Other US cities where the trend persisted include Tallahassee, San Antonio, Memphis and Boston. European countries where #Aurora persisted include the Netherlands and Sweden.

Closing

The techniques described above help us quantify the level of attention certain topics receive during an event. It is important to not only look at the total number of responses, a.k.a. “buzz”, but take into account the complexity of the language being used, as well as the dominant frame. Distilling the multitude of conversational threads from data is critical for answering questions such as – what are the different ways in which folks are responding to the event? Which of these are drawing more engagement? What are users more attentive to and how does this change over time?

In his fantastic blog post, Ethan Zuckerman proposed a metric for attention that would give us the ability to compare between events. Call it the nanoKardashian or whatever you want, we’ve reached a point where this can be tracked in real time, over time, and can be a potentially powerful metric for analyzing events as they unfold.

Trend data also helps us grasp how strongly topics resonate within certain cities or countries around the world. In the case of #Aurora, users in Los Angeles were persistently engaged with the topic throughout the day, much more than other major US cities. Additionally, people in the Netherlands and Sweden showed higher than average levels of attention towards the story. Once we build a model for how each city behaves, we can track this over time, and draw valuable insight on how engaging certain events were amongst different geographical regions.

Remember, just because something is popular, doesn’t mean it is significant. But by having a better understanding of the landscape we can improve our judgements. Data coming from social spaces tends to reinforce our hunches and intuition. Now we have the ability to quantify them, in real time.

Questions? Feel free to ping me on Twitter – @gilgul

If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.
This entry was posted in Data Analysis, Research. Bookmark the permalink.

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>