Whatever Happened To The Denominator? Why We Need To Normalize Social Media

Kalev Leetaru

One of the most important but least told stories of the “big data” revolution is the way in which the denominator has all but vanished from data science. Today we tally tweets, inventory Instagram posts and survey searches, reporting them all as raw volume counts without any understanding of whether the trends they depict are real or merely statistical artifacts of opaque datasets. We no longer even know how big our datasets are nor how fast or slowly they are growing. The companies behind those datasets refuse to offer even the most basic of details to help us add a denominator back into our equations to normalize and contextualizing our findings. In short, we have traded more data for less understanding of what is in that data. How might this complicate our understanding?

One of the most interesting findings of yesterday’s analysis of the speed of Twitter, beyond the fact that Twitter doesn’t actually regularly “beat the news,” was just how much of the Twitter activity about both scheduled and unexpected news were retweets. Whether a preplanned press conference heavily advertised in advance or an out-of-the-blue natural disaster, the vast majority of the signal received from Twitter was in the form of retweeting other people’s posts.

Rather than on-the-ground witnesses and participants live chronicling what they are experiencing in realtime, it seems that at least when it comes to government press conferences and natural disasters, Twitter is about forwarding from afar.

In other words, Twitter is a behavioral dataset, filled with the equivalent of “likes” and “forwards” rather than a content-based platform filled with new perspectives, contexts and details to help us understand breaking events from those involved.

This is a huge finding that calls into question the utility of Twitter as a data source for understanding the very kinds of breaking news events it has been most promoted for.

Implicit in this argument is the idea that heavy retweeting is unique to breaking news events.

What if this instead merely reflected a broader Twitter-wide trend towards retweeting? If the totality of Twitter was shifting from novel commentary to mere retweeting, this finding would no longer suggest that the way in which we are handling breaking news is changing.

In other words, while its findings would hold, its significance as something distinct and special to how we understand breaking news would change.

One way of examining this would be to look over a longer period of time at a societally polarizing topic that involves both breaking events and steady background conversation. If heavy retweeting is distinct to breaking news events, we should see a fairly steady horizontal line with sharp temporary surges around major breaking climatic events, but otherwise nearly flat.

If the high level of retweeting instead reflected a Twitter-wide trend towards less and less original content, we would expect to see a steady downward trendline in the percent of non-retweets over the years.

The timeline below shows the final results, plotting the percent of global warming and climate change tweets that were not retweets, from July 2009 to present, using data from Crimson Hexagon.

Kalev Leetaru

From nearly 90% of climatic tweets that were original content in July 2009 to less than 25% today, we can see that at least for these two topics, Twitter has trended steadily and surely over the past decade from a content service into a retweeting service.

In fact, plotting a wide range of different topics we see exactly the same curve for every one of them, suggesting this trend is not distinct to climatic or even scientific content, but rather something more existential to changing trends in how we use Twitter.

Comparing the trendlines above to Twitter as a whole, we find they are correlated at r=0.90 and using a 7-day rolling average to smooth the data that correlation rises to r=0.96.

In short, that significant finding about breaking news being defined by retweets is not a finding about breaking news at all, but rather merely a reflection of Twitter as a whole.

What about the heavy linking activity found in climatic tweets? For both press conferences and breaking news events, a large percentage of the tweets included URLs, either links to external websites or embedded images.

Looking longitudinally at the percent of global warming and climate change tweets that included a URL 2009-present, we see that this time there are two distinct trendlines that emerge: the relatively steady 80% linking of climate change and the sharp decrease/increase/steady curve of global warming. Looking closely, climate change appears to curve slightly at the same time as global warming, but not nearly as significantly. The two trendlines are not highly correlated, at just r=0.36.

Kalev Leetaru

With two very different trendlines, which is the most significant finding?

From a visual and statistical standpoint, it might seem that the greater variance of the global warming trend would be the most interesting, since it has changed significantly over time, whereas climate change has remained relatively unchanged.

Indeed, based purely on comparing these two graphs, most data analyses would likely select the high variance of global warming linking activity as the most significant finding.

However, truly determining which trendline is the most significant requires normalization: comparing both against the underlying trends of the dataset as a whole to see which deviates furthest from Twitter-wide behavior.

In this case, the steady state of climate change link tweets is correlated at just r=0.22 with Twitter as a whole, while global warming tweets are correlated at r=0.70.

Thus, from a significance standpoint, global warming’s ebbing and flowing merely mirrors the broader evolution of how people have used Twitter for sharing links, whereas climate change’s steady flow of links has deviated significantly from the rest of Twitter, suggesting there is something distinct about that community.

Putting this all together, we find that without normalization, it is impossible to tell which of the findings we derive from social media are significant and which merely reflect the background changes of the platforms we are measuring.

For their part, the companies themselves don’t tout normalization. When asked about changes in tweeting and retweeting behavior, two different Twitter spokespersons declined to comment and one also declined to comment on why the company doesn't release such numbers.

The result is a strange world in which we apply highly sophisticated algorithms to large volumes of data without any understanding of whether the results we receive are actually meaningful in any way to the questions we asked.

We tout the accuracy and precision of our algorithms without acknowledging that all of that accuracy is for naught when we can’t distinguish what is real and what is an artifact of our opaque source data.

It is true that retweeting and link forwarding are defining characteristics of how Twitter users cover climatic press conferences and breaking news events. Yet, the significance of that finding is muddied by the fact that this is simply how Twitter as a whole is used, rather than a statement about breaking news.

In many ways this is no different than a finding that climate-related tweets are rare in areas of the world where there is no electricity and no connectivity. While technically a true statement, the meaningfulness of that finding is decreased by the fact that this is a reflection of Twitter as a whole, rather than something specific to climatic change.

Looking across all of the social media analyses performed each day, it is the rare analysis indeed that actually normalizes its findings to determine their significance. Instead, the idea of denominators and normalization has been relegated to a quaint memory lost to the foggy mists of history.

Social media is not alone in this regard. Media analyses have been plagued for decades by their failure to normalize, which has meant that many of their conclusions have been wrong.

In the end, rather than increase our accuracy, the big data revolution has led us to trade accuracy for size. It is truly remarkable for a field so steeped in statistics that we see nothing wrong with reporting raw counts from opaque black boxes that are changing right before our eyes in entirely unknown ways.

Perhaps it is finally time to restore the denominator to data science.

More From Forbes

Whatever Happened To The Denominator? Why We Need To Normalize Social Media