BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Was The Era Of 'Big Data' Social Media Based on False Hype?

Following
This article is more than 5 years old.

Getty Images.

One of the most surprising findings when looking back at Twitter’s evolution from 2012 to 2018 is just how small it turns out social media really is. For years the public narrative around social platforms has been that they were the flag bearers of the “big data” revolution, holding some of the largest datasets in the world that could offer unprecedented views into the heart of human society. In reality, it seems social media is far smaller and more limited than we ever realized.

Using Twitter’s 2012-2015 trajectory, the service was estimated to be not much larger than the online news media it was supposed to replace. Armed with Twitter’s actual size over the last seven years, it turns out those original estimates were far too generous. Twitter’s slow decline and rising retweet rate mean the total amount of unique content per day in Twitter’s firehose is becoming smaller and smaller.

Since Twitter's founding 13 years ago, it appears there have been only around 1.1 to 1.2 trillion tweets and their small size means the actual total size in bytes of all the unique textual content that has ever flowed across Twitter’s servers is extraordinarily small. Over the last seven years, covering Twitter’s peak growth period, there has been less than 33TB of mineable original text in all.

On a typical day at the start of 2012 there was around 11GB of novel text posted to Twitter, rising to around 20.5GB of text a day by July 2013. Yet that number has steadily shrunk, reaching just 10.5GB a day by last October and is on a downward trajectory.

Newly emerging statistics from Facebook's research dataset collaboration suggest it too is vastly smaller than believed. The company's vaunted hyperlink dataset containing “almost all public URLs Facebook users globally have clicked on, when, and by what types of people” is almost three times smaller than a similar news-based link dataset, despite being collected over twice the timespan.

For all these years we have stood outside the walled gardens of the major social platforms, projecting onto them our own dreams and imaginations of what their datasets must look like.

Instead, as we are getting our first glimpses inside their immense data estates, we find this promise was all hype. In reality these vast social media archives aren't much larger than the traditional data streams that came before.

Twitter’s total textual output over the last seven years is just a few tens of terabytes, while Facebook’s master link dataset is just a fraction of a single news linking dataset.

Why then do we think of Twitter and Facebook as being so massive?

The answer is that Silicon Valley has mastered the art of the reality distortion field, encouraging us to project our own dreams upon them while avoiding anything that might bring us back to reality.

In the case of Twitter and Facebook, both companies released copious statistics in their early high-growth periods, chronicling their every milestone. As growth slowed, they pulled back on these statistics and eventually both companies largely halted regular releases of detailed growth statistics. Even when pressed, the companies steadfastly decline to release any kind of statistics, from volume counts to the false positive rates of their algorithms.

In the absence of official statistics, users have been free to imagine the companies holding immeasurably large archives of human behavior.

The implications for this false narrative we have created around social media are profound.

First and foremost, the incredibly small actual size of these social platforms reminds us that the insights we receive from social media are not nearly as representative as we’ve been led to believe. In the case of Twitter, the total number of tweets is shrinking, tweets are increasingly merely retweets and the accounts posting all of that content are older and older. Geography is becoming less and less available and less and less precise. The few statistics available for Facebook, from the size of its link dataset to the size of the training data it uses for counter-terrorism, suggests that it too is vastly smaller than we have believed.

Putting this all together, our “big data” signal turns out to have been actually quite a small data signal and is getting smaller, less precise and less representative by the day.

In the end, rather than continuing to put social media up on a pedestal of our imagination, perhaps it is finally time for society to accept reality and look for new, more representative and more privacy-protecting ways of understanding ourselves before the public begins to lose faith in the “big data” world.