Do We Even Care What Our Data Says Anymore?

Getty

One of the most striking aspects of the reaction this week to my 2012-2018 Twitter documentary has been the way in which so many researchers have acknowledged anecdotally observing the same patterns in their own Twitter work over the years, yet failed to pursue them further or adjust their workflows accordingly. More worryingly, most noted they were unlikely to make any changes to their analytic workflows or the questions they ask of Twitter even with these new findings invalidating much of the way they use Twitter. In a world in which we seem increasingly uninterested in what our data actually says, is there any hope left for the future of “big data?”

Over the last seven years Twitter has been changing in profound ways that upend many of the assumptions upon which social media analytics is based. The platform’s decreasing volume, increasing retweet rate and centralization around elites marks a transition from a content platform to a behavioral platform. Event-stream behavioral indicators require very different algorithms and analytic techniques compared with mining textual posts, yet we still largely treat Twitter as a collection of short documents.

The shift away from novel textual posts towards behavioral retweeting is particularly damaging to those using Twitter for sentiment or topic mining, yet there has been little shift away from either over the last seven years. In fact, usage of Twitter for such text mining has become mainstream.

Despite a total collapse of GPS-tagged tweet volume, precision mapping of Twitter has never been more popular. Many social media analytics platforms heavily promote their ability to provide realtime GPS-level mapping of keywords, while academic exploitation of geotagged tweets has rapidly expanded.

Twitter’s existential longitudinal change means the absolute volume counts that form the basis of nearly all current social media analytics are misleading and often dead wrong. Yet, there has been no movement towards normalization or other efforts to address its rapidly changing underpinnings. We still treat Twitter as a static and unchanging data source.

Remarkably, many of the academics I spoke with that study social media and internet technologies as their specialty could not understand why a changing Twitter might impact their work. For example, researchers who have published studies touting increasing retweeting rates of particular topics as methodologically important did not see why it mattered that overall Twitter-wide retweeting rates precisely equaled the topical retweeting changes they had documented. Despite a perfect alignment of topical and baseline retweeting curves, those researchers just couldn’t understand the concept of denominators and normalization.

This reflects the unfortunate fact that many researchers that work with large datasets are actually qualitative researchers reluctantly tipping their toes into the quantitative world but lacking sufficient statistical literacy to understand the techniques they use.

A number of researchers noted that the lack of funding support and publication venues for data description studies meant they had been unable to pursue formally the anecdotal trends they had observed over the years in their own Twitter analyses.

Most researchers I heard from noted that they would not be changing their Twitter analytic workflows or research agendas in any substantive way, even with Twitter’s changes completely invalidating their approaches. Nor would they be seeking out new datasets that better reflected the geographies, communities and questions they were most interested in.

A common refrain was that Twitter was simply the dataset that you have to work with today to stay relevant. One major funding agency whose recent RFP’s have included several that mandated the use of Twitter data, put it more starkly: analyzing social media is the only way to get attention these days and we want our grantees to get attention and high-profile publications.

The fact that researchers would not be rushing to adjust their analytic workflows and research agendas in response to Twitter’s changing landscape reminds us that data science has become more and more about hype rather than results.

The “big data” era was supposed to be about giving voice to our massive datasets.

It seems we are less and less interested in listening to those voices.

In their place, we have become fixated on who we are listening to rather than what they are telling us. It no longer matters whether Twitter is the right dataset to answer our questions, it matters only that we have the word “Twitter” in the title of our results.

Putting this all together, big data has sadly become nothing more than a shiny buzzword like “deep learning” and “social media,” to be sprinkled like fairy dust on grant applications and publications to dramatically increase their likelihood of success.

We no longer seem to care what our data actually says, whether we are asking the right questions of it or whether it has any relevance to our work. It has become merely a password to unlock funding and publication success rather than something that actually influences our findings.

After all, if our data actually mattered, we would be spending the time to understand it and adjusting our methodologies and research agendas to match its strengths and limitations.

More From Forbes

Do We Even Care What Our Data Says Anymore?