BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Is Twitter's Spritzer Stream Really A Nearly Perfect 1% Sample Of Its Firehose?

Following
This article is more than 5 years old.

Omar Marques/SOPA Images/LightRocket via Getty Images

One of the leading factors in Twitter’s popularity among data scientists is its JSON APIs. Among those APIs is its “Spritzer” or “Sample” stream which is a realtime stream of a random sample of 1% of all tweets. Unlike Twitter’s premium data offerings, the Spritzer stream is available free of charge and has become a mainstay of macro-level social media analytics and research. Previous work has thoroughly deconstructed the sampling mechanism used by Twitter to create the 1% stream and found it to be largely methodologically sound, though vulnerable to manipulation. However, there are unanswered questions about how closely the 1% stream hews to the complete Twitter firehose in practice and whether the two yield nearly identical results in practice.

Using Crimson Hexagon to search Twitter for tweets mentioning climate change January 2012 to October 2018 yields a total of 83.6 million tweets, while a search for global warming over the same period yields around 26.1 million tweets and the IPCC garners around 2.1 million tweets. On an average day there were around 33,500 climate tweets, 10,500 global warming tweets and 850 IPCC tweets. This trio of topics, one medium, one small and one micro, offers a perfect testbed to determine how closely these results from the Twitter firehose match those of the free 1% stream over the same time period.

Comparing the results from Crimson Hexagon against the free 1% Twitter stream over the same period, all three timelines were highly correlated. Climate change was correlated at r=0.96, global warming at r=0.94 and IPCC at r=0.94.

In short, from a purely volumetric trend detection standpoint, you will get the same story whether searching the firehose or the 1% stream, even for topics that average just a few hundred tweets a day.

Searching for a much rarer topic like carbon sequestration yields just 60,500 tweets over the entire period and averages just 24 tweets on a typical day, with nearly a third of the period having 0 to 10 tweets per day. Even this extremely low volume topic still offered a timeline that was correlated at r=0.79 between the 1% and firehose streams.

Obviously the rarer a topic the more exclusion error there will be in the 1% stream simply due to the statistics of how likely one of that topic’s tweets are to be found in a 1% sample. Though, as the numbers above show, even topics with just a handful of tweets per day can still yield a representative signal.

Looking at the 1% stream as a whole over the period January 2012 to October 2018, it has remained nearly perfectly steady within a close interval of 1% of the full firehose over all 7 years of the comparison. Prior to February 2015 the 1% stream averaged around 0.99%, but after an adjustment made by Twitter in February 2015 the volume dropped to around 0.98% overall.

The daily tweet volume between the 1% stream and the firehose over those seven years was almost perfectly correlated at r=0.987. The total volume of non-retweets was correlated at r=0.996 and tweets with URLs were correlated at r=0.986. Verified tweets, being far rarer, were correlated at only r=0.877.

In short, from the standpoint of daily volume counts, whether one uses the 1% stream or the firehose, the results here suggest that the results will be nearly identical. Rarer topics and less common attributes like verified users will exhibit more error due to the extreme sampling, but still yield almost identical trends.

This tight alignment between the contents of the free 1% and commercial firehose Twitter streams is critically important in that it suggests that findings derived from the 1% stream are likely to validly reflect genuine Twitter trends. Given how much of published academic research on Twitter relies on the free 1% stream, it suggests that the body of literature on Twitter is not unduly skewed by relying on the 1% sample.

In turn, this suggests that the kinds of advanced Twitter analyses that are not supported by realtime social media analytics platforms, like fine-grained linguistic analysis and even neural language modeling of topics, can be safely performed on the 1% data.

For example, one could examine the linguistic evolution of commentary on a topic like climate change and compare that to the baseline of the totality of Twitter to understand which words are statistically significant to climate-related conversations.

Putting this all together, we see that the Twitter Spritzer/Sample stream offers a fairly stable and robust 1% sample of the totality of Twitter and that this sample has remained stable over at least the past seven years. Volume timelines derived from this sample closely match those produced from the firehose at both the micro scale of a single query and the macro scale of Twitter as a whole. Even very low volume queries that generate just a handful of tweets per day still yield daily attention timelines highly correlated with their firehose results.

In the end, it seems the Twitter 1% stream can be reliably and robustly used as a proxy for Twitter as a whole.