BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Visualizing Seven Years Of Twitter's Evolution: 2012-2018

Following
This article is more than 5 years old.

Kalev Leetaru

In 2012 I published one of the first in-depth explorations of the geography of Twitter, examining where it was that we were talking from and about in Twitter’s early days of explosive growth. At the time Twitter was spreading rapidly across the world and was highly correlated with historical NASA Night Lights imagery, leading to my conclusion that “where there is electricity there is Twitter.” I followed that in 2015 with a look back at three years of Twitter as its growth had leveled off, it was entrenching rather than expanding and the number of daily tweets and tweeting users had remained constant since July 2013. Four years later it is time to look back now over seven years of Twitter’s growth to explore the evolution of one of the world’s most influential social networks.

During its heyday of exponential growth, Twitter used to provide detailed growth statistics at regular intervals chronicling its latest milestones of tweet counts and unique users per day and the latest geographies and languages it had added. The company largely halted these updates as usage leveled off and the platform stopped expanding, leaving an information vacuum around even the most basic statistics like how many tweets per day are sent or how many of those tweets are retweets, include geotags or contain links.

Since its growth leveled off, Twitter has provided few insights into the state of its platform. While a spokesperson did confirm that geotagged tweets have consistently averaged around 1-2% of total tweet volume since 2012, the company declined to comment when asked about broader platform metrics like how many total tweets had been sent or how retweeting behavior was changing. It seems that any question that might shed light into Twitter’s growth trajectory is met with silence.

Thankfully, Twitter makes a free realtime stream of 1% of all tweets available, known as its Spritzer or Sample stream. This stream is nearly perfectly correlated with the full firehose, making it an ideal proxy through which to study the longitudinal evolution of the platform.

To explore Twitter’s development over the last seven years, the Twitter 1% stream from January 1, 2012 through October 31, 2018 was examined. Due to technical issues on the monitoring side, some days have incomplete data and would otherwise skew the results. To remove them, the total number of tweets sent each day was compared to the average of the two days before and two days after. Any day that had a tweet volume 10% higher or lower than the surrounding average was dropped from analysis.

This yielded a final 1% dataset that is nearly perfectly correlated with the daily volume of the firehose at r=0.987 and has remained at nearly a perfect 1% sample over the entire seven years.

After removal of partial days there were around 7.9 billion tweets in the sample, with around 437 million distinct tweeting users, of which around 0.07% of users were “verified” accounts.

Twitter exhibits a strong bias towards a small number of users accounting for a disproportionate amount of total tweet volume. The top 1% most prolific users sent 28% of all tweets in the sample during the period, with the top 5% accounting for 57% of tweets, the top 10% accounting for 71% of tweets and the top 15% of users accounting for 80% of all tweets. These ratios do not appear to have changed appreciably over the seven years, suggesting they are defining characteristics of such a digital public forum in which a small central set of elites drive the overall conversation.

The timeline below shows the daily tweet volume of the 1% sample multiplied by 100 to reflect the estimated total daily tweet volume of Twitter 2012-2018. Here we can see the explosive growth period of my original 2012 study, the steady state period of my 2015 study, a sharp decrease in late 2015 and a slow steady decline through the end of October 2018. The two gaps in late 2014 and early 2015 represent periods where technical issues on the monitoring side precluded monitoring the complete 1% stream and to avoid skewing the results with partial data these were filtered out during the outlier filtering above.

Kalev Leetaru

By the end of October 2018, the total daily tweet volume of the 1% stream was roughly the same as it was in June 2012, more than six years prior.

Projecting from these trends we can reasonably assume that around 1.1 to 1.2 trillion tweets have been sent since the service’s founding in 2006.

As Twitter volume has edged downward, has the active user base decreased with it? A lower volume of tweets could suggest that there are the same number of users as always, but they are simply saying less than they used to. Alternatively, Twitter users could be saying just as much as they always have, but there could simply be fewer users tweeting each day.

Dividing the number of daily tweets by the number of daily distinct users sending those tweets in the 1% stream yields a ratio that has remained fairly consistent over time, suggesting that the platform is not further centralizing towards a small number of power users and instead is simply being used less overall. Indeed, daily tweet volume and the number of distinct users tweeting in the 1% stream is correlated at r=0.96.

Kalev Leetaru

Dividing the estimated daily firehose volume from above by the daily tweet/user ratio above, we get the following estimate of the total number of distinct users in Twitter as a whole sending one or more tweets per day.

Kalev Leetaru

These numbers are significantly lower than Twitter’s official estimates of “monthly active users.” The reason is that the timeline above estimates only users who actually published one or more tweets each day, whereas Twitter counts active users as “users who logged in or were otherwise authenticated and accessed Twitter through our website, mobile website, desktop or mobile applications, SMS or registered third-party applications or websites.”

In other words, Twitter counts all users who log into Twitter, whereas the graph above estimates users who actually contribute to the platform by posting public tweets.

In information science parlance we might say that Twitter counts both contributors and lurkers, while the graph above counts only contributors. While lurkers are important from the standpoint of monetization, since they can receive ads, it is contributors that are the fuel of a platform, generating the content that makes people return for more.

What about verified users? Some social networks elsewhere in the world have experienced a centralization effect as they aged, where the platforms increasingly revolved around a set of elite users posting and the rest of the platform reacting or sharing those posts, rather than publishing their own content.

The timeline below shows the percent of daily tweets in the 1% stream and the percent of daily distinct tweeting users in the 1% stream that were verified. The percentage appears to have largely leveled off and has barely risen above half a percent over the last seven years.

Looking more closely, the impact of Twitter’s November 2017 pause to its verification program can be clearly seen.

Kalev Leetaru

One limitation of the graph above is that it reflects only the percentage of tweets sent by verified users themselves. In contrast, the behavior seen in some other platforms is that the daily output of elite users remains the same, but the rest of the user community increasingly shifts to merely sharing those elite posts, rather than posting their own original content.

To explore the degree to which this is occurring in Twitter, the timeline below of the 1% stream combines tweets sent by verified users with retweets of verified users’ tweets sent by ordinary users. While the graph above only included tweets sent directly by verified users, the graph below captures the far more common scenario: a verified user’s tweet going viral and being retweeted by the rest of Twitter.

Kalev Leetaru

Immediately clear from this timeline is that Twitter is following in the footsteps of previous social networks in gradually centralizing around a core of elites whose views are then amplified and forwarded by the masses. From less than 1% of Twitter seven years ago to almost 11% of total tweet volume today, verified users are rapidly taking over the social conversation, though there appears to have been a slowdown of late.

Of course, these graphs only examine verified users. It is likely that an even greater fraction of Twitter volume consists of non-verified but high-follower-count users being similarly retweeted.

This raises the question of whether Twitter is becoming less of a place to share original content and more of a place to merely retweet the content of others.

In other words, is Twitter transitioning from a content platform to a behavioral platform?

The timeline below would appear to support this hypothesis. It shows the percentage of all daily tweets in the 1% stream that are retweets. From around 20% of Twitter consisting of retweets in January 2012 to around 53% by October 2018, retweeting has increased nearly linearly over time.

After removing hyperlinks, user mentions and the capitalized phrase “RT” by itself with a space after it, there were only around 5.4 billion unique tweet messages in the 1% stream over the seven years.

In short, only around 65% of the total 1% stream tweet volume over this period was actual unique mineable textual contents. For social analytics companies, this means they can achieve a significant computational savings by hashing the contents of each tweet and only processing new tweets.

Kalev Leetaru

The steady rise in retweets could come from either a small number of users sending a large number of retweets while others send fewer tweets or all users transitioning from original content to retweets. The timeline below shows the percentage of unique tweeting users in the 1% stream that sent at least one retweet each day. The graph’s near identical trajectory to that of overall retweeting suggests this is a Twitter-wide behavioral change.

Kalev Leetaru

This rapid transition from original commentary towards simple retweeting has profound implications for how we think about Twitter as a data source.

Rather than as a source of first person commentary on the events of the moment, Twitter is increasingly becoming a place where users share the commentary of others.

Discussions on Twitter can also take the form of replies in which one user explicitly responds directly to the tweet of another user. Replies have also trended downwards in the 1% stream, from around 26% in January 2012 to a low of 14% in May 2017, but has trended slowly back up in year since, to around 18% by October 2018 (using the reply information contained in the tweet JSON record).

Kalev Leetaru

As a percent of distinct daily tweeting users in the 1% stream, replying users are largely the same as total reply volume, once again suggesting this is a broad behavioral shift on the platform.

Kalev Leetaru

In addition to retweets and replies, tweets can also simply mention another user. In all, just over 263 million distinct users were mentioned by name in the 1% stream over the seven years. The top 15 users mentioned most often are seen below, with most being celebrities.

YouTube 29762560
BTS_twt 20393700
justinbieber 10771692
NiallOfficial 7371952
realDonaldTrump 6707246
Harry_Styles 6378236
Real_Liam_Payne 4836841
weareoneEXO 4133300
Louis_Tomlinson 4013812
zaynmalik 3435909
ArianaGrande 3278748
Luke5SOS 3046707
camerondallas 2839582
onedirection 2839545
Gurmeetramrahim 2787755

Tweets that mention other users have been steadily increasing over the years in the 1% stream, from around 57% of tweets seven years ago to around 72% today.

Kalev Leetaru

One notable feature of the graph above is that the density of user-mentioning tweets remains fairly steady in the first quarter of the timeline during Twitter’s growth period. As Twitter’s tweet volume stops growing and levels off, user mentions increase. As Twitter’s daily tweet volume begins to decrease, the percent of tweets mentioning other users begins to increase rapidly and steadily.

This suggests that, like with retweets, Twitter is becoming a place to talk to and about others, rather than offer one’s own reflections about the world.

What about the connection between Twitter and the outside world?

A core finding of Twitter’s evolution has been the changing number of hyperlinks shared across the service. From 2009 through late 2011 link sharing decreased rapidly Twitter-wide, but increased beginning in 2012 and leveled off in mid-2016, before decreasing again beginning in late 2017.

The 2012-2018 portion of this evolution can be seen in the graph below, showing the percent of all tweets in the 1% stream that contained a URL.

Kalev Leetaru

The rapid transition of Twitter links from HTTP to HTTPS can be seen in the graph below, which shows the percentage of URL-containing tweets in the 1% stream that contained at least one HTTP URL. Initially the transition to HTTPS was gradual, picking up speed in early 2013. In April 2015 the site began rapidly progressing away from HTTP. Prior to October 19, 2015 the site was averaging around 84% of URL-containing tweets having an HTTP link. On October 19 that percent dropped to 69%, on October 20th it dropped to 12% and by the 21st it had dropped to just 8%.

Kalev Leetaru

Over the course of just 48 hours Twitter transitioned its entire platform to HTTPS links in tweets, showing the power of centralized platforms to transition web-scale practices. Twitter’s transition follows Google’s concerted push of the at-large web towards HTTPS through its Chrome browser warnings.

URLs in tweets can come from two primary sources: embedded images and links to external content. Twitter actually distinguishes between these two cases in the underlying tweet JSON record, classifying each URL as a media object or link. The timeline below shows the percent of all tweets in the 1% stream that fell into these two categories, using the metadata Twitter includes with each tweet record. It is important to remember that Twitter only classifies tweet-embedded images as media objects, meaning that a URL link to an image on an external photo sharing site will be counted as a link, rather than an image in the graph below.

Kalev Leetaru

Looking only an external links in the 1% stream, just under 1.7 billion links were shared, of which 981 million were unique (though many were shortened even in their “expanded” form in the JSON record making it impossible to know how many true unique external URLs were pointed to). The expanded URLs listed in the Twitter JSON record hailed from 8.7 million distinct domains, reflecting the incredible variety of the web captured in Twitter’s 1% stream outlinks.

The extremely high density of links shared on Twitter each day suggest it could be a powerful resource for web archives to monitor. At the same time, the enormous variety of domains being shared suggest many of these links may be to other social media platforms, product pages and other content that may be of less relevance to resource-constrained archives.

The top 30 domains of the expanded URLs in the 1% stream are seen below. Many expanded URLs merely point to other shortening services, but overall there are numerous social media platforms represented.

twitter.com 277512131
bit.ly 149263669
fb.me 87000043
goo.gl 58421122
youtu.be 53516567
instagram.com 53033977
du3a.org 41671416
dlvr.it 30291869
youtube.com 27615134
ift.tt 26443383
vine.co 24864516
ow.ly 21376084
ask.fm 18990059
tmblr.co 16746731
tinyurl.com 13704254
instagr.am 13344307
gigam.es 11811292
facebook.com 10558347
d3waapp.org 9783098
twitpic.com 9771329
amzn.to 8954377
4sq.com 6373457
google.com 6182142
qurani.tv 5483571
tumblr.com 5349766
wp.me 5148355
buff.ly 5142870
fllwrs.com 4919621
twcm.me 4791687
path.com 4657660

Many websites offer their own bespoke URL shorteners and compiling an exhaustive list of all shorteners offered worldwide is beyond the scope of this analysis. However, manual review suggests that domains of 7 or fewer characters are most often shortening services. In the 1% stream such domains account for 6% of all domains shared in Twitter links and 47% of all links. Remember that this is after Twitter’s own URL expansion, meaning users are often sharing already-shortened links, rather than asking Twitter to shorten the link directly.

Of the remaining 1% stream links to domains longer than 7 characters, 4.8% of domains and 69.8% of links are to domains in the Alexa Top 1 Million list circa May 2018. Around 1.7% of domains and 66.8% of links were to domains in the top 250,000 most popular sites listed in the Alexa Top 1 Million list.

News media, on the other hand, does not appear to constitute a majority of links shared on Twitter, accounting for just 0.18% of domains and 5.4% of all links.

What is the combined impact of all of this on the total volume of textual content flowing through Twitter’s firehose each day to data miners and social analytics companies?

Removing hyperlinks (since shortened URLs aren’t useful for text mining), user mentions (since those are treated as connections rather than mineable text) and retweets (since they are the equivalent of a “like” and useful as a behavioral signal rather than novel content to be mined), the timeline below shows the average characters per tweet and average bytes per tweet in the 1% stream over time. Twitter uses UTF8 encoding, meaning that for languages outside the ASCII characterset, a single letter can require multiple bytes.

Kalev Leetaru

Interestingly, the average tweet length globally in the 1% stream has remained unchanged, at around 45 characters per tweet, while the number of bytes per tweet has trended steadily upwards. This suggests Twitter has become more multilingual over time, spreading to languages that require multiple bytes per character or has adopted elements like emojis that require additional storage.

What does this mean for the total volume of novel textual content posted to Twitter each day? Multiplying the total bytes per day of text in the 1% stream (after removal of links, user mentions and retweets) by one hundred, the graph below shows the estimated total volume of novel textual content flowing across Twitter’s firehose each day.

Kalev Leetaru

From an estimated 11GB of text per day in January 2012 to a peak of around 20.5GB of text per day in July 2013 down to around 10.5GB per day in October 2018, the estimated total volume of daily text posted to Twitter is quite small.

In total, an estimated 33TB of text was posted to Twitter as a whole over the sample period, after links, user mentions and retweets are removed.

How does the total volume of tweet text compare to worldwide online news output as monitored by the open data GDELT Project? Over the period November 2014 to October 2018 GDELT monitored just over 2.8TB of text while the Twitter firehose contained an estimated 16.5TB of text after links, user mentions and retweets are removed.

That would mean that the estimated totality of all mineable Twitter text was just 5.8 times larger than news content. That ratio would be even lower if hashtags were excluded.

So why do we talk about social media as “big data” if it is not actually that much larger than traditional digital archives like news?

The reason is that when we talk about small-message social media platforms like Twitter as “big data” we typically fixate on their total number of posts. Twitter’s trillion-tweet archive certainly sounds at first glance like an enormous archive of content. However, by their very design, small message platforms consist of large volumes of very short posts, meaning the actual total volume of text contained in those trillion tweets is quite small.

Given that many text mining tools operate exclusively in English, how much of Twitter is in English?

While Google’s CLD2 language detection library is not well-tuned for the brief snippets of text that define Twitter, its open source availability and high processing speed mean it has a history of being used as a rapid screener for tweets even if its error rate is higher than purpose-built tools.

To determine the language of each tweet in the 1% stream, hyperlinks and user mentions were removed and the remaining text was sent through CLD2 with its extended language set enabled. The pound symbol was removed, but actual hashtag tags were left in place to ensure the maximal amount of text for each tweet, even though hashtags may skew language classification since non-English tweets may use English hashtags and vice-versa.

The resulting timeline below suggests English language tweets do not constitute a majority of tweets and that they have been falling steadily as a percentage of all tweets, at least in the 1% stream. However, again, these results must be taken with a grain of salt given CLD2’s imperfect alignment with the small text snippets of Twitter.

Kalev Leetaru

Hashtags are used as a linguistic bridge on Twitter, bringing together all of the tweets about a particular topic or event using a common metadata tag, despite differences in their language or word choices.

In all, just over 106 million distinct hashtags appeared in Twitter’s 1% stream during the analysis period (using the Twitter-extracted hashtag list in the entities field of the JSON record). The top 10 hashtags appear to be largely technology and entertainment related.

iHeartAwards 12400969
gameinsight 11550723
MTVHottest 10301625
BestFanArmy 9038510
RT 7733509
androidgames 6537899
android 6090706
izmirescort 5326195
MTVStars 5209870
BTS 4844403

Intriguingly, hashtag use appears to have leveled off and is actually decreasing on the platform. The timeline below shows the percent of all 1% tweets that contained at least one hashtag per their JSON record, showing a rapid increase from late 2012 through mid-2016, then a sharp drop and a steady decline through October 2018.

Kalev Leetaru

It is unclear what might be driving this reduction in hashtag use, but their historical popularity as an organizing and search tool raises questions about whether some new form of organizing behavior is emerging on Twitter to replace hashtags or if users are simply finding ways to express themselves without the summarization of hashtags.

Given that Twitter usage is slowing, both in terms of total tweets and unique tweeting users and that retweets are steadily increasing, especially of verified users, this raises the question of whether Twitter is aging.

By taking each tweet and computing the number of elapsed days between that tweet being sent and the date the user account sending the tweet was created and averaging that delta across all tweets sent each day, we can compute the average “age” of Twitter’s tweeting user accounts over time.

If Twitter was steadily adding new users as it lost users, but merely added slightly fewer users each day than it lost (for example adding 90 new users for each 100 users it lost), the service would experience a decline in overall tweets, but would be kept “young” by a constant influx of new users constantly modernizing it. In such a case, one would expect the average “age” of user accounts to remain relatively constant as older accounts are steadily replaced with newer accounts over time.

Alternatively, what if Twitter largely consisted of a core of existing users that was steadily aging, with an ever-slowing rate of user change? Like an aging social club without new members, the platform would continue to be widely used, but the slowing growth in new users would mean the site would become increasingly insular. In such a case the average age of user accounts would be expected to increase linearly over time as more and more of the service’s tweets are written by an older and older established community of user accounts.

The timeline below shows the final result, computing the average “age” in days of all user accounts sending tweets in the 1% stream each day. Since averages can be skewed by outliers, the timeline shows both the mean and median ages.

Kalev Leetaru

The timeline’s steady march upwards reflects a platform with an established user base that is aging over time with a sufficient influx of new users. This can be seen in the timeline below, which plots the median creation year of all accounts tweeting in the 1% stream each day. In keeping with the timeline above, this graph shows a gradual steady aging of the platform.

Kalev Leetaru

Putting all these timelines together we see a social media platform that is evolving from a place for ordinary citizens to express themselves into a place to forward and share the commentary of others, especially older and elite accounts.

From an algorithmic standpoint, Twitter is transitioning from a content dataset filled with text to be mined using natural language processing algorithms into a behavioral dataset containing attentional information in the form of shares. Behavioral data can offer an extremely powerful lens onto global attention but requires very different methodological and algorithmic approaches from the content-based ones we have been using to date.

What about the geography of Twitter?

Each of the graphs above chart Twitter’s progression through time, but for a platform that is supposed to let us hear from the world, it is space that is even more important to understand.

Geography on Twitter comes in two forms: the space from which a tweet is sent and the space about which a tweet focuses. These two geographies are distinct.

A tweet might be sent by someone standing in Central Park in New York City, discussing their views on a sporting event taking place in London. Someone else actually sitting in the stadium in London watching the match in person might also tweet their thoughts. While both tweets refer to the same event, knowing that one was sent by someone witnessing it firsthand, while the other represents someone somewhere in the world commenting from afar, might lend more credence to the details of the former.

Mapping the locations tweets are about is quite powerful but tells us nothing about whether the users talking about those places are actually physically there or merely speaking about them from afar. Any large corpus of text can be mined for mentions of place.

Instead, it is the geography of creation that formed the greatest promise of Twitter in its early years. Twitter embodied the idea that suddenly ordinary citizens anywhere on earth could use their smartphones to live report what they are experiencing and feeling as they go about their lives. In turn, breaking news could be experienced from afar through the eyes of those witnessing and participating in the events.

Powering this vision was the idea that all across the world there are ordinary people equipped with GPS-enabled smartphones and the Twitter app. By recording those GPS coordinates with each tweet, it becomes possible to see precisely where a user is when they send a tweet, allowing us to map the global discourse in a way never possible before with any previous dataset at this resolution.

To put it another way, every piece of information is created in a specific geographic location. The mobile-centric nature of Twitter meant this geography of creation could be preserved and was envisioned as a primary index over Twitter, powered by GPS-equipped smartphones and users willing to share their locations.

So how precisely does one place tweets on a map?

There are three primary sources of user geography on Twitter.

The most accurate are GPS-tagged tweets in which a user permits their smartphone to attach their precise device-provided GPS coordinates to their tweet as they send it. This permits that tweet to be mapped down to the side of a specific street intersection they were standing on when they sent it.

Less accurate are “Place” geotags in which the phone’s GPS coordinates can be used to provide the user a predefined list of nearby locations to select from. Instead of broadcasting their GPS coordinates, the tweet might offer only that it was sent from “London” or “The United Kingdom” or at best a particular business address.

Finally, users can also specify their location in free text in either their User Location or User Description fields. These are freeform text, however, meaning users can type anything they want, including fantasy locations like “The Moon” or “The Shire.”

Freeform textual locations in the Location and Description fields are not verified sources of information. Users can type any location they want and activism campaigns may even encourage users to change their location to a given city or country to draw attention to events there.

The timeline below shows the total percent of tweets and distinct tweeting users that had a non-empty Location field over time in the 1% stream. From a high of 70% in January 2012 the percentage of accounts providing a Location value dropped to around 47% before recovering to around 62% as of October 2018. In total around 51% of distinct users populated the Location field over the seven years, representing around 58% of tweets in the 1% stream.

Kalev Leetaru

Though the Location field can be readily geocoded, its unverified nature and imprecise resolution means it is little more useful than other forms of textual geography.

Indeed, Twitter itself cautioned in 2009 that “since anything can be written in this field, it’s interesting but not very dependable.”

This is the reason that Twitter officially introduced geotagged tweets in 2009 in which tweets could be directly assigned the GPS coordinates where they were sent. Mindful of the privacy concerns in publishing precise GPS coordinates, Twitter Places followed shortly thereafter, allowing users to share a coarser geographic indicator such as a city or specific business. Rather than offer a precise GPS coordinate, Twitter Places offer a bounding box defining the total area of interest, from which an approximate centroid can be computed for mapping.

With the debut of Places, users could choose to share no location information at all with their tweets (the default), share their precise GPS coordinates (the most sensitive) or meet halfway and share only the country, city, neighborhood or business they sent the tweet from.

For most mapping uses, it is enough to know that a tweet about a football game came from someone physically sitting in that stadium watching that game in person rather than someone watching it on television on the other side of the world. Knowing the specific seat they were sitting in, courtesy of their GPS coordinates, would be overkill for most purposes. Similarly, a consumer product brand just needs to know that someone in the Wall Street area of Manhattan enjoys their product, they don’t need to know which street corner the person was standing on when they commented about it.

At the same time, that sports stadium might find it useful to aggregate fan feedback by stadium section, to understand what fans in each part of the stadium talked the most about. Similarly, a brand might wish to know which of their physical stores a customer was in when they commented about it being run down or having a negative interaction with staff.

Thus, there is a natural tension between Twitter’s core design of placing tweets on a map, for which city-level data is likely perfectly fine in most cases and the precision needs of commercial brands wishing to harness tweet geography for enhanced understanding of their customers.

So how common are geotagged tweets?

The timeline below shows the percent of all tweets in the 1% stream that were geotagged, including both GPS-tagged tweets and Place-tagged tweets.

Kalev Leetaru

Immediately clear is the steady decrease in geotagged tweets over the past seven years, from a high of just over 3.5% of tweets to around 1.5% of tweets since early 2017.

It appears that over the last seven years Twitter’s user base has been less and less interested in sharing their location in any form.

This has fascinating implications for how we think about locative privacy in the digital era. Given the choice of broadcasting their precise GPS coordinates, sharing only their city or country or sharing no verified location information at all, it appears that Twitter users have decided more and more not to share any form of verified locative information, even coarse city-level information. As companies increasingly look to collect, monetize and even outright sell our realtime locations collected from our phones, it is noteworthy that at least Twitter’s user base has made it clear that they view their verified location as something increasingly too sensitive to share.

This unwillingness to share even coarse verified locations has particular ramifications for proposals to combat “deep fakes” by requiring imagery and video to be digitally signed with the user’s GPS coordinates to prove that their smartphone camera was actually at that precise geographic place when it recorded an image. If users aren’t willing to allow their tweets to reveal even their GPS-verified city, why would they permit GPS-based signing of the content they generate for other purposes?

Where does Twitter’s geotagged geography come from? The timeline below shows the three primary sources of geotagged tweets in the 1% stream.

Kalev Leetaru

Before Twitter formally supported GPS-tagged tweets, some applications would record the GPS coordinates in the user’s textual Location field. In January 2012 this actually constituted the majority of all geotagged tweets in the 1% stream. Use of the Location field to record GPS coordinates decreased rapidly from the start of the analysis period through early 2015 and as of October 2018 accounted for around 0.08% of all tweets in the 1% stream. This field was largely missed by most early studies of Twitter’s geography, leading to a significant underrepresentation of geotagged tweets.

Rather than record GPS coordinates in the textual Location field, Twitter clients are supposed to record them in a dedicated machine-readable geographic coordinates field in the tweet’s JSON record. Tweets containing this field steadily increased through late 2014, then decreased sharply, before falling vertically on April 28, 2015.

The overnight decrease of GPS-tagged tweets from around 1.5% of all tweets in the 1% stream, to 0.5% of all tweets on April 28, 2015, to around 0.27% by October 2018 and falling coincides with a policy shift Twitter made regarding Place-geotagged tweets.

A Twitter spokesperson confirmed that prior to April 28, 2015 when a user elected to share only a coarse Twitter Place as their location instead of their precise GPS coordinate, Twitter would silently record their precise GPS coordinates in the tweet’s JSON record but would only display the Place in its public interface. Thus, a user that chose to report their tweet as being sent from “San Francisco” would see only “San Francisco” reported on Twitter’s website and apps, but in reality, their precise GPS coordinate had been quietly recorded into the JSON record of their tweet and distributed via all of Twitter’s commercial and free data streams, meaning data miners, social analytics companies and others could precisely pinpoint them. On April 28, 2015 Twitter switched its data streams so that they no longer included a user’s GPS coordinates unless the user explicitly elected to share their exact GPS-tagged location.

The impact of this transition can be seen in the timeline below, which shows the percent of all geotagged tweets in the 1% stream that included precise GPS coordinates. Almost overnight geotagged tweets drop from around 80-90% having GPS coordinates to just 20% through October 2018.

Kalev Leetaru

To illustrate the impact of this change on the geographic resolution of Twitter, the timeline below shows the total number of distinct coordinates (rounded to two decimals) conveyed by Place-geotagged tweets versus all geotagged tweets in the 1% stream. Geotagged tweets tagged with Place locations have an extremely limited set of locations, averaging only around 10,000 distinct locations on earth each day in the 1% stream. As GPS tagged tweets have continued to decline, the total number of distinct places on earth represented by Twitter’s geotagged tweets continues to decline.

Kalev Leetaru

In the place of precise GPS coordinates, we are now largely relegated to Twitter Places, which define bounding boxes around the given geography of interest.

This transition is significant both from the standpoint of data analysis and what it tells us about how people view their own location privacy.

Precision GPS-tagged posts were one of the selling points of working with Twitter data for researchers interested in mapping societal events and narratives in realtime. Reducing the resolution of tweets from the precision of GPS coordinates down to city-level centroids is devastating to data miners yet is a huge boon for the privacy of Twitter’s users.

To a typical Twitter user, the company’s privacy stance of reducing the resolution of geotagged tweets to city- or business-level bounding boxes affords them a level of locative privacy not available through many other popular social networks.

At the same time, data miners have historically repurposed these geographic coordinates for all sorts of purposes which were rendered obsolete in the space of 24 hours when Twitter made its switch. This reminds us how much of the “big data” world is based on repurposing data for applications it was never intended for and how easily that data can be changed with a flip of the switch in a way that breaks all of those use cases. Our “data repurposing” economy is far more fragile than we often realize.

Thus, since April 2015 the overwhelming majority of available Twitter geography comes in the form of Twitter Places. How different is the geographic picture we receive from Places?

Over the last seven years the Twitter 1% stream has recorded a total of 734,049 unique Place locations and a total of 3,360,101 unique location/language pairings (a given location might have multiple languages of tweets tagged as it). Since Place locations are typically mapped by converting their bounding boxes to point-mappable centroids, the result is 688,796 distinct centroids.

The top 20 locations are seen in the table below.

Place Name Number Tweets
Türkiye 1747726
Rio de Janeiro, Brasil 1674599
São Paulo, São Paulo 1095383
Los Angeles, CA 954601
Rio de Janeiro, Rio de Janeiro 902016
Turkey 867954
São Paulo, Brasil 778227
İstanbul, Türkiye 765599
Chile 662416
Houston, TX 659268
Chicago, IL 569738
Manhattan, NY 566372
Porto Alegre, Rio Grande do Sul 560538
Brasil 553391
Ciudad Autónoma de Buenos Aires, Argentina 480294
Curitiba, Paraná 418697
Buenos Aires, Argentina 416561
Texas, USA 416550
Florida, USA 409976
Georgia, USA 409309

Twitter classifies all Place locations into one of five types, ranging from recording just the country the user was in when they sent a tweet down to a specific point of interest like a museum, restaurant, airport or business office.

As the table below makes clear, the vast majority of Place-geotagged tweets in the 1% stream record the person’s resolution only down to the city level, followed by administrative divisions (such as a US State) and countries. In all, 98.7% of all Place-geotagged tweets in the 1% stream were city-level or worse. Just 0.43% of Place-geotagged tweets record the actual business address where a user was.

Place Type Tweets %Place Tweets
City 123,489,522 81.53
Admin 16,381,181 10.82
Country 9,658,987 6.38
Neighborhood 1,284,527 0.85
Points Of Interest 650,379 0.43

Knowing that someone is somewhere in New York City is useful, but when trying to triage a flurry of tweets of a major building fire that has just broken out to alert nearby citizens, it is far more limited.

From precision GPS coordinates accurate to which side of a street a person is standing on to just 0.43% of Place-geotagged tweets having even building-level resolution, Twitter has enforced an immense loss of geographic precision, severely limiting its application to topics like realtime disaster response, understanding the street-level flows of urban life and so on.

Given that combined, neighborhood and points of interest represent just 1.3% of all geotagged tweets in the 1% stream, the verified geography of Twitter is for all intents and purposes now limited to studying cities at the level of the city itself, rather than at any finer resolution.

This is particularly significant in that from the standpoint of verified geographic precision, Twitter is now effectively identical to news media, limited to city-level resolution. In fact, news media may actually offer higher resolution for many events, since it can often include street addresses, business names and other sub-city locative identifiers.

What does the centroid-level Place geography of Twitter look like? The map below shows the combined geography of the centroids of every Place found in the Twitter 1% stream 2012-2018, colored by the primary language of tweets from that location, as recorded by the CLD2 algorithm. The color palette was generated using R’s palette generator for the top 100 most common languages and randomized to ensure the best possible visual separation between nearby languages.

The image below is also available in full resolution 4K, 16K and 32K versions.

Kalev Leetaru

Compare the image above with that of worldwide news media 2015-2018 in the 65 languages monitored by the GDELT Project (this map uses a different color palette due to its different set of monitored languages). Higher resolution versions of the map are available, including a 16K resolution image.

While the image below reflects the textual geography of news mentions rather than the GPS-derived creation geography of the one above, it offers a useful reference of the places where humans reside and care about (and thus places where we would want geotagged tweets from). The stark difference between the two maps reminds us how much of the human world is absent from Twitter’s geotagged tweets.

Kalev Leetaru

How much did the 80% reduction in GPS-tagged tweets affect the geography of Twitter? The two maps below compare the Twitter 1% stream in the year before and the year after the elimination of most GPS-tagged tweets. While the two images are largely alike, the second is far sparser, reflecting the planned geography of cities rather than the unexpected geography of daily human life.

Kalev Leetaru

Kalev Leetaru

The “before” map is available in 4K, 16K and 32K versions for download and the “after” map is also available for download in 4K, 16K and 32K.

Unfortunately, the geography of Twitter will increasingly trend towards the “after” map, becoming more and more sparse as the rich detail of the GPS era gives way to constrained city centroids of the Twitter Place era.

When asked how Twitter saw this resolution reduction affecting its use cases and whether it was working on ways of encouraging more users to share at least coarse geographic indicators, the company declined to comment.

Setting aside what Twitter is becoming, what did Twitter look like at its geographic prime? If all Place-geotagged tweets are combined with the 80.2 million tweets with GPS coordinates (which together record 69 million distinct locations rounded to two decimals) in the 1% stream from 2012-2018, what would the final map look like?

The image below presents the sum total of every geotagged tweet in the Twitter 1% stream from January 2012 through October 2018, colored by the most common language of tweets sent from that location per CLD2. For unknown reasons, the coordinates field of geotagged Finnish language tweets from 2015-2018 report a randomized square bounding box over the country, so only Finnish language tweets from January 2012 to December 2014 are included. A similar artifact can be seen at the centroid of Turkey. A Twitter spokesperson clarified that Twitter itself does not perform any jittering or other obfuscation of GPS coordinates, recording them as-is from the device, so it is unclear what might be causing these issues.

Kalev Leetaru

The level of detail captured in this map is exquisite, recording not just the cities and transportation corridors of our shared planet, but the most minute detail about how we go about our lives throughout the world.

National borders are clearly visible across the world, as is the monolingual nature of most countries. Zoom into most cities around the world, however, and you’ll notice a vibrant multilingual core representing the diversity of both residents and tourists in the planet’s urban cores.

Full resolution versions of the map are available for download in 4K, 16K and 32K resolutions.

The 32K resolution image is also available as an online version that can be interactively zoomed/panned.

Since the vibrant palette of colors in the map above can make it hard to get a full sense of Twitter’s reach over the last seven years, the image below reproduces the map but colors all points black for greater contrast.

Kalev Leetaru

Full resolution versions of the map are available for download in 4K, 16K and 32K resolutions. There are also full resolution versions of the map with a black background and white point layer in 4K, 16K and 32K resolutions.

The 32K resolution images are also available as a white background zoomable version and a black background zoomable version.

Immediately noticeable in the by-language map above is how languages tend to have one or more primary locations where they dominate, but that they also seem to cluster in specific secondary areas elsewhere in the world. This verified linguistic geography raises the question of what it might look like to map just a single language at a time, to show the verified areas where a language has been tweeted most frequently over the last seven years.

The map below captures all geotagged Arabic language tweets in the 1% stream January 2012 to October 2018, reflecting both its dominance in the Middle East, but also its prevalence in central Europe (see also 4K and 16K versions of the map).

Kalev Leetaru

Despite Twitter being officially banned in China, there are still quite a few geotagged Chinese language tweets in the 1% stream. Outside China, the language is largely found regionally and central Europe and the coastal US (see also 4K and 16K versions of the map).

Kalev Leetaru

Estonian language tweets in the 1% stream span the globe, reflecting that for her small population, Estonia has a global presence (see also 4K and 16K versions of the map).

Kalev Leetaru

Japanese tweets in the 1% stream paint yet another very distinct geographic profile. Notably, Japanese tweets in the US are found most heavily on the West Coast, the North East and clustered in major US cities (see also 4K and 16K versions of the map).

Kalev Leetaru

Portuguese tweets in the 1% stream unsurprisingly cluster primarily in the Western half of Europe and Brazil, but also have high representation across the US and Latin America, as well as Japan and Indonesia (see also 4K and 16K versions of the map).

Kalev Leetaru

Finally, Russian tweets in the 1% stream extend seamlessly from Russia through the former Soviet Republics with another large cluster in central Europe but has little representation in Latin America. Like Japanese it is largely limited to major cities in the US (see also 4K and 16K versions of the map).

Kalev Leetaru

Looking back on each of these maps, they reflect a static snapshot of seven years of Twitter compressed into a single image and thus cannot communicate the way in which that geography has changed over time.

How has Twitter changed linguistically and geographically over these seven years?

The animation below shows all geotagged tweets in the 1% stream by day from January 2012 to October 2018, colored by the primary language of tweets from that location on that day per CLD2. In the space of just 4 minutes and 20 seconds you are seeing seven years of Twitter’s linguistic and geographic evolution go by.

Look closely at the map and you will see that, at least through the eyes of its geotagged 1% stream, Twitter does not expand much geographically beyond the places it was used in January 2012. Through April 2015 it largely entrenches in the places it already existed, spreading across Saudi Arabia, for example.

The impact of the April 2015 loss of GPS coordinates can be seen clearly in the animation as it suddenly becomes far sparser and afterwards exhibits a gradual reduction in overall geotagged geographic information on Twitter, as the map becomes less and less detailed through October 2018.

Putting this all together, what can we learn from all of these images?

First off, it is important to remember that all of the statistics here have come from the Twitter 1% stream. This stream has been found to be highly correlated with Twitter as a whole, but if it possible that there are unknown limitations to that reflection. Twitter’s refusal to provide official statistics, however, mean that it represents one of the only external sources we have for understanding the platform as a whole. Moreover, the popularity of the 1% stream with researchers mean that regardless of how well it reflects Twitter itself, it defines the view of Twitter that powers a lot of data analysis.

Most obviously we see a platform that is constantly evolving and changing.

Over the last seven years Twitter has undergone enormous change that has fundamentally altered the nature of the signal it provides us about the world.

The problem is that Twitter’s own users and the world’s social media analysts that use the platform to draw insights about the world have no visibility into these changes and so are unable to adapt their usage and algorithms.

For its part, Twitter no longer provides the kind of detailed statistics that would help its users and the data community understand how its platform is changing and adjust their own usage to match. Indeed, the company declined to comment when asked about even the most basic statistics like how many tweets had been sent or how many tweets were retweets.

It is inconceivable that one of the world’s most influential social networks operates entirely as a black box into which we have effectively no visibility. For a service that is used by heads of state themselves to connect with their citizenry and which moves markets, understanding the kinds of trends captured in the images above is absolutely critical.

From a data science standpoint, it is worth noting that we tout our datasets as reflecting an ever-changing society even while the methods we use to analyze them assume fixed and unchanging characteristics. The images here suggest we need more than ever to take the time to step back and understand the data we use each day, rather than blindly rushing forward with the unnormalized insights they give us.

In the end, perhaps the most important takeaway is just how much the black boxes that fuel the “big data” revolution are changing right before our eyes and how fundamentally those changes are altering the nature of the signals those datasets provide us. It is time we acknowledged this rapid rate of change and adapted our methods to match.