Solving 'The Britney Spears Problem': Why Finding Trends In Data Is Such A Challenge

MIT CSAIL

‘I think, therefore I am.’ ‘I google, therefore I think.’ Not sure that Descartes would agree with this. Search engines are not as omniscient as many people may think - as anything powered by AI is limited by the statistical methods that humans have devised.

Estimating the online popularity of Britney Spears for example, or the frequency of searches for the singer compared to other terms, might seem a simple task for AI to compute. But this scenario has presented a significant challenge for researchers since the early days of the internet, for a number of reasons. MIT CSAIL has announced a new hybridized method for getting around this issue, improving the accuracy and speed of frequency estimation to give Google, Twitter and many other applications the ability to monitor trends even more accurately.

Data explosion

Back in the first decades of the internet, finding trends in data was a lot easier: there were less users, websites were simple and the metrics for measuring user activity were nowhere near as complex as today. Today this represents a significant challenge, colloquially known as ‘the Britney Spears problem’, in which algorithms struggle to precisely measure trends in free-flowing and voluminous data such as monitoring internet traffic to different IP addresses.

Imagine a random stream of numbers being counted one by one to determine the size of the sequence. This counting function only needs to factor in one number at a time and add it to the total, so it only needs the memory and capacity to register that single digit. Now imagine that you want to count how many numbers have appeared in the sequence and also see which number has appeared the most. This function now needs to count the numbers, register each number separately and store them all to determine which is the most frequent. This constitutes a processing nightmare because there is physically not enough space to hold that string of digits, or enough processing power to keep the entire stream in active memory for analysis.

There have been a number of ways to address this challenge, most of which involve splitting up the stream to analyze samples independently and estimating the result based on that. But that does not provide the accuracy that users have come to expect from AI-powered tools like search engines and trending charts, and so data giants have taken to using the most reliable method available called ‘hashing’, which involves randomizing the stream and only pulling out the element that matches the request. This also takes up a lot of processing power and makes it difficult to spot wider patterns in large streams of data (like buying habits of a group, or a potential DDoS attack).

Oops! I split it again

To tackle this problem head-on and gain more accurate and granular insight into large data streams, researchers from MIT CSAIL have employed machine learning (ML) in combination with hashing to predict if a specific element will appear frequently in a data stream: these ‘frequent’ elements are placed in a bucket for ‘heavy hitters’, and the rest of the stream is handled via hashing. This creates a two-tier filtering system for continuous data streams that ‘prioritize[s] the biggest problems before getting to the smaller ones’, says MIT professor Piotr Indyk a co-author of the paper on this technique that will be presented at the International Conference on Learning Representations (ICLR) in New Orleans. This allows users to get a ‘sketch’ of what patterns in the data might look like (the system is aptly named LearnedSketch) and ‘do frequency-estimation much more efficiently and with much less error.’ In tests, the system was over 57% more accurate in estimating internet traffic and more than 71% for trending social media topics.

Aside from frequency estimation on social media or for ISPs, MIT CSAIL’s machine learning-based approach can also be used for a broader class of so-called “streaming” algorithms that are used in everything from security systems to natural language processing. These algorithms run on an endless ‘stream’ of data, which makes it difficult to analyze the data for meaningful patterns as there is no way of comparing against the entire dataset. LearnedSketch addresses this by using AI to predict trends from a snapshot of the stream, and runs the rest of the stream through a hash function - which means that the first sample-layer is still representative of the set because the rest of the set is also analyzed, and trends are caught quicker and more reliably. As this system relies on a machine learning component, it also improves its predictions over time and can speculate on data it hasn’t seen yet - such that data streams of all kinds could theoretically be unpicked without significant error or lag.

Better together

Frequency estimation has become more important in recent years as the amount of data available to companies has exploded, and analysis has become more important in most applications. Co-author of the new paper by the LearnedSketch team, Chen-Yu Hsu, states that ‘these kinds of results show that machine learning is very much an approach that could be used alongside the classic algorithmic paradigms like “divide and conquer” and dynamic programming.’

As such this system could improve frequency-estimation exponentially if trained and employed properly by search engines, network operators, e-commerce hosts, cyber-security firms, or any number of applications with a constant stream of incoming data, and represents another win for AI technology against historically mind-bending statistical problems.

Follow me on LinkedIn. Check out my website or some of my other work here.

More From Forbes

Solving 'The Britney Spears Problem': Why Finding Trends In Data Is Such A Challenge