In the Batman movie The Dark Knight, the protagonist transforms the city’s smartphones into a massive audio surveillance network, quietly enabling all citizens' microphones and scanning for the Joker’s voice. Silently enabling and scanning the smartphone microphones of an entire city would historically have required streaming all of their audio to centralized datacenters, creating such enormous bandwidth and processing demands that it remains the realm of Hollywood fiction. Yet, a little-noticed talk at Facebook’s F8 conference last week regarding the company’s enormous investment in on-device AI content scanning suggests we are actually rapidly approaching a world in which this would not only be possible, but something Facebook could almost deploy today.
At last week’s F8 conference, one of the AI engineering talks focused on Facebook’s efforts to move its machine-based content moderation scanning to the edge, running its content matching and AI filtering algorithms directly on users’ smartphones, rather than in its own datacenters.
From Facebook’s standpoint, moving its content moderation to users’ own devices will allow it to continue enforcing its acceptable speech regulations even as user content is increasingly encrypted and takes the form of user-to-user private communications rather than public posts.
Yet, once Facebook has robust AI filtering algorithms running directly on users’ phones for images and text, it will face increasing pressure to analyze video as well, beyond its current reliance on content hashing.
Already, in the case of New Zealand, the company acknowledged deploying audio analysis on uploaded videos to attempt to catch reuploads of the attacker’s livestream video.
If it connected those filters directly to a user's phone microphone they could identify who has watched the video no matter what medium they consumed it through.
Imagine a future New Zealand situation in which Facebook now deploys on-device content filtering and is faced with a sudden deluge of users sharing a terrorist video privately in small groups and from person to person using WhatsApp. Since WhatsApp messages are encrypted in transit, there is no way for Facebook to scan those communications to flag sharing of the video using its current central datacenter scanning model.
Faced with another public relations scandal and intense government pressure, Facebook is likely to push an emergency patch to its on-phone WhatsApp content moderation model that recognizes the audio and visual characteristics of the video to delete any attempted sharing of it and alert authorities to anyone who has sent, received or viewed it.
As Facebook’s models grow more robust, it will likely add on-device speech recognition to its video scanning, generating realtime transcripts of videos and livestreams and identifying ones mentioning certain speech or depicting gunfire or violence.
In turn, as governments become accustomed to Facebook being able to scan videos and livestreams for violence or certain speech, it is likely they will force the company to add voice identification to flag videos containing the voices or faces of known terrorist leaders in order to halt the spread of their latest videos.
Once Facebook demonstrates the ability to scan users’ livestreams in realtime on their devices for topics and people of interest, it will not be long until governments leverage these capabilities in times of national emergency.
Imagine a world in which Facebook can respond to the release of a terrorist video by pushing out a new moderation rule blocking it within minutes to its two billion users’ phones, by uploading a vocal print of the terrorist’s voice.
Now imagine a country that receives an intelligence warning that a known terrorist has entered its borders and will be conducting a mass-casualty terror attack within the following week. Under its national counter-terrorism laws, that country could go to Facebook, give it past recordings of the terrorist's voice and ask them to turn on the microphones of all phones currently within its borders and upload a voice identification model to run against their microphones in realtime to listen for his voice. Any match would result in an immediate notification back to Facebook with the matching phone’s GPS location and the identity of its owner.
Given that Facebook clarified this past February that it reserves the right to access all device streams, including “GPS location, camera or photos” for any purpose it deems necessary, it would not be a stretch to see it connecting those inputs to its AI moderation algorithms to scan them in the background, much as it already scans users’ GPS locations in the background to identify when selected users come near its facilities.
In many countries the government could quite readily obtain a lawful court order forcing Facebook to activate the microphone for all its users within that country’s borders and upload a content filtering model to scan the background audio for a particular voice, noise or words.
We are not yet quite at this point, but the underlying technologies are all in place and as Facebook’s presentation last week demonstrates, the company is aggressively moving towards deploying content moderation to the edge.
Once Facebook has normalized on-device filtering, it will only be a matter of time before governments repurpose those capabilities for more nefarious purposes.
Of course, users could simply choose not to install Facebook’s application on their phone, but once the capability has been proven out, governments are likely to mandate that phone manufacturers themselves install the capability at a firmware level.
Physical privacy would be a thing of the past. A journalist meeting a confidential source at a restaurant might leave their phone at home, but if all of the other diners’ phones are quietly scanning the dining room for the voices of known individuals, the government could readily identify both the journalist and the person they were meeting.
Indeed, Facebook itself could easily deploy this mass surveillance system for its own purposes without government intervention.
The company did not respond to multiple requests for comment regarding its edge content scanning efforts.
Putting this all together, as social media companies increasingly push content moderation to the edge, running their filtering models directly on users’ devices, the intrinsic scalability of this distributed moderation model opens the door to frightening new Orwellian surveillance capabilities.
In the end, 1984’s Telescreens might be closer that we could ever imagine.