More Data, More Power,
Better Performance
The Defense Department, through its Defense Advanced
Research Projects Agency (DARPA),
started funding academic and commercial research into speech recognition in
the early 1970s.
What
emerged were several systems to turn speech into text, all of which
slowly but gradually improved as they were able to work with more data and
at faster speeds.
In a brief interview, Dan Kaufman, director of DARPA’s
Information Innovation Office, indicated that the government’s ability to
automate transcription is still limited.
Kaufman says that automated transcription of phone
conversation is “super hard,” because “there’s a lot of noise on the signal”
and “it’s informal as hell.”
“I would tell you we are not very good at that,” he said.
In an ideal environment like a news broadcast, he said,
“we’re getting pretty good at being able to do these types of translations.”
A
2008 document from the Snowden archive shows that transcribing news
broadcasts was already working well seven years ago, using a program called
Enhanced Video Text and Audio Processing:
(U//FOUO) EViTAP is a fully-automated news monitoring
tool. The key feature of this Intelink-SBU-hosted tool is that it
analyzes news in six languages, including Arabic, Mandarin Chinese,
Russian, Spanish, English, and Farsi/Persian. “How does it work?” you
may ask. It integrates Automatic Speech Recognition (ASR) which provides
transcripts of the spoken audio. Next, machine translation of the ASR
transcript translates the native language transcript to English. Voila!
Technology is amazing.
A version of the system the NSA uses is now even
available
commercially.
Experts in speech recognition say that in the last decade
or so, the pace of technological improvement has been explosive. As
information storage became cheaper and more efficient, technology companies
were able to store massive amounts of voice data on their servers, allowing
them to continually update and improve the models. Enormous processors,
tuned as “deep neural networks” that
detect patterns like human brains do, produce much cleaner transcripts.
And the Snowden documents show that the same kinds of
leaps forward seen in commercial speech-to-text products have also been
happening in secret at the NSA, fueled by the agency’s singular access to
astronomical processing power and its own vast data archives.
In fact, the NSA has been repeatedly releasing new and
improved speech recognition systems for more than a decade.
The first-generation tool, which made keyword-searching of
vast amounts of voice content possible, was rolled out in 2004 and
code-named RHINEHART.
“Voice word search technology allows analysts to find and
prioritize intercept based on its intelligence content,” says an internal
2006 NSA memo entitled “For
Media Mining, the Future Is Now!”
The memo says that intelligence analysts involved in
counterterrorism were able to identify terms related to bomb-making
materials, like “detonator” and “hydrogen peroxide,” as well as place names
like “Baghdad” or people like “Musharaf.”
RHINEHART was “designed to support both real-time
searches, in which incoming data is automatically searched by a
designated set of dictionaries, and retrospective searches,
in which analysts can repeatedly search over months of past traffic,” the
memo explains (emphasis in original).
As of 2006, RHINEHART was operating “across a wide variety
of missions and languages” and was “used throughout the NSA/CSS [Central
Security Service] Enterprise.”
But even then, a newer, more sophisticated product was
already being rolled out by the NSA’s Human Language Technology (HLT)
program office. The new system, called VoiceRT, was first introduced in
Baghdad, and “designed to index and tag 1 million cuts per day.”
The goal, according to
another 2006 memo, was to use voice processing technology to be able
“index, tag and graph,” all intercepted communications. “Using HLT services,
a single analyst will be able to sort through millions of cuts per day and
focus on only the small percentage that is relevant,” the memo states.
A
2009 memo from the NSA’s British partner, GCHQ, describes how “NSA have
had the BBN
speech-to-text system Byblos running at Fort Meade for at least 10 years.
(Initially they also had Dragon.) During this period they have invested
heavily in producing their own corpora of transcribed Sigint in both
American English and an increasing range of other languages.” (GCHQ also
noted that it had its own small corpora of transcribed voice communications,
most of which happened to be “Northern Irish accented speech.”)
VoiceRT, in turn, was surpassed a few years after its
launch. According to the intelligence community’s “Black
Budget” for fiscal year 2013, VoiceRT was decommissioned and replaced in
2011 and 2012, so that by 2013, NSA could operationalize a new system. This
system, apparently called
SPIRITFIRE, could handle more data, faster. SPIRITFIRE would be “a more
robust voice processing capability based on speech-to-text keyword search
and paired dialogue transcription.”
Extensive Use Abroad
Voice communications can be collected by the NSA whether
they are being sent by regular phone lines, over cellular networks, or
through voice-over-internet services. Previously released
documents from the Snowden archive describe enormous efforts by the NSA
during the last decade to get access to voice-over-internet content like
Skype calls, for instance. And other documents in the archive chronicle the
agency’s adjustment to the fact that an increasingly large percentage of
conversations, even those that
start as landline or mobile calls,
end up as digitized packets flying through the same fiber-optic cables
that the NSA
taps so effectively for other data and voice communications.
The Snowden archive, as searched and analyzed by The
Intercept, documents extensive use of speech-to-text by the NSA to
search through international voice intercepts — particularly in Iraq and
Afghanistan, as well as Mexico and Latin America.
For example, speech-to-text was a key but previously
unheralded element of the sophisticated analytical program known as the Real
Time Regional Gateway (RTRG), which started in 2005 when newly appointed NSA
chief Keith B. Alexander, according to
the Washington Post, “wanted everything: Every Iraqi text
message, phone call and e-mail that could be vacuumed up by the agency’s
powerful computers.”
The Real Time Regional Gateway was credited with playing a
role in “breaking up Iraqi insurgent networks and significantly reducing the
monthly death toll from improvised explosive devices.” The indexing and
searching of “voice cuts” was deployed to Iraq in 2006. By 2008, RTRG was
operational in Afghanistan as well.
A slide from a
June 2006 NSA powerpoint presentation described the role of VoiceRT:
Keyword spotting extended to Iranian intercepts as well. A
2006 memo reported that RHINEHART had been used successfully by
Persian-speaking analysts who “searched for the words ‘negotiations’ or
‘America’ in their traffic, and RHINEHART located a very important call that
was transcribed verbatim providing information on an important Iranian
target’s discussion of the formation of a the new Iraqi government.”
According to a 2011 memo, “How
is Human Language Technology (HLT) Progressing?“, NSA that year deployed
“HLT Labs” to Afghanistan, NSA facilities in Texas and Georgia, and
listening posts in Latin America run by the
Special Collection Service, a joint NSA/CIA unit that operates out of
embassies and other locations.
“Spanish is the most mature of our speech-to-text
analytics,” the memo says, noting that the NSA and its Special Collections
Service sites in Latin America, have had “great success searching for
Spanish keywords.”
The memo offers an example from NSA Texas, where an
analyst newly trained on the system used a keyword search to find previously
unreported information on a target involved in drug-trafficking. In another
case, an official at a Special Collection Service site in Latin America “was
able to find foreign intelligence regarding a Cuban official in a fraction
of the usual time.”
In a 2011 article, “Finding
Nuggets — Quickly — in a Heap of Voice Collection, From Mexico to
Afghanistan,” an intelligence analysis technical director from NSA Texas
described the “rare life-changing instance” when he learned about human
language technology, and its ability to “find the exact traffic of interest
within a mass of collection.”
Analysts in Texas found the new technology a boon for
spying. “From finding tunnels in Tijuana, identifying bomb threats in the
streets of Mexico City, or shedding light on the shooting of US Customs
officials in Potosi, Mexico, the technology did what it advertised:
It accelerated the process of finding relevant intelligence when time was of
the essence,” he wrote. (Emphasis in original.)
The author of the memo was also part of a team that
introduced the technology to military leaders in Afghanistan. “From Kandahar
to Kabul, we have traveled the country explaining NSA leaders’ vision and
introducing SIGINT teams to what HLT analytics can do today and to what is
still needed to make this technology a game-changing success,” the memo
reads.
Extent of Domestic Use
Remains Unknown
What’s less clear from the archive is how extensively this
capability is used to transcribe or otherwise index and search voice
conversations that primarily involve what the NSA terms “U.S. persons.”
The NSA did not answer a series of detailed questions
about automated speech recognition, even though an NSA “classification
guide” that is part of the Snowden archive explicitly states that “The
fact that NSA/CSS has created HLT models” for speech-to-text processing as
well as gender, language and voice recognition, is “UNCLASSIFIED.”
Also unclassified: The fact that the processing can sort
and prioritize audio files for human linguists, and that the statistical
models are regularly being improved and updated based on actual intercepts.
By contrast, because they’ve been tuned using actual intercepts, the
specific parameters of the systems are highly classified.
“The National Security Agency employs a variety of
technologies in the course of its authorized foreign-intelligence mission,”
spokesperson Vanee’ Vines wrote in an email to The Intercept.
“These capabilities, operated by NSA’s dedicated professionals and overseen
by multiple internal and external authorities, help to deter threats from
international terrorists, human traffickers, cyber criminals, and others who
seek to harm our citizens and allies.”
Vines did not respond to the specific questions about
privacy protections in place related to the processing of domestic or
domestic-to-international voice communications. But she wrote that “NSA
always applies rigorous protections designed to safeguard the privacy not
only of U.S. persons, but also of foreigners abroad, as directed by the
President in January 2014.”
The presidentially appointed but independent
Privacy and Civil Liberties Oversight Board
(PCLOB) didn’t mention speech-to-text technology in its
public reports.
“I’m not going to get into whether any program does or
does not have that capability,” PCLOB chairman David Medine told
The Intercept.
His board’s reports, he said, contained only information
that the intelligence community agreed could be declassified.
“We went to the intelligence community and asked them to
declassify a significant amount of material,” he said. The “vast majority”
of that material was declassified, he said. But not all — including “facts
that we thought could be declassified without compromising national
security.”
Hypothetically, Medine said, the ability to turn voice
into text would raise significant privacy concerns. And it would also raise
questions about how the intelligence agencies “minimize” the retention and
dissemination of material— particularly involving U.S. persons — that
doesn’t include information they’re explicitly allowed to keep.
“Obviously it increases the ability of the government to
process information from more calls,” Medine said. “It would also allow the
government to listen in on more calls, which would raise more of the kind of
privacy issues that the board has raised in the past.”
“I’m not saying the government does or doesn’t do it,” he
said, “just that these would be the consequences.”
A New Learning Curve
Speech recognition expert Bhiksha Raj likens the current
era to the early days of the Internet, when people didn’t fully realize how
the things they typed would last forever.
“When I started using the Internet in the 90s, I was just
posting stuff,” said Raj, an associate professor at Carnegie Mellon
University’s Language Technologies
Institute. “It never struck me that 20 years later I could go Google
myself and pull all this up. Imagine if I posted something on
alt.binaries.pictures.erotica or something like that, and now that post is
going to embarrass me forever.”
The same is increasingly becoming the case with voice
communication, he said. And the stakes are even higher, given that the
majority of the world’s communication has historically been conducted by
voice, and it has traditionally been considered a private mode of
communication.
“People still aren’t realizing quite the magnitude that
the problem could get to,” Raj said. “And it’s not just surveillance,” he
said. “People are using voice services all the time. And where does the
voice go? It’s sitting somewhere. It’s going somewhere. You’re living on
trust.” He added: “Right now I don’t think you can trust anybody.”
The Need for New Rules
Kim Taipale, executive director of the
Stilwell Center for Advanced
Studies in Science and Technology Policy, is one of several people who
tried a decade ago to get policymakers to recognize that existing
surveillance law doesn’t adequately deal with new global communication
networks and advanced technologies including speech recognition.
“Things aren’t ephemeral anymore,” Taipale told The
Intercept. “We’re living in a world where many things that were
fleeting in the analog world are now on the permanent record. The question
then becomes: what are the consequences of that and what are the rules going
to be to deal with those consequences?”
Realistically, Taipale said, “the ability of the
government to search voice communication in bulk is one of the things we may
have to live with under some circumstances going forward.” But there at
least need to be “clear public rules and effective oversight to make sure
that the information is only used for appropriate law-enforcement or
national security purposes consistent with Constitutional principles.”
Ultimately, Taipale said, a system where computers
flag suspicious voice communications could be less invasive than one where
people do the listening, given the potential for human abuse and misuse to
lead to privacy violations. “Automated analysis has different privacy
implications,” he said.
But to Jay Stanley, a senior policy analyst with the
ACLU’s
Speech, Privacy and Technology Project, the distinction between a human
listening and a computer listening is irrelevant in terms of privacy,
possible consequences, and a chilling effect on speech.
“What people care about in the end, and what creates
chilling effects in the end, are consequences,” he said. “I think that over
time, people would learn to fear computerized eavesdropping just as much as
they fear eavesdropping by humans, because of the consequences that it could
bring.”
Indeed, computer listening could raise
new concerns. One of the
internal NSA memos from 2006 says an “important enhancement under
development is the ability for this HLT capability to predict what
intercepted data might be of interest to analysts based on the analysts’
past behavior.”
Citing Amazon’s ability to not just track but predict
buyer preferences, the memo says that an NSA system designed to flag
interesting intercepts “offers the promise of presenting analysts with
highly enriched sorting of their traffic.”
To Phillip Rogaway, a professor of computer science at the
University of California, Davis, keyword-search is probably the “least of
our problems.” In an email to The Intercept, Rogaway warned that
“When the NSA identifies someone as ‘interesting’ based on contemporary NLP
[Natural Language Processing] methods, it might be that there is no
human-understandable explanation as to why beyond: ‘his corpus of discourse
resembles those of others whom we thought interesting'; or the conceptual
opposite: ‘his discourse looks or sounds different from most people’s.'”
If the algorithms NSA computers use to identify threats
are too complex for humans to understand, Rogaway wrote, “it will be
impossible to understand the contours of the surveillance apparatus by which
one is judged. All that people will be able to do is to try your best to
behave just like everyone else.”
Next: The NSA’s
best kept open secret.
Readers with information or insight into these programs
are encouraged to get in touch, either
by email,
or anonymously via SecureDrop.
Documents published with this article:
-
RT10 Overview (June 2006)
-
For Media Mining, the Future is Now!
(August 1, 2006)
-
For Media Mining, the Future is Now! (conclusion)
(August 7, 2006)
-
Dealing With a ‘Tsunami’ of Intercept
(August 29, 2006)
-
Coming Soon! A Tool that Enables Non-Linguists to Analyze
Foreign-TV News Programs (October 23, 2008)
-
SIRDCC Speech Technology WG assessment of current STT
technology (December 7, 2009)
-
Classification Guide for Human Language Technology (HLT)
Models (May 18, 2011)
-
Finding Nuggets – Quickly – in a Heap of Voice
Collection, From Mexico to Afghanistan (May
25, 2011)
-
How Is Human Language (HLT) Progressing?(September
26, 2011)
-
“Black Budget” — FY 2013 Congressional Budget
Justification/National Intelligence Program, p. 262
(February 2012)
-
“Black Budget” — FY 2013 Congressional Budget
Justification/National Intelligence Program, pp. 360-364
(February 2012)
Research on the Snowden archive was conducted by
Intercept researcher Andrew Fishman.
Illustrations by Richard Mia for The Intercept.