Earlier this summer, at the re: MARS conference — an Amazon-hosted event focused on machine learning, automation, robotics, and aerospace — Rohit Prasad, chief scientist and vice president of Alexa AI, tried to wow the audience with a paranormal salon trick: talk to the dead. “While AI can’t take away that pain of loss, it can certainly make their memories last,” he said, before showing a short video that begins with a cute boy. Ask Alexa“Can grandma read me?” The Wizard of Oz?”
The female voice reading a few sentences from the book sounds grandmotherly enough. But without knowing Grandma, it was impossible to judge the resemblance. And the whole thing struck many observers as more than a little creepy – Ars Technica called the demo “morbid.” But Prasad’s reveal of how the “trick” was performed was truly gasping: Amazon scientists were able to evoke Grandma’s voice from just a one-minute audio clip. And they can easily do the same thing with just about any voice, a prospect you might find exciting, terrifying, or a combination of both.
Fears of “deepfake” voices that could fool people or speech recognition technology are not unfounded — in a 2020 case, thieves used an artificially generated voice to persuade a Hong Kong bank manager to release $400,000 in funds before the ruse was discovered. At the same time, as voice interactions with technology become more common, brands want to be represented by unique voices. And consumers seem to want technology that sounds more humane (although a Google voice assistant imitating the “ums”, “mm-hmms” and other tics of human speech, however, was criticized for at realistic).
That has sparked a wave of innovation and investment in AI-powered text-to-speech (TTS) technology. A search on Google Scholar shows more than 20,000 text-to-speech synthesis research articles published since 2021. Globally, the text-to-speech market is expected to reach $7 billion by 2028, up from about $2.3 billion in 2020, according to Emerging Research.
Today, TTS is most commonly used in digital assistants and chatbots. But emerging voice identity applications in gaming, media, personal communications are easy to imagine: custom voices for your virtual personas, text messages read aloud in your voice, voiceovers by absent (or deceased) actors. The metaverse is also changing the way we interact with technology.
“There will be many more of these virtualized experiences, where the interaction is less and less a keyboard, and more about speech,” said Frank Chang, one of the founders of the AI-focused venture fund Flying Fish in Seattle. “Everyone thinks that speech recognition is the hottest thing, but in the end, when you talk to something, don’t you just want it to talk to you? To the extent that it can be personalized – with your voice or the voice of someone you want to hear – so much the better.” Providing accessibility for people with visual impairments, limited motor functions and other cognitive problems is another factor driving the development of speech technology. , especially for e-learning.
Whether you like the idea of ”Grandma Alexa” or not, the demo shows how quickly AI has affected text-to-speech, suggesting that convincingly human fake voices may be much closer than we think.
The original Alexa, released with the Echo device in November 2014, is said to be based on: the voice of Nina Rolle, a voice-over artist based in Boulder (something neither Amazon nor Rolle has ever confirmed), and relied on technology developed by Polish text-to-speech company Ivona, which was acquired by Amazon in 2013. But Alexa’s early conversational style left a lot to be desired. 2017, VentureBeat wrote, “Alexa is pretty smart, but whatever the AI-powered assistant talks about, there’s no escaping his relatively flat and monotonous voice.”
Early versions of Alexa used a version of “concatenated” text-to-speech, which works by compiling a large library of speech clips recorded by a single speaker, which can be recombined to produce complete words and sounds. Imagine a ransom note, where letters are cut and glued back together to form new sentences. This approach generates intelligible audio with an authentic-sounding timbre, but it requires many hours of recorded speech data and a lot of fine tuning – and the reliance on a recorded library of sounds makes it difficult to modify voices. Another technique, known as parametric TTS, does not use recorded speech, but rather starts with statistical models of individual speech sounds, which can be assembled into a series of words and sentences and processed by a speech synthesizer called a vocoder. (Google’s “standard” text-to-speech voices use a variation of this technology.) It offers more control over speech output, but has a muffled, robotic sound. You wouldn’t want a bedtime story read to you.
In the effort to create new, more expressive and natural-sounding voices, Amazon, Google, Microsoft, Baidu, and other major players in text-to-speech have all adopted some form of “neural TTS” in recent years. NTTS systems use deep learning neural networks trained on human speech to model audio waveforms from scratch, dynamically converting each text input into smooth-sounding speech. Neural systems can learn not only pronunciation, but also patterns of rhythm, stress, and intonation that linguists call “prosody.” And they can pick up new speaking styles or switch speaker “identities” relatively easily.
Google Cloud’s Text-to-Speech API currently provides developers with more than 100 neural voices in languages ranging from Arabic to Vietnamese (plus regional dialects), along with “default voices” that use older parametric TTS (listen). Microsoft’s Azure gives developers access to more than 330 neural voices in more than 110 languages and dialects, with a range of speaking styles including newscast, customer service, screaming, whispering, angry, excited, excited, sad and terrified (try it). Azure neural voices have also been adopted by companies including ATT, Duolingo, and Progressive. (In March, Microsoft completed its acquisition of Nuance, a leader in conversational AI and a partner in building Apple’s Siri, whose Vocalizer service provides 120-plus neural chatbot voices in more than 50 languages.) Amazon’s Polly text-to-speech -API supports about three dozen neural voices in 20 languages and dialects, in conversational and “newsreader” speaking styles (listen to an early demo here.
The technology underlying the Grandma voice demo was developed by scientists at Amazon’s text-to-speech lab in Gdansk, Poland. In a research paper, the developers describe their novel approach to cloning a new voice from a very limited sample — a “couple-shot” problem, in machine learning parlance. Essentially, they split the task into two parts. First, the system converts text to “general” speech, using a model trained on 10 hours of another speaker’s speech. Then a “voice filter” – trained on a one-minute sample of the target speaker’s voice – creates a new speaker identity, changing the characteristics of the generic voice to sound like the target speaker. Very few training samples are needed to build new voices.
Rather than having to build a new text-to-speech model for each new voice, this modular approach turns the process of creating a new speaker identity into the computationally simpler task of turning one voice into another. On objective and subjective measures, the quality of synthetic speech generated in this way was comparable to speech from models trained on 30 times more data. That said, it cannot fully mimic a specific person’s speaking style. In an email to londonbusinessblog.com, the Alexa researchers explain that the voice filter only changes the timbre of the speaking voice — the basic resonance. The prosody of the voice – its rhythms and intonation – come from the generic voice model. So it would sound like Grandma’s voice reading aloud, but without the signature way she would stretch out certain words or take a long pause between others.
Amazon has not said when the new voice cloning capabilities will be available to developers and the public. In an email, a spokesperson wrote: “Personalizing Alexa’s voice is a highly desired feature from our customers, who could use this technology to create many wonderful experiences. We’re working to improve upon the fundamental science we demonstrated at re:MARS and explore use cases that our customers will love, with the necessary guardrails to prevent potential abuse.“
As you can imagine, offering you the option to customize something like Reading Sidekick — an Alexa feature that lets kids take turns reading with Alexa — with the voice of a loved one. And it’s easy to see how the “Grandma’s Voice” demo could foreshadow an expanded cast of more customizable celebrity voices for virtual assistants. Alexa’s current celebrity voices — Shaquille O’Neal, Melissa McCarthy, and Samuel L. Jackson — took about 60 hours of studio recording to produce, and they’re somewhat limited in what they can do, answer questions about the weather, joke tell and narrate and respond to certain questions, but use standard Alexa voice for requests outside the comfort zone of the system.
Google Assistant “celebrity voice cameos” by John Legend and Issa Rae — introduced in 2018 and 2019, but not currently supported — similarly combined pre-recorded audio with some improvised responses synthesized using WaveNet technology. The ability to develop more powerful celebrity voices that can read out any text input after a short recording session could be a game changer — and even help boost sluggish smart speaker sales. (According to research firm Omdia, shipments of smart speakers in the U.S. were down nearly 30% last year from 2020, including a nearly 51% drop in shipments of Amazon Alexa smart speakers.)
As the big tech companies continue to invest in text-to-speech, one thing is certain: it will become increasingly difficult to determine whether the voice you hear is man-made or by a man-made algorithm.