Quantcast

Why Synthetic Speech Doesn't Sound Right (Yet)

By Matthew Braga

Realness is a struggle in today’s development of synthesized speech. Emotion is hard to fudge. And while mobile personal assistants and text-to-speech stop announcements on the bus are good at conveying information, they’re not so good at sounding like us.

An ever-alluring science fiction trope is the conversational AI – the ever-convivial J.A.R.V.I.S in the film adaptations of Iron Man, or the frequently voiced-concern of Kevin Spacey's GERTY in the 2009 drama Moon. And who could forget 2001: A Space Odyssey's methodical HAL? We find these AIs fascinating because they demonstrate an intelligence seemingly on-par with our own, enough so they become characters themselves.

In the case of Spike Jonze's Her released last year, the AI character is also a main actor in the film, an operating system named Samantha, the titular her in Her. The film's other lead, a man named Theodore Twombly, falls in love with Samantha. That's how real she acts and sounds.

But realness is a struggle in today’s development of synthesized speech. Emotion is hard to fudge. And while mobile personal assistants and text-to-speech stop announcements on the bus are good at conveying information, they’re not so good at sounding like us.

Speech synthesis, particularly in the context of science fiction, is actually a multidisciplinary field – one that, according to a Microsoft Research paper from 2006, "spans Machine Translation, Information Retrieval, Natural Language Processing, Data Resources, and Speech Understanding." In other words, it involves a lot more than just, well, speech. If we're striving for perfection, suggested AT&T Labs in 2003, a computer can’t just be smart. It has to pass the Turing test (or the modern equivalent). To speak like a human a computer has to think like one too.

Today, however, we're mainly just turning text into speech, feeding lines into systems that spit them back out with verbalized indifference. The most prevalent technique is something called unit-based or concatenative synthesis – basically a copy-and-paste style of voice generation, wherein many, many hours of recorded speech are broken down into tiny utterances called phonemes, and put back together in new combinations that form synthesized words.

A text-to-speech company named Cereproc famously used this technique to develop a replacement voice for Roger Ebert, and it’s the same technique Apple used to give Siri her voice. But, there are limits – a noticeable monotony and neutrality to the output, a voice that talks talks at you, but not with. Faculty from the University of Southern California and Carnegie Mellon University in a paper from 2012 characterized such present-day synthesis as lacking “all the attitude, intention, and spontaneity associated with everyday conversations.” It’s intelligible, but a voice without a soul.

What synthesized voices lack is prosody: expressiveness, emotion and vocal tics that make us who we are.

The proper term for what most synthesized voices lack is something called prosody – the expressiveness, emotion and vocal tics that make us who we are. Prosody is why we raise our voices at the end of a question, or annunciate when giving commands to a dog, and can easily distinguish between the two.

It’s also very, very difficult to implement in synthesized speech. Concatenative speech libraries are usually built from neutral, read-aloud speech that purposely lacks emotion – intended to make the copy-paste process of building words and sentences from disparate phonemes sound more natural in the widest number of situations possible. Recording different speech libraries for the full range of human emotions is nigh impossible, and we’re just not good enough yet at generating purely artificial and realistic-sounding voices from scratch.

That’s a problem, because prosody is integral to the development of a future conversational AI. Consider this scenario presented by a working group in 2002:

"If a synthesizer is to be used in seamless or unobtrusive conversational interactions with a human interlocutor, then there will be a need for the expression of personal attitudes, moods, and interest, and more use will be made of non-lexical sounds such as 'grunts', fillers, and laughter. In such cases, the key difference lies in the degree of interaction with the listener and reaction to the contexts of the discourse. Humans raise their voices both to show anger and to adapt to a noisy environment. They whisper when the content is confidential. Conversation is an interactive, two-way process, with the listener also taking an active part in the discourse. The synthesizers that takes the part of a human will be required to express personal feelings and attitudes that are perhaps more in the domain of psychology than linguistics."

Her is a good example of this. You can hear Samantha breathe. She sighs and inhales, and even laughs and sings. All of that requires more than just a textual understanding of the content of speech, but a psychological and physical understanding of why humans sound, act and feel the way they do – a whole extra level of prosody. It why we forget that Samantha is actually an operating system, and what allows Theodore to fall in love. It all feels so real.

Photo credit: Flickr user Rotwang via Creative Commons.

Scientists are already working on introducing at least some prosody awareness into synthesized speech. A technique called Hidden-Markov Model synthesis, or HMM, has been called "one of the most important recent developments in speech recognition” and aggregates recorded voice data from multiple human speakers to create models for accents, emotions and speaking styles. There’s more prosodic flexibility than with concatenative synthesis, though it’s certainly far from perfect.

Siri has nothing to fear from Samantha just yet.