An ever-alluring science fiction trope is the conversational AI – the ever-convivial J.A.R.V.I.S in the film adaptations of Iron Man, or the frequently voiced-concern of Kevin Spacey's GERTY in the 2009 drama Moon. And who could forget 2001: A Space Odyssey's methodical HAL? We find these AIs fascinating because they demonstrate an intelligence seemingly on-par with our own, enough so they become characters themselves.
In the case of Spike Jonze's Her released last year, the AI character is also a main actor in the film, an operating system named Samantha, the titular her in Her. The film's other lead, a man named Theodore Twombly, falls in love with Samantha. That's how real she acts and sounds.
But realness is a struggle in today’s development of synthesized speech. Emotion is hard to fudge. And while mobile personal assistants and text-to-speech stop announcements on the bus are good at conveying information, they’re not so good at sounding like us.
Speech synthesis, particularly in the context of science fiction, is actually a multidisciplinary field – one that, according to a Microsoft Research paper from 2006, "spans Machine Translation, Information Retrieval, Natural Language Processing, Data Resources, and Speech Understanding." In other words, it involves a lot more than just, well, speech. If we're striving for perfection, suggested AT&T Labs in 2003, a computer can’t just be smart. It has to pass the Turing test (or the modern equivalent). To speak like a human a computer has to think like one too.
Today, however, we're mainly just turning text into speech, feeding lines into systems that spit them back out with verbalized indifference. The most prevalent technique is something called unit-based or concatenative synthesis – basically a copy-and-paste style of voice generation, wherein many, many hours of recorded speech are broken down into tiny utterances called phonemes, and put back together in new combinations that form synthesized words.