How the Human Voice is Processed for Text-to-Speech

By Wesley Fenlon

How a human voice becomes a (mostly) human computer that talks back.

Something has changed in the technology of text-to-speech systems within the last 10 years. Actually, a lot of things have changed, but one's especially noticeable--our computers stopped talking to us in computerized voices and started talking to us in human ones. The human voices may still sound a little robotic, but that's because today's personal assistants like Siri and other text-to-speech systems are weaving together individual words from massive databases of speech read by human voice actors. The Verge has a great feature on how this came to be.

Voice actors now spend dozens of hours reading thousands of sentences to create a library of words for text-to-voice systems. It's an interesting process, because the actors don't just read words in a vacuum or read sentences that computer voice companies plan to use verbatim. They read "phonetically rich" sentences that are filled with important words and important phrasing. This is where the technology of text-to-speech systems has really changed in the past decade--computers have grown fast and powerful enough to piece together millions or billions of phonemes on the fly to create words and sentences the voice actors never said.

Much of speaking, and understanding speech, comes down to context. Context that most people naturally understand and reflect in the way they speak. If a computer can't replicate our natural speech patterns, we're reminded that the voice we're hearing, and talking to, is fake.

"Speaking is unconscious; we do it, we don’t think about how we’re doing it, and we certainly aren’t thinking about the minute fluctuations of stress, intonation, pitch, speed, tongue position, relationships between phonemes, and myriad other factors that allow us to seamlessly and effectively communicate complex ideas and emotions," writes The Verge. "But in order to get a computer to assemble a human-sounding voice, all of those things have to be considered, a task described by one language professor as 'Herculean.' "

Until recently, computers couldn't process human speech to create realistic voices that raised in pitch at the end of a question.

"Take, for instance, the phoneme 'A' as in 'cat.' It will sound slightly different if it’s the center of a syllable, as in 'catty,' versus at the beginning of a syllable, as in 'alligator.' And that 'a' will also sound a little different if it’s in a stressed syllable, as it is in 'catty,' versus a non-stressed syllable, as in the word 'androgynous.' "

Until recently, computers couldn't process human speech and create realistic voices that raised in pitch at the end of a question and could correctly pronounce proper nouns. Now they're powerful enough to combine human and synthesized voices, making it sound natural. The rest of The Verge feature explores where this technology might be going--from talking microwaves to vanity voices to emotional personal assistants. That last one will be an incredible challenge: the better computer voices get, the more we'll pick out the tiny mistakes that make them not quite human.