Virtual Talking Head Combines Facial and Vocal Emotions

By Wesley Fenlon

Zoe's six primary emotions can be combined to create nuanced facial and vocal expressions which all look a bit weird, since she's a disembodied head.

Watch your back, Siri--the Department of Engineering at Cambridge is coming for you. University researchers working in collaboration with Toshiba published a story on Tuesday about Zoe, the "face of the future." Zoe is a disembodied talking head--she might bring to mind Mario's cheery mug from the title screen of Mario 64--but the big breakthrough with Zoe is how many emotions she can subtly and not-so-subtly express. The Cambridge team claim she's the most expressive system designed for human-machine interaction, and their eventual goal is to use Zoe's technology to turn anyone's face and voice into a disembodied digital avatar.

To create Zoe, the team enlisted the help of U.K. actress Zoë Lister. She recorded several thousand sentences while expressing six unique emotions--happy, sad, angry, tender, afraid, and neutral. They also used face tracking and computer vision to scan Zoë's face while she talked, then used that data to pair visible emotions with the audio tracks. The resulting computer program can do more than express six emotions--those emotions can be combined to create more nuanced tones, like hurried or nervous.

Given her range, it's easy to see Zoe replacing other voice-based interfaces. She still speaks with a noticeable computerized affectation, but it's also easy to detect emotional shifts in her voice. Zoe's developers say her program is only 10s of megabytes in size, and their true goal is making it possible for users to create avatars of themselves with the same expressiveness:

“It took us days to create Zoe, because we had to start from scratch and teach the system to understand language and expression, [said Professor Roberto Cipolla]. "Now that it already understands those things, it shouldn’t be too hard to transfer the same blueprint to a different voice and face.”

...The framework behind “Zoe” is also a template that, before long, could enable people to upload their own faces and voices - but in a matter of seconds, rather than days. That means that in the future, users will be able to customise and personalise their own, emotionally realistic, digital assistants.

We're likely still a long ways off from being able to capture Zoe's range of emotion from a few self-shot photographs and self-recorded voice clips, but the team's success so far is impressive. One of the suggested applications--helping deaf and autistic children read lips and emotions--could be a wonderful use of this technology beyond the cool-but-unnecessary realm of smartphone assistants.

Check out the engineering department's video below to get a look at Zoe's real face--it's not nearly as creepy as her wireframe model--and learn more about how the project came together.