Skip to main content
Humans+Robots

AI can create a voice. It’s up to humans to use it responsibly.

From deepfakes to Anthony Bourdain, synthetic voice technology is raising fears. But expert Rupal Patel says it's also helping patients and companies.

How does AI create an uncannily accurate voice? In a controversial scene in the new documentary “Roadrunner: A Film About Anthony Bourdain,” a voice that sounds like Bourdain’s reads an email he had written — but the sound was created posthumously, using “voice cloning” technology.

Synthetic voices aren’t new; they’re already being used in the health care and commercial fields, notes Rupal Patel, a professor in the departments of speech, language pathology, and computer science at Northeastern University and the founder of the voice-generating company VocaliD. Experience editor-in-chief Joanna Weiss spoke to Patel several weeks ago about the mechanics and ethics of synthetic voice technology. This interview is edited and condensed; you can view the full video here.

You were featured in a recent Experience story that imagines future uses of this technology — one man wanted his great-aunt to “read” a bedtime story to his children, even though she’s no longer with us. How close is that to what your company does?
We also do this, in terms of recreating the voice of people who are alive today, or recreating a voice of someone who had archives of audio before. The genesis of VocaliD, though, came from creating voices for people who didn’t have a voice. Maybe someone born with disorder such as cerebral palsy. Or someone who later in life loses their voice to a condition like ALS or Parkinson’s disease. In the laboratory, we’ve figured out a way to create a unique synthetic voice for someone, despite having very little audio.

How do you create a new voice for a patient?
To create a synthetic voice, you need audio recordings. From those, you train a neural network to learn how to speak like that person. A voice volunteer records their voice on our platform, line by line. An individual [who wants to use the service] records samples of their voice online in a different kind of a studio.

For non-speaking individuals, that person still makes sound. They can’t produce all the consonants and vowels accurately, but their sound has its particular, distinct vocal DNA. We use both recordings together to create the voice. [In the end,] you have a file that can be downloaded onto a mobile phone or an assistive communication device.

When you say “vocal DNA,” what does that mean? What makes each voice unique?
There are actually hundreds of vocal features: The range and the pitch of their voice. The breathiness of their voice. The loudness of their voice. You’re going to blend it with a voice donor that’s similar to them, so accents are important, too.

When we first started, we were focused on making it sound accurate—like it could be that person’s voice. What we’ve learned from the feedback, now that we have hundreds of people using their VocaliD voices on a daily basis, is that accuracy is not always the thing. Many individuals who have lost their voices actually want a younger-sounding voice — I call it a “voice lift.” Or they want a voice that is more pleasant to their ear. You’ve probably heard people say, “Oh, I don’t like the sound of my voice. If I lost my voice, I’d probably have so-and-so’s voice instead.” The common joke is Morgan Freeman’s.

Can people bank their voices ahead of time?
Absolutely. In the last couple of years, we’ve been focusing on these individuals, primarily those with head and neck cancer. Many will have the voice box or part of their tongue removed. But if they have banked their voice ahead of time, they might get a little prosthesis that allows them to talk through the stoma [an opening in the voice box]. It’s been a game changer in terms of having them continue to do their jobs, the social closeness with family members. Sometimes, even when these people pass, the family members continue to want to use their voice on their assistive communication devices as a means of grieving and social closeness.

Beyond healthcare, what are the applications for this technology?
We’re working with companies to create their own brand voice for applications: phone trees, interactive voice response. Imagine a voice for a brand and you want to make a 15-second audio sample, but that talent is not available. It’s not about dropping the talent — it’s about making sure that you expand the capability or scale that talent, so that they can be available to do all this content. A lot of financial technology companies want to regionalize their voices. It’s the same person, but can you change that voice for those different demographics?

Do you mean you take a Siri-like program and impose a Southern accent or a Midwestern accent?
I would call it more “styles.” When you’re speaking to an older person, or if you’re speaking to someone in a crowded restaurant or a bar, you speak in a different way. So how do you adapt the voice so that it can be understood in those different contexts? One of the earliest projects I did in the lab was a project called “Loud Mouth.” When we talk in a bar, we talk very differently, right? We use those learnings to make the synthesizer speak as if it was speaking in a crowded restaurant, which isn’t making every single word louder or higher in pitch. It’s actually the content words that are highlighted with different kinds of cues in order to make them understood.

Everyone today is familiar with deepfakes. How do you protect against them? You can imagine a world where somebody has banked another person’s voice without their consent, and is using it to say something that they never agreed to say.
It’s a two-part solution, I think — part on the technology side, and part on the consumer side. For the consumer, you have to educate: Make people aware of this new media, and understand what is fake. On the technology side, when we make a voice, we have to have consent. And be more clear: “You’re listening to an avatar of an individual.”

The other thing technology providers can do is build in protections, such as watermarking the audio. But you have to keep evolving, because fraudsters will always find the leak. It’s a whack-a-mole game. So VocaliD, along other synthetic media companies, created a coalition called AITHOS — for “AI ethos.” We want to think about the unintended consequences before they become a problem. Many of the talks I give these days are centered around this concept: What is our responsibility ethically, as a technology provider, to make sure that we don’t inadvertently create a problem, when we were looking to do something that was a social good? You can’t plead ignorance and say, “We had no idea.” We’ve seen enough to know that we absolutely have an idea.

Published on

Photo by Ruby Wallau / Northeastern University

Humans+Robots

Hear your grandmother’s voice, reconstructed

Voice synthesizing technology may someday mimic loved ones — even when they’re no longer alive to speak.

By Stav Dimitropoulos