Five years ago, Lee Mallon’s grandmother gave him his great-aunt’s diaries from the 1930s and 1940s. They transported the 36-year-old developer to a different world. “She was a really nice old English lady,” he recalls. When he was a child, she would give him math problems to do on Sundays and take him to a sweet shop if he solved them.
Mallon thought how lovely it would be if his great-aunt, who died in the late 1990s, could read her diaries to his own children today. But the only recording he has of her is seven seconds of video on a VHS tape of his grandfather’s 50th birthday party, in which she utters just a few words. After Mallon’s nostalgia came an epiphany: What if we could use neural network technology — a method of simulating the workings of a human brain in a computer — to keep the voices of those we love with us forever?
Last year, Mallon, an app developer who’s worked on projects for Amazon’s Alexa voice, launched voiceOK, an app that prompts users to select a short story and read it aloud. Once they have recorded an hour’s worth of audio, they will have essentially created their own “voice vault” — a kind of raw material for voice manipulation.
Voice synthesizing technology is already common. It’s used by digital assistants like Alexa, musical producers, and people who have lost their voices due to medical conditions; think Stephen Hawking’s speech synthesizer, which made its way into pop culture. But Mallon is suggesting a step further: Creating a synthesized voice that sounds exactly like the original human — and can be programmed to read or say anything, even when that person is no longer alive to speak. It sounds like a sci-fi prospect, but Mallon and others say the technological work is already underway.
Since Hawking started using his robotic speech synthesizer in the 1980s, researchers have been working to improve the quality of synthesized speech, mostly to help people with health issues and disabilities. Rupal Patel, a professor at Northeastern University’s Department of Communication Sciences and Disorders, is the founder of VocaliD, a company that creates customized synthetic voices for people who have lost their speech or were born with speech disorders.
“Our voice is an important part of how we form bonds across generations,” says Patel. And every voice is unique, due to differences in our physiology. A single “aahhh” contains enough vocal DNA to be an identity cue.
VocaliD, which was founded in 2014, set up a bank of 27,000 voices, collected from volunteers who read a passage into a home microphone and sent it to the company. To develop a new synthetic voice for a client, the company analyzes the vocalizations and sounds the person is still able to make — known as a “residual voice” — then uses algorithms to search for acoustic and demographic similarities in the sample bank. The program finds matches based on broad categories such as age and gender, as well as more detailed characteristics such as pitch, tempo, and volume.
“With just a few minutes of data … [the AI] can understand the uniqueness of a human voice.”Kundan Kumar, cofounder of Lyrebird AI and research lead at Descript
Once a match is found, an elaborate process called signal processing follows, through which sounds are analyzed, modified, and mixed. Many acoustic cues reveal a person’s vocal identity, including accent, pronunciation, range of pitch, loudness, rate of speech, breathiness, and resonance, says Patel.
The signal processing software takes the speech-impaired person’s vocalizations and fuses them with the donor speaker’s set of sounds, creating one voice. “After that, we put our artificial intelligence algorithms to use in order to create a voice engine, which can take any text in and produce speech that sounds like that new voice,” says Patel.
That program can be linked to an assistive communication device that allows the user to type a message, either with their hands or eyes. “Our engine essentially takes the text of that message and converts it into speech so that they can speak it out loud,” says Patel.
In addition to people who have lost their speech completely, VocaliD can help people with cancer of the mouth or the tongue, who can bank their voices before they lose them. The company also creates custom voices for companies and organizations that want their own alternative to Alexa or Siri in an application.
Elsewhere in the private sector — including the entertainment industry — the demand for one-of-a-kind synthetic voices is on the rise. In Japan, vocal synthesizer music software, known as vocaloid software, has created several singing voices, including that of international pop music megastar Hatsune Miku, who started off as a singing voice synthesizer and is now also a virtual performer. Even Voices, a global network of voice actors, is using AI to try to create the “perfect synthetic persona.”
Microsoft’s cloud computing service Azure recently launched a text-to-speech feature that allows business customers to create a customized synthetic neural voice for their brands, for use in chatbots and audio programs. And Descript, an audio and podcast platform, has a feature called Overdub, which allows podcasters to create a phonetic clone of their own voices based on a short recording, then use a text editor to remove words — and add words they didn’t actually speak.
Overdub’s technology is powered by Lyrebird AI, a Canadian startup recently acquired by Descript, which has the stated ambition to create the most realistic artificial voices in the world. “We use deep neural networks and transfer learning to clone voices,” says Kundan Kumar, the cofounder of Lyrebird AI and the research lead at Descript. “The AI learns the pattern of human speech from thousands of speakers. With just a few minutes of data from a voice, it can understand the uniqueness of a human voice.”
It’s not a huge leap to think we could soon use the same technology to serve Mallon’s idea: re-creating voices that are no longer with us. And already, there’s a market. Karen Trainer, a 40-year-old teacher and mother of two from Bournemouth, England, recently signed up for Mallon’s voiceOK app because she wants to preserve the legacy of her parents, who are both in their mid- to late 70s. “I may need to live 40 years more without my parents,” she says. “Half my life, or a third of my life, without them is not a very, very short time in the grand scheme of things.” The prospect of someday hearing their voices in new contexts brings her comfort. “A photo, even a video of my parents, is just a single memory,” she says. “But having them tell me and my children lots of different stories, that’s a precious thing.”
Not all of Mallon’s ambitions for voiceOK involve death. In the future, Mallon says, a parent who works night shifts or is in the armed forces may opt to have their voice synthesized. If they have also installed a virtual-assistant artificial-intelligence technology like Amazon’s Alexa, their child may be able to enjoy Harry Potter in Dad’s voice even if Dad is away. Journalists and other writers could digitize their voices, allowing readers to listen to a newly released article in the writer’s voice, without a recording session. Isolated people with dementia could hear the voice of their nephew talk about the day’s weather or a daughter comment on the news.
Mallon says voiceOK is working on a voice-synthesis project that may fulfill some of these goals, but he isn’t ready to release details yet. “It’s a good way to keep people’s legacy going,” he says.
But as voice-synthesis technology advances, ethical challenges arise. Theoretically, anyone with access to the expired person’s data could virtually resurrect them by creating a deepfake or avatar, without the person having consented to such posthumous uses. The data could even encroach on living people’s privacy by including interactions the people had with the deceased person when he or she was alive. Are we equipped to deal with the legal and ethical dilemmas of keeping our departed loved ones with us in a digital urn?
And if your voice can go on in perpetuity after your demise, what will it get to say when you are not alive to control it? “This is a tricky part,” says Patel. “Digital technologies can live on forever. It would be great if my grandmother who passed away could read to me a recipe while I’m cooking, but it’d be weird for me to read a letter that was never written by her in her voice.” The fact that synthetic speech is getting better and better, making it difficult to tell real from fake, gives rise to an unprecedented ethical conundrum: “You have put words into their artificial mouth,” says Patel, who urges that we put some kind of constraints on what can be said with our “eternalized” voices before the technology develops further.
The chillingly convincing deepfakes of Tom Cruise that went viral on TikTok in February 2021 — in which he appeared to show off his CD collection and play a Dave Matthews Band song on guitar — alerted many to the eerie era we are already in, where anyone can virtually make a fake video of us and circulate it online. “Deep fakes, impersonation, identity theft are problems that have not been an issue before. They are creating a gray area,” says Mallon. “What happens if one impersonates you and contacts your bank?” he wonders. (He is quick to offer solutions, such as two-step verification.) “Like with anything else, technology will just keep going to catch up to the problem,” he concludes.
Protecting against misuses such as audio deepfakes might involve embedding information in the audio files to indicate that they are synthetically generated, says Patel. It might also involve creating standards or educating the public about deepfakes, she continues. That said, bad actors will always find ways to bypass constraints, she cautions: “I don’t think there is one solution, and any technical solution will require continual vigilance.”
Perhaps machine intelligence will become an ally in our never-ending battle to bring death to its knees — even on a symbolic level. “Alexa, give me a recipe for my favorite noodle soup,” you could soon ask your virtual assistant AI. “Yes, lovely,” your long-gone grandmother will say, before explaining the recipe, the inflections and familiar warmth in her voice all there. Mallon says he created his invention in the hopes of preserving voices for 500 years. In this case, a part of you could fool death for at least 20 generations.