How VocaliD,a new technology, is changing the lives of people who cannot speak
Last November, Joe Morris, a 31-year-old film-maker from London, noticed a sore spot on his tongue. He figured he’d bitten himself in his sleep and thought nothing more about it until halfway through the winter holidays, when he realized the sore was still with him. He Googled “cut on tongue won’t heal” and, after sifting through pages of medical information on oral cancer, he decided to call his doctor.
The cut was nothing, Joe was sure: he was a non-smoker with no family history of cancer. But he’d make an appointment, just in case.
I’m sure it’s nothing, the doctor said. You’re not a smoker, and you’re 31 years old. But see a specialist, just in case.
I’m sure it’s nothing, the specialist said, you don’t check any of the boxes, but we’ll do a biopsy, just in case.
When the biopsy results came back positive for cancerous cells, the specialist said that the lab must have made a mistake. The second time Joe’s biopsy results came back positive, the specialist was startled. Now Joe was transferred to Guy’s hospital, which has one of the best oral cancer teams in Britain.
The oncologists at Guy’s reassured Joe again: the cancerous spot was small, and cancer of the tongue typically starts on the surface and grows inward. This tiny sore could likely be nipped out without much damage to the rest of his tongue. They’d take an MRI to make sure there wasn’t any serious inward growth, and then schedule the surgery.
The image revealed a tumour like an iceberg. It was rooted deep in the base of Joe’s tongue, mounding upward and out, its tip breaking the surface just where the telltale sore was located. “When the doctor told me the news, there was a work email that was bugging me and I still had that on my mind,” Joe wrote to me last summer. “As he was explaining to me that I was going to lose my tongue, I was redrafting a reply in my head.”
“You’re going to lose two-thirds of your tongue,” the doctor was telling him. “This is going to seriously affect your ability to eat. And your speech.”
Joe wanted to know how the surgery would affect his speaking. Would he have a lisp?
The doctor hesitated, and then looked at his hands. “Your family will still be able to understand you.”
A week before the surgery, Joe started to panic: he realised that he might never speak again. Even if he did, he would no longer sound like himself. Knowing that he was about to lose a huge part of his identity, Joe asked a friend to film an interview with him so that he would have a permanent record of his voice.
In the video, Joe’s speech is already beginning to falter: he has a bit of a lisp, and he has to sip water frequently and take breaks to withstand the strain of talking. He is dressed in a black knitted V-neck sweater, and seated near a window through which you can see the London skyline at dusk. He is pale, with sunken, blue eyes, dark, shaggy hair and three days of stubble. He looks a little unwell, a little sad, and a little rueful, as if he’s unsure about being the centre of attention. He keeps ducking his head and looking away, or making jokes. When asked to state the date, he smirks and says, with wry formality, “The date is, I believe, the 24th of February, the year of our Lord 2017.”
Speaking to the camera, Joe struggles genially to articulate what it feels like to contemplate losing his voice for ever. “I’m not what you’d call a vain man,” he says, quietly. “Usually it’s very far into the day before I’ve looked in the mirror. I don’t care about any of that.” He takes a moment. “But I am human. And the idea that I’m going to not look like me or sound like me … terrifying.” He swallows. “And also my job, my life, is all about communication, all about talking. I love talking,” he says feelingly, with a little smile. “I’ve got a few things to say.”
Shortly before this video was taken, the friend behind the camera had come to Joe with some news. He had found a company outside Boston called VocaliD, which creates custom digitised voices for people who use devices to help them speak. The company could use recordings of Joe speaking to recreate his own voice on a computer for him to keep and use for ever.
When they contacted VocaliD’s founder, a speech pathologist named Rupal Patel, she explained that it would be possible to digitally reconstruct Joe’s voice if, before his surgery, he was able to “bank” his voice. This meant recording the few thousand sentences that VocaliD has developed to capture all the phonemes in the English language.
Joe agreed to try. He recorded several hundred sentences and then, realising the magnitude of the task, stopped for several days. “This was my last week of freedom and I had a lot of stuff to do, people to see, life to live (and steaks to eat),” he wrote to me. Two days before the surgery, he started again. Banking his voice was slow and painful – by then, it was excruciating to talk, and he was trying to be at his most eloquent. On the final day, he recorded late into the night.
The next morning, Joe went back to hospital and had his tongue cut out, joining the ranks of those who cannot, in any traditional sense, speak.
There are surprisingly many ways for the power of speech to fail. There are disorders such as stuttering or apraxia, in which syllables are scrambled; motor neurone disease and cerebral palsy, which rob people of the muscle control required to articulate; traumatic brain injury; stroke; anatomical excisions like Joe’s; multiple sclerosis; autism. In the US, more than 2 million people require digital “adaptive alternative communication” (AAC) methods to help compensate for speech deficits. A 2008 study by the disability charity Scope estimated that 1% of people in Britain use or need AAC.
Modern adaptive alternative communication often involves the type of device made famous by Stephen Hawking – a small computer or tablet that plays aloud words typed into it. Before the invention of the first modern text-to-speech communication device in 1969, people with muscular or vocal disorders had to use “sip-and-puff” typewriters, which were operated by inhaling and exhaling through a straw. By 1986, when Hawking began using a speech device, AAC technology had improved significantly. The programme he used, known as the Equalizer, allowed him to press a switch to select words or phrases on a desktop computer and, later, a smaller computer mounted to his wheelchair.
The 2014 biopic of Hawking’s life, The Theory of Everything, contains a stark reminder of the loss that this technology tries to amend. When Hawking and his first wife, Jane, first hear what will be Hawking’s new voice, they are stunned. After a moment of speechlessness, Jane offers a timid objection: “It’s American.” The moment is played for laughs, but it marks a trauma. Our voices are coded with information by which others know us – age, gender, nationality, hometown, personality, mood – but they are also coded with the information by which we know ourselves. When your voice is no longer English, what part of your Englishness do you lose?
Hawking’s case is one of the most striking examples of the way a person’s voice shapes their identity. Though the robotic quality of his digital voice (and the American accent) felt inappropriate at first, it came to be his trademark. Hawking reshaped himself around his new voice, and years later, when he was offered the opportunity to use a new voice that was smoother, more human-sounding, and English, he refused. This felt like “him” now.
The “Stephen Hawking voice” doesn’t belong only to Hawking. In the years since it was created, the same voice has also been used by little girls, old men, and people of every racial and ethnic background. This is one of the stranger features of the world of people who rely on AAC: millions of them share a limited number of voices. While there is more variety now than before, only a few dozen options are widely available, and most of them are adult and male.
“Walk into a classroom of children with voice disorders and you’ll hear the exact same voice all around you,” Rupal Patel of VocaliD told me. Ten years ago, she was at a speech disorders conference when she came upon a little girl and a man in his mid-50s who were using their devices to have a conversation. They were speaking in the same adult, male voice. Patel was horrified. “This is just continuing to dehumanise people who already don’t have a voice to talk,” she told me.
The film critic Roger Ebert, whose jaw was removed to treat cancer, wrote in 2009about how frustrating it was to use one of these generic voices: “I sound like Robby the Robot. Eloquence and intonation are impossible.” He was tired of being ignored in conversations or coming across like “the village idiot”. He went on: “We put men on the moon, people like to say about such desires: Why can’t I have a voice of my own?”
This is the problem Patel has set out to solve. In 2007, she began researching technology that would allow her to make customised digital voices that sounded more like the humans they would represent. By 2014, the technology was sufficiently developed for Patel and her team to set up what they claim is the world’s first “voice bank”, an online platform where anyone with an internet connection can “donate” their voice by recording themselves reading aloud on to the VocaliD Voicebank, which is programmed with stories crafted to capture all the phonemes in the English language. (Early donors were required to upload 3,487 sentences; now, Geoff Meltzner, VocaliD’s Director of Research, can create a voice with as few as 1,000 sentences, though more material makes for a more human-sounding voice.)
Each donation is catalogued in a library of voices that VocaliD can then use when crafting a new voice for a client. The company offers clients “BeSpoke” voices – custom-made voices that combine the sound of a client’s own voice with the vocabulary supplied by a donor. This way, a teenager could use his brother’s donated voice, or a perfect stranger’s from the Voicebank, whichever is the closest approximation to their imagined vocal quality. (Clients like Joe bank their voices for a purpose VocaliD calls “voice legacy”: they record themselves for later, and then, when the time comes, are given back a digital file of their own voice.)
Creating a new digital voice like this requires splitting two elements of the human voice that normally function as one: the source and the filter. “Source” is the term for the vocal cords, larynx and throat muscles – the part of the body responsible for making sound when we laugh, yell or talk. As Geoff Meltzner, VocaliD’s vice-president of research, explained, your source is like a fingerprint. “There’s enough identity in each source alone to make it unique among all other sources.” The voice’s “filter” is the muscles (tongue, lips, pharynx etc) that shape those sounds into discrete, discernible words.
VocaliD’s technology works by capturing a few seconds of vowel sound (the source) from the recipient, and applying it to the filter provided by a donor. The combination allows the production of a voice that’s largely “of” the recipient. By tweaking his algorithms, Meltzner can offer voice that is “warmer” (more nasal) or more “authoritative” (lower pitch) or “brighter” (full of high overtones).
When a new voice is completed, it is added as a plug-in to whatever speech device its owner already uses. Recently, VocaliD added a feature on their own app that allows clients to adjust their voice to get exactly the timbre and volume they want. The system is designed to be convenient, but it still occasionally falls victim to glitches. One time, a teenage client called Patel in a panic because she had updated the software on her iPhone and lost her voice.
Donating your voice – unlike, say, a kidney – usually takes a few days, and you’re awake for all of it. There’s no screening process and no equipment involved except a laptop and an internet connection. One lazy day last winter, I decided to donate my voice from bed, which is how I found myself pitched forward with my laptop, mouth to the built-in microphone, insisting: “That tiramisu is to die for! That tiramisu is to die for!”
VocaliD’s Human Voicebank runs on an internet browser and is set up to look slightly like a video game: an indigo-blue backdrop frames a jaunty cartoon mouth with arms and legs standing next to a line of text, which you read aloud. Once you’re satisfied with your delivery, you click to upload the sentence to the bank, and a new line emerges. A bar at the bottom of the screen tracks your progress.
There are so many sentences to read that people usually spread their donation out over days and weeks, doing just a few hours at a time. In an attempt to make the long exercise entertaining, VocaliD gives options to read material according to your interests: poetry, say, or science fiction. The sentences I read ranged from the proverbial (“It ain’t over till the fat lady sings!”) to the banal (“Did you see it on Twitter?”) to the sobering (“This is an emergency. Get help now”). Some felt too private. Donating any part of your body is intimate. It strikes at the core of something we appreciate about ourselves: we are each single-editions. The voice is perhaps a uniquely personal gift. It’s both physical and metaphysical. It’s the emissary between our corporeal selves and the rest of the world.
When “I love you” came up, I re-recorded it over and over in a slight panic. (Other people, I’ve since learned, come upon this line and burst into tears.) What kind of I-love-you was this? Should I be speaking as if to a lover, a parent, or a pet? Was this I-love-you an assurance of the strength of the feeling (“I love you”) or its target (“I love you!”)? Should it have the butterflies tone of a first I-love-you, like a shy declaration, or of warm confirmation, like a mother saying goodnight to her child? Now sweating slightly, I recorded the phrase in what I hoped was a warm, neutral tone, deemed it not too stiff on playback, squeezed my eyes shut, and clicked. It was my last donation of the day.
Shortly after making my donation, I visited Rupal Patel at VocaliD’s offices in the western suburbs of Boston. Patel is slight and energetic, with bright eyes, a pert, chin-length bob, and brilliant enunciation. She is ardent about how miraculous a personalized voice can be for someone who has been unvoiced. When disabled people have communication impairments, she explained, it increases the likelihood that they’ll be removed from the workforce, isolated socially, mistakenly identified as cognitively impaired, or rendered invisible.
Humans respond with special attention and empathy to other human voices, and to unconsciously equate the ability to speak with a presence of mind. In 2010, medical anthropologist Mary Wickenden wrote a thesis on teenage AAC users titled Teenage Worlds, Different Voices, in which she pointed out: “If you cannot talk, it may be hard to prove that you [think] … language expressed out loud makes our subjectivity ‘more real’.”
Those who cannot speak are constantly reminded of their “unreality” in the eyes of society. Of the seven voices VocaliD completed in its first year, six were for children or teenagers with cerebral palsy, many of whom complained that strangers tend to either ignore them entirely, directing any questions or conversation to their parents, or speak to them as though they were toddlers.
Type-to-talk technology varies widely based on the needs of the individual user: people who have muscular command of their fingers can simply type on to a traditional keyboard and hear the words echoed out over a speaker. More common is a version in which the user scrolls and selects from a selection of words, phrases or symbols on screen, using a lever or switch positioned closest to the limb they have best control over. For those who can’t use levers, there are screens that track eye movement and are programmed to read aloud a phrase or symbol when the user has stared at it long enough.
Even for people who become fluent users of type-to-talk technology, the devices can be frustrating. Often, you have to wait for a cursor to pass over more than a dozen letters or symbols before it arrives at the one you need – and if you miss, you have to wait for the cursor to cycle all the way through again. Until recently, many devices didn’t even come with words or symbols for female genitalia, leaving no easy way to talk candidly with a friend or partner about sex, or to alert a caregiver about, say, a urinary tract infection, or abuse.
Preprogrammed voices are often inappropriate to their user’s age, or frustratingly robotic. Patel told me about one of her clients, a teenager named Sara Young, for whom they were building a new voice. At the time, Sara was using the same voice (“Heather”) as her mother’s GPS and some bank ATMs. Several girls with speech disabilities in Sara’s class at school use Heather, which means that in group settings it’s nearly impossible to distinguish who is saying what unless you’re looking closely. Like lots of her peers, Sara often played around with the different preprogrammed voices in her device – trying on different options for a day or two, or speaking in an adult male voice to give herself a laugh – but still she was frustrated. When I visited the office, Patel and Meltzner were putting the finishing touches on Sara’s BeSpoke voice, which they were constructing using a few “ahhh” sounds that Sara had recorded, and a donated voice. They were hoping for it to be ready by Christmas.
On my second day with Patel, I accompanied her to a technology fair at the Cotting School in Lexington, Massachusetts, a private school for special-needs students, several of whom are VocaliD clients. The company often does outreach at schools, both to offer their products to children using AAC, and to recruit new donors – they are always short on young donor voices. The fair was full of parents and children with cerebral palsy, including Sara. Like many kids with cerebral palsy, Sara is exceptionally small for her age, because eating requires muscular control she doesn’t always have. Her hair is dark and wavy and dyed with teal streaks, and when we met, she was wearing a light pink, long-sleeved shirt, the bag affixed to her motorised wheelchair was pink, and the foot she uses to drive (her only limb with reliable fine motor control) was shod in a pink sneaker.
As is often the case for people with movement and muscle disorders, Sara’s body moves in spasms. Her tongue flexes in and out of her mouth, and her neck twists from side to side. Her arms curl and unfurl like leaves. She can’t eat or shower or use the bathroom without help. She uses silicon straws to drink, because she bites uncontrollably when she sucks, which destroys regular straws. (Before her parents figured out about the silicon, they cut short lengths of fish-tank tubing for her.) She uses her left foot to do her homework assignments on an iPad and – with the help of duct tape and markers – to draw. When she speaks, it’s through an AAC device mounted on her roller chair, which senses her eye movements as a substitute for typing.
Her bodily presence, which seemed at first glance childlike, belied her personality, which was classic teen. She kept her wheelchair still, occasionally shuffling back and forth a bit listlessly, like someone rocking back and forth on her heels. When she got bored, she made a little loop. She has a blue-and-pink nose ring, and she hates having to carry an outmoded cell phone. (“BLACKBERRY SUCKS,” she told me.) She has ferocious eyebrows and sharp, dark, funny eyes, which she rolls frequently.
Because she is a strong communicator, Sara has become an occasional ambassador for the other kids in her AAC community. At this technology fair, Sara and her mother, Amy Young, took the stage to give the keynote address. Sara spoke first, giving a few sentences of introduction she had written in advance on her device. Her voice did indeed sound just like an ATM. “HELLO EVERYONE, MY NAME IS SARA. I AM 16. WHEN I DON’T HAVE A [DEVICE], PEOPLE BABY TALK TO ME OR JUST TALK TO MY MOM. SOMETIMES I AM SLOW TO SPEAK AND SO THEY JUST TALK OVER ME. THEY DON’T KNOW HOW TO WAIT FOR MY ANSWER.”
The truth of this became clear later in the day, when Amy and Sara conducted a joint Q&A. When Sara was asked what she uses her iPad for, she began staring at her screen with particular focus, twisting her head to keep her gaze levelled at the device regardless of her neck spasms. Thirty seconds passed, then 60. Everyone sat in silence, looking at her. Ninety seconds later, the computer spoke in a fluid stream: “HWFACEBOOKIGSNAPCHATMUSIC.”
Amy translated. “Homework, Facebook, Instagram, Snapchat and music.”
In the hour-long Q&A, Sara spoke fewer than 30 words. As is their custom, Amy did the majority of the speaking, partly for the sake of time and partly because Sara routinely relies on her mother to understand and translate her non-verbal cues. “It’s a tremendous amount of energy for her,” Amy explained to me later. “And while we encourage people to talk to her directly, sometimes she’ll reply by looking at me, like, ‘Can you just answer that?’”
Sara has a canny sense of humour, but her speaking style and pace lends itself more to well-placed interjection. In the middle of Amy’s careful explanation of why Apple’s bluetooth systems are incompatible with the motorised wheelchair, Sara cut in to phrase things more succinctly: “IDIOTS.” She peppered her mother’s sentences with brief exclamations in her own voice: “Yeah!” When she was out of the spotlight and speaking to people who knew how to adapt to her ways of speaking, conversation flowed more naturally. After the Q&A, scrolling through Instagram with a young aide at her high school, Sara let out a series of amused hoots. The aide shook her head in mock disapproval at the screen and said conspiratorially: “Your class is crazy.”
Sara laughed. “YOU DON’T EVEN KNOW.”
The disconnect between the spirit of Sara’s words and the robotic deadpan of their delivery was jarring. “The digital voice gets kind of lost,” said Amy. “When we heard about VocaliD, we thought: ‘How cool it would be to create something a little more natural.’ Sara hasn’t had the experience of having her voice change as she ages, and so that would be neat, too. If the voice is that much more natural, I’m hoping it won’t get lost as much.”
When I told Patel about this conversation, her eyes sparkled. “I really want people to be able not just to hear Sara, but in hearing her, to see her and experience her. When she belts out that “Yeah!” or that “No!” or whatever she says with her natural voice and then transitions to her device, boy wouldn’t it be nice if those two types of communication felt fluid. In an ideal world, she’d never have to use that thing. She’d be wearing a pair of glasses and an oculus and she’d have all these messages that she’d typed out. It wouldn’t be stigmatising. She wouldn’t have to be seen as an “other” communicator. That’s the wave of the future.”
For Joe, the transition from being able-bodied and seamlessly verbal to someone who seems physically – and, to the careless observer, mentally – disabled has been astonishing, and deeply painful. When he woke up from the surgery, it was the first time he was really, truly speechless. The doctors had excised most of his tongue – “and you have to remember that a large portion of your tongue you can’t see, because it’s in your throat,” Joe reminded me – and then took a long strip of his quadricep and attached the muscle to the flap that was left in his throat. They hoped that in time he would gain enough control of the new muscle to swallow and, in time, form words.
For the first week and a half, the tracheotomy tube diverted all air from his windpipe out through his neck, so that even if he tried to speak, no sound would come. “I felt completely trapped, a prisoner in my own body,” he told me via email. He could write things down to let the medical staff know if he was hungry or in pain. “But in terms of meaningful communication, you’re locked out completely.” His friends came to visit, and for the first time, he couldn’t join in the conversations, couldn’t interject with thoughts and jokes. He sat there, silent. “I love debating and arguing and being heard,” he told me. “And joking – that was really hard. You can’t have much of a wit when you need to write everything down. You miss the moment.”
This is one thing you lose when you’re taken outside the flow of conversation. The other, as Joe would discover, is the privilege of being included as an equal. “People treat you differently,” Joe wrote. “They don’t mean to, but they patronise you, treat you like a child.”
In the months that have passed since his glossectomy, Joe has made slow and steady progress with his physical therapy. The natural timbre of his voice is lower than it was before the surgery, though it may rise as the swelling continues to go down. “I fear I’ve said my last ‘S’ sound,” he typed to me over the summer. Ls and Js are also difficult, which frustrates him, because he struggles to say his own name, and that of his wife, Louisa.
When I spoke to him in late November, he cheerfully reported that his Ss were nearly back, if a little bit reminiscent of Sean Connery. He prefers to get along with his occasionally muddled natural speech, but he has found his digital voice helpful as a reference in his speech and language sessions. Recently, he started a new job at an ad agency and used his VocaliD version to show his new colleagues his “old voice”.
Joe may not end up using an AAC device every day, but he insists that it was important to him that, no matter what, his voice exists somewhere. “My wife is a Harry Potter fan, so I joke that this is my Horcrux,” he said, referring to the object in which a wizard can hide a part of his soul, and so attain a kind of immortality. He saw it as an act of self-preservation. “I was worried that as I grew older and this became more and more a past event, I might start to forget the sound of my own voice.”
People often use the Voicebank this way, Patel says. Early on, she noticed that a disproportionate number of the people who “banked” their voices were transgender people in some stage of transition, often at the beginning, before starting hormone-replacement therapy. For them, as for Joe, the bank might serve as a vault for safeguarding an old self. The record is there to find, just in case: this is who I was.
For others, the digital voice is not a remnant of who they were, but a promise of who they will be. Sara Young got her new voice just before Christmas, in the VocaliD offices. Patel and Meltzner shifted nervously on their feet as they stood before Sara and Amy Young, making smalltalk and cueing up the two voices Meltzner had designed for her to choose between. He played the first, using a sentence he had pre-programmed. “HI, MY NAME IS SARA. I’M 16 YEARS OLD, AND I’M AWESOME.”
It sounded tinny and halting, like “Heather’s” younger sister, but with a trace of something idiosyncratic and human at the base of the sound.
Sara laughed delightedly. “OK,” Patel said. “Now we’ll hear the second one.”
“HI, MY NAME IS SARA. I’M 16 YEARS OLD, AND I’M AWESOME.” The second voice sounded clearer, more bell-like. It seemed older than the first in its assurance, but younger in its vitality.
“OK, which do you like?” Patel asked Sara.
After a long pause, Sara asked for the second.
“Oh, phew!” Patel laughed. “That was our favourite, too. What do you like about number two?”
After a long pause, Sara said: “IT’S SPUNKY.”
They downloaded it on to her device. Patel pointed out to me later that the moment when a person’s new voice is turned on can be anti-climactic, because they’re not quite sure how to respond to it. The really interesting stuff, she says, comes in the days and weeks after, when the client notices how they’re treated differently, or in watching how they psychologically internalise the experience of having a voice that sounds like them.
As the voice was loading, Patel asked Amy how she felt. “Great – as long as Sara feels great!” She paused. “It’ll take some getting used to. It was 12 years with the other voice, with Heather. It’s foreign in some way, like when my son’s voice changed.”
When Sara’s voice had finished its transfer and was finally in her control, the team gathered to hear what her first words in the new voice would be.
“THANK – THANK YOU FOR ALL YOUR WORK,” she said. “I KNEW YOU COULD DO IT.”
Patel laughed. “Thank you for giving us this chance!” The adults stood around aimlessly for a second, looking at Sara. “Do you want to say anything else?”
She thought for a moment, and then stared fixedly at her screen.
Illustrations by Bratislav Milenković