Podcast: How AI is giving a lady again her voice


Voice know-how is among the largest tendencies within the healthcare house. We take a look at the way it would possibly assist care suppliers and sufferers, from a lady who’s dropping her speech, to documenting healthcare data for docs. However how do you educate AI to be taught to speak extra like a human, and can it result in extra environment friendly machines?

We meet: 

  • Kenneth Harper, VP & GM, Healthcare Digital Assistants and Ambient Scientific Intelligence at Nuance
  • Bob MacDonald, Technical Program Supervisor, Venture Euphonia, Google 
  • Julie Cattiau, Venture Supervisor, Venture Euphonia, Google 
  • Andrea Peet, Venture Euphonia consumer 
  • David Peet, Legal professional, husband of Andrea Peet
  • Hod Lipson, Professor of Innovation within the Division of Mechanical Engineering; Co-Director, Maker House Facility, Columbia College. 


  • The Examination of the Future Has Arrived – through Youtube


This episode was reported and produced by Anthony Inexperienced with assist from Jennifer Robust and Emma Cillekens. It was edited by Michael Reilly. Our combine engineer is Garret Lang and our theme music is by Jacob Gorski.

Full transcript:


Jennifer: Healthcare appears to be like just a little totally different than it didn’t so way back…when your physician probably wrote down particulars about your situation on a chunk of paper…

The explosion of well being tech has taken us all types of locations…  digitized data, telehealth, AI that may learn x-rays and different scans higher than individuals, and simply medical developments that may have gave the impression of science fiction till fairly just lately. 

We’re at a stage the place it’s secure to say healthcare is Silicon Valley’s subsequent battleground… with all the most important names in tech jockeying for place.

And squarely positioned among the many largest tendencies on this house…  is voice know-how… and the way it would possibly assist care suppliers and sufferers.

Like a lady quickly dropping her speech to speak with good units in her house.

Andrea Peet: My smartphone can perceive me. 

Jennifer:  Or… a health care provider who needs to concentrate on sufferers, and let know-how do the file maintaining.

Clinician: Hey Dragon, begin my customary order set for arthritis ache.

Jennifer: Voice might additionally change how AI programs be taught… by changing the 1’s and 0’s in coaching knowledge with an strategy that extra intently mirrors how kids are taught.

Hod Lipson: We people, we do not assume in phrases. We predict in sounds. It’s a considerably controversial thought, however I’ve a hunch and there is not any knowledge for this, that early people communicated with sounds manner earlier than they communicated with phrases.

Jennifer: I’m Jennifer Robust and, this episode, we discover how AI voice know-how could make us really feel extra human…  and the way educating AI to be taught to speak just a little extra like a human would possibly result in extra environment friendly machines. 


OC:…you have got reached your vacation spot.

Ken Harper: In healthcare particularly, There’s been a significant downside during the last decade as they’ve adopted the digital well being programs, all the things’s been digitized however it has include a price in that you simply’re spending heaps and many time really documenting care.

Ken Harper: So, I am Ken Harper. I’m the final supervisor of the Dragon Ambient Expertise, or DAX as we wish to check with it. And what DAX is, it is an ambient functionality the place we are going to hearken to a supplier and affected person having pure dialog with each other. And based mostly on that pure dialog, we are going to convert that into a top quality scientific word on behalf of the doctor.

Jennifer: DAX is A-I powered… and it was designed by Nuance, a voice recognition firm owned by Microsoft. Nuance is among the world’s main gamers within the area of pure language processing. Its know-how is the spine of Apple’s voice assistant, Siri. Microsoft paid almost 20-billion {dollars} for Nuance earlier this yr, primarily for its healthcare tech. It was the most costly acquisition in Microsoft’s historical past…after LinkedIn.   

Ken Harper: We have, most likely, have all skilled a state of affairs the place we go see our main care supplier or possibly a specialist for some subject that we’re having. And as a substitute of the supplier taking a look at us throughout the encounter, they’re on their pc typing away. And what they’re doing is that they’re really creating the scientific word of why you are in that day. What’s their analysis? What’s their evaluation? And it creates an impersonal expertise the place you do not really feel as related. You do not really feel as if the supplier is definitely specializing in us. 

Jennifer: The aim is to cross this administrative work off to a machine. His system data all the things that is being spoken, transcribes it, and tags it based mostly on particular person audio system. 

Ken Harper: After which we take it a step additional. So this isn’t simply speech recognition. You realize, that is really pure language understanding the place we are going to take the context of what is in that transcription, that context of what was mentioned, our information of what is medically related, and likewise what’s not medically related. And we are going to write a scientific word based mostly on a few of these key inputs that have been within the recording. 

Jennifer: Below the hood, DAX makes use of deep studying—which is closely depending on knowledge. The system is skilled on quite a lot of totally different interactions between sufferers and physicians— and their medical specialties. 

Ken Harper: So the macro view is the way you get an AI mannequin that understands by specialty usually, what must be documented. However then on prime of that, there’s numerous adaptation on the micro view, which is on the consumer stage, which is taking a look at a person supplier. And as that supplier makes use of DAX for increasingly more of their encounters, DAX will get that rather more correct of methods to doc precisely and comprehensively for that particular person supplier.

Jennifer: And it does the processing.. in actual time. 

Ken Harper: So if we all know {that a} coronary heart murmur is being mentioned, and this is the details about the affected person on their historical past, this might allow numerous programs to offer choice help or evidence-based help again to the care staff on one thing that possibly they need to think about doing from a remedy perspective or possibly one thing else they need to be asking about and doing triage on. The long-term potential is you perceive context. You perceive the sign of what is really being mentioned. And the quantity of innovation that may occur, as soon as that enter is thought,  it is by no means been accomplished earlier than in healthcare. Every part in healthcare has at all times been retrospective otherwise you put one thing into an digital well being file after which some alert goes off. If we might really deliver that intelligence into the dialog the place we all know one thing must be flagged or one thing must be mentioned, or there is a suggestion that must be surfaced to the supplier. That is simply going to open up an entire new set of capabilities for care groups.

Julie Cattiau: Sadly these voice enabled know-how do not at all times work properly in the present day for individuals who have speech impairments. So that is the hole that we have been actually excited about filling and addressing. And so what we consider is that making voice enabled assistive know-how extra accessible might help individuals who have this sort of circumstances be extra impartial  of their day by day lives

Julie Cattiau: Hello, my identify is Julie Cattiau. I am a product supervisor in Google analysis. And for the previous three years, I have been engaged on mission Euphonia, which aim is to make speech recognition work higher for individuals who have speech disabilities. 

Julie Cattiau: So the best way that know-how works is that we’re personalizing the speech recognition fashions for people who’ve speech impairments. So To ensure that our know-how to work, we’d like people who’ve hassle being understood by others to file a sure variety of phrases. After which we use these speech samples as examples to coach our machine studying mannequin to raised perceive the best way they communicate.

Jennifer: The mission began in 2018, when Google started working with a non-profit looking for a remedy for ALS. It’s a progressive, nervous system illness that impacts nerve cells within the mind and the spinal wire—typically resulting in speech impediments. 

Julie Cattiau: One in every of their initiatives is to file numerous knowledge from individuals who have ALS to be able to research the illness. And as a part of this program, they have been really recording speech samples from individuals who have ALS to see how the illness impacts their speech over time, so Google had a collaboration with ALS TDI to see if we might use machine studying to detect ALS early however a few of our analysis scientists at Google, once they listened to these speech samples and requested themselves the query: might we do extra with these recordings? And as a substitute of simply making an attempt to detect whether or not somebody has ALS might we additionally assist them talk extra simply by mechanically transcribing what they’re saying. We began this work from scratch and since 2019, a few thousand totally different individuals, people with speech impairments have recorded over 1,000,000 utterances for this analysis initiative.

Andrea Peet: My identify is Andrea Peet and I used to be identified with ALS in 2014. I run a non-profit.

David Peet: And my identify is David Peet. I am Andrea’s husband. I am an legal professional for my day job, however my ardour helps Andrea run the inspiration, the Staff Drea basis to finish ALS via revolutionary analysis.

Jennifer: Andrea Peet began to note one thing was off in 2014… when she stored tripping over her personal toes throughout a triathlon. 

Andrea Peet: So I began going to neurologists and it took about eight months. However I used to be identified with ALS which generally has a lifespan of two to 5 years and so I’m doing amazingly properly, that I am nonetheless alive and, speaking and strolling, with a walker, seven years later.

David Peet: Yeah, I second, all the things you mentioned about actually simply feeling fortunate. Um, that is most likely the most effective, the most effective phrase for it. Once we acquired the analysis and I might began doing analysis that two to 5 years was actually the common, we knew from that analysis date in 2014, we’d be fortunate to have something after Might twenty ninth, 2019. And so to be right here and to nonetheless see Andrea competing in marathons and out on the earth and taking part in podcasts like this one, it is an actual blessing.

Jennifer: One of many main challenges of this illness—it impacts individuals in very other ways. Some lose motor management of their arms and might’t raise their arms, however would nonetheless be capable to give a speech. Others can nonetheless transfer their limbs however have problem talking or swallowing…as is the case right here 

Andrea Peet: Folks can perceive me more often than not. However when I’m drained or when I’m in a loud place, it’s more durable for me to uh, um..

David Peet: It is more durable so that you can pronounce, is it?

Andrea Peet: To mission… 

David Peet: Ahh, to pronounce and mission phrases.

Andrea Peet: So Venture Euphonia, mainly, reside captions, what I am saying on my telephone so individuals can learn alongside what I’m saying. And it is actually useful when I’m giving shows.

David Peet: Yeah, it is actually useful if you’re giving a presentation or if you end up out talking publicly to have a platform that captures in actual time the phrases that Andrea is saying in order that she will be able to mission them out to those who are listening. After which the opposite enormous assist for us is that Euphonia syncs up what’s being captioned to our Google house, proper? And so having a wise house that may perceive Andrea after which permit her totally different performance at house actually offers her extra freedom and autonomy than she in any other case would have. She will flip the lights on, flip the lights off. She will open the entrance door for somebody who’s there. So, with the ability to have a know-how that permits them to operate utilizing solely their voice is admittedly important to permitting them to really feel human, proper? Proceed to really feel like an individual and never like a affected person that must be waited on 24 hours a day. 

Bob MacDonald: I did not come into this with knowledgeable speech or language background. I really turned concerned as a result of I heard that this staff was engaged on applied sciences that have been impressed by individuals with ALS and my sister’s husband had handed away from ALS. And so I knew how profoundly useful that may be if we might make instruments that may assist ease communication.

Jennifer: Bob MacDonald additionally works at Google. He’s a technical program supervisor on Venture Euphonia.

Bob MacDonald: A giant focus of our effort has been bettering speech recognition fashions by personalizing them. Partly as a result of that is what our early analysis has discovered, offers you the most effective accuracy enhance. And you realize, that is not stunning that should you use speech samples from only one particular person, you may sort of tremendous tune the system to know that one particular person, lots higher. Somebody who does not sound precisely like them, the enhancements are likely to get washed out. However then as you consider, properly, even for one particular person, if their voice is altering over time, as a result of the illness is progressing or they’re growing older, or there’s another subject that is happening. Possibly even they’re sporting a masks or there’s some non permanent issue that is modulating their voice, then that may positively degrade the accuracy. The open query is how strong are these fashions to these sorts of modifications. And that is very a lot one of many different frontiers of our analysis that we’re pursuing proper now.

Jennifer: Speech recognition programs are largely skilled on western, english-speaking voices. So it’s not simply individuals with medical circumstances who’ve a tough time being understood by this tech… it’s additionally difficult for these with accents and dialects. 

Bob MacDonald: So the problem actually goes to be how will we guarantee that that hole in efficiency does not stay vast or get wider as we span bigger inhabitants segments and actually attempt to preserve a helpful stage of efficiency and that every one will get even more durable as we transfer away from the first languages which might be used and merchandise that mostly have these speech recognizers embedded. In order you progress to nations or elements of nations the place languages have fewer audio system, the information turns into even more durable to come back by. And so it should require only a larger push to guarantee that we preserve that sort of an inexpensive stage of fairness.

Jennifer: Even when we’re capable of remedy the speech range downside, there’s nonetheless the difficulty of the large quantities of coaching knowledge wanted to construct dependable, common programs. 

However what if there was one other manner—one which takes a web page from how people be taught?

That’s after the break.


Hod Lipson:  Hello. My identify is Hod Lipson. I am a roboticist. I am professor of engineering and knowledge science at Columbia college in New York. And I research robots, methods to construct them, methods to program them, methods to make them smarter.

Hod Lipson: Historically, should you take a look at how AI is skilled. We give very concise labels to issues after which we practice an AI to foretell one for a cat, two for a canine, that is how all of the deep studying networks in the present day are being skilled with these very, very compacted labels.

Hod Lipson: Now, should you take a look at the best way people be taught, they give the impression of being very in another way. Once I present my baby footage of canine, or I present them our canine or a canine, different individuals’s canine strolling outdoors, I do not simply give them one bit of data. I really enunciate the phrase “canine.” I’d even say canine in numerous tones and I’d do all types of issues. So I give them numerous data once I label the canine. And that acquired me to assume that possibly we’re educating computer systems within the flawed manner. So we mentioned, okay, let’s do that loopy experiment the place we’re going to practice computer systems to acknowledge cats and canine and different issues, however we will label it not with the one and the zero, however with an entire audio file. In different phrases, the pc wants to have the ability to say, articulate, the phrase “canine”. The entire audio file. Each time it sees a canine. It isn’t sufficient for it to say you realize, thumbs up for canine, thumbs down for cat. You really should articulate the entire thing.

Jennifer: To the shock of him and his staff… It labored. It recognized photos — simply in addition to utilizing ones and zeros.

Hod Lipson: However then we seen one thing very, very attention-grabbing. We seen that it might be taught the identical factor with lots much less data. In different phrases, it might get an identical quantity, high quality of consequence, however it’s seeing a few tenth of the information. And that’s in itself very, very worthwhile, but additionally we additionally seen one thing, even one thing that is doubtlessly extra attention-grabbing is that when it realized to tell apart between a cat and a canine it realized it in a way more resilient manner. In different phrases, it was not as simply fooled by, you realize, tweaking a pixel right here and there and making the canine look just a little bit extra like a cat and so forth. To me it looks like, you realize, there’s one thing right here. It implies that possibly we have been coaching neural networks the flawed manner. Possibly we have been caught in Seventies pondering the place we’re, you realize, stingy about knowledge. We have moved ahead extremely quick with regards to the information we use to coach the system, however with regards to the labels, we’re nonetheless pondering like Seventies, with those and zeros. So which may be one thing that may change the best way we take into consideration how AI is skilled.

Jennifer: He sees the potential for serving to programs acquire effectivity, practice with much less knowledge or simply be extra resilient. However he additionally believes this might result in AI programs which might be extra individualized.

Hod Lipson: Possibly it is extra simpler to go from a picture to audio than it’s with a bit. A bit, it is form of unforgiving. It is both proper or flawed. Whereas an audio file, there’s so some ways to say canine, then possibly it is extra forgiving. So numerous hypothesis about why that’s, issues which might be simpler. Possibly they’re simpler to be taught. Possibly, it is a actually attention-grabbing speculation, possibly the best way we are saying canine and cat is definitely not a coincidence. Possibly now we have chosen evolutionarily. We might’ve known as, you realize, we might have known as a cat, you realize, a smog as a substitute of a canine. Okay. A cat. It will be too near a canine and it might be complicated and no one. It will take children longer to inform the distinction between a cat and a canine. So we people have developed to decide on language and enunciations which might be simple to be taught and are applicable and so possibly that touches additionally on form of the historical past of language. 

Jennifer: And he says, the subsequent stage of growth?… could possibly be permitting AI to provide it’s personal language in response to the pictures it’s proven. 

Hod Lipson: We people select specific sounds partly due to our physiology and the sort of frequencies we will emit and all types of bodily constraints. But when the AI can produce sounds in different methods, possibly it will probably produce its personal language that’s each simpler for it to speak and assume, but additionally possibly it is simpler for it to be taught. So, if we present it a cat and a canine after which it’s going to see a giraffe that he by no means noticed earlier than. I would like it to give you a reputation. And there is a purpose for that possibly as a result of, you realize, it is based mostly on the way it appears to be like with relationship to a cat and a canine and we’ll see the place it goes from there. So if it learns with much less knowledge and if it is extra resilient and if it will probably make analogies extra effectively and, you realize, see if it is only a completely happy coincidence or if there’s actually one thing deep right here. And that is, I feel, the form of query that we have to reply subsequent.


Jennifer: This episode was reported and produced by Anthony Inexperienced with assist from me and Emma Cillekens. It was edited by Michael Reilly. Our combine engineer is Garret Lang and our theme music is by Jacob Gorski. 
Thanks for listening, I’m Jennifer Robust.


Leave a Reply

Your email address will not be published. Required fields are marked *