WEBVTT - Speech Recognition 0:00:00.200 --> 0:00:07.240 Brought to you by Toyota. Let's go places. Welcome to 0:00:07.400 --> 0:00:17.520 Forward Thinking. Welcome everyone to Forward Thinking, the podcast where 0:00:17.520 --> 0:00:20.400 we think about the future. I'm Jonathan Strickland, I'm Lauren 0:00:20.440 --> 0:00:22.920 Vogel Dan, and I'm Joe McCormick. And today we want 0:00:23.000 --> 0:00:28.480 to talk about the evolution of speech recognition software and hardware, 0:00:28.800 --> 0:00:31.479 and to talk about how did we get to where 0:00:31.480 --> 0:00:34.200 we are and where are we going from here? Because 0:00:34.240 --> 0:00:37.840 clearly this is a thing. I mean, we're seeing speech 0:00:37.880 --> 0:00:44.600 recognition in lots of different devices and including computers, mobile devices. 0:00:44.960 --> 0:00:47.080 I know that my phone allows me to talk to 0:00:47.120 --> 0:00:50.800 it and ask questions and occasionally I get the correct response. 0:00:51.479 --> 0:00:56.680 I also have programs that will create an automatic transcript 0:00:56.760 --> 0:01:01.680 of voicemails. So how did we get to this point? 0:01:01.680 --> 0:01:05.400 How did we get to a time where computers can, 0:01:05.640 --> 0:01:09.880 at least on the surface level, appear to understand speech. 0:01:10.520 --> 0:01:12.280 And to really understand this, we have to go back 0:01:12.319 --> 0:01:18.000 aways and by a ways, I mean seventeen seventy three, what, yeah, 0:01:18.080 --> 0:01:21.840 all right, So seventeen seventy three, shortly before the Atari 0:01:21.840 --> 0:01:24.920 twenty six hundred came out by a couple of centuries, 0:01:25.360 --> 0:01:30.120 there was a Russian scientist named Christian Kratzenstein, and he 0:01:30.319 --> 0:01:35.480 actually Kratzenstein, Kratzenstein, Kratzenstein. That's a great name, it's an 0:01:35.480 --> 0:01:40.000 awesome name. Right. Well, during this time people were starting 0:01:40.040 --> 0:01:43.840 to get really interested in the nature of sound and 0:01:43.920 --> 0:01:49.400 ways of producing sound, and Kratzenstein actually created something very interesting. 0:01:49.480 --> 0:01:53.920 He created a machine that was capable of producing vowel 0:01:54.040 --> 0:01:59.840 like sounds using organ pipes and resonance tubes. Wow. So 0:02:00.760 --> 0:02:04.279 totally synthetic. Totally synthetic. And in this case we're talking 0:02:04.320 --> 0:02:08.360 about a machine producing sounds, not a machine taking in 0:02:08.520 --> 0:02:13.359 sounds and analyst exactly. This is a machine. But the 0:02:13.880 --> 0:02:17.919 history of speech recognition is also a history of designing 0:02:18.000 --> 0:02:21.360 machines that talk back to us. They don't just listen 0:02:21.400 --> 0:02:23.680 to what we have to say, but they can communicate 0:02:23.760 --> 0:02:26.560 back to us. So this is one of those earliest 0:02:27.520 --> 0:02:31.320 versions of that. And he wasn't alone. In seventeen ninety 0:02:31.360 --> 0:02:36.360 one Wuff Gong von Kimpellen in Vienna he built the 0:02:36.480 --> 0:02:41.480 Acoustic Mechanical Speech Machine or two for two. Yeah, yeah, 0:02:41.520 --> 0:02:47.200 and then let's just skip the entire nineteenth century. But 0:02:47.840 --> 0:02:50.400 wait a minute, I actually it wasn't I think that 0:02:50.480 --> 0:02:54.079 Alexander Graham Bell. His wife was deaf correct, and he 0:02:54.360 --> 0:02:57.240 originally when he was starting to play around with sound, 0:02:57.560 --> 0:03:00.360 created was trying to transform, trying to create the device 0:03:00.360 --> 0:03:03.280 that would transform audible words into a visual output that 0:03:03.360 --> 0:03:07.440 a deaf person could interpret. And I mean he wound 0:03:07.520 --> 0:03:10.960 up creating pictures with sound, but his wife never really 0:03:11.000 --> 0:03:15.000 managed to interpret them. However, that research started going into 0:03:15.000 --> 0:03:18.280 things like the telephone. Well, and also there were early 0:03:18.639 --> 0:03:22.160 attempts when the gramophone first became a thing, when they 0:03:22.200 --> 0:03:25.720 started to use wax cylinders to record sound in a 0:03:25.720 --> 0:03:29.200 physical medium. Was that at a cent or somebody before him? 0:03:29.440 --> 0:03:32.000 Bell did some of this as well, So we're talking 0:03:32.080 --> 0:03:34.760 about there are actually quite a few inventors who were 0:03:34.760 --> 0:03:37.160 working on this sort of technology. But there were people 0:03:37.200 --> 0:03:41.560 who had created these different devices to record sound on 0:03:41.600 --> 0:03:44.360 a physical medium, and there were already people thinking, well, 0:03:44.360 --> 0:03:46.040 if we can do this, is there some way where 0:03:46.080 --> 0:03:49.080 we can reverse the process, where we take the physical 0:03:49.120 --> 0:03:52.400 medium and make that into an input of some type. 0:03:52.440 --> 0:03:55.040 And some people were even thinking maybe we can make 0:03:55.320 --> 0:04:00.720 an automatic typewriter. Once the mechanical typewriter came into being, 0:04:00.720 --> 0:04:03.600 there were thoughts of if there's some way to make 0:04:03.800 --> 0:04:07.440 this same process where we're recording sound onto a physical 0:04:07.440 --> 0:04:11.000 medium then turn into a way of actually transmitting this 0:04:11.040 --> 0:04:14.360 into text. That would be amazing. No one quite figured 0:04:14.400 --> 0:04:16.480 it out at that point, but it got a lot 0:04:16.520 --> 0:04:19.800 of people thinking, and in fact, Bell Laboratories was one 0:04:19.839 --> 0:04:24.640 of the leading companies or leading research firms that was 0:04:24.720 --> 0:04:28.200 really concentrated on this speech recognition problem. And in the 0:04:28.279 --> 0:04:32.400 nineteen thirties there was a guy named Homer Dudley. I 0:04:32.400 --> 0:04:35.719 guess probably not quite as colorful as Kratzenstein or Wolfgang. 0:04:36.040 --> 0:04:39.360 It's okay, it's a more uh, what would you say, 0:04:39.400 --> 0:04:44.120 it's a more Americana kind of name. Homer Dudley, Yeah, 0:04:44.120 --> 0:04:47.280 Homer Dudley. He was at Bell Labs. He proposed a 0:04:47.360 --> 0:04:51.520 system model for speech analysis and synthesis, and he also 0:04:51.600 --> 0:04:55.760 designed the Voice Operating Demonstrator also known as the VOTER, 0:04:56.279 --> 0:04:59.480 which was a speech synthesizer, And this was essentially building 0:04:59.480 --> 0:05:02.279 on that scene work that the other guys had done 0:05:02.640 --> 0:05:07.560 centuries before, but in a electronic capacity as opposed to mechanical. 0:05:08.720 --> 0:05:11.480 Then we get into this era where the researchers were 0:05:11.480 --> 0:05:13.640 starting to try and figure out how to make machines 0:05:13.720 --> 0:05:17.520 actually understand speech. At least on a surface level. And 0:05:17.839 --> 0:05:22.920 the early emphasis was on phonetics, which is the sounds 0:05:22.960 --> 0:05:25.440 we make in our language. You know, the age language 0:05:25.440 --> 0:05:29.880 has its own list of phonemes that we generate in 0:05:29.960 --> 0:05:32.640 order to make the words. These are, yeah, the building 0:05:32.640 --> 0:05:36.520 blocks kind of speech. It's the individual sounds, and they're 0:05:36.560 --> 0:05:39.880 similar across languages. But yeah, for example, English has how 0:05:39.920 --> 0:05:43.760 about forty. Linguists are actually kind of in a disagreement 0:05:43.760 --> 0:05:46.600 about the exact number. It all depends on where you are, 0:05:46.760 --> 0:05:50.080 like in the South. Next, we produce sounds in the South, 0:05:50.400 --> 0:05:52.839 like we can make a one syllable word into at 0:05:52.920 --> 0:05:57.159 least three or four syllables, y'all. So we have the 0:05:57.200 --> 0:06:00.320 ability to insert sounds where no sound was before, which 0:06:00.320 --> 0:06:02.920 is why things like hooked on phonics does not necessarily 0:06:02.960 --> 0:06:06.440 work as well as advertised, not necessarily unless you create 0:06:06.480 --> 0:06:09.960 a regionalized version, in which case that would be interesting. 0:06:10.360 --> 0:06:13.560 But you've got to understand how hard this is for 0:06:13.640 --> 0:06:17.760 computers to hear, right, So we're used to it. I mean, 0:06:17.800 --> 0:06:19.880 we talk to people all the time. But just think 0:06:19.880 --> 0:06:23.880 about like when you're on the phone and say somebody 0:06:23.960 --> 0:06:26.920 is reading something to you, like spelling out a word 0:06:27.040 --> 0:06:29.719 or something, and you did you say P or B? 0:06:30.720 --> 0:06:34.000 Did you say P? I? Could you know what? That's 0:06:34.040 --> 0:06:37.240 why you need the Yankee Hotel fox trot kind of sure, 0:06:37.320 --> 0:06:42.400 sure stuff. So when you it's so easy for computers 0:06:42.400 --> 0:06:45.840 to mess this up, well yeah, and beyond that, even 0:06:45.880 --> 0:06:49.720 if you are enunciating clearly, the speed at which you 0:06:49.760 --> 0:06:55.920 say a word can completely make a computer misunderstand you, absolutely, 0:06:56.040 --> 0:06:59.000 because if you've programmed the computer program in such a way, 0:06:59.040 --> 0:07:01.480 you've built the computer pro in such a way that 0:07:02.400 --> 0:07:06.080 it analyzes a word based on a sequence of sounds, 0:07:06.120 --> 0:07:09.359 and it expects each part of that sequence to be 0:07:09.360 --> 0:07:11.880 a certain length. If you pronounce that word at a 0:07:11.920 --> 0:07:14.800 different speed than someone else, then the computer might have 0:07:14.880 --> 0:07:18.080 trouble figuring out that the two versions of that was 0:07:18.120 --> 0:07:22.760 the same word. And this can vary within an individual speech. 0:07:23.360 --> 0:07:26.160 I could say the same word twice in one paragraph, 0:07:26.520 --> 0:07:29.880 and the way I say it each time might be 0:07:29.960 --> 0:07:32.880 different enough to cause problems. Right, So these are non 0:07:32.880 --> 0:07:35.520 trivial problems, and in these early days they were mostly 0:07:35.600 --> 0:07:37.760 focused on just trying to figure out how to teach 0:07:37.800 --> 0:07:42.480 a computer to recognize those basic sounds. In nineteen fifty two, 0:07:42.480 --> 0:07:46.360 Bell Labs introduced the Audrey system and that could recognize 0:07:46.440 --> 0:07:50.080 spoken digits, which made it a little easier because you 0:07:50.240 --> 0:07:53.520 eliminate everything that's not a digit, right, you're just going 0:07:53.560 --> 0:07:58.080 through a series of what ten, twenty? It was probably 0:07:58.080 --> 0:08:01.080 only nine actually, because you usually do one at a time. 0:08:01.800 --> 0:08:04.560 Maybe maybe ten. If you include zero there as well, 0:08:05.160 --> 0:08:07.440 I mean you might not. It all depends. They discovered 0:08:07.440 --> 0:08:10.360 the number zero in the nineteen fifty They did, but 0:08:10.440 --> 0:08:13.800 they lost it for a while. Oh yeah, so, but 0:08:13.960 --> 0:08:16.040 I think by nineteen fifty two they re found it. 0:08:16.120 --> 0:08:18.160 The Mayans had it at least. Yeah, it was you know, 0:08:18.280 --> 0:08:22.720 twenty thirteen, we had to have it. I mean yeah. 0:08:22.760 --> 0:08:26.080 Bell Labs ended up having this Audrey system, and by 0:08:26.360 --> 0:08:29.440 limiting it to just digits, it meant that they could 0:08:29.520 --> 0:08:34.959 work very hard on a drastically simplified version of speech recognition, 0:08:35.120 --> 0:08:37.480 because again, you just throw out anything that's non digit, 0:08:37.920 --> 0:08:41.240 and it means the computer it can concentrate on which 0:08:41.840 --> 0:08:45.440 digit did that sound like the most, based upon the 0:08:45.440 --> 0:08:48.640 phonemes that are needed to say whatever that digit is. 0:08:49.400 --> 0:08:51.680 In nineteen sixty two, so this is a decade later, 0:08:51.960 --> 0:08:55.480 IBM demonstrated the shoe box machine at the World's Fair 0:08:56.400 --> 0:09:01.080 and it could understand sixteen words spoken in English. Another 0:09:01.160 --> 0:09:04.920 good point is that speech recognition, some of these systems 0:09:04.960 --> 0:09:08.600 are language specific. It's not that it can adapt to 0:09:08.679 --> 0:09:12.440 any language. It is most of the program's programmed specifically 0:09:12.480 --> 0:09:15.280 for a certain one, right exactly. So, Again, if the 0:09:15.320 --> 0:09:18.400 phonemes that we produce here are different from ones and 0:09:18.520 --> 0:09:22.840 say China, then it's not going to give you like 0:09:23.040 --> 0:09:26.240 whatever it produces is not going to be the response 0:09:26.320 --> 0:09:31.160 that someone who's speaking Chinese would want, right So generally, 0:09:31.200 --> 0:09:34.400 in the nineteen sixties, Japanese labs began to work on 0:09:34.920 --> 0:09:38.560 vowel recognition phonemes, and they also did some early work 0:09:38.559 --> 0:09:41.840 and continuous speech recognition. Now this is important because again 0:09:41.880 --> 0:09:45.000 those early speech recognition programs, even when they got to 0:09:45.040 --> 0:09:47.319 the point where they could recognize full words, you had 0:09:47.320 --> 0:09:55.960 to put long pauses between each word or else it 0:09:56.000 --> 0:09:58.600 never would And unless you're William Shatner, that's not really 0:09:58.679 --> 0:10:03.840 a natural way of speaking or Christopher Walkin. Yeah. Either way. 0:10:05.120 --> 0:10:08.480 Also in the sixties, Fry and Dean's, two researchers at 0:10:08.480 --> 0:10:12.120 the University College in England, designed a phone name recognizer 0:10:12.640 --> 0:10:16.640 that could recognize four vowel sounds and nine consonant sounds, 0:10:16.960 --> 0:10:20.800 and they use statistical data on phoneme sequences found in 0:10:20.840 --> 0:10:24.080 English to help the system recognize more words than it 0:10:24.200 --> 0:10:27.440 normally would. And this is kind of interesting. What you 0:10:27.480 --> 0:10:30.200 do is you say, all right, there are a certain 0:10:30.720 --> 0:10:36.280 limited number of sounds typically found in the spoken English language. 0:10:37.080 --> 0:10:40.040 But those sounds are not you know, it's not that 0:10:40.040 --> 0:10:42.160 those are completely interchangeable and that you're going to find 0:10:42.200 --> 0:10:45.280 every single combination of those sounds in an English word. 0:10:45.280 --> 0:10:49.320 There's certain sounds that are rarely, if ever, going to 0:10:49.360 --> 0:10:52.160 go together. So if you start to take those sounds 0:10:52.160 --> 0:10:55.280 out and then concentrate on the words that do use 0:10:55.559 --> 0:10:58.800 the sounds that are left, you have reduced the number 0:10:58.880 --> 0:11:03.319 of possibilities and thus made the system more efficient and reliable. 0:11:03.800 --> 0:11:05.880 So now are we starting to get into an era 0:11:06.120 --> 0:11:10.040 of what you're talking about here where the machines are 0:11:10.080 --> 0:11:14.680 doing some analysis, Yes, to uh, to figure out what 0:11:14.760 --> 0:11:18.400 the language means, right, well, really what it means even 0:11:18.440 --> 0:11:20.400 just to figure out what the word is exactly. Yes, 0:11:20.440 --> 0:11:23.800 that's what I meant. To interpret the sounds into words. 0:11:24.200 --> 0:11:27.439 It's not just drawing on things that have been directly 0:11:27.520 --> 0:11:31.280 programmed into it, you know, the hard coded, right, understanding 0:11:31.320 --> 0:11:35.160 that it's using statistical analysis. Yes, and and I mean 0:11:35.240 --> 0:11:37.800 clearly this would be important if you're talking about any 0:11:37.840 --> 0:11:42.520 sort of dictation software, right, because with dictation software, to 0:11:42.880 --> 0:11:47.800 program every single word in the English language into a 0:11:48.040 --> 0:11:53.679 vocabulary for this program and to do every variation of 0:11:53.720 --> 0:11:57.319 the pronunciation of that word would be pretty that'd be 0:11:57.360 --> 0:11:59.440 a lot of work. Yeah. So if you can create 0:11:59.559 --> 0:12:03.360 a system that can analyze the phonemes and then, based 0:12:03.400 --> 0:12:06.760 upon the certain statistical analysis, figure out or make a 0:12:06.760 --> 0:12:09.719 best guess at what that word is, you've fixed a 0:12:09.720 --> 0:12:11.840 lot of the problems. And in fact best guest becomes 0:12:11.920 --> 0:12:15.360 really important in just a few decades. So in nineteen 0:12:15.440 --> 0:12:18.960 seventy one, oh wait, I'm sorry, let me back up. 0:12:19.480 --> 0:12:23.080 Late sixties early seventies, researchers start to look into non 0:12:23.360 --> 0:12:27.960 uniform timescale approaches to speech recognition, which is what I 0:12:28.000 --> 0:12:30.960 was talking about earlier. The fact that not everyone speaks 0:12:31.080 --> 0:12:33.360 the same words at the same speed or uses the 0:12:33.360 --> 0:12:37.760 same emphasis. So you have to figure out a way 0:12:37.800 --> 0:12:42.240 of analyzing that and accounting for that. And it's called 0:12:42.320 --> 0:12:46.959 the it's called dynamic time warping, which is not a 0:12:47.080 --> 0:12:48.800 jump to the left and a step to the right. 0:12:49.600 --> 0:12:53.760 I'm disappointed, Jonathan, I'm sorry. Dynamic to me, H, well, 0:12:53.800 --> 0:12:56.559 you know, I'll take you to a movie. On Friday 0:12:56.920 --> 0:13:00.680 nineteen seventy one, the United States Department of Defense Advance 0:13:00.880 --> 0:13:05.520 Research Project Agency, also known as DARPA initiates a program 0:13:05.559 --> 0:13:10.480 called Speech Understanding Research or su R, and it funded 0:13:10.520 --> 0:13:15.280 several projects, including one by Carnegie Mellon University called Harpie, 0:13:15.840 --> 0:13:19.200 which is just charming. But yes, there's a speech understanding 0:13:19.200 --> 0:13:23.240 system which could understand one thousand and eleven words, which 0:13:23.240 --> 0:13:25.000 I said was about the same as a vocabulary of 0:13:25.040 --> 0:13:28.440 a three year old. And it used something called beam 0:13:28.720 --> 0:13:31.880 search to narrow down the possibilities of what a spoken 0:13:32.040 --> 0:13:36.400 sound could be by comparing it to the statistical data 0:13:36.400 --> 0:13:38.600 and going with the most likely results. So it's going 0:13:38.600 --> 0:13:41.679 with probabilities. And so this is really interesting to me 0:13:41.800 --> 0:13:44.320 because it doesn't necessarily mean it's going to produce the 0:13:44.400 --> 0:13:47.200 correct result. It's making a best guess based upon the 0:13:47.200 --> 0:13:50.560 input that it got what it was you said. So 0:13:50.679 --> 0:13:53.199 in this case, if I were having the conversation with you, Joe, 0:13:53.280 --> 0:13:55.200 and I said a letter and you weren't sure if 0:13:55.240 --> 0:13:57.560 it was p or b you. Instead of you asking me, 0:13:57.600 --> 0:13:59.079 you just say, well, I think it was probac was 0:13:59.120 --> 0:14:01.839 a P. I'm just gonna write, well, I mean, if 0:14:02.200 --> 0:14:04.120 your computer is smart enough and it has a large 0:14:04.200 --> 0:14:09.680 enough dictionary, it might understand that that say, the words 0:14:09.720 --> 0:14:12.080 starting with a P sound makes sense here, But the 0:14:12.080 --> 0:14:14.400 words starting with a B sound is not. So like 0:14:14.800 --> 0:14:17.679 I ate a pair or I ate a bear, and 0:14:17.960 --> 0:14:21.960 now some days, some days the pair eats you exactly. 0:14:22.560 --> 0:14:25.200 But of course i'd imagine the machine at that time 0:14:25.360 --> 0:14:29.240 didn't have the resources to say, go figure out if 0:14:29.280 --> 0:14:31.960 I ate a pair or I ate a bear made 0:14:31.960 --> 0:14:34.680 more sense, right right? We need to remember that this 0:14:34.760 --> 0:14:38.160 is you said, early seventies, so this is when you know, 0:14:38.240 --> 0:14:40.880 computers were the size of like three of my car 0:14:40.960 --> 0:14:43.080 at least, you know, right well, and there were I mean, 0:14:43.280 --> 0:14:46.200 there was no Internet yet, and that'll come in a 0:14:46.400 --> 0:14:49.120 big way in a little bit here. By the seventies 0:14:49.120 --> 0:14:54.280 they had Arpanet, but that was very limited and that 0:14:54.440 --> 0:14:55.760 they had have anything to do with it. They had 0:14:55.800 --> 0:14:58.240 no web to draw on, no no web at all 0:14:58.280 --> 0:15:02.800 for massive sampling of of data. So nineteen seventy six, 0:15:03.080 --> 0:15:07.520 the Serve program that DARPA had concludes there were a 0:15:07.520 --> 0:15:11.200 couple of other agencies that had tried to create speech 0:15:11.320 --> 0:15:16.000 understanding algorithms and hardware, but had not quite met the 0:15:16.080 --> 0:15:18.280 requirements by the end of the program to really count 0:15:18.320 --> 0:15:21.600 as a success, but they did end up contributing quite 0:15:21.600 --> 0:15:26.360 a bit to future endeavors. So then we've got the 0:15:26.480 --> 0:15:31.280 nineteen eighties that typically follows the seventies, and that's when 0:15:31.320 --> 0:15:35.040 they introduced a statistical method that was based on the 0:15:35.160 --> 0:15:38.640 hidden Markov model. Have you guys heard of this the hmm, 0:15:38.760 --> 0:15:42.360 all right, it's a little complicated and it's difficult to 0:15:42.400 --> 0:15:47.760 really explain without the benefit of complicated graphics behind me, 0:15:47.800 --> 0:15:52.840 but I will try. So it's a probability model. And 0:15:53.520 --> 0:15:57.360 let's say that you've got let's say you've got three 0:15:58.280 --> 0:16:01.200 earns in front of you. Okay, three three vases are 0:16:01.240 --> 0:16:04.320 in front of you. They're solid, you can't see through them, 0:16:04.600 --> 0:16:07.960 but you see that you've put a certain number of 0:16:08.480 --> 0:16:11.840 orange ping pong balls in each. The first one has 0:16:11.880 --> 0:16:14.240 the most. You put a certain number of white ping 0:16:14.280 --> 0:16:16.000 pong balls in each, The middle one has the most 0:16:16.000 --> 0:16:18.040 of those, and you put a certain number of yellow 0:16:18.120 --> 0:16:20.760 ping pong balls in each. The third one has the 0:16:20.760 --> 0:16:23.720 most of those, and then you already know the states 0:16:23.760 --> 0:16:26.280 of the you're actually watching as you draw these ping 0:16:26.360 --> 0:16:28.440 pong balls out, and then you're combining them to get 0:16:28.480 --> 0:16:32.400 some sort of response at the end. It doesn't matter 0:16:32.400 --> 0:16:34.920 what the response is, but you're drawing a ping pong 0:16:34.960 --> 0:16:38.480 ball out from each combining them together, and that you 0:16:38.640 --> 0:16:41.160 see the whole process. Now, that's a normal Markov model 0:16:41.200 --> 0:16:44.760 because you know the state of each of those draws 0:16:44.960 --> 0:16:50.200 from the vases, all right, so you observe the state. Now, 0:16:50.280 --> 0:16:52.400 let's say those vases are in a one room and 0:16:52.400 --> 0:16:54.480 you're in another room, and you cannot see into the 0:16:54.880 --> 0:16:56.600 other room. You just get to see the output of 0:16:56.640 --> 0:16:58.840 the three ping pong balls as they come out of 0:16:58.880 --> 0:17:03.000 this process. So you don't see which one's drawn from 0:17:03.000 --> 0:17:05.320 which earned, but you know that one is drawn from 0:17:05.320 --> 0:17:08.520 each one, and you see what the result is. Now, 0:17:08.840 --> 0:17:12.000 you don't know the state of those individual urns, but 0:17:12.040 --> 0:17:14.680 you do see the result, which gives you enough information 0:17:14.760 --> 0:17:18.520 to draw some conclusions about the state of the urns 0:17:18.560 --> 0:17:20.959 inside the room. Not enough for you to know for certain, 0:17:21.359 --> 0:17:24.280 but you can get sort of a probability of what 0:17:24.680 --> 0:17:27.240 happened in there to get the result that you have 0:17:27.400 --> 0:17:31.560 that's a hidden Markov model, and that is an oversimplification 0:17:31.760 --> 0:17:34.080 of the hidden Markov model. So anyone out there who 0:17:34.119 --> 0:17:38.200 actually works with systems that use this is screaming that's 0:17:38.400 --> 0:17:41.560 way too simplistic, I know, but this is the easiest 0:17:41.560 --> 0:17:43.800 way for me to explain it, Okay. But so basically 0:17:43.920 --> 0:17:47.560 what you're saying is that it uses it looks at 0:17:47.560 --> 0:17:51.600 the statistical prevalence of these three different colors appearing into 0:17:51.640 --> 0:17:56.600 the room, and by that it makes judgments about how 0:17:56.680 --> 0:18:00.800 common they probably are in the vases more or less. 0:18:00.840 --> 0:18:04.800 And so these models are used a lot in things 0:18:04.880 --> 0:18:08.520 that require a lot of interpretation on machines. Part voice 0:18:08.520 --> 0:18:10.440 recognition is a big part of that, but it's not 0:18:10.480 --> 0:18:15.840 just voice recognition, gesture recognition, handwriting recognition, anything where you know, 0:18:15.960 --> 0:18:19.000 two people could try and make the same result, but 0:18:19.119 --> 0:18:22.800 because we are individuals and because we do think slightly differently, 0:18:23.280 --> 0:18:26.200 even though we're both creating the same result, we're doing 0:18:26.200 --> 0:18:28.199 it in a different way. The computer has to be 0:18:28.200 --> 0:18:31.000 able to interpret that, right. So it's because it's taking 0:18:31.440 --> 0:18:35.720 sort of ambiguous analog data from the world, sure, yeah, 0:18:35.760 --> 0:18:38.119 and it has to be able to react to that 0:18:38.240 --> 0:18:42.280 and create a meaningful result. So once people started to 0:18:42.440 --> 0:18:47.960 concentrate on this form of statistical analysis, voice recognition pretty 0:18:48.040 --> 0:18:51.959 much hit its peak as far as recognizing individual words, 0:18:52.520 --> 0:18:55.760 not necessarily knowing what the context is or what the 0:18:55.800 --> 0:18:58.520 meaning is, but it meant that if you were speaking 0:18:58.600 --> 0:19:02.959 into a mischie that had this kind of software in it, 0:19:02.960 --> 0:19:06.560 it could determine with relative ease what it was you 0:19:06.640 --> 0:19:10.200 were saying, not what it meant, but what the actual 0:19:10.240 --> 0:19:12.639 words were. So if, for example, if it's a simple 0:19:12.800 --> 0:19:16.960 speech to text program, it be fairly accurate, and it 0:19:17.000 --> 0:19:19.840 got more accurate as time went on. In nineteen eighty two, 0:19:19.880 --> 0:19:23.960 that's when a certain Ray Kurtzweil got involved. Our old 0:19:24.800 --> 0:19:29.000 Kurtzweil's a well known futurist, one of those evangelists for 0:19:29.080 --> 0:19:32.879 the oncoming singularity, a fellow who I think is hoping 0:19:32.920 --> 0:19:38.200 to achieve immortality through technology in some method or another personally, yes, definitely. 0:19:38.680 --> 0:19:42.120 So he created in nineteen eighty two the Kurtzweil Applied 0:19:42.240 --> 0:19:47.600 Intelligence Division Company, really and it was all about creating 0:19:47.640 --> 0:19:52.119 computer based speech recognition, And in nineteen eighty seven it 0:19:52.240 --> 0:19:56.960 introduced a commercial speech recognition system. And Kurtzweil was really 0:19:57.000 --> 0:20:02.320 applying his expertise in two areas, computer science and pattern recognition. 0:20:02.800 --> 0:20:05.560 He was really interested in the way that computers can 0:20:05.680 --> 0:20:10.520 identify patterns and respond to them, and speech was certainly 0:20:10.680 --> 0:20:14.000 part of that. So he applied that knowledge and that 0:20:14.040 --> 0:20:18.200 expertise and really made some big contributions in the speech 0:20:18.240 --> 0:20:21.680 recognition field. Skipping over to the nineteen nineties, I mean, 0:20:21.760 --> 0:20:24.879 essentially we're having this field evolve over time. But in 0:20:24.920 --> 0:20:28.200 the nineties we started seeing the development of real speech 0:20:28.280 --> 0:20:31.920 enabled applications. So this is when we started getting those 0:20:32.640 --> 0:20:34.760 telephone systems where you would call in and get an 0:20:34.800 --> 0:20:39.960 automated response saying say say or press one, which is 0:20:40.119 --> 0:20:42.399 again going all the way back to the Audrey system 0:20:42.400 --> 0:20:44.360 in nineteen fifty two that no labs DI it Yeah, 0:20:44.520 --> 0:20:46.399 You've only got ten responses, and so it just has 0:20:46.400 --> 0:20:48.800 to figure out which one right, and then eventually it 0:20:48.800 --> 0:20:52.080 would get to things like you know, say yes, or 0:20:52.760 --> 0:20:55.679 like I can help you with that. What is your 0:20:55.720 --> 0:21:00.520 problem he's a keyword? Yeah, not that keyword. Sorry, I 0:21:00.560 --> 0:21:04.879 don't understand. Can you restate that? Yeah. So by twenty 0:21:04.920 --> 0:21:09.040 ten we get the we get Google's English Voice search system, 0:21:09.560 --> 0:21:14.400 which incorporates around two hundred and thirty billion words from 0:21:14.560 --> 0:21:19.400 actual user queries. Wow, have you all tried this thing? Oh? Yeah, 0:21:19.400 --> 0:21:22.160 I use it all the time. No, I do, because 0:21:22.200 --> 0:21:24.560 I've got an Android phone, so I actually do use 0:21:24.680 --> 0:21:28.280 voice search all the time. Sometimes I think it's really 0:21:28.400 --> 0:21:31.600 hilarious how accurate it is, Like, you know, it shouldn't 0:21:31.600 --> 0:21:35.280 recognize that term, but it does. I use it mostly 0:21:35.320 --> 0:21:39.080 for navigation purposes, So I'll pull up a map application 0:21:39.320 --> 0:21:42.119 and it's a Google one. So then I, you know, 0:21:43.280 --> 0:21:45.960 speak destination and I can say an address, or I 0:21:45.960 --> 0:21:49.040 can say a business name, or you know, if I 0:21:49.080 --> 0:21:51.040 have someone in my contact list, I can say their 0:21:51.160 --> 0:21:54.760 name and it pulls up the information, which leads us 0:21:54.840 --> 0:21:59.320 kind of into a second part of this speech recognition discussion. 0:21:59.520 --> 0:22:02.160 We've got the idea that speech and search are really 0:22:03.000 --> 0:22:07.120 tightly connected, actually to the point where advances in one 0:22:07.200 --> 0:22:10.600 field often mean that the other field benefits as a result. 0:22:11.080 --> 0:22:15.240 But now we're talking about not just recognizing words, but 0:22:15.680 --> 0:22:20.640 pulling some sort of meaning from them. Right, Well, what 0:22:20.840 --> 0:22:25.399 is the goal of input of an interface that takes 0:22:25.440 --> 0:22:29.080 input from a human and turns it into data. I mean, 0:22:29.880 --> 0:22:31.800 I don't know. We'll say I would argue that the 0:22:31.920 --> 0:22:36.080 ultimate goal of an input interface is to become invisible, 0:22:36.840 --> 0:22:40.320 to make things as easy and as natural and as 0:22:40.400 --> 0:22:44.560 intuitive for you as it possibly could be, so that 0:22:44.640 --> 0:22:48.000 you don't even recognize the tools you're using, right, Right, 0:22:48.040 --> 0:22:51.800 to give the computer the ability to answer your questions 0:22:51.840 --> 0:22:54.640 almost before you ask them, exactly and right now, you know, 0:22:54.920 --> 0:22:58.960 we're still using tools that we have to learn how 0:22:59.000 --> 0:23:03.000 to use. Right. So when you when you want to 0:23:03.040 --> 0:23:07.639 talk to the voice recognition program on your smartphone, you 0:23:07.760 --> 0:23:10.560 do have to be aware that it's only going to 0:23:10.640 --> 0:23:14.239 be listening to certain keywords, right. You have to you 0:23:14.240 --> 0:23:17.760 have to give it keywords and sort of specific commands 0:23:17.880 --> 0:23:21.439 that it can understand in order for it to help you. Sure, 0:23:21.480 --> 0:23:23.440 And in that sense, it's kind of like a program 0:23:23.480 --> 0:23:26.160 where you know, you have a certain number of buttons 0:23:26.200 --> 0:23:29.000 you can click on, or commands you can enter on 0:23:29.040 --> 0:23:32.399 a command line that are chosen from a list of 0:23:32.480 --> 0:23:36.320 pre selected commands, but you're just doing it with your voice, right. 0:23:36.400 --> 0:23:40.720 Anything outside of that would just be interpreted as an error. Sure, yeah, Yeah, 0:23:40.760 --> 0:23:42.400 you can say open and close, but if you say 0:23:42.400 --> 0:23:46.640 French fries, it goes qua. Yeah. Yeah. So let's say 0:23:46.640 --> 0:23:49.400 you had this and you're looking for something on Google, right, 0:23:49.480 --> 0:23:51.680 you're looking at Google Maps and you're using your voice, 0:23:51.800 --> 0:23:54.280 you could probably say French fries, though, right, should say 0:23:54.280 --> 0:23:57.040 like French fries near my house. Well, even there it 0:23:57.400 --> 0:24:00.480 might be able to understand those keywords. Right, You've given 0:24:00.520 --> 0:24:03.240 it something that it knows how to work with. But 0:24:03.359 --> 0:24:06.480 what if you've got a problem, like I'm trying to 0:24:06.520 --> 0:24:09.479 remember this meal I had that was real good in 0:24:09.600 --> 0:24:12.680 town and I don't know, and you're kind of describing it, 0:24:12.840 --> 0:24:15.479 but right, it can't do anything with that, right, Right, 0:24:15.520 --> 0:24:17.560