WEBVTT - I'm sorry, what did you say? 0:00:04.120 --> 0:00:07.160 Get in tech with technology with tech Stuff from how 0:00:07.200 --> 0:00:13.680 stuff works dot com. Hey there, and welcome to tech Stuff. 0:00:13.720 --> 0:00:16.279 I'm your host, Jonathan Strickland. I'm an executive producer with 0:00:16.280 --> 0:00:19.960 how Stuff Works in love all Things Tech and listener 0:00:20.079 --> 0:00:22.320 Nate wrote in and asked that I do an episode 0:00:22.560 --> 0:00:27.720 about personal digital assistance or virtual assistance or voice helpers. 0:00:28.200 --> 0:00:30.160 This is hard because we don't really have a great 0:00:30.560 --> 0:00:33.839 term for these things, but I'm talking about applications like 0:00:34.400 --> 0:00:36.559 and I apologize ahead of time if I activate your 0:00:36.560 --> 0:00:42.840 technology Sirie, Alexa and Google Assistant. These sort of voice 0:00:42.840 --> 0:00:45.920 helpers that can respond to voice commands as well as 0:00:46.040 --> 0:00:48.000 other means of input in a way that makes them 0:00:48.000 --> 0:00:51.479 seem almost intelligent. Now, as it turns out, that's actually 0:00:51.479 --> 0:00:55.240 a pretty complicated history because it requires a discussion about 0:00:55.320 --> 0:00:59.840 a lot of different connected ideas that we're all in 0:01:00.040 --> 0:01:03.080 dependent and then ultimately converged. We're talking about stuff like 0:01:03.320 --> 0:01:08.560 speech recognition, natural language processing, and technology that was meant 0:01:08.600 --> 0:01:12.200 to improve accessibility and a whole lot more So, it 0:01:12.240 --> 0:01:15.280 makes talking about the services somewhat challenging because it's not 0:01:15.360 --> 0:01:18.360 like there was just one pathway that led to their development. 0:01:18.640 --> 0:01:22.960 They exist largely because of these independent but converging areas 0:01:22.959 --> 0:01:25.920 of innovation. Much of the work that made these services 0:01:25.959 --> 0:01:29.840 possible took place in events that were concurrent with each other, 0:01:30.000 --> 0:01:36.920 with different organizations all working towards similar but unconnected, disconnected goals. 0:01:36.959 --> 0:01:40.040 So going by strict timeline approach would be really hard, 0:01:40.120 --> 0:01:42.880 if not impossible, just because you have to jump around 0:01:42.920 --> 0:01:46.200 a lot to talk about different advances. So today I'm 0:01:46.240 --> 0:01:50.480 going to focus solely on speech recognition. This in itself 0:01:50.640 --> 0:01:53.640 is a huge topic, so it's more than enough for 0:01:53.680 --> 0:01:56.280 a single episode of tech stuff. In the next episode, 0:01:56.320 --> 0:02:00.000 I'm going to dive more into natural language processing, which 0:02:00.520 --> 0:02:03.800 has some crossover with speech recognition, but it is its 0:02:03.840 --> 0:02:06.200 own thing. And then after that we'll take a look 0:02:06.200 --> 0:02:09.640 at how voice assistants like Sirie and Alexa popped up 0:02:09.680 --> 0:02:14.240 over time. First, the idea of creating a machine that 0:02:14.280 --> 0:02:18.920 could interpret speech is older than computers. If you listen 0:02:18.960 --> 0:02:21.440 to my episodes about the history of the turntable, you'll 0:02:21.480 --> 0:02:26.800 remember the phanatograph, designed by Eduard Leon Scott de Martinville 0:02:27.120 --> 0:02:31.680 in eighteen fifty seven. The gadget had a small brush 0:02:31.760 --> 0:02:35.360 that was attached to a parchment diaphragm and the bristles 0:02:35.400 --> 0:02:38.520 on the brush rested against a sheet of paper that 0:02:38.600 --> 0:02:41.520 itself was wrapped around a cylinder. On top of the 0:02:41.520 --> 0:02:44.520 sheet of paper was a layer of soot. So to 0:02:44.560 --> 0:02:47.400 operate the device, you would turn the cylinder, the brush 0:02:47.400 --> 0:02:50.679 would drag across the soot on the paper, and you 0:02:50.720 --> 0:02:54.120 would shout at the diaphragm. The vibrations of sound would 0:02:54.160 --> 0:02:56.959 cause the parchment diaphragm to vibrate. That would make the 0:02:57.000 --> 0:02:59.720 brush vibrate and move against the paper, and that would 0:02:59.720 --> 0:03:03.480 create a pattern corresponding to the vibrations that were made 0:03:03.560 --> 0:03:06.799 by the paper diaphragm. The phonautograph was supposed to aid 0:03:06.840 --> 0:03:10.240 in the study of language and sound. The machine itself 0:03:10.320 --> 0:03:14.639 was not intended to interpret sound, but rather facilitate interpretation. 0:03:14.680 --> 0:03:19.240 A human would take a look at these tracings essentially 0:03:19.520 --> 0:03:22.360 and be able to analyze sound, or at least that 0:03:22.400 --> 0:03:24.760 was the intent. It didn't quite work out that way, 0:03:24.760 --> 0:03:28.480 but that was the concept behind it. Now, let's set 0:03:28.480 --> 0:03:32.160 the way back machine to the nineteen fifties. In nineteen 0:03:32.200 --> 0:03:37.280 fifty two, Bell Labs created the Audrey system, which was 0:03:37.320 --> 0:03:41.000 not a mean green mother from outer space, but rather 0:03:41.480 --> 0:03:45.960 the first documented speech recognizer system. It was an analog system, 0:03:46.000 --> 0:03:50.200 not a digital one. It was its own dedicated massive circuit, 0:03:50.880 --> 0:03:53.400 and it even had vacuum tubes in this thing. Because 0:03:53.440 --> 0:03:56.960 this is before the transistor. It could recognize strings of 0:03:57.000 --> 0:04:00.720 digits spoken by its creator with about nine and accuracy. 0:04:00.800 --> 0:04:03.360 If anyone else tried it, the accuracy dropped a bit. 0:04:03.440 --> 0:04:07.200 This already shows that speech recognition is tough because not 0:04:07.320 --> 0:04:10.480 everyone says things exactly the same way. I know that's 0:04:10.480 --> 0:04:12.680 not a news flash, but it is important for the 0:04:12.720 --> 0:04:16.279 concept of speech recognition. Uh. You also had to pause 0:04:16.800 --> 0:04:21.160 between strings of numbers. You couldn't just rattle off conversationally. 0:04:21.240 --> 0:04:23.440 You had to put pauses in there. But it was 0:04:23.480 --> 0:04:26.680 also an enormous piece of machinery. It took up a 0:04:26.800 --> 0:04:29.640 six ft high relay rack and it consumed a lot 0:04:29.640 --> 0:04:34.280 of electricity. Then Big Blue, also known as IBM, had 0:04:34.360 --> 0:04:37.880 scientists and engineers working on the possibility of designing technologies 0:04:37.880 --> 0:04:40.560 that could recognize speech. They were kind of working around 0:04:40.600 --> 0:04:44.520 the same time that Bell South was computer scientists Nathaniel Rochester, 0:04:44.600 --> 0:04:47.200 who designed an IBM computer called the seven oh one. 0:04:47.480 --> 0:04:50.120 He also wrote the first assembler. Headed up a group 0:04:50.160 --> 0:04:53.359 of engineers at IBM who were researching pattern recognition and 0:04:53.480 --> 0:04:57.440 information theory. That work, which was early research into fundamental 0:04:57.520 --> 0:05:01.039 building blocks for artificial intelligence, would also become important for 0:05:01.080 --> 0:05:05.240 speech recognition. In the late nineteen fifties, William C. Dirsh, 0:05:05.360 --> 0:05:09.640 another IBM computer scientist, developed a computer system as part 0:05:09.720 --> 0:05:14.720 of IBMS Advanced Systems Development Division laboratory, and it incorporated 0:05:14.760 --> 0:05:18.679 basic elements of speech recognition. He unveiled the device, called 0:05:18.720 --> 0:05:22.359 the IBM Shoebox in nineteen sixty two at the World's Fair. 0:05:22.880 --> 0:05:26.480 Using a microphone, you could speak basic digits from zero 0:05:26.520 --> 0:05:30.680 to nine, and also six additional control words like plus 0:05:30.800 --> 0:05:34.279 or minus, and the shoebox would recognize the words and 0:05:34.480 --> 0:05:39.960 perform calculation, So essentially this was a basic voice controlled calculator. 0:05:40.279 --> 0:05:43.920 While the application was limited, this showed off a remarkable achievement. 0:05:44.120 --> 0:05:46.719 Finding a way to program a machine to accept speech 0:05:46.880 --> 0:05:50.120 as a command is a non trivial problem. Throughout the 0:05:50.160 --> 0:05:53.880 nineteen sixties, computer scientists took a brute force sort of 0:05:53.920 --> 0:05:57.520 approach to solving speech recognition, which could work in very 0:05:57.640 --> 0:06:01.280 narrow applications such as the calculator approach, but were by 0:06:01.320 --> 0:06:04.839 their nature difficult to scale up. Even in the early 0:06:04.920 --> 0:06:09.200 nineteen seventies, the Speech Understanding Research Project from our PA 0:06:09.279 --> 0:06:12.000 as the same organization that would help bring the Internet 0:06:12.040 --> 0:06:17.040 into being, produced a brute force template called Harpy. While 0:06:17.120 --> 0:06:19.920 it was reliant upon brute force, Harpy, which came out 0:06:19.960 --> 0:06:24.200 of Carnegie Melon Research, could recognize about one thousand words. 0:06:24.600 --> 0:06:28.320 Harpy also made use of a process called beam search. 0:06:28.880 --> 0:06:31.560 This is a search strategy in which a search algorithm 0:06:31.640 --> 0:06:35.120 can consider multiple possible hits at a single time, rather 0:06:35.160 --> 0:06:38.720 than looking through a large data set for a specific 0:06:38.839 --> 0:06:42.560 perfect hit. Then the algorithm would determine the probability of 0:06:42.600 --> 0:06:44.880 each of the hits as being the right word. The 0:06:44.960 --> 0:06:47.599 number of potential hits is determined by a value called 0:06:47.640 --> 0:06:51.320 the beam width, setting the speech recognition and application designer 0:06:51.360 --> 0:06:54.000 can set. Beam search is a much more efficient way 0:06:54.000 --> 0:06:56.920 to suss out speech, and it's frequently used today, not 0:06:56.960 --> 0:07:00.000 just in speech recognition but also in natural language process 0:07:00.000 --> 0:07:03.160 saying another sequential models, but it gets super technical, so 0:07:03.200 --> 0:07:05.800 we're gonna leave it at that kind of high level approach. 0:07:06.600 --> 0:07:10.520 But these systems still mapped all words to a template, 0:07:10.720 --> 0:07:14.480 one template per word. It didn't break words up into sounds, 0:07:14.760 --> 0:07:18.080 but look for a match against a database of established 0:07:18.160 --> 0:07:21.240 vocabulary words, which meant that if you did not pronounce 0:07:21.280 --> 0:07:23.640 the word the same way as it was represented in 0:07:23.680 --> 0:07:26.760 the database, you might not get a hit. You would 0:07:26.800 --> 0:07:29.520 have to get it close enough to that template for 0:07:29.640 --> 0:07:31.400 you to be able to get a hit. This is 0:07:31.440 --> 0:07:35.120 a big problem. People speak with accents or dialects, or 0:07:35.160 --> 0:07:39.200 they may have difficulty replicating certain sounds. The brute force 0:07:39.280 --> 0:07:42.120 approach often meant you you'd have to say the same 0:07:42.120 --> 0:07:45.560 word a few times with clear enunciation and long pauses 0:07:45.600 --> 0:07:48.680 to get a hit. And again, it just didn't scale 0:07:48.800 --> 0:07:52.040 very well. It wasn't until the late nineteen seventies that 0:07:52.080 --> 0:07:55.360 computer scientists were able to find a different approach that 0:07:55.400 --> 0:07:59.280 would power more modern speech recognition systems. And let's go 0:07:59.360 --> 0:08:01.720 through some of the steps that are necessary, from the 0:08:01.760 --> 0:08:07.040 basic physical attributes of speech to the processing of the information. First, speech, 0:08:07.280 --> 0:08:11.800 like all sound, ultimately is a physical phenomenon. It is vibration. 0:08:12.160 --> 0:08:15.520 We produce these vibrations with vocal cords and our lips, teeth, 0:08:15.640 --> 0:08:18.680 and tongue according to the rules of whatever language we 0:08:18.720 --> 0:08:22.160 are speaking. These vibrations travel through a medium such as 0:08:22.160 --> 0:08:25.000 the air, and then they get picked up by something else, 0:08:25.080 --> 0:08:28.360 like someone else's ears or a microphone or whatever. But 0:08:28.440 --> 0:08:32.319 at this stage we're talking about physical vibrations and analog 0:08:32.920 --> 0:08:37.840 form of input. Computers do not directly interpret physical vibrations. 0:08:37.840 --> 0:08:42.839 Computers process digital information, and speech is an analog phenomena. 0:08:42.960 --> 0:08:45.320 So the first thing we need for a computer to 0:08:45.360 --> 0:08:49.000 recognize speech is for some sort of analog to digital 0:08:49.040 --> 0:08:52.840 converter that can accept the analog information and then translated 0:08:52.920 --> 0:08:56.400 into digital information. The a d C would typically sample 0:08:56.520 --> 0:09:00.400 speech by taking precise measurements of the sound at frequent 0:09:00.480 --> 0:09:04.439 intervals or samples such as thousands of times per second, 0:09:04.480 --> 0:09:07.319 so you can almost think of it like snapshots, Like 0:09:07.320 --> 0:09:11.480 like pictures. The a d C is measuring quantifiable elements 0:09:11.559 --> 0:09:14.920 of the sound every time it takes a sample. That 0:09:15.000 --> 0:09:19.160 might include stuff like amplitude and frequency, or volume and pitch. 0:09:19.240 --> 0:09:22.880 If you're talking about how we perceive sound. There's usually 0:09:23.000 --> 0:09:25.679 some sort of noise filter incorporated into this step as 0:09:25.720 --> 0:09:29.640 well to help remove any unwanted sounds from the signal. 0:09:30.080 --> 0:09:32.719 The system has to be able to recognize which signals 0:09:32.960 --> 0:09:36.200 represent a command in which ones are not important. This 0:09:36.280 --> 0:09:38.760 is why I can do stuff like send vocal commands 0:09:38.840 --> 0:09:42.200 to a voice assistant, even if there's another conversation going 0:09:42.240 --> 0:09:45.560 on nearby, or if I have the radio or television on. Now. 0:09:45.559 --> 0:09:47.960 I have a lot more to say about the technology 0:09:48.000 --> 0:09:50.880 that makes speech recognition possible, but before I get into that, 0:09:50.960 --> 0:10:01.360 let's take a quick break to thank our sponsor. So, 0:10:01.559 --> 0:10:05.840 a speech recognition system typically as a database of sound 0:10:05.880 --> 0:10:10.679 samples that will allow the recognition system to compare incoming 0:10:10.760 --> 0:10:15.480 signals against that database. The speech recognition system might have 0:10:15.600 --> 0:10:19.760 to put the incoming sound through a process called temporal alignment, 0:10:20.240 --> 0:10:22.240 which is a fancy way of saying the system might 0:10:22.240 --> 0:10:25.480 have to slow down or speed up the incoming sound. 0:10:25.960 --> 0:10:28.439 You can think of this as like making a recording 0:10:28.600 --> 0:10:31.800 and then almost immediately playing the recording back. Obviously, the 0:10:31.800 --> 0:10:34.920 speech recognition system can't change the speed at which you're speaking, 0:10:35.600 --> 0:10:37.959 though you might get a feature that prompts you to 0:10:38.880 --> 0:10:41.679 slow down or speed up if the message may say 0:10:41.679 --> 0:10:44.480 could you say that again, but slower that kind of thing. Um. 0:10:44.520 --> 0:10:47.080 If you happen to be someone from the Northeastern United States, 0:10:47.080 --> 0:10:50.120 for example, you may frequently get these messages saying slow 0:10:50.160 --> 0:10:54.040 the heck down. Temporal alignment allows the speech recognition system 0:10:54.040 --> 0:10:56.400 to look for matches between the incoming sound and the 0:10:56.440 --> 0:10:59.839 samples in the system's memory. The system must also do 0:11:00.000 --> 0:11:03.560 gied up the sounds in the incoming signal into segments 0:11:03.600 --> 0:11:08.040 that represent specific sounds in the native language, such as 0:11:08.080 --> 0:11:12.320 the sound or the hard to sound. It looks for 0:11:12.440 --> 0:11:17.000 matches in its memory that represent phonemes, and a phoneme 0:11:17.080 --> 0:11:20.360 is a basic sound native to a specific language, to 0:11:20.440 --> 0:11:23.760 a particular language, whichever when you're looking at. So, for example, 0:11:23.800 --> 0:11:28.280 the English language has about forty phonemes. Linguists actually get 0:11:28.280 --> 0:11:31.200 into some pretty vicious fights about exactly how many phonemes 0:11:31.280 --> 0:11:34.720 English language has, but it's around forties. Some people argue 0:11:34.760 --> 0:11:37.840 that there are more phonemes, some say that there are. 0:11:37.880 --> 0:11:41.199 Some of the supposed additional phonemes are in fact repeats 0:11:41.200 --> 0:11:45.280 of existing ones. Other languages, though, will have different number 0:11:45.280 --> 0:11:48.760 of phonemes in them. Some may have far more than English, 0:11:48.800 --> 0:11:52.079 some may have fewer than English. The system then has 0:11:52.120 --> 0:11:56.240 to analyze the phonemes in sequence. So it's looking at 0:11:56.280 --> 0:11:59.480 these little markers that represent different sounds, and this is 0:11:59.520 --> 0:12:01.760 how it says them can look for matches between a 0:12:01.760 --> 0:12:05.000 series of phonemes and the words that it can recognize 0:12:05.000 --> 0:12:08.679 it can try and build words from these sounds. This 0:12:08.760 --> 0:12:12.040 is way harder than I'm making it. Sound speech recognition 0:12:12.040 --> 0:12:16.280 systems have complicated statistical models to help them determine what 0:12:16.440 --> 0:12:20.560 a word might be. Even a simple speech recognition system 0:12:20.600 --> 0:12:25.280 will have a complex statistical model to recognize individual words. 0:12:25.720 --> 0:12:30.120 More sophisticated systems might also look at contextual information surrounding 0:12:30.160 --> 0:12:33.720 the phonemes. In other words, a really sophisticated system isn't 0:12:33.760 --> 0:12:36.120 just looking for a match in phonemes to sus out 0:12:36.160 --> 0:12:39.240 what a single word is in a sentence. It's looking 0:12:39.240 --> 0:12:42.560 at the phonemes that came before and after to determine 0:12:42.600 --> 0:12:45.559 what those words were and to help increase the confidence 0:12:45.640 --> 0:12:48.600 level overall. So let me give an example. Let's say 0:12:48.640 --> 0:12:51.040 have activated one of these voice assistants, and I've used 0:12:51.040 --> 0:12:53.720 whatever voice command activates it. I'm not going to do 0:12:53.760 --> 0:12:55.480 it here because some of you might be listening on 0:12:55.520 --> 0:12:58.800 those devices. And then I say turn the volume up 0:12:58.920 --> 0:13:02.200 thirty percent. The speech recognition system begins to parse what 0:13:02.280 --> 0:13:05.440 I said by analyzing those sounds phone name by phone name, 0:13:05.640 --> 0:13:09.240 identifying them, analyzing them, trying to group them together to 0:13:09.320 --> 0:13:11.839 form words, and when it thinks it's found a word, 0:13:11.880 --> 0:13:15.760 it assigns a certain probability to that, and when it 0:13:15.800 --> 0:13:18.160 starts to analyze the phone names that make up the 0:13:18.200 --> 0:13:21.160 word volume, it's also looking at the words that came 0:13:21.200 --> 0:13:24.360 before turn the and it's looking at the words that 0:13:24.400 --> 0:13:29.000 came after up. That boosts the system's confidence overall that 0:13:29.080 --> 0:13:32.240 the keyword volume is in fact volume, and then it 0:13:32.280 --> 0:13:34.199 does what I told it to do. When I talk 0:13:34.240 --> 0:13:37.559 about confidence, I don't mean the system feels good about itself. 0:13:37.600 --> 0:13:41.440 I'm talking about probabilities. These systems largely work in the 0:13:41.440 --> 0:13:44.920 realm of probabilities. What is the probability that I said 0:13:45.000 --> 0:13:49.040 volume rather than some other word. For speech recognition system 0:13:49.040 --> 0:13:51.560 to work, it needs to be able to assign a 0:13:51.640 --> 0:13:55.880 confidence level towards. The higher the level, the more certain 0:13:56.160 --> 0:13:59.079 quote unquote the system is that it got things correct. 0:13:59.559 --> 0:14:03.320 Typical computer engineers will design systems that will only execute 0:14:03.320 --> 0:14:07.240 a command or return a result of some sort if 0:14:07.280 --> 0:14:10.160 the system has reached a certain threshold of confidence, and 0:14:10.200 --> 0:14:13.600 if it hasn't, you won't get a result. So, for example, 0:14:13.679 --> 0:14:17.000 and this isn't about speech recognition exactly, but it illustrates 0:14:17.040 --> 0:14:20.760 my point. IBM S Watson computer would not offer up 0:14:20.800 --> 0:14:24.760 an answer on Jeopardy unless it met a certain threshold 0:14:24.880 --> 0:14:26.800 of confidence in an answer, and I think it was 0:14:26.800 --> 0:14:30.600 about eight percent. So if it or eight percent certain 0:14:30.680 --> 0:14:32.520 that it had the right answer, it would buzz in. 0:14:32.600 --> 0:14:35.680 But if it was less than eight sure, it would 0:14:35.680 --> 0:14:38.880 not put forth that answer. There are two broad types 0:14:38.920 --> 0:14:43.120 of statistical models in speech recognition systems today. There are 0:14:43.120 --> 0:14:45.400 others that could be used, but there are two broad 0:14:45.480 --> 0:14:47.640 ones that tend to be used these days. They are 0:14:47.640 --> 0:14:52.040 the hidden Markov model and neural networks. Hidden Markov model, 0:14:52.080 --> 0:14:57.280 by the way, is overwhelmingly the most popular method of 0:14:58.080 --> 0:15:02.640 using a statistical model to analyze speech recognition. It is 0:15:02.680 --> 0:15:05.200 the prevalent approach, and it works sort of how I 0:15:05.280 --> 0:15:07.640 just described. It looks at each phone name and starts 0:15:07.640 --> 0:15:10.080 to build out a pathway. If you think of this 0:15:10.120 --> 0:15:13.120 as like an actual physical path that you're following, you 0:15:13.120 --> 0:15:15.640 would start off with the first phone name that represents 0:15:15.680 --> 0:15:18.120 the beginning of the path, and the phone name might 0:15:18.160 --> 0:15:22.000 eliminate other possible phone names right away. By that, I 0:15:22.000 --> 0:15:26.560 mean it might be a sound that doesn't combine with 0:15:26.640 --> 0:15:29.080 certain other sounds within that language. There might be a 0:15:29.120 --> 0:15:33.800 phone name that does not combine with other specific phone names. 0:15:33.880 --> 0:15:36.880 So imagine you have a path and originally it splits 0:15:36.920 --> 0:15:40.160 into tons of other pathways, but a couple of those 0:15:40.160 --> 0:15:42.960 pathways are blocked off with signs that say the pathway 0:15:43.000 --> 0:15:47.120 is closed. It's closed because those pathways represent phone names 0:15:47.200 --> 0:15:51.320 that would never be paired with the initial one. You 0:15:51.360 --> 0:15:55.480 just don't get that sound in English. The closed paths 0:15:55.600 --> 0:15:58.640 would therefore be off limits, and only the open paths 0:15:58.640 --> 0:16:01.960 would be the possibility. Then the hidden Markov model would 0:16:01.960 --> 0:16:04.640 look at the next phone name the next step along 0:16:04.640 --> 0:16:08.080 this pathway. That phone name determines which of the viable 0:16:08.120 --> 0:16:11.800 path options is actually the one to follow. All the 0:16:11.800 --> 0:16:14.640 other options would be discarded, and so on. It would 0:16:14.680 --> 0:16:17.320 go all the way down the list of phone names 0:16:17.360 --> 0:16:19.840 until the model arrives at a conclusion of the most 0:16:19.880 --> 0:16:23.280 likely word that was spoken. It assigns a probability score 0:16:23.280 --> 0:16:26.720 to each phone names, thinking I'm pretty sure the sound 0:16:26.760 --> 0:16:30.600 that I heard, quote unquote was this. That helps the 0:16:30.640 --> 0:16:33.280 system make an educated guess as to what word was 0:16:33.320 --> 0:16:36.320 actually spoken. Now, I've talked a lot about neural networks 0:16:36.320 --> 0:16:37.760 in the past. I'm just going to give it a 0:16:37.840 --> 0:16:42.800 quick cursory covering here, because they really aren't the dominant 0:16:43.080 --> 0:16:47.800 statistical model in speech recognition. UH Neural networks have nodes, 0:16:48.080 --> 0:16:51.560 computer nodes or algorithms that act like a neuron right 0:16:51.680 --> 0:16:54.760 like a like a brain cell, and they execute operations 0:16:54.840 --> 0:16:58.440 on data. The neurons also assigned a probability score to 0:16:58.640 --> 0:17:02.080 that x that execution of of data and shows the 0:17:02.080 --> 0:17:05.400 confidence in the system in the result before they pass 0:17:05.440 --> 0:17:08.080 it on to another neuron in the network, which then 0:17:08.160 --> 0:17:10.879 executes another operation on the data and so on, and 0:17:10.960 --> 0:17:13.760 ultimately the network produces an end result of all those 0:17:13.800 --> 0:17:17.320 operations and judges the probability of whether or not that 0:17:17.440 --> 0:17:20.560 result is the right one, and again, if it meets 0:17:20.560 --> 0:17:24.560 a certain threshold, then it's considered the correct answer or 0:17:24.560 --> 0:17:27.119 the closest to correct that the system can manage. In 0:17:27.160 --> 0:17:30.840 any case, speech recognition systems have to be trained, and 0:17:30.840 --> 0:17:34.200 there are trillions of potential combinations of sounds that could 0:17:34.280 --> 0:17:37.640 represent different words. And the How stuff Works article How 0:17:37.720 --> 0:17:41.000 Speech Recognition Works Ed Grabanowski, who is one of the 0:17:41.160 --> 0:17:44.120 powerhouses of the site. He's written some of the best 0:17:44.240 --> 0:17:47.320 articles on how stuff Works, gave a great example. He says, 0:17:47.600 --> 0:17:52.800 take the phrase recognize speech right the phone emes in 0:17:52.800 --> 0:17:55.520 that phrase happened to be pretty similar to a totally 0:17:55.600 --> 0:17:59.199 different phrase, which would be recognized beach. So you have 0:17:59.280 --> 0:18:04.720 recognized speech or wreck a nice beach. The speech recognition 0:18:04.720 --> 0:18:07.480 software has to be able to determine the difference, or 0:18:07.560 --> 0:18:09.200 else the next thing you know, you're gonna have terminators 0:18:09.280 --> 0:18:12.960 kicking sand in everyone's face, and that's no good. Alexander Wibel, 0:18:13.320 --> 0:18:17.360 who worked on that system called Harpy that I mentioned earlier, 0:18:18.000 --> 0:18:21.200 had another couple of examples. He said, you might say 0:18:21.320 --> 0:18:25.919 youth and Asia and get the result youth in Asia. 0:18:26.080 --> 0:18:29.280 Or you might say give me a new display and 0:18:29.320 --> 0:18:32.040 you get the result, give me a newdist play. If 0:18:32.080 --> 0:18:36.040 you've ever used something like Google transcripts, where if you 0:18:36.040 --> 0:18:38.760 had a Google Voice and you were reading the voicemails, 0:18:39.760 --> 0:18:43.159 you could get hilarious results. Because of this, the speech recognition, 0:18:43.480 --> 0:18:46.640 the speech to text feature could end up spelling out 0:18:47.560 --> 0:18:52.080 truly ridiculous messages. I would get messages from my mother, 0:18:52.920 --> 0:18:56.000 and I only wish my mom would leave me messages 0:18:56.080 --> 0:18:59.600 the way that Google transcript thought she was leaving me messages, 0:18:59.640 --> 0:19:03.320 because they were the most crazy messages ever. But it's 0:19:03.359 --> 0:19:05.720 mostly because my mom has a Southern accent and so 0:19:05.800 --> 0:19:10.640 Google would often misinterpret what she was saying, so these 0:19:10.640 --> 0:19:14.520 systems have to undergo hours of training. John Garofolo, a 0:19:14.520 --> 0:19:17.800 computer scientist who was cited in that House Stuff Works article, 0:19:18.119 --> 0:19:22.600 had this to say. These statistical systems need lots of 0:19:22.640 --> 0:19:26.960 exemplary training data to reach their optimal performance, sometimes on 0:19:27.000 --> 0:19:30.840 the order of thousands of hours of human transcribed speech 0:19:30.880 --> 0:19:34.960 and hundreds of megabytes of text. These training data are 0:19:35.040 --> 0:19:38.919 used to create acoustic models of words, word lists, and 0:19:39.040 --> 0:19:43.200 multi word probability networks. There is some art into how 0:19:43.240 --> 0:19:47.680 one selects, compiles, and prepares this training data for digestion 0:19:47.800 --> 0:19:50.680 by the system, and how the system models are tuned 0:19:50.840 --> 0:19:54.400 to a particular application. These details can make the difference 0:19:54.400 --> 0:19:57.719 between a well performing system and a poorly performing system, 0:19:57.960 --> 0:20:01.639 even when using the same basic algorith Rhythm speech recognition 0:20:01.680 --> 0:20:05.240 also requires a decent amount of processing power. This was 0:20:05.280 --> 0:20:08.440 a limiting factor on speech recognition for a really long time. 0:20:08.880 --> 0:20:12.639 Systems were limited in their capabilities, which meant that for years, 0:20:12.680 --> 0:20:16.399 if you wanted to incorporate speech recognition in a computer system, 0:20:16.480 --> 0:20:19.320 and then most of the computer's processing power would have 0:20:19.400 --> 0:20:22.119 to dedicate itself just to parsing speech. You couldn't do 0:20:22.240 --> 0:20:25.320 much else on that machine. But since Moore's laws held 0:20:25.400 --> 0:20:27.400 up so well for decades, we got to a point 0:20:27.400 --> 0:20:30.000 where the process and capabilities of machines reached a stage 0:20:30.200 --> 0:20:33.679 where this isn't as big a concern, And another development 0:20:33.960 --> 0:20:38.680 that Google really helped pioneer definitely change things. I'll talk 0:20:38.720 --> 0:20:41.160 more about that in our next section, but first let's 0:20:41.160 --> 0:20:51.560 take another quick break to thank our sponsors. Okay, So, 0:20:51.600 --> 0:20:54.800 advances in speech recognition in the late nineteen seventies paved 0:20:54.920 --> 0:20:58.280 the way from how most systems work these days, though 0:20:58.320 --> 0:21:01.480 of course the models have under gone multiple refinements and 0:21:01.520 --> 0:21:05.080 tweaking over time. The first speech recognition product to ever 0:21:05.200 --> 0:21:09.200 launch for consumers was a program called Dragon Dictate, which 0:21:09.240 --> 0:21:13.879 debuted in Dragon Dictate. The original version that is, because 0:21:13.920 --> 0:21:17.119 they still come out to this day, relied on discrete 0:21:17.160 --> 0:21:20.320 speech recognition. Now, I don't mean you had to be 0:21:20.400 --> 0:21:22.879 secretive and hush hush about it. It's not that kind 0:21:22.880 --> 0:21:25.879 of discreet. Rather, I mean you had to pronounce each 0:21:26.040 --> 0:21:28.879 word clearly, with a pause between words. You could not 0:21:29.080 --> 0:21:33.040 speak conversationally, or the dictation software could not interpret what 0:21:33.080 --> 0:21:43.080 you were saying, so using the software would sound like this. 0:21:44.840 --> 0:21:47.440 It was limited and it was primitive compared to today's 0:21:47.520 --> 0:21:51.080 speech recognition products, but it was a groundbreaking product in 0:21:51.119 --> 0:21:54.560 the early nineties. And it also costs somewhere between six 0:21:54.600 --> 0:21:59.320 thousand and nine thousand dollars I saw differing accounts, but 0:21:59.359 --> 0:22:02.920 that would be between nine and fourteen grand in today's dollars, 0:22:02.960 --> 0:22:07.320 so pretty expensive software package. Dragon still produces speech recognition 0:22:07.359 --> 0:22:09.760 technologies to this day, and of course they are much 0:22:09.800 --> 0:22:13.000 more adept at recognizing and transcribing speech than the original 0:22:13.080 --> 0:22:16.160 version was years ago. The software is also less expensive. 0:22:16.640 --> 0:22:19.399 One version I saw retails for less than a hundred dollars, 0:22:19.400 --> 0:22:23.000 so nice. Big deep price cut advancements and model design 0:22:23.040 --> 0:22:27.800 and processor speed meant that speech recognition technology advanced rather quickly. 0:22:28.240 --> 0:22:33.280 In Bell South released Vowel v a L. The Voice 0:22:33.280 --> 0:22:36.840 Portal VAL was an automated interactive system that could respond 0:22:36.880 --> 0:22:40.520 to questions over the phone. This was a basic implementation 0:22:40.560 --> 0:22:42.719 that would evolve over time to the systems you may 0:22:42.760 --> 0:22:45.800 have encountered when calling up automated menus where it's a 0:22:46.160 --> 0:22:49.800 press three or say three and that kind of thing, 0:22:50.080 --> 0:22:53.600 or do you have any questions? You can say anything 0:22:53.640 --> 0:22:56.479 from check my balance to you know, that kind of stuff. 0:22:57.320 --> 0:22:59.920 In two thousand five DARPA, which is the same brand 0:23:00.000 --> 0:23:02.199 each of the Department of Defense that used to be 0:23:02.240 --> 0:23:04.280 known as ARPA, So in other words, it's the same 0:23:04.680 --> 0:23:09.120 R and d ARM that funded the creation of the Internet. 0:23:09.480 --> 0:23:11.760 They funded a program in two thousand five called the 0:23:11.800 --> 0:23:17.359 Global Autonomous Language Exploitation Project or GALE. The purpose of 0:23:17.359 --> 0:23:20.639 this project was to advance research and development into automated 0:23:20.640 --> 0:23:24.480 translation between languages. So not only were computers supposed to 0:23:24.520 --> 0:23:27.560 be able to recognize speech, but also translate that speech 0:23:27.600 --> 0:23:31.320 from one language into another, which adds another layer of 0:23:31.359 --> 0:23:35.120 complexity on top. Right well, according to s r I International, 0:23:35.680 --> 0:23:39.480 the system should be able to quote automatically take multi 0:23:39.600 --> 0:23:44.239 lingual newscasts, text documents, and other forms of communication and 0:23:44.280 --> 0:23:47.679 make their information available to human queries end quote. So 0:23:47.760 --> 0:23:50.840 wouldn't just translate the information, which was already even more 0:23:50.880 --> 0:23:54.760 complicated than speech recognition, It could also index that information 0:23:54.800 --> 0:23:57.119 in a meaningful way so you could search for stuff. 0:23:57.960 --> 0:24:02.720 So layer upon layer of complexity for that project. Things 0:24:02.720 --> 0:24:06.080