1 00:00:04,120 --> 00:00:07,160 Speaker 1: Get in tech with technology with tech Stuff from how 2 00:00:07,200 --> 00:00:13,680 Speaker 1: stuff works dot com. Hey there, and welcome to tech Stuff. 3 00:00:13,720 --> 00:00:16,279 Speaker 1: I'm your host, Jonathan Strickland. I'm an executive producer with 4 00:00:16,280 --> 00:00:19,960 Speaker 1: how Stuff Works in love all Things Tech and listener 5 00:00:20,079 --> 00:00:22,320 Speaker 1: Nate wrote in and asked that I do an episode 6 00:00:22,560 --> 00:00:27,720 Speaker 1: about personal digital assistance or virtual assistance or voice helpers. 7 00:00:28,200 --> 00:00:30,160 Speaker 1: This is hard because we don't really have a great 8 00:00:30,560 --> 00:00:33,839 Speaker 1: term for these things, but I'm talking about applications like 9 00:00:34,400 --> 00:00:36,559 Speaker 1: and I apologize ahead of time if I activate your 10 00:00:36,560 --> 00:00:42,840 Speaker 1: technology Sirie, Alexa and Google Assistant. These sort of voice 11 00:00:42,840 --> 00:00:45,920 Speaker 1: helpers that can respond to voice commands as well as 12 00:00:46,040 --> 00:00:48,000 Speaker 1: other means of input in a way that makes them 13 00:00:48,000 --> 00:00:51,479 Speaker 1: seem almost intelligent. Now, as it turns out, that's actually 14 00:00:51,479 --> 00:00:55,240 Speaker 1: a pretty complicated history because it requires a discussion about 15 00:00:55,320 --> 00:00:59,840 Speaker 1: a lot of different connected ideas that we're all in 16 00:01:00,040 --> 00:01:03,080 Speaker 1: dependent and then ultimately converged. We're talking about stuff like 17 00:01:03,320 --> 00:01:08,560 Speaker 1: speech recognition, natural language processing, and technology that was meant 18 00:01:08,600 --> 00:01:12,200 Speaker 1: to improve accessibility and a whole lot more So, it 19 00:01:12,240 --> 00:01:15,280 Speaker 1: makes talking about the services somewhat challenging because it's not 20 00:01:15,360 --> 00:01:18,360 Speaker 1: like there was just one pathway that led to their development. 21 00:01:18,640 --> 00:01:22,960 Speaker 1: They exist largely because of these independent but converging areas 22 00:01:22,959 --> 00:01:25,920 Speaker 1: of innovation. Much of the work that made these services 23 00:01:25,959 --> 00:01:29,840 Speaker 1: possible took place in events that were concurrent with each other, 24 00:01:30,000 --> 00:01:36,920 Speaker 1: with different organizations all working towards similar but unconnected, disconnected goals. 25 00:01:36,959 --> 00:01:40,040 Speaker 1: So going by strict timeline approach would be really hard, 26 00:01:40,120 --> 00:01:42,880 Speaker 1: if not impossible, just because you have to jump around 27 00:01:42,920 --> 00:01:46,200 Speaker 1: a lot to talk about different advances. So today I'm 28 00:01:46,240 --> 00:01:50,480 Speaker 1: going to focus solely on speech recognition. This in itself 29 00:01:50,640 --> 00:01:53,640 Speaker 1: is a huge topic, so it's more than enough for 30 00:01:53,680 --> 00:01:56,280 Speaker 1: a single episode of tech stuff. In the next episode, 31 00:01:56,320 --> 00:02:00,000 Speaker 1: I'm going to dive more into natural language processing, which 32 00:02:00,520 --> 00:02:03,800 Speaker 1: has some crossover with speech recognition, but it is its 33 00:02:03,840 --> 00:02:06,200 Speaker 1: own thing. And then after that we'll take a look 34 00:02:06,200 --> 00:02:09,640 Speaker 1: at how voice assistants like Sirie and Alexa popped up 35 00:02:09,680 --> 00:02:14,240 Speaker 1: over time. First, the idea of creating a machine that 36 00:02:14,280 --> 00:02:18,920 Speaker 1: could interpret speech is older than computers. If you listen 37 00:02:18,960 --> 00:02:21,440 Speaker 1: to my episodes about the history of the turntable, you'll 38 00:02:21,480 --> 00:02:26,800 Speaker 1: remember the phanatograph, designed by Eduard Leon Scott de Martinville 39 00:02:27,120 --> 00:02:31,680 Speaker 1: in eighteen fifty seven. The gadget had a small brush 40 00:02:31,760 --> 00:02:35,360 Speaker 1: that was attached to a parchment diaphragm and the bristles 41 00:02:35,400 --> 00:02:38,520 Speaker 1: on the brush rested against a sheet of paper that 42 00:02:38,600 --> 00:02:41,520 Speaker 1: itself was wrapped around a cylinder. On top of the 43 00:02:41,520 --> 00:02:44,520 Speaker 1: sheet of paper was a layer of soot. So to 44 00:02:44,560 --> 00:02:47,400 Speaker 1: operate the device, you would turn the cylinder, the brush 45 00:02:47,400 --> 00:02:50,679 Speaker 1: would drag across the soot on the paper, and you 46 00:02:50,720 --> 00:02:54,120 Speaker 1: would shout at the diaphragm. The vibrations of sound would 47 00:02:54,160 --> 00:02:56,959 Speaker 1: cause the parchment diaphragm to vibrate. That would make the 48 00:02:57,000 --> 00:02:59,720 Speaker 1: brush vibrate and move against the paper, and that would 49 00:02:59,720 --> 00:03:03,480 Speaker 1: create a pattern corresponding to the vibrations that were made 50 00:03:03,560 --> 00:03:06,799 Speaker 1: by the paper diaphragm. The phonautograph was supposed to aid 51 00:03:06,840 --> 00:03:10,240 Speaker 1: in the study of language and sound. The machine itself 52 00:03:10,320 --> 00:03:14,639 Speaker 1: was not intended to interpret sound, but rather facilitate interpretation. 53 00:03:14,680 --> 00:03:19,240 Speaker 1: A human would take a look at these tracings essentially 54 00:03:19,520 --> 00:03:22,360 Speaker 1: and be able to analyze sound, or at least that 55 00:03:22,400 --> 00:03:24,760 Speaker 1: was the intent. It didn't quite work out that way, 56 00:03:24,760 --> 00:03:28,480 Speaker 1: but that was the concept behind it. Now, let's set 57 00:03:28,480 --> 00:03:32,160 Speaker 1: the way back machine to the nineteen fifties. In nineteen 58 00:03:32,200 --> 00:03:37,280 Speaker 1: fifty two, Bell Labs created the Audrey system, which was 59 00:03:37,320 --> 00:03:41,000 Speaker 1: not a mean green mother from outer space, but rather 60 00:03:41,480 --> 00:03:45,960 Speaker 1: the first documented speech recognizer system. It was an analog system, 61 00:03:46,000 --> 00:03:50,200 Speaker 1: not a digital one. It was its own dedicated massive circuit, 62 00:03:50,880 --> 00:03:53,400 Speaker 1: and it even had vacuum tubes in this thing. Because 63 00:03:53,440 --> 00:03:56,960 Speaker 1: this is before the transistor. It could recognize strings of 64 00:03:57,000 --> 00:04:00,720 Speaker 1: digits spoken by its creator with about nine and accuracy. 65 00:04:00,800 --> 00:04:03,360 Speaker 1: If anyone else tried it, the accuracy dropped a bit. 66 00:04:03,440 --> 00:04:07,200 Speaker 1: This already shows that speech recognition is tough because not 67 00:04:07,320 --> 00:04:10,480 Speaker 1: everyone says things exactly the same way. I know that's 68 00:04:10,480 --> 00:04:12,680 Speaker 1: not a news flash, but it is important for the 69 00:04:12,720 --> 00:04:16,279 Speaker 1: concept of speech recognition. Uh. You also had to pause 70 00:04:16,800 --> 00:04:21,160 Speaker 1: between strings of numbers. You couldn't just rattle off conversationally. 71 00:04:21,240 --> 00:04:23,440 Speaker 1: You had to put pauses in there. But it was 72 00:04:23,480 --> 00:04:26,680 Speaker 1: also an enormous piece of machinery. It took up a 73 00:04:26,800 --> 00:04:29,640 Speaker 1: six ft high relay rack and it consumed a lot 74 00:04:29,640 --> 00:04:34,280 Speaker 1: of electricity. Then Big Blue, also known as IBM, had 75 00:04:34,360 --> 00:04:37,880 Speaker 1: scientists and engineers working on the possibility of designing technologies 76 00:04:37,880 --> 00:04:40,560 Speaker 1: that could recognize speech. They were kind of working around 77 00:04:40,600 --> 00:04:44,520 Speaker 1: the same time that Bell South was computer scientists Nathaniel Rochester, 78 00:04:44,600 --> 00:04:47,200 Speaker 1: who designed an IBM computer called the seven oh one. 79 00:04:47,480 --> 00:04:50,120 Speaker 1: He also wrote the first assembler. Headed up a group 80 00:04:50,160 --> 00:04:53,359 Speaker 1: of engineers at IBM who were researching pattern recognition and 81 00:04:53,480 --> 00:04:57,440 Speaker 1: information theory. That work, which was early research into fundamental 82 00:04:57,520 --> 00:05:01,039 Speaker 1: building blocks for artificial intelligence, would also become important for 83 00:05:01,080 --> 00:05:05,240 Speaker 1: speech recognition. In the late nineteen fifties, William C. Dirsh, 84 00:05:05,360 --> 00:05:09,640 Speaker 1: another IBM computer scientist, developed a computer system as part 85 00:05:09,720 --> 00:05:14,720 Speaker 1: of IBMS Advanced Systems Development Division laboratory, and it incorporated 86 00:05:14,760 --> 00:05:18,679 Speaker 1: basic elements of speech recognition. He unveiled the device, called 87 00:05:18,720 --> 00:05:22,359 Speaker 1: the IBM Shoebox in nineteen sixty two at the World's Fair. 88 00:05:22,880 --> 00:05:26,480 Speaker 1: Using a microphone, you could speak basic digits from zero 89 00:05:26,520 --> 00:05:30,680 Speaker 1: to nine, and also six additional control words like plus 90 00:05:30,800 --> 00:05:34,279 Speaker 1: or minus, and the shoebox would recognize the words and 91 00:05:34,480 --> 00:05:39,960 Speaker 1: perform calculation, So essentially this was a basic voice controlled calculator. 92 00:05:40,279 --> 00:05:43,920 Speaker 1: While the application was limited, this showed off a remarkable achievement. 93 00:05:44,120 --> 00:05:46,719 Speaker 1: Finding a way to program a machine to accept speech 94 00:05:46,880 --> 00:05:50,120 Speaker 1: as a command is a non trivial problem. Throughout the 95 00:05:50,160 --> 00:05:53,880 Speaker 1: nineteen sixties, computer scientists took a brute force sort of 96 00:05:53,920 --> 00:05:57,520 Speaker 1: approach to solving speech recognition, which could work in very 97 00:05:57,640 --> 00:06:01,280 Speaker 1: narrow applications such as the calculator approach, but were by 98 00:06:01,320 --> 00:06:04,839 Speaker 1: their nature difficult to scale up. Even in the early 99 00:06:04,920 --> 00:06:09,200 Speaker 1: nineteen seventies, the Speech Understanding Research Project from our PA 100 00:06:09,279 --> 00:06:12,000 Speaker 1: as the same organization that would help bring the Internet 101 00:06:12,040 --> 00:06:17,040 Speaker 1: into being, produced a brute force template called Harpy. While 102 00:06:17,120 --> 00:06:19,920 Speaker 1: it was reliant upon brute force, Harpy, which came out 103 00:06:19,960 --> 00:06:24,200 Speaker 1: of Carnegie Melon Research, could recognize about one thousand words. 104 00:06:24,600 --> 00:06:28,320 Speaker 1: Harpy also made use of a process called beam search. 105 00:06:28,880 --> 00:06:31,560 Speaker 1: This is a search strategy in which a search algorithm 106 00:06:31,640 --> 00:06:35,120 Speaker 1: can consider multiple possible hits at a single time, rather 107 00:06:35,160 --> 00:06:38,720 Speaker 1: than looking through a large data set for a specific 108 00:06:38,839 --> 00:06:42,560 Speaker 1: perfect hit. Then the algorithm would determine the probability of 109 00:06:42,600 --> 00:06:44,880 Speaker 1: each of the hits as being the right word. The 110 00:06:44,960 --> 00:06:47,599 Speaker 1: number of potential hits is determined by a value called 111 00:06:47,640 --> 00:06:51,320 Speaker 1: the beam width, setting the speech recognition and application designer 112 00:06:51,360 --> 00:06:54,000 Speaker 1: can set. Beam search is a much more efficient way 113 00:06:54,000 --> 00:06:56,920 Speaker 1: to suss out speech, and it's frequently used today, not 114 00:06:56,960 --> 00:07:00,000 Speaker 1: just in speech recognition but also in natural language process 115 00:07:00,000 --> 00:07:03,160 Speaker 1: saying another sequential models, but it gets super technical, so 116 00:07:03,200 --> 00:07:05,800 Speaker 1: we're gonna leave it at that kind of high level approach. 117 00:07:06,600 --> 00:07:10,520 Speaker 1: But these systems still mapped all words to a template, 118 00:07:10,720 --> 00:07:14,480 Speaker 1: one template per word. It didn't break words up into sounds, 119 00:07:14,760 --> 00:07:18,080 Speaker 1: but look for a match against a database of established 120 00:07:18,160 --> 00:07:21,240 Speaker 1: vocabulary words, which meant that if you did not pronounce 121 00:07:21,280 --> 00:07:23,640 Speaker 1: the word the same way as it was represented in 122 00:07:23,680 --> 00:07:26,760 Speaker 1: the database, you might not get a hit. You would 123 00:07:26,800 --> 00:07:29,520 Speaker 1: have to get it close enough to that template for 124 00:07:29,640 --> 00:07:31,400 Speaker 1: you to be able to get a hit. This is 125 00:07:31,440 --> 00:07:35,120 Speaker 1: a big problem. People speak with accents or dialects, or 126 00:07:35,160 --> 00:07:39,200 Speaker 1: they may have difficulty replicating certain sounds. The brute force 127 00:07:39,280 --> 00:07:42,120 Speaker 1: approach often meant you you'd have to say the same 128 00:07:42,120 --> 00:07:45,560 Speaker 1: word a few times with clear enunciation and long pauses 129 00:07:45,600 --> 00:07:48,680 Speaker 1: to get a hit. And again, it just didn't scale 130 00:07:48,800 --> 00:07:52,040 Speaker 1: very well. It wasn't until the late nineteen seventies that 131 00:07:52,080 --> 00:07:55,360 Speaker 1: computer scientists were able to find a different approach that 132 00:07:55,400 --> 00:07:59,280 Speaker 1: would power more modern speech recognition systems. And let's go 133 00:07:59,360 --> 00:08:01,720 Speaker 1: through some of the steps that are necessary, from the 134 00:08:01,760 --> 00:08:07,040 Speaker 1: basic physical attributes of speech to the processing of the information. First, speech, 135 00:08:07,280 --> 00:08:11,800 Speaker 1: like all sound, ultimately is a physical phenomenon. It is vibration. 136 00:08:12,160 --> 00:08:15,520 Speaker 1: We produce these vibrations with vocal cords and our lips, teeth, 137 00:08:15,640 --> 00:08:18,680 Speaker 1: and tongue according to the rules of whatever language we 138 00:08:18,720 --> 00:08:22,160 Speaker 1: are speaking. These vibrations travel through a medium such as 139 00:08:22,160 --> 00:08:25,000 Speaker 1: the air, and then they get picked up by something else, 140 00:08:25,080 --> 00:08:28,360 Speaker 1: like someone else's ears or a microphone or whatever. But 141 00:08:28,440 --> 00:08:32,319 Speaker 1: at this stage we're talking about physical vibrations and analog 142 00:08:32,920 --> 00:08:37,840 Speaker 1: form of input. Computers do not directly interpret physical vibrations. 143 00:08:37,840 --> 00:08:42,839 Speaker 1: Computers process digital information, and speech is an analog phenomena. 144 00:08:42,960 --> 00:08:45,320 Speaker 1: So the first thing we need for a computer to 145 00:08:45,360 --> 00:08:49,000 Speaker 1: recognize speech is for some sort of analog to digital 146 00:08:49,040 --> 00:08:52,840 Speaker 1: converter that can accept the analog information and then translated 147 00:08:52,920 --> 00:08:56,400 Speaker 1: into digital information. The a d C would typically sample 148 00:08:56,520 --> 00:09:00,400 Speaker 1: speech by taking precise measurements of the sound at frequent 149 00:09:00,480 --> 00:09:04,439 Speaker 1: intervals or samples such as thousands of times per second, 150 00:09:04,480 --> 00:09:07,319 Speaker 1: so you can almost think of it like snapshots, Like 151 00:09:07,320 --> 00:09:11,480 Speaker 1: like pictures. The a d C is measuring quantifiable elements 152 00:09:11,559 --> 00:09:14,920 Speaker 1: of the sound every time it takes a sample. That 153 00:09:15,000 --> 00:09:19,160 Speaker 1: might include stuff like amplitude and frequency, or volume and pitch. 154 00:09:19,240 --> 00:09:22,880 Speaker 1: If you're talking about how we perceive sound. There's usually 155 00:09:23,000 --> 00:09:25,679 Speaker 1: some sort of noise filter incorporated into this step as 156 00:09:25,720 --> 00:09:29,640 Speaker 1: well to help remove any unwanted sounds from the signal. 157 00:09:30,080 --> 00:09:32,719 Speaker 1: The system has to be able to recognize which signals 158 00:09:32,960 --> 00:09:36,200 Speaker 1: represent a command in which ones are not important. This 159 00:09:36,280 --> 00:09:38,760 Speaker 1: is why I can do stuff like send vocal commands 160 00:09:38,840 --> 00:09:42,200 Speaker 1: to a voice assistant, even if there's another conversation going 161 00:09:42,240 --> 00:09:45,560 Speaker 1: on nearby, or if I have the radio or television on. Now. 162 00:09:45,559 --> 00:09:47,960 Speaker 1: I have a lot more to say about the technology 163 00:09:48,000 --> 00:09:50,880 Speaker 1: that makes speech recognition possible, but before I get into that, 164 00:09:50,960 --> 00:10:01,360 Speaker 1: let's take a quick break to thank our sponsor. So, 165 00:10:01,559 --> 00:10:05,840 Speaker 1: a speech recognition system typically as a database of sound 166 00:10:05,880 --> 00:10:10,679 Speaker 1: samples that will allow the recognition system to compare incoming 167 00:10:10,760 --> 00:10:15,480 Speaker 1: signals against that database. The speech recognition system might have 168 00:10:15,600 --> 00:10:19,760 Speaker 1: to put the incoming sound through a process called temporal alignment, 169 00:10:20,240 --> 00:10:22,240 Speaker 1: which is a fancy way of saying the system might 170 00:10:22,240 --> 00:10:25,480 Speaker 1: have to slow down or speed up the incoming sound. 171 00:10:25,960 --> 00:10:28,439 Speaker 1: You can think of this as like making a recording 172 00:10:28,600 --> 00:10:31,800 Speaker 1: and then almost immediately playing the recording back. Obviously, the 173 00:10:31,800 --> 00:10:34,920 Speaker 1: speech recognition system can't change the speed at which you're speaking, 174 00:10:35,600 --> 00:10:37,959 Speaker 1: though you might get a feature that prompts you to 175 00:10:38,880 --> 00:10:41,679 Speaker 1: slow down or speed up if the message may say 176 00:10:41,679 --> 00:10:44,480 Speaker 1: could you say that again, but slower that kind of thing. Um. 177 00:10:44,520 --> 00:10:47,080 Speaker 1: If you happen to be someone from the Northeastern United States, 178 00:10:47,080 --> 00:10:50,120 Speaker 1: for example, you may frequently get these messages saying slow 179 00:10:50,160 --> 00:10:54,040 Speaker 1: the heck down. Temporal alignment allows the speech recognition system 180 00:10:54,040 --> 00:10:56,400 Speaker 1: to look for matches between the incoming sound and the 181 00:10:56,440 --> 00:10:59,839 Speaker 1: samples in the system's memory. The system must also do 182 00:11:00,000 --> 00:11:03,560 Speaker 1: gied up the sounds in the incoming signal into segments 183 00:11:03,600 --> 00:11:08,040 Speaker 1: that represent specific sounds in the native language, such as 184 00:11:08,080 --> 00:11:12,320 Speaker 1: the sound or the hard to sound. It looks for 185 00:11:12,440 --> 00:11:17,000 Speaker 1: matches in its memory that represent phonemes, and a phoneme 186 00:11:17,080 --> 00:11:20,360 Speaker 1: is a basic sound native to a specific language, to 187 00:11:20,440 --> 00:11:23,760 Speaker 1: a particular language, whichever when you're looking at. So, for example, 188 00:11:23,800 --> 00:11:28,280 Speaker 1: the English language has about forty phonemes. Linguists actually get 189 00:11:28,280 --> 00:11:31,200 Speaker 1: into some pretty vicious fights about exactly how many phonemes 190 00:11:31,280 --> 00:11:34,720 Speaker 1: English language has, but it's around forties. Some people argue 191 00:11:34,760 --> 00:11:37,840 Speaker 1: that there are more phonemes, some say that there are. 192 00:11:37,880 --> 00:11:41,199 Speaker 1: Some of the supposed additional phonemes are in fact repeats 193 00:11:41,200 --> 00:11:45,280 Speaker 1: of existing ones. Other languages, though, will have different number 194 00:11:45,280 --> 00:11:48,760 Speaker 1: of phonemes in them. Some may have far more than English, 195 00:11:48,800 --> 00:11:52,079 Speaker 1: some may have fewer than English. The system then has 196 00:11:52,120 --> 00:11:56,240 Speaker 1: to analyze the phonemes in sequence. So it's looking at 197 00:11:56,280 --> 00:11:59,480 Speaker 1: these little markers that represent different sounds, and this is 198 00:11:59,520 --> 00:12:01,760 Speaker 1: how it says them can look for matches between a 199 00:12:01,760 --> 00:12:05,000 Speaker 1: series of phonemes and the words that it can recognize 200 00:12:05,000 --> 00:12:08,679 Speaker 1: it can try and build words from these sounds. This 201 00:12:08,760 --> 00:12:12,040 Speaker 1: is way harder than I'm making it. Sound speech recognition 202 00:12:12,040 --> 00:12:16,280 Speaker 1: systems have complicated statistical models to help them determine what 203 00:12:16,440 --> 00:12:20,560 Speaker 1: a word might be. Even a simple speech recognition system 204 00:12:20,600 --> 00:12:25,280 Speaker 1: will have a complex statistical model to recognize individual words. 205 00:12:25,720 --> 00:12:30,120 Speaker 1: More sophisticated systems might also look at contextual information surrounding 206 00:12:30,160 --> 00:12:33,720 Speaker 1: the phonemes. In other words, a really sophisticated system isn't 207 00:12:33,760 --> 00:12:36,120 Speaker 1: just looking for a match in phonemes to sus out 208 00:12:36,160 --> 00:12:39,240 Speaker 1: what a single word is in a sentence. It's looking 209 00:12:39,240 --> 00:12:42,560 Speaker 1: at the phonemes that came before and after to determine 210 00:12:42,600 --> 00:12:45,559 Speaker 1: what those words were and to help increase the confidence 211 00:12:45,640 --> 00:12:48,600 Speaker 1: level overall. So let me give an example. Let's say 212 00:12:48,640 --> 00:12:51,040 Speaker 1: have activated one of these voice assistants, and I've used 213 00:12:51,040 --> 00:12:53,720 Speaker 1: whatever voice command activates it. I'm not going to do 214 00:12:53,760 --> 00:12:55,480 Speaker 1: it here because some of you might be listening on 215 00:12:55,520 --> 00:12:58,800 Speaker 1: those devices. And then I say turn the volume up 216 00:12:58,920 --> 00:13:02,200 Speaker 1: thirty percent. The speech recognition system begins to parse what 217 00:13:02,280 --> 00:13:05,440 Speaker 1: I said by analyzing those sounds phone name by phone name, 218 00:13:05,640 --> 00:13:09,240 Speaker 1: identifying them, analyzing them, trying to group them together to 219 00:13:09,320 --> 00:13:11,839 Speaker 1: form words, and when it thinks it's found a word, 220 00:13:11,880 --> 00:13:15,760 Speaker 1: it assigns a certain probability to that, and when it 221 00:13:15,800 --> 00:13:18,160 Speaker 1: starts to analyze the phone names that make up the 222 00:13:18,200 --> 00:13:21,160 Speaker 1: word volume, it's also looking at the words that came 223 00:13:21,200 --> 00:13:24,360 Speaker 1: before turn the and it's looking at the words that 224 00:13:24,400 --> 00:13:29,000 Speaker 1: came after up. That boosts the system's confidence overall that 225 00:13:29,080 --> 00:13:32,240 Speaker 1: the keyword volume is in fact volume, and then it 226 00:13:32,280 --> 00:13:34,199 Speaker 1: does what I told it to do. When I talk 227 00:13:34,240 --> 00:13:37,559 Speaker 1: about confidence, I don't mean the system feels good about itself. 228 00:13:37,600 --> 00:13:41,440 Speaker 1: I'm talking about probabilities. These systems largely work in the 229 00:13:41,440 --> 00:13:44,920 Speaker 1: realm of probabilities. What is the probability that I said 230 00:13:45,000 --> 00:13:49,040 Speaker 1: volume rather than some other word. For speech recognition system 231 00:13:49,040 --> 00:13:51,560 Speaker 1: to work, it needs to be able to assign a 232 00:13:51,640 --> 00:13:55,880 Speaker 1: confidence level towards. The higher the level, the more certain 233 00:13:56,160 --> 00:13:59,079 Speaker 1: quote unquote the system is that it got things correct. 234 00:13:59,559 --> 00:14:03,320 Speaker 1: Typical computer engineers will design systems that will only execute 235 00:14:03,320 --> 00:14:07,240 Speaker 1: a command or return a result of some sort if 236 00:14:07,280 --> 00:14:10,160 Speaker 1: the system has reached a certain threshold of confidence, and 237 00:14:10,200 --> 00:14:13,600 Speaker 1: if it hasn't, you won't get a result. So, for example, 238 00:14:13,679 --> 00:14:17,000 Speaker 1: and this isn't about speech recognition exactly, but it illustrates 239 00:14:17,040 --> 00:14:20,760 Speaker 1: my point. IBM S Watson computer would not offer up 240 00:14:20,800 --> 00:14:24,760 Speaker 1: an answer on Jeopardy unless it met a certain threshold 241 00:14:24,880 --> 00:14:26,800 Speaker 1: of confidence in an answer, and I think it was 242 00:14:26,800 --> 00:14:30,600 Speaker 1: about eight percent. So if it or eight percent certain 243 00:14:30,680 --> 00:14:32,520 Speaker 1: that it had the right answer, it would buzz in. 244 00:14:32,600 --> 00:14:35,680 Speaker 1: But if it was less than eight sure, it would 245 00:14:35,680 --> 00:14:38,880 Speaker 1: not put forth that answer. There are two broad types 246 00:14:38,920 --> 00:14:43,120 Speaker 1: of statistical models in speech recognition systems today. There are 247 00:14:43,120 --> 00:14:45,400 Speaker 1: others that could be used, but there are two broad 248 00:14:45,480 --> 00:14:47,640 Speaker 1: ones that tend to be used these days. They are 249 00:14:47,640 --> 00:14:52,040 Speaker 1: the hidden Markov model and neural networks. Hidden Markov model, 250 00:14:52,080 --> 00:14:57,280 Speaker 1: by the way, is overwhelmingly the most popular method of 251 00:14:58,080 --> 00:15:02,640 Speaker 1: using a statistical model to analyze speech recognition. It is 252 00:15:02,680 --> 00:15:05,200 Speaker 1: the prevalent approach, and it works sort of how I 253 00:15:05,280 --> 00:15:07,640 Speaker 1: just described. It looks at each phone name and starts 254 00:15:07,640 --> 00:15:10,080 Speaker 1: to build out a pathway. If you think of this 255 00:15:10,120 --> 00:15:13,120 Speaker 1: as like an actual physical path that you're following, you 256 00:15:13,120 --> 00:15:15,640 Speaker 1: would start off with the first phone name that represents 257 00:15:15,680 --> 00:15:18,120 Speaker 1: the beginning of the path, and the phone name might 258 00:15:18,160 --> 00:15:22,000 Speaker 1: eliminate other possible phone names right away. By that, I 259 00:15:22,000 --> 00:15:26,560 Speaker 1: mean it might be a sound that doesn't combine with 260 00:15:26,640 --> 00:15:29,080 Speaker 1: certain other sounds within that language. There might be a 261 00:15:29,120 --> 00:15:33,800 Speaker 1: phone name that does not combine with other specific phone names. 262 00:15:33,880 --> 00:15:36,880 Speaker 1: So imagine you have a path and originally it splits 263 00:15:36,920 --> 00:15:40,160 Speaker 1: into tons of other pathways, but a couple of those 264 00:15:40,160 --> 00:15:42,960 Speaker 1: pathways are blocked off with signs that say the pathway 265 00:15:43,000 --> 00:15:47,120 Speaker 1: is closed. It's closed because those pathways represent phone names 266 00:15:47,200 --> 00:15:51,320 Speaker 1: that would never be paired with the initial one. You 267 00:15:51,360 --> 00:15:55,480 Speaker 1: just don't get that sound in English. The closed paths 268 00:15:55,600 --> 00:15:58,640 Speaker 1: would therefore be off limits, and only the open paths 269 00:15:58,640 --> 00:16:01,960 Speaker 1: would be the possibility. Then the hidden Markov model would 270 00:16:01,960 --> 00:16:04,640 Speaker 1: look at the next phone name the next step along 271 00:16:04,640 --> 00:16:08,080 Speaker 1: this pathway. That phone name determines which of the viable 272 00:16:08,120 --> 00:16:11,800 Speaker 1: path options is actually the one to follow. All the 273 00:16:11,800 --> 00:16:14,640 Speaker 1: other options would be discarded, and so on. It would 274 00:16:14,680 --> 00:16:17,320 Speaker 1: go all the way down the list of phone names 275 00:16:17,360 --> 00:16:19,840 Speaker 1: until the model arrives at a conclusion of the most 276 00:16:19,880 --> 00:16:23,280 Speaker 1: likely word that was spoken. It assigns a probability score 277 00:16:23,280 --> 00:16:26,720 Speaker 1: to each phone names, thinking I'm pretty sure the sound 278 00:16:26,760 --> 00:16:30,600 Speaker 1: that I heard, quote unquote was this. That helps the 279 00:16:30,640 --> 00:16:33,280 Speaker 1: system make an educated guess as to what word was 280 00:16:33,320 --> 00:16:36,320 Speaker 1: actually spoken. Now, I've talked a lot about neural networks 281 00:16:36,320 --> 00:16:37,760 Speaker 1: in the past. I'm just going to give it a 282 00:16:37,840 --> 00:16:42,800 Speaker 1: quick cursory covering here, because they really aren't the dominant 283 00:16:43,080 --> 00:16:47,800 Speaker 1: statistical model in speech recognition. UH Neural networks have nodes, 284 00:16:48,080 --> 00:16:51,560 Speaker 1: computer nodes or algorithms that act like a neuron right 285 00:16:51,680 --> 00:16:54,760 Speaker 1: like a like a brain cell, and they execute operations 286 00:16:54,840 --> 00:16:58,440 Speaker 1: on data. The neurons also assigned a probability score to 287 00:16:58,640 --> 00:17:02,080 Speaker 1: that x that execution of of data and shows the 288 00:17:02,080 --> 00:17:05,400 Speaker 1: confidence in the system in the result before they pass 289 00:17:05,440 --> 00:17:08,080 Speaker 1: it on to another neuron in the network, which then 290 00:17:08,160 --> 00:17:10,879 Speaker 1: executes another operation on the data and so on, and 291 00:17:10,960 --> 00:17:13,760 Speaker 1: ultimately the network produces an end result of all those 292 00:17:13,800 --> 00:17:17,320 Speaker 1: operations and judges the probability of whether or not that 293 00:17:17,440 --> 00:17:20,560 Speaker 1: result is the right one, and again, if it meets 294 00:17:20,560 --> 00:17:24,560 Speaker 1: a certain threshold, then it's considered the correct answer or 295 00:17:24,560 --> 00:17:27,119 Speaker 1: the closest to correct that the system can manage. In 296 00:17:27,160 --> 00:17:30,840 Speaker 1: any case, speech recognition systems have to be trained, and 297 00:17:30,840 --> 00:17:34,200 Speaker 1: there are trillions of potential combinations of sounds that could 298 00:17:34,280 --> 00:17:37,640 Speaker 1: represent different words. And the How stuff Works article How 299 00:17:37,720 --> 00:17:41,000 Speaker 1: Speech Recognition Works Ed Grabanowski, who is one of the 300 00:17:41,160 --> 00:17:44,120 Speaker 1: powerhouses of the site. He's written some of the best 301 00:17:44,240 --> 00:17:47,320 Speaker 1: articles on how stuff Works, gave a great example. He says, 302 00:17:47,600 --> 00:17:52,800 Speaker 1: take the phrase recognize speech right the phone emes in 303 00:17:52,800 --> 00:17:55,520 Speaker 1: that phrase happened to be pretty similar to a totally 304 00:17:55,600 --> 00:17:59,199 Speaker 1: different phrase, which would be recognized beach. So you have 305 00:17:59,280 --> 00:18:04,720 Speaker 1: recognized speech or wreck a nice beach. The speech recognition 306 00:18:04,720 --> 00:18:07,480 Speaker 1: software has to be able to determine the difference, or 307 00:18:07,560 --> 00:18:09,200 Speaker 1: else the next thing you know, you're gonna have terminators 308 00:18:09,280 --> 00:18:12,960 Speaker 1: kicking sand in everyone's face, and that's no good. Alexander Wibel, 309 00:18:13,320 --> 00:18:17,360 Speaker 1: who worked on that system called Harpy that I mentioned earlier, 310 00:18:18,000 --> 00:18:21,200 Speaker 1: had another couple of examples. He said, you might say 311 00:18:21,320 --> 00:18:25,919 Speaker 1: youth and Asia and get the result youth in Asia. 312 00:18:26,080 --> 00:18:29,280 Speaker 1: Or you might say give me a new display and 313 00:18:29,320 --> 00:18:32,040 Speaker 1: you get the result, give me a newdist play. If 314 00:18:32,080 --> 00:18:36,040 Speaker 1: you've ever used something like Google transcripts, where if you 315 00:18:36,040 --> 00:18:38,760 Speaker 1: had a Google Voice and you were reading the voicemails, 316 00:18:39,760 --> 00:18:43,159 Speaker 1: you could get hilarious results. Because of this, the speech recognition, 317 00:18:43,480 --> 00:18:46,640 Speaker 1: the speech to text feature could end up spelling out 318 00:18:47,560 --> 00:18:52,080 Speaker 1: truly ridiculous messages. I would get messages from my mother, 319 00:18:52,920 --> 00:18:56,000 Speaker 1: and I only wish my mom would leave me messages 320 00:18:56,080 --> 00:18:59,600 Speaker 1: the way that Google transcript thought she was leaving me messages, 321 00:18:59,640 --> 00:19:03,320 Speaker 1: because they were the most crazy messages ever. But it's 322 00:19:03,359 --> 00:19:05,720 Speaker 1: mostly because my mom has a Southern accent and so 323 00:19:05,800 --> 00:19:10,640 Speaker 1: Google would often misinterpret what she was saying, so these 324 00:19:10,640 --> 00:19:14,520 Speaker 1: systems have to undergo hours of training. John Garofolo, a 325 00:19:14,520 --> 00:19:17,800 Speaker 1: computer scientist who was cited in that House Stuff Works article, 326 00:19:18,119 --> 00:19:22,600 Speaker 1: had this to say. These statistical systems need lots of 327 00:19:22,640 --> 00:19:26,960 Speaker 1: exemplary training data to reach their optimal performance, sometimes on 328 00:19:27,000 --> 00:19:30,840 Speaker 1: the order of thousands of hours of human transcribed speech 329 00:19:30,880 --> 00:19:34,960 Speaker 1: and hundreds of megabytes of text. These training data are 330 00:19:35,040 --> 00:19:38,919 Speaker 1: used to create acoustic models of words, word lists, and 331 00:19:39,040 --> 00:19:43,200 Speaker 1: multi word probability networks. There is some art into how 332 00:19:43,240 --> 00:19:47,680 Speaker 1: one selects, compiles, and prepares this training data for digestion 333 00:19:47,800 --> 00:19:50,680 Speaker 1: by the system, and how the system models are tuned 334 00:19:50,840 --> 00:19:54,400 Speaker 1: to a particular application. These details can make the difference 335 00:19:54,400 --> 00:19:57,719 Speaker 1: between a well performing system and a poorly performing system, 336 00:19:57,960 --> 00:20:01,639 Speaker 1: even when using the same basic algorith Rhythm speech recognition 337 00:20:01,680 --> 00:20:05,240 Speaker 1: also requires a decent amount of processing power. This was 338 00:20:05,280 --> 00:20:08,440 Speaker 1: a limiting factor on speech recognition for a really long time. 339 00:20:08,880 --> 00:20:12,639 Speaker 1: Systems were limited in their capabilities, which meant that for years, 340 00:20:12,680 --> 00:20:16,399 Speaker 1: if you wanted to incorporate speech recognition in a computer system, 341 00:20:16,480 --> 00:20:19,320 Speaker 1: and then most of the computer's processing power would have 342 00:20:19,400 --> 00:20:22,119 Speaker 1: to dedicate itself just to parsing speech. You couldn't do 343 00:20:22,240 --> 00:20:25,320 Speaker 1: much else on that machine. But since Moore's laws held 344 00:20:25,400 --> 00:20:27,400 Speaker 1: up so well for decades, we got to a point 345 00:20:27,400 --> 00:20:30,000 Speaker 1: where the process and capabilities of machines reached a stage 346 00:20:30,200 --> 00:20:33,679 Speaker 1: where this isn't as big a concern, And another development 347 00:20:33,960 --> 00:20:38,680 Speaker 1: that Google really helped pioneer definitely change things. I'll talk 348 00:20:38,720 --> 00:20:41,160 Speaker 1: more about that in our next section, but first let's 349 00:20:41,160 --> 00:20:51,560 Speaker 1: take another quick break to thank our sponsors. Okay, So, 350 00:20:51,600 --> 00:20:54,800 Speaker 1: advances in speech recognition in the late nineteen seventies paved 351 00:20:54,920 --> 00:20:58,280 Speaker 1: the way from how most systems work these days, though 352 00:20:58,320 --> 00:21:01,480 Speaker 1: of course the models have under gone multiple refinements and 353 00:21:01,520 --> 00:21:05,080 Speaker 1: tweaking over time. The first speech recognition product to ever 354 00:21:05,200 --> 00:21:09,200 Speaker 1: launch for consumers was a program called Dragon Dictate, which 355 00:21:09,240 --> 00:21:13,879 Speaker 1: debuted in Dragon Dictate. The original version that is, because 356 00:21:13,920 --> 00:21:17,119 Speaker 1: they still come out to this day, relied on discrete 357 00:21:17,160 --> 00:21:20,320 Speaker 1: speech recognition. Now, I don't mean you had to be 358 00:21:20,400 --> 00:21:22,879 Speaker 1: secretive and hush hush about it. It's not that kind 359 00:21:22,880 --> 00:21:25,879 Speaker 1: of discreet. Rather, I mean you had to pronounce each 360 00:21:26,040 --> 00:21:28,879 Speaker 1: word clearly, with a pause between words. You could not 361 00:21:29,080 --> 00:21:33,040 Speaker 1: speak conversationally, or the dictation software could not interpret what 362 00:21:33,080 --> 00:21:43,080 Speaker 1: you were saying, so using the software would sound like this. 363 00:21:44,840 --> 00:21:47,440 Speaker 1: It was limited and it was primitive compared to today's 364 00:21:47,520 --> 00:21:51,080 Speaker 1: speech recognition products, but it was a groundbreaking product in 365 00:21:51,119 --> 00:21:54,560 Speaker 1: the early nineties. And it also costs somewhere between six 366 00:21:54,600 --> 00:21:59,320 Speaker 1: thousand and nine thousand dollars I saw differing accounts, but 367 00:21:59,359 --> 00:22:02,920 Speaker 1: that would be between nine and fourteen grand in today's dollars, 368 00:22:02,960 --> 00:22:07,320 Speaker 1: so pretty expensive software package. Dragon still produces speech recognition 369 00:22:07,359 --> 00:22:09,760 Speaker 1: technologies to this day, and of course they are much 370 00:22:09,800 --> 00:22:13,000 Speaker 1: more adept at recognizing and transcribing speech than the original 371 00:22:13,080 --> 00:22:16,160 Speaker 1: version was years ago. The software is also less expensive. 372 00:22:16,640 --> 00:22:19,399 Speaker 1: One version I saw retails for less than a hundred dollars, 373 00:22:19,400 --> 00:22:23,000 Speaker 1: so nice. Big deep price cut advancements and model design 374 00:22:23,040 --> 00:22:27,800 Speaker 1: and processor speed meant that speech recognition technology advanced rather quickly. 375 00:22:28,240 --> 00:22:33,280 Speaker 1: In Bell South released Vowel v a L. The Voice 376 00:22:33,280 --> 00:22:36,840 Speaker 1: Portal VAL was an automated interactive system that could respond 377 00:22:36,880 --> 00:22:40,520 Speaker 1: to questions over the phone. This was a basic implementation 378 00:22:40,560 --> 00:22:42,719 Speaker 1: that would evolve over time to the systems you may 379 00:22:42,760 --> 00:22:45,800 Speaker 1: have encountered when calling up automated menus where it's a 380 00:22:46,160 --> 00:22:49,800 Speaker 1: press three or say three and that kind of thing, 381 00:22:50,080 --> 00:22:53,600 Speaker 1: or do you have any questions? You can say anything 382 00:22:53,640 --> 00:22:56,479 Speaker 1: from check my balance to you know, that kind of stuff. 383 00:22:57,320 --> 00:22:59,920 Speaker 1: In two thousand five DARPA, which is the same brand 384 00:23:00,000 --> 00:23:02,199 Speaker 1: each of the Department of Defense that used to be 385 00:23:02,240 --> 00:23:04,280 Speaker 1: known as ARPA, So in other words, it's the same 386 00:23:04,680 --> 00:23:09,120 Speaker 1: R and d ARM that funded the creation of the Internet. 387 00:23:09,480 --> 00:23:11,760 Speaker 1: They funded a program in two thousand five called the 388 00:23:11,800 --> 00:23:17,359 Speaker 1: Global Autonomous Language Exploitation Project or GALE. The purpose of 389 00:23:17,359 --> 00:23:20,639 Speaker 1: this project was to advance research and development into automated 390 00:23:20,640 --> 00:23:24,480 Speaker 1: translation between languages. So not only were computers supposed to 391 00:23:24,520 --> 00:23:27,560 Speaker 1: be able to recognize speech, but also translate that speech 392 00:23:27,600 --> 00:23:31,320 Speaker 1: from one language into another, which adds another layer of 393 00:23:31,359 --> 00:23:35,120 Speaker 1: complexity on top. Right well, according to s r I International, 394 00:23:35,680 --> 00:23:39,480 Speaker 1: the system should be able to quote automatically take multi 395 00:23:39,600 --> 00:23:44,239 Speaker 1: lingual newscasts, text documents, and other forms of communication and 396 00:23:44,280 --> 00:23:47,679 Speaker 1: make their information available to human queries end quote. So 397 00:23:47,760 --> 00:23:50,840 Speaker 1: wouldn't just translate the information, which was already even more 398 00:23:50,880 --> 00:23:54,760 Speaker 1: complicated than speech recognition, It could also index that information 399 00:23:54,800 --> 00:23:57,119 Speaker 1: in a meaningful way so you could search for stuff. 400 00:23:57,960 --> 00:24:02,720 Speaker 1: So layer upon layer of complexity for that project. Things 401 00:24:02,720 --> 00:24:06,080 Speaker 1: that helped push speech recognition as well as natural language 402 00:24:06,080 --> 00:24:10,680 Speaker 1: processing to new heights largely came from two competing companies, 403 00:24:11,040 --> 00:24:14,200 Speaker 1: Apple and Google. So let me explain that In two 404 00:24:14,280 --> 00:24:18,000 Speaker 1: thousand seven, Apple introduced the iPhone, which was the first 405 00:24:18,040 --> 00:24:21,800 Speaker 1: truly successful consumer smartphone, especially here in the United States. 406 00:24:22,119 --> 00:24:25,959 Speaker 1: The smartphone introduced a new era and form of computing. 407 00:24:26,400 --> 00:24:31,720 Speaker 1: It created countless opportunities in numerous areas, including location based computing, 408 00:24:32,160 --> 00:24:36,280 Speaker 1: mobile interactions, and speech recognition. The computer was in a 409 00:24:36,440 --> 00:24:39,800 Speaker 1: phone form factor. Phones are designed for us to talk into, 410 00:24:39,880 --> 00:24:42,520 Speaker 1: So now you can walk around carrying a computer that 411 00:24:42,600 --> 00:24:45,080 Speaker 1: was designed to transmit your voice. It's only a matter 412 00:24:45,080 --> 00:24:47,480 Speaker 1: of time before someone figured out a way to leverage 413 00:24:47,520 --> 00:24:51,760 Speaker 1: that for speech recognition. Google meanwhile, was pioneering an approach 414 00:24:51,800 --> 00:24:55,639 Speaker 1: in what would perform all the processing functions necessary to 415 00:24:55,720 --> 00:24:58,680 Speaker 1: support speech recognition. It was doing it in the cloud, 416 00:24:59,240 --> 00:25:02,480 Speaker 1: so instead of having the device itself have to run 417 00:25:02,560 --> 00:25:05,919 Speaker 1: all that processing power, the device would have a persistent 418 00:25:05,960 --> 00:25:10,520 Speaker 1: connection to a server on the Internet, and the server 419 00:25:10,640 --> 00:25:13,240 Speaker 1: would do the work. It would just send the signal 420 00:25:13,280 --> 00:25:16,080 Speaker 1: to the server. The server would process and analyze the 421 00:25:16,119 --> 00:25:18,760 Speaker 1: signal and return the result back to the phone, and 422 00:25:18,800 --> 00:25:21,200 Speaker 1: the phone just was acting as a transmitter. It wasn't 423 00:25:21,280 --> 00:25:24,200 Speaker 1: really having to do any of that analysis itself. So 424 00:25:25,000 --> 00:25:29,600 Speaker 1: in two thousand and eight, Google launched the Google Voice 425 00:25:29,600 --> 00:25:33,040 Speaker 1: search app for the iPhone that would do all the 426 00:25:33,400 --> 00:25:38,040 Speaker 1: this uh speech recognition processing. Right, you could speak into 427 00:25:38,080 --> 00:25:40,960 Speaker 1: it and have Google search the terms for you for 428 00:25:41,080 --> 00:25:43,720 Speaker 1: whatever it was you were saying. But again, what was 429 00:25:43,760 --> 00:25:45,920 Speaker 1: really going on was that Google was sending those search 430 00:25:46,080 --> 00:25:50,040 Speaker 1: terms or that that speech signal over to a server 431 00:25:50,160 --> 00:25:53,520 Speaker 1: that Google operated, and then send the results back down 432 00:25:53,560 --> 00:25:55,840 Speaker 1: to the phone. But to the user it looked like 433 00:25:55,880 --> 00:25:58,240 Speaker 1: the phone itself was doing all the work. The truth 434 00:25:58,400 --> 00:26:02,600 Speaker 1: was it was simply a very basic application of true 435 00:26:02,680 --> 00:26:05,840 Speaker 1: cloud computing, and that created a new method of rolling 436 00:26:05,880 --> 00:26:09,320 Speaker 1: out speech recognition and apps and services. No longer did 437 00:26:09,320 --> 00:26:12,119 Speaker 1: you have to worry about creating a really powerful piece 438 00:26:12,160 --> 00:26:15,280 Speaker 1: of equipment. You can have that be on the back end. 439 00:26:15,840 --> 00:26:18,719 Speaker 1: The piece of equipment the user could have could be 440 00:26:18,760 --> 00:26:23,480 Speaker 1: a relatively underpowered terminal. Essentially. Meanwhile, it also meant that 441 00:26:23,520 --> 00:26:27,879 Speaker 1: Google could collect enormous samples of data, not necessarily to 442 00:26:27,960 --> 00:26:31,520 Speaker 1: market to people or to identify specific individual but rather 443 00:26:31,680 --> 00:26:34,840 Speaker 1: it could collect a lot of data for training its 444 00:26:34,920 --> 00:26:39,200 Speaker 1: speech recognition and natural language recognition models. Google could build 445 00:26:39,200 --> 00:26:42,080 Speaker 1: out a much more robust model of human speech patterns 446 00:26:42,359 --> 00:26:46,840 Speaker 1: because they had thousands of real world uses going on 447 00:26:46,960 --> 00:26:49,520 Speaker 1: in real time they could keep using that to build 448 00:26:49,520 --> 00:26:54,919 Speaker 1: out and bolster their models, and that improved Google's speech 449 00:26:54,920 --> 00:26:59,439 Speaker 1: recognition accuracy. Today, major speech recognition platforms typically have an 450 00:26:59,520 --> 00:27:02,800 Speaker 1: error rate below five percent, which is pretty darn impressive. 451 00:27:03,280 --> 00:27:07,880 Speaker 1: According to a calm score estimation, by twenty half of 452 00:27:07,920 --> 00:27:11,600 Speaker 1: all searches on the Internet will be voice searches. So 453 00:27:11,640 --> 00:27:15,720 Speaker 1: speech recognition, along with natural language processing, could lead to 454 00:27:15,760 --> 00:27:19,679 Speaker 1: a future of ambient computing in which the environments we 455 00:27:19,880 --> 00:27:23,720 Speaker 1: move through our effectively computer interfaces, and we can access 456 00:27:23,760 --> 00:27:26,600 Speaker 1: them through voice commands and other ways of commanding, maybe 457 00:27:26,600 --> 00:27:29,480 Speaker 1: gesture commands, but that seems like it might be better, 458 00:27:29,520 --> 00:27:32,360 Speaker 1: say for our episode about voice assistance and where we're 459 00:27:32,359 --> 00:27:36,160 Speaker 1: headed with that technology. In our next episode, I'm going 460 00:27:36,200 --> 00:27:40,199 Speaker 1: to really explore natural language processing, how it works, and 461 00:27:40,240 --> 00:27:42,840 Speaker 1: how that field of research has evolved over the last 462 00:27:42,840 --> 00:27:46,119 Speaker 1: few decades. It's also really fascinating, and it does, in 463 00:27:46,200 --> 00:27:49,000 Speaker 1: fact cross over quite a bit with speech recognition. But 464 00:27:49,280 --> 00:27:53,679 Speaker 1: natural language processing goes beyond speech. It also includes text, 465 00:27:54,320 --> 00:27:56,320 Speaker 1: and that will be our next episode. But if you 466 00:27:56,359 --> 00:27:58,760 Speaker 1: have a suggestion for a future topic I should cover 467 00:27:58,960 --> 00:28:01,520 Speaker 1: on tech Stuff, send me a message let me know 468 00:28:01,560 --> 00:28:04,199 Speaker 1: about it. The email for the show is tech stuff 469 00:28:04,359 --> 00:28:07,440 Speaker 1: at how stuff works dot com, or you can drop 470 00:28:07,440 --> 00:28:09,240 Speaker 1: me a line on Facebook or Twitter. The handle for 471 00:28:09,320 --> 00:28:13,040 Speaker 1: both of those is text Stuff hs W, and you 472 00:28:13,119 --> 00:28:15,800 Speaker 1: can also follow us on Instagram. I would love it 473 00:28:15,840 --> 00:28:19,200 Speaker 1: if you did, and I'll talk to you again really soon. 474 00:28:24,960 --> 00:28:27,399 Speaker 1: For more on this and bouthsands of other topics, is 475 00:28:27,400 --> 00:28:38,520 Speaker 1: it how stuff works dot com