1 00:00:04,240 --> 00:00:07,240 Speaker 1: Welcome to Tech Stuff, a production of I Heart Radios 2 00:00:07,320 --> 00:00:14,000 Speaker 1: How Stuff Works. Hey there, and welcome to tech Stuff. 3 00:00:14,040 --> 00:00:16,880 Speaker 1: I'm your host, Jonathan Strickland. I'm an executive producer with 4 00:00:16,920 --> 00:00:19,960 Speaker 1: I Heart Radio and I love all things tech, and guys, 5 00:00:19,960 --> 00:00:24,079 Speaker 1: stick with me. I am fighting off a cold. You'll 6 00:00:24,120 --> 00:00:25,560 Speaker 1: be able to hear it in my voice. I have 7 00:00:25,760 --> 00:00:28,720 Speaker 1: no doubt. But you know, I wanted to get you 8 00:00:28,760 --> 00:00:32,040 Speaker 1: guys a brand new episode. So we're gonna fight on 9 00:00:32,200 --> 00:00:37,200 Speaker 1: because the show must keep going. I think I think 10 00:00:37,200 --> 00:00:40,120 Speaker 1: this is saying, oh no, this cold medicine is good though. 11 00:00:40,200 --> 00:00:42,959 Speaker 1: All right, Anyway, I thought that we would do an 12 00:00:42,960 --> 00:00:47,760 Speaker 1: episode about smart speakers because I wanted to kind of 13 00:00:47,760 --> 00:00:51,199 Speaker 1: start this whole episode off with with an old man observation, 14 00:00:51,240 --> 00:00:54,000 Speaker 1: you know, get off my lawn kind of thing. And 15 00:00:54,080 --> 00:00:57,600 Speaker 1: this is from our resident old man, old man Strickland. 16 00:00:57,840 --> 00:01:01,160 Speaker 1: That meaning meaning me, So, when I was young, speakers 17 00:01:01,160 --> 00:01:03,959 Speaker 1: were dumb. Now I don't. I don't mean that speakers 18 00:01:03,960 --> 00:01:07,479 Speaker 1: were useless, or that they were terrible, or that they 19 00:01:07,480 --> 00:01:11,679 Speaker 1: were incapable of replicating certain frequencies or volumes of sound, 20 00:01:12,319 --> 00:01:15,440 Speaker 1: or that they were limited in some other way other 21 00:01:15,480 --> 00:01:19,319 Speaker 1: than they didn't quote unquote think they didn't connect to 22 00:01:19,360 --> 00:01:22,800 Speaker 1: any sort of computational engine in a meaningful way. You 23 00:01:22,880 --> 00:01:25,160 Speaker 1: might have a set of speakers plugged into a computer, 24 00:01:25,600 --> 00:01:27,840 Speaker 1: but that was just a one way communications tool, right. 25 00:01:27,840 --> 00:01:29,920 Speaker 1: It was just a way to provide an outlet for 26 00:01:30,120 --> 00:01:33,200 Speaker 1: sound that your computer was generating, nothing more than that. 27 00:01:33,880 --> 00:01:37,200 Speaker 1: But contrast that with today, when we have numerous smart 28 00:01:37,240 --> 00:01:40,440 Speaker 1: speakers on the market. These speakers act as a user 29 00:01:40,480 --> 00:01:44,840 Speaker 1: interface between us and the Internet at large, often facilitated 30 00:01:44,840 --> 00:01:49,120 Speaker 1: by a virtual assistant of some kind. Now with these speakers, 31 00:01:49,200 --> 00:01:52,760 Speaker 1: we don't just listen to stuff like music and podcasts 32 00:01:52,840 --> 00:01:56,640 Speaker 1: and the radio and you know, other traditional audio content. 33 00:01:57,200 --> 00:02:00,520 Speaker 1: We use them to find out information. We might link 34 00:02:00,640 --> 00:02:03,360 Speaker 1: them to our calendars so that we can get reminders 35 00:02:03,360 --> 00:02:06,760 Speaker 1: for upcoming appointments. We probably use them to ask about 36 00:02:06,800 --> 00:02:09,720 Speaker 1: the weather report. I use mine at home for that 37 00:02:09,800 --> 00:02:12,640 Speaker 1: all the time, or even more often than that, if 38 00:02:12,639 --> 00:02:15,200 Speaker 1: you're at my house, you'll hear us use it to 39 00:02:15,240 --> 00:02:17,400 Speaker 1: find out which foods are safe for us to feed 40 00:02:17,480 --> 00:02:21,080 Speaker 1: to our dog. My doggie, Tibolt, absolutely loves our smart 41 00:02:21,120 --> 00:02:24,160 Speaker 1: speaker because it frequently gives us permission to spoil him 42 00:02:24,160 --> 00:02:27,560 Speaker 1: with a carrot or a piece of banana. But how 43 00:02:27,639 --> 00:02:31,200 Speaker 1: do these smart speakers work, How are they able to 44 00:02:31,320 --> 00:02:35,680 Speaker 1: respond to our requests? And what are their limitations? How 45 00:02:35,760 --> 00:02:38,320 Speaker 1: safe are they? That's the sort of stuff we're gonna 46 00:02:38,320 --> 00:02:40,920 Speaker 1: be looking into in this episode of tech Stuff, and 47 00:02:40,960 --> 00:02:44,120 Speaker 1: we'll start off with the basics, which means we have 48 00:02:44,160 --> 00:02:47,519 Speaker 1: to start off with how speakers work in general. Now, 49 00:02:47,520 --> 00:02:49,840 Speaker 1: this is something that I've covered before on tech Stuff, 50 00:02:50,120 --> 00:02:51,919 Speaker 1: but I want to go over it again from a 51 00:02:52,000 --> 00:02:55,400 Speaker 1: high level because well, I just find it fascinating that 52 00:02:55,480 --> 00:02:58,800 Speaker 1: people figured out how to harness electricity to drive a 53 00:02:58,840 --> 00:03:02,760 Speaker 1: motor so that it could in turn cause components to 54 00:03:02,800 --> 00:03:07,079 Speaker 1: replicate a recorded or transmitted sound. And really motors being 55 00:03:07,120 --> 00:03:10,600 Speaker 1: too generous, but to drive an element to create vibrations 56 00:03:10,639 --> 00:03:14,320 Speaker 1: that could replicate a sound that was made into another component, 57 00:03:14,560 --> 00:03:17,080 Speaker 1: that whole thing just boggles my mind that people are 58 00:03:17,120 --> 00:03:20,920 Speaker 1: smart enough to figure that out. Okay, So to understand 59 00:03:20,960 --> 00:03:24,000 Speaker 1: how speakers work, it first helps to understand how sound 60 00:03:24,160 --> 00:03:28,800 Speaker 1: itself works. Sound is a physical phenomenon. Do do do do? 61 00:03:29,320 --> 00:03:33,560 Speaker 1: Sound is all about vibrations, and typically we experience sound 62 00:03:33,600 --> 00:03:36,280 Speaker 1: when we pick up on changes in air pressure that 63 00:03:36,520 --> 00:03:40,119 Speaker 1: enter through our ear canal and then affect the tympanic 64 00:03:40,160 --> 00:03:44,800 Speaker 1: membrane or ear drum. So it's all about these changes 65 00:03:44,880 --> 00:03:48,720 Speaker 1: of of of air pressure, all about air molecules transmitting 66 00:03:48,800 --> 00:03:53,520 Speaker 1: vibrations from a source outward in a radiating pattern from 67 00:03:53,520 --> 00:03:56,760 Speaker 1: from that source. So let's think of someone knocking on 68 00:03:56,800 --> 00:03:59,520 Speaker 1: a door. For example, you're inside a house, someone's knocking 69 00:03:59,560 --> 00:04:02,960 Speaker 1: on your door. When that person's hand hits the door, 70 00:04:03,280 --> 00:04:07,600 Speaker 1: it causes the door to vibrate, and that vibration transmits 71 00:04:07,640 --> 00:04:10,440 Speaker 1: to the surrounding air molecules on the other side of 72 00:04:10,440 --> 00:04:13,640 Speaker 1: the door. They get pushed through that vibration and then 73 00:04:13,640 --> 00:04:18,440 Speaker 1: pulled when the the wood is vibrating back towards its 74 00:04:18,440 --> 00:04:23,359 Speaker 1: original position. So the air molecules vibrate, those air molecules 75 00:04:23,400 --> 00:04:26,919 Speaker 1: cause the next surrounding layer of air molecules to vibrate 76 00:04:26,960 --> 00:04:29,160 Speaker 1: as well, and so on and so forth. It's like 77 00:04:29,200 --> 00:04:32,360 Speaker 1: a cascade or domino effect. You get these little pockets 78 00:04:32,400 --> 00:04:35,599 Speaker 1: of high and low air pressure that travel outward from 79 00:04:35,720 --> 00:04:40,919 Speaker 1: that door. It spreads further as it goes towards you know, 80 00:04:41,000 --> 00:04:45,760 Speaker 1: any distance, and if you are close enough so that 81 00:04:46,360 --> 00:04:49,200 Speaker 1: you can still detect those changes in air pressure. You 82 00:04:49,360 --> 00:04:52,320 Speaker 1: experience this by hearing the knocking on the door. Those 83 00:04:52,360 --> 00:04:56,039 Speaker 1: vibrating air molecules lose a bit of energy as they 84 00:04:56,080 --> 00:04:59,200 Speaker 1: move outward. Right, as they vibrate to the next layer, 85 00:04:59,320 --> 00:05:01,159 Speaker 1: you start to lo use a bit of energy with 86 00:05:01,279 --> 00:05:05,560 Speaker 1: each transmission of that So the sound gets quieter the 87 00:05:05,640 --> 00:05:08,800 Speaker 1: further away you are because there's not as many air 88 00:05:08,839 --> 00:05:13,680 Speaker 1: molecules vibrating, its amplitude as decreased. So if you are 89 00:05:13,720 --> 00:05:16,120 Speaker 1: in hearing range, you can pick up on those changes 90 00:05:16,120 --> 00:05:18,720 Speaker 1: of air pressure they encounter the tympanic membrane in your 91 00:05:18,720 --> 00:05:21,760 Speaker 1: ear canal. Those changes in pressure will cause a reaction 92 00:05:21,800 --> 00:05:25,919 Speaker 1: in your middle and inner ear set that will ultimately 93 00:05:25,920 --> 00:05:29,720 Speaker 1: get picked up by your brain that interprets it as sound. Now, 94 00:05:29,720 --> 00:05:34,400 Speaker 1: the frequency at which those fluctuations occur relate to the 95 00:05:34,440 --> 00:05:40,880 Speaker 1: pitch that we hear, so faster vibrations are higher pitches, 96 00:05:41,080 --> 00:05:44,760 Speaker 1: higher frequencies, higher notes. If you think of a musical scale, 97 00:05:45,520 --> 00:05:50,200 Speaker 1: we perceive the force of the changes as volume, so 98 00:05:50,839 --> 00:05:54,560 Speaker 1: lower forces lower volume right, and higher forces higher volume. 99 00:05:55,279 --> 00:05:58,039 Speaker 1: The human ear can hear a pretty decent range of 100 00:05:58,080 --> 00:06:02,239 Speaker 1: frequencies from twenty hurts, which means twenty cycles or twenty 101 00:06:02,360 --> 00:06:06,880 Speaker 1: waves per second past a given point of reference, to 102 00:06:07,000 --> 00:06:12,320 Speaker 1: twenty killer hurts. That's twenty thousand cycles or waves per second. 103 00:06:12,800 --> 00:06:15,440 Speaker 1: So yeah, the cycle refers to the frequency of the 104 00:06:15,480 --> 00:06:19,120 Speaker 1: wavelength of sound. The lower the frequency, the lower the sound. 105 00:06:19,440 --> 00:06:21,560 Speaker 1: All right, and then our brain has to make meaning 106 00:06:21,560 --> 00:06:23,880 Speaker 1: of all this, Right, it's not just that it's picking 107 00:06:23,960 --> 00:06:28,279 Speaker 1: up on it. Our brain interprets this and we experience 108 00:06:28,360 --> 00:06:32,359 Speaker 1: it as a sound we have heard. So it either 109 00:06:32,720 --> 00:06:36,840 Speaker 1: matches this perceived sound with one we've encountered before, and 110 00:06:36,880 --> 00:06:39,840 Speaker 1: then we say, oh, I know what that is. That's 111 00:06:39,880 --> 00:06:43,799 Speaker 1: someone knocking at the door, or they might be Holy Cala, 112 00:06:44,000 --> 00:06:46,120 Speaker 1: I've never heard that sound in my life. I have 113 00:06:46,200 --> 00:06:49,920 Speaker 1: no idea what it is. If the sound is language, 114 00:06:50,000 --> 00:06:52,560 Speaker 1: then our brains have to derive the meaning from the 115 00:06:52,640 --> 00:06:56,479 Speaker 1: perceived sound. We've heard someone say words such as you're 116 00:06:56,520 --> 00:07:00,920 Speaker 1: hearing me say this. Then our brains have to take 117 00:07:01,160 --> 00:07:03,960 Speaker 1: that collection of sounds and say, what does that actually mean? 118 00:07:04,040 --> 00:07:07,200 Speaker 1: What is the the context, what is the the intent? 119 00:07:07,640 --> 00:07:10,440 Speaker 1: What is the message here? Otherwise it would just be 120 00:07:10,960 --> 00:07:14,760 Speaker 1: you know, random noises that I'm making with my mouth. Alright, 121 00:07:14,800 --> 00:07:17,760 Speaker 1: so we have a basic understanding behind the physics of sound. 122 00:07:17,760 --> 00:07:21,600 Speaker 1: Now to talk about speakers and microphones and the reason 123 00:07:21,760 --> 00:07:24,000 Speaker 1: I'm going to talk about both of them is that 124 00:07:24,080 --> 00:07:26,720 Speaker 1: the devices complement one another. You can think of one 125 00:07:26,760 --> 00:07:31,080 Speaker 1: as being the other in reverse. Plus smart speakers we 126 00:07:31,160 --> 00:07:34,160 Speaker 1: have to talk about microphones anyway, because smart speakers have 127 00:07:34,400 --> 00:07:38,280 Speaker 1: microphones as well as the speaker element. So you can 128 00:07:38,360 --> 00:07:41,520 Speaker 1: think of this as one long process of taking the 129 00:07:41,520 --> 00:07:46,280 Speaker 1: physical phenomena of sound waves, transforming that physical phenomena into 130 00:07:46,360 --> 00:07:49,800 Speaker 1: an electrical signal, taking the electrical signal, and changing it 131 00:07:49,920 --> 00:07:52,920 Speaker 1: back into something that can produce the sound waves that 132 00:07:53,040 --> 00:07:56,520 Speaker 1: started the whole thing. So you're replicating the original sound 133 00:07:56,560 --> 00:08:00,480 Speaker 1: waves with this end device, which in this case is 134 00:08:00,480 --> 00:08:03,120 Speaker 1: allowed speaker. So the microphone is the part of the 135 00:08:03,160 --> 00:08:05,280 Speaker 1: process where you take the sound and you turn it 136 00:08:05,320 --> 00:08:08,080 Speaker 1: into an electrical signal, and the speakers where you take 137 00:08:08,120 --> 00:08:10,600 Speaker 1: the electrical signal and you turn it back into actual sound. 138 00:08:10,680 --> 00:08:14,640 Speaker 1: That's the simple way. But what's actually happening, Well, let's 139 00:08:14,680 --> 00:08:18,520 Speaker 1: talk about on a physical level. Sound waves go into 140 00:08:18,560 --> 00:08:23,080 Speaker 1: a microphone. So you've got these fluctuations and air pressure 141 00:08:23,200 --> 00:08:27,120 Speaker 1: that encounter a microphone. I'm speaking into a microphone right now, 142 00:08:27,240 --> 00:08:30,480 Speaker 1: so this is happening right now. Inside the microphone is 143 00:08:30,520 --> 00:08:33,679 Speaker 1: a very thin diaphragm, typically made out of a very 144 00:08:33,720 --> 00:08:37,440 Speaker 1: flexible plastic, and it's sort of like the skin of 145 00:08:37,440 --> 00:08:40,880 Speaker 1: a drum. So as the changes in air pressure encounter 146 00:08:41,360 --> 00:08:45,520 Speaker 1: the diaphragm, they cause the diaphragm to move back and forth. Well. 147 00:08:45,520 --> 00:08:49,319 Speaker 1: Attached to the diaphragm is a coil of conductive wire, 148 00:08:49,760 --> 00:08:53,640 Speaker 1: and that coil wraps either around or near a permanent magnet. 149 00:08:54,040 --> 00:08:57,200 Speaker 1: Magnets have magnetic fields. They have a north pole and 150 00:08:57,240 --> 00:09:00,760 Speaker 1: a south pole, and there's a magnetic field that surrounds 151 00:09:01,320 --> 00:09:05,720 Speaker 1: the magnet. And the electro magnetic effect means that if 152 00:09:05,720 --> 00:09:10,600 Speaker 1: you move a coil of conductive wire through a magnetic field, 153 00:09:11,040 --> 00:09:14,280 Speaker 1: it will produce a change in voltage in that coil, 154 00:09:14,600 --> 00:09:19,000 Speaker 1: otherwise known as electromotive force, and that means electrical current 155 00:09:19,040 --> 00:09:22,760 Speaker 1: will flow through the coil. Now, if you have the 156 00:09:22,880 --> 00:09:26,240 Speaker 1: end of that coil attached to a wire, a conductive 157 00:09:26,280 --> 00:09:30,120 Speaker 1: wire for that current to flow through, you can send 158 00:09:30,160 --> 00:09:33,960 Speaker 1: that current onto other components. So for our purposes, the 159 00:09:34,000 --> 00:09:37,360 Speaker 1: component in question would be an amplifier, and I'll get 160 00:09:37,400 --> 00:09:40,480 Speaker 1: to explaining why that is in just a moment, but 161 00:09:40,559 --> 00:09:43,160 Speaker 1: first let's talk about loud speakers, and the way allowed 162 00:09:43,160 --> 00:09:48,000 Speaker 1: speaker works is essentially the reverse of a microphone. You've 163 00:09:48,040 --> 00:09:51,440 Speaker 1: got your permanent magnet around or near which is a 164 00:09:51,480 --> 00:09:56,360 Speaker 1: coil of conductive wire. The wire is connected to a diaphragm, 165 00:09:56,400 --> 00:09:59,600 Speaker 1: one much larger and typically made out of stiffer material 166 00:09:59,800 --> 00:10:03,480 Speaker 1: that the plastic you'd find in a microphone. This is 167 00:10:03,520 --> 00:10:06,520 Speaker 1: the element inside a speaker that will vibrate, that will 168 00:10:06,559 --> 00:10:10,960 Speaker 1: push air and pull air as it moves either outward 169 00:10:11,040 --> 00:10:14,800 Speaker 1: or inward. The electrical signal comes from a source such 170 00:10:14,880 --> 00:10:17,440 Speaker 1: as the microphone we were just using a second ago 171 00:10:18,080 --> 00:10:22,439 Speaker 1: that comes into the loudspeaker and it flows through the coil. Now, 172 00:10:22,480 --> 00:10:26,400 Speaker 1: when you have an electrical current flowing through a conductive coil, 173 00:10:26,920 --> 00:10:31,079 Speaker 1: you generate a magnetic field because the laws of electromagnetism. 174 00:10:31,600 --> 00:10:35,920 Speaker 1: You've got the electro magnetic field generated as a result. 175 00:10:36,280 --> 00:10:39,440 Speaker 1: Now that field will interact with the magnetic field of 176 00:10:39,480 --> 00:10:42,360 Speaker 1: the permanent magnet. That the permnet magnet always has a 177 00:10:42,360 --> 00:10:46,040 Speaker 1: magnetic field. The coil only has one when electric current 178 00:10:46,120 --> 00:10:48,760 Speaker 1: is flowing through it. And as I said, we have 179 00:10:48,840 --> 00:10:51,120 Speaker 1: magnets to have a north pole and a south pole. 180 00:10:51,160 --> 00:10:54,240 Speaker 1: And we also know that when we bring two magnets 181 00:10:54,240 --> 00:10:57,840 Speaker 1: with their north poles together, they'll push against each other, 182 00:10:57,960 --> 00:11:02,240 Speaker 1: right because like repels like, But if we turn one 183 00:11:02,240 --> 00:11:04,640 Speaker 1: of those magnets around so that now it's a south 184 00:11:04,679 --> 00:11:08,560 Speaker 1: pole and a north pole, they attract one another, you know, 185 00:11:08,600 --> 00:11:15,160 Speaker 1: opposites attract. So by having the this magnetic field being 186 00:11:15,200 --> 00:11:21,360 Speaker 1: generated by the coil, uh, it starts to generate interactions 187 00:11:21,400 --> 00:11:25,520 Speaker 1: with the magnetic field of the permanent magnet, so they 188 00:11:25,600 --> 00:11:28,160 Speaker 1: start to push and pull against each other. Well, the 189 00:11:28,240 --> 00:11:31,959 Speaker 1: coil is attached to that diaphragm, so it in turn 190 00:11:32,160 --> 00:11:36,000 Speaker 1: drives the diaphragm to either push outward or pull inward. 191 00:11:36,480 --> 00:11:40,760 Speaker 1: That causes air molecules to vibrate, just as it would 192 00:11:41,120 --> 00:11:43,840 Speaker 1: with any other you know, source of sound, and it 193 00:11:43,920 --> 00:11:48,599 Speaker 1: emanates outward from the loudspeaker, so you get a representation 194 00:11:48,920 --> 00:11:51,839 Speaker 1: of the same sound that was going into the microphone 195 00:11:52,679 --> 00:11:56,760 Speaker 1: got converted into an electrical current. The electrical current then 196 00:11:57,080 --> 00:12:00,360 Speaker 1: was passed through a coil and next to a permanent 197 00:12:00,360 --> 00:12:03,720 Speaker 1: magnet to create the same sort of movement. It replicates 198 00:12:03,720 --> 00:12:07,240 Speaker 1: the movement of the original diaphragm in the microphone and 199 00:12:07,320 --> 00:12:11,200 Speaker 1: generates the sound. So you get the replication of the 200 00:12:11,240 --> 00:12:15,079 Speaker 1: sound that was made in the other location. It's pretty cool. 201 00:12:15,160 --> 00:12:18,400 Speaker 1: I think now I did mention earlier that you would 202 00:12:18,400 --> 00:12:21,480 Speaker 1: need an amplifier. And the reason you need an amplifier 203 00:12:21,640 --> 00:12:24,920 Speaker 1: is that the electrical signal generated by a microphone is 204 00:12:24,960 --> 00:12:28,440 Speaker 1: far too weak to drive allowed speakers diaphragm. You just 205 00:12:29,160 --> 00:12:31,880 Speaker 1: wouldn't have the juice to do it. It would be 206 00:12:32,000 --> 00:12:35,839 Speaker 1: much much less, uh powerful than what the speaker would need. 207 00:12:36,120 --> 00:12:39,040 Speaker 1: So chances are the diaphragm would either not move at 208 00:12:39,080 --> 00:12:40,920 Speaker 1: all because it would just be too stiff, it would 209 00:12:41,240 --> 00:12:44,559 Speaker 1: resist the movement too much, or it would move so 210 00:12:44,600 --> 00:12:47,360 Speaker 1: weakly as to generate little to no sound, so it 211 00:12:47,360 --> 00:12:50,559 Speaker 1: wouldn't do you any good. So the signal from the 212 00:12:50,600 --> 00:12:53,240 Speaker 1: microphone has to first pass through an amplifier, which, as 213 00:12:53,280 --> 00:12:56,679 Speaker 1: the name implies, takes an incoming signal and increases the 214 00:12:56,720 --> 00:13:00,960 Speaker 1: amplitude of that signal the volume. In other words, uh so, 215 00:13:01,000 --> 00:13:03,480 Speaker 1: it doesn't affect pitch, but it does affect the signal 216 00:13:03,559 --> 00:13:08,160 Speaker 1: strength and consequently the volume. And I've done episodes about amplifiers, 217 00:13:08,240 --> 00:13:11,920 Speaker 1: including explaining the difference between amplifiers that use vacuum tubes 218 00:13:11,960 --> 00:13:14,880 Speaker 1: and ones that use transistors, so I'm not going to 219 00:13:15,000 --> 00:13:18,640 Speaker 1: go into that here. Besides, it doesn't really factor into 220 00:13:18,679 --> 00:13:22,679 Speaker 1: our conversation about smart speakers anyway. It's just important for 221 00:13:23,000 --> 00:13:26,080 Speaker 1: it to work with a microphone and speaker setting. Now, 222 00:13:26,120 --> 00:13:29,600 Speaker 1: over the years, engineers have paired microphones and speakers and 223 00:13:29,720 --> 00:13:33,440 Speaker 1: lots of stuff. You've got telephones, you've got intercom systems, 224 00:13:33,480 --> 00:13:37,280 Speaker 1: public address systems, handheld radios, all sorts of things, so 225 00:13:37,320 --> 00:13:41,160 Speaker 1: that technology was well and truly mature. Before we ever 226 00:13:41,240 --> 00:13:45,040 Speaker 1: got our first smart speaker, there wasn't much call to 227 00:13:45,160 --> 00:13:49,200 Speaker 1: incorporate microphones into home speaker systems for many years. I mean, 228 00:13:49,760 --> 00:13:52,560 Speaker 1: what would you actually use a microphone embedded in a 229 00:13:52,640 --> 00:13:56,320 Speaker 1: speaker for? Before smart speakers, Typically you would have your 230 00:13:56,360 --> 00:13:59,280 Speaker 1: speakers like I'm talking about, like like sound system speakers. 231 00:13:59,400 --> 00:14:01,800 Speaker 1: You would have them hooked up to some other dumb 232 00:14:02,080 --> 00:14:05,800 Speaker 1: as in, not connected to a network technology. So it 233 00:14:05,880 --> 00:14:09,040 Speaker 1: might be a sound system or home entertainment set up 234 00:14:09,040 --> 00:14:11,480 Speaker 1: with a television as the focal point, or maybe even 235 00:14:11,720 --> 00:14:14,079 Speaker 1: you know, a computer for the purposes of playing more 236 00:14:14,160 --> 00:14:19,240 Speaker 1: dynamic sounds for like video games and and things like that. Um. 237 00:14:19,320 --> 00:14:21,800 Speaker 1: But for a very long time, these were all thought 238 00:14:21,840 --> 00:14:25,320 Speaker 1: of as one way communications applications, right, Like, the sound 239 00:14:25,400 --> 00:14:27,480 Speaker 1: was coming from a source and it would get to 240 00:14:27,600 --> 00:14:30,800 Speaker 1: us through the speakers, but we weren't meant to send 241 00:14:31,360 --> 00:14:34,480 Speaker 1: sound back through those same channels. The information was just 242 00:14:34,560 --> 00:14:37,440 Speaker 1: coming to you. You weren't sending anything back, But that 243 00:14:37,480 --> 00:14:40,040 Speaker 1: would all change in time. Now. One thing to keep 244 00:14:40,040 --> 00:14:42,680 Speaker 1: in mind about smart speakers is that they are the 245 00:14:42,680 --> 00:14:46,360 Speaker 1: product of several different technologies and lines of innovation and 246 00:14:46,400 --> 00:14:50,800 Speaker 1: development that all converged together. The microphone and speaker technology 247 00:14:51,120 --> 00:14:53,160 Speaker 1: is one of the oldest ones that we can point 248 00:14:53,200 --> 00:14:57,000 Speaker 1: to as far as the fundamental underlying technology is concerned, 249 00:14:57,560 --> 00:15:00,440 Speaker 1: the stuff that's been around since the late nineties century. 250 00:15:00,600 --> 00:15:03,440 Speaker 1: Now there is one other we'll talk about that's even older. 251 00:15:03,720 --> 00:15:06,440 Speaker 1: But I don't want to spoil things. I'll just mention 252 00:15:06,920 --> 00:15:10,560 Speaker 1: there is an even older line of development that goes 253 00:15:10,600 --> 00:15:14,240 Speaker 1: into smart speakers than the microphone speaker stuff of the 254 00:15:14,320 --> 00:15:18,040 Speaker 1: nineteenth century. Most of the other components, however, are much 255 00:15:18,080 --> 00:15:23,239 Speaker 1: younger than that. One big one is speech or voice recognition. 256 00:15:23,600 --> 00:15:28,040 Speaker 1: Creating computer systems that could detect noise was relatively simple. Right. 257 00:15:28,120 --> 00:15:31,120 Speaker 1: You could have a computer connected to microphones and they 258 00:15:31,120 --> 00:15:35,360 Speaker 1: could monitor the input from those microphones and any incoming 259 00:15:35,400 --> 00:15:38,680 Speaker 1: signal could be registered. Right, they could record an incoming 260 00:15:38,720 --> 00:15:42,080 Speaker 1: signal that would indicate the microphone had detected a noise. 261 00:15:42,560 --> 00:15:46,080 Speaker 1: That's child's play. That's easy to do. But teaching computers 262 00:15:46,080 --> 00:15:49,160 Speaker 1: how to analyze those signals and decipher them so that 263 00:15:49,160 --> 00:15:53,440 Speaker 1: the computer could display in text or otherwise act upon 264 00:15:54,000 --> 00:15:57,880 Speaker 1: that that sound in a meaningful way that was much 265 00:15:57,880 --> 00:16:02,400 Speaker 1: more difficult. There was an IBM engineer named William C. 266 00:16:02,680 --> 00:16:06,560 Speaker 1: Dirsh of the Advanced System Development Division who created an 267 00:16:06,640 --> 00:16:11,200 Speaker 1: early implementation of voice recognition. It was a very limited application, 268 00:16:11,280 --> 00:16:14,240 Speaker 1: but it proved that the ability to interact with computers 269 00:16:14,280 --> 00:16:18,800 Speaker 1: by voice was more than just science fiction. Within IBM. 270 00:16:18,800 --> 00:16:23,080 Speaker 1: It was called the Shoebox. Dirsh worked on this project 271 00:16:23,200 --> 00:16:26,440 Speaker 1: in the early nineteen sixties and what he produced was 272 00:16:26,480 --> 00:16:29,840 Speaker 1: a machine that had a microphone attached to it. The 273 00:16:29,880 --> 00:16:34,680 Speaker 1: machine could detect sixteen spoken words, which included the digits 274 00:16:34,800 --> 00:16:39,160 Speaker 1: of zero to nine plus some command indicators like plus 275 00:16:39,520 --> 00:16:43,360 Speaker 1: minus total, sub total. You get the idea. So you 276 00:16:43,360 --> 00:16:46,680 Speaker 1: could speak a string of numbers and then commands to 277 00:16:46,840 --> 00:16:49,920 Speaker 1: this device, then ask it to total everything and it 278 00:16:49,960 --> 00:16:52,000 Speaker 1: would do so. So it was more or less a 279 00:16:52,080 --> 00:16:58,000 Speaker 1: basic calculator with some voice interpretation incorporated into it. Now 280 00:16:58,280 --> 00:17:02,000 Speaker 1: there's a great newsreel piece about this shoebox. There's a 281 00:17:02,040 --> 00:17:05,040 Speaker 1: demonstration of it, and it came out in nineteen one, 282 00:17:05,480 --> 00:17:08,480 Speaker 1: and I love that newsreel because it has that great 283 00:17:08,560 --> 00:17:10,520 Speaker 1: music you would hear in the background of those old 284 00:17:10,560 --> 00:17:14,560 Speaker 1: industrial and business films. Anyway, there's also a helpful chart 285 00:17:14,840 --> 00:17:19,159 Speaker 1: that hangs in the background of that video where Dersh 286 00:17:19,320 --> 00:17:22,439 Speaker 1: is actually explaining how it works. You can see a 287 00:17:22,480 --> 00:17:25,919 Speaker 1: little bit behind him what the what is actually being 288 00:17:25,960 --> 00:17:30,520 Speaker 1: analyzed and uh he broke the words down into phonemes 289 00:17:30,560 --> 00:17:36,720 Speaker 1: and syllables, so phonemes being specific sounds that make up words. So, 290 00:17:36,760 --> 00:17:40,679 Speaker 1: for example, the digit one is a single syllable word 291 00:17:40,960 --> 00:17:43,520 Speaker 1: with a vowel sound right at the front. But you 292 00:17:43,600 --> 00:17:48,200 Speaker 1: also have the word eight that's another single syllable word 293 00:17:48,480 --> 00:17:51,040 Speaker 1: as a vowel sound right at the front, but it's 294 00:17:51,359 --> 00:17:55,280 Speaker 1: different from one phonetically in that eight also has a 295 00:17:55,359 --> 00:17:59,159 Speaker 1: plosive and has that hard t at the end. So 296 00:17:59,200 --> 00:18:02,919 Speaker 1: the shoebox was limited not just in what words it 297 00:18:02,960 --> 00:18:07,720 Speaker 1: could recognize, but also the types of voices it could recognize. 298 00:18:07,880 --> 00:18:10,760 Speaker 1: Get someone who has a different dialect or manner of speech, 299 00:18:10,760 --> 00:18:12,800 Speaker 1: and the machine might not be able to understand them 300 00:18:12,800 --> 00:18:16,119 Speaker 1: because they're not pronouncing the words the same way that 301 00:18:16,280 --> 00:18:20,560 Speaker 1: drsh did. This would be a big challenge in speech 302 00:18:20,560 --> 00:18:24,240 Speaker 1: recognition moving forward, and it's also an example of where 303 00:18:24,280 --> 00:18:28,480 Speaker 1: we find bias creeping into technology. And it's not necessarily 304 00:18:28,520 --> 00:18:32,359 Speaker 1: a conscious thing, but if you have people designing a 305 00:18:32,400 --> 00:18:36,520 Speaker 1: system and they're designing it based off their own uh, 306 00:18:36,680 --> 00:18:41,280 Speaker 1: you know, speech patterns, their own pronunciations, their own dialects, 307 00:18:41,800 --> 00:18:44,879 Speaker 1: then it may be that the system they create works 308 00:18:44,960 --> 00:18:48,040 Speaker 1: really well for them and less well for anyone who 309 00:18:48,160 --> 00:18:51,440 Speaker 1: isn't them, And the further away you are from their 310 00:18:51,480 --> 00:18:56,200 Speaker 1: manner of speaking, the more frustration you will encounter as 311 00:18:56,240 --> 00:18:59,719 Speaker 1: you try to interact with that technology. That's an example 312 00:18:59,760 --> 00:19:03,200 Speaker 1: of s and in fact, if you read the histories 313 00:19:03,320 --> 00:19:06,359 Speaker 1: of speech recognition and as we'll get too later natural 314 00:19:06,640 --> 00:19:10,119 Speaker 1: language processing, you'll see a lot of people say it 315 00:19:10,200 --> 00:19:13,119 Speaker 1: works great if you happen to be a white man, 316 00:19:13,840 --> 00:19:17,880 Speaker 1: because the manner of speech was being or the people 317 00:19:17,920 --> 00:19:21,000 Speaker 1: who were designing it were primarily white men who were 318 00:19:21,760 --> 00:19:26,000 Speaker 1: uh typically aiming for a a what is considered a 319 00:19:26,080 --> 00:19:31,840 Speaker 1: non accented American dialect somewhere in you know, the Eastern 320 00:19:31,920 --> 00:19:35,439 Speaker 1: seaboard side. But that meant that if you did have 321 00:19:35,520 --> 00:19:39,639 Speaker 1: an accent or a dialect, or you had a different vernacular, 322 00:19:40,200 --> 00:19:43,240 Speaker 1: that it was harder for the systems to actually understand 323 00:19:43,240 --> 00:19:46,399 Speaker 1: what you were saying. That's an example of bias. Well. 324 00:19:46,760 --> 00:19:49,359 Speaker 1: The general strategy was again to break up speech and 325 00:19:49,400 --> 00:19:52,560 Speaker 1: too constituent sound units, you know, those phonemes, and then 326 00:19:52,600 --> 00:19:55,879 Speaker 1: to susse out which words were being spoken based on 327 00:19:55,920 --> 00:19:59,880 Speaker 1: those phonemes, and that was done by digitizing the voice train, 328 00:20:00,160 --> 00:20:04,159 Speaker 1: forming it from sound into data that represented stuff like 329 00:20:04,240 --> 00:20:08,320 Speaker 1: the sounds frequency or pitch, and then matching up specific 330 00:20:08,359 --> 00:20:12,199 Speaker 1: signal signal signatures with specific phone nmes. So generally the 331 00:20:12,240 --> 00:20:14,919 Speaker 1: idea was that the computer system would monitor incoming sound, 332 00:20:15,280 --> 00:20:18,919 Speaker 1: convert the sound into digital data, compare that data that 333 00:20:19,000 --> 00:20:22,679 Speaker 1: had received with information stored in a database, and effort 334 00:20:22,720 --> 00:20:26,199 Speaker 1: to look for matches. Uh. The shoebox database was just 335 00:20:26,320 --> 00:20:29,280 Speaker 1: sixteen words and size. Later ones would be much larger, 336 00:20:29,320 --> 00:20:33,399 Speaker 1: but pretty quickly people realized this was not an efficient 337 00:20:33,480 --> 00:20:37,640 Speaker 1: way of doing speech recognition because the bigger the vocabulary, 338 00:20:37,840 --> 00:20:40,040 Speaker 1: the more work intens of it was to build out 339 00:20:40,080 --> 00:20:43,520 Speaker 1: those databases. So it wasn't something that people thought would 340 00:20:43,520 --> 00:20:48,560 Speaker 1: be sustainable for very large vocabularies. But the Shoebox marked 341 00:20:48,560 --> 00:20:50,680 Speaker 1: the beginning of a serious effort to create machines that 342 00:20:50,720 --> 00:20:53,720 Speaker 1: could accept audio cues as actual input, and as we'll see, 343 00:20:54,080 --> 00:20:57,760 Speaker 1: that's one important component for these smart speaker systems. I've 344 00:20:57,800 --> 00:20:59,560 Speaker 1: got a lot more to say, but before I get 345 00:20:59,600 --> 00:21:09,760 Speaker 1: into the next part, let's take a quick break. Now, 346 00:21:09,800 --> 00:21:13,480 Speaker 1: obviously we didn't jump right into full voice recognition right 347 00:21:13,520 --> 00:21:17,520 Speaker 1: after IBM S Shoebus innovation. The challenges related to building 348 00:21:17,560 --> 00:21:21,399 Speaker 1: automated speech recognition systems were numerous, even for just a 349 00:21:21,520 --> 00:21:24,879 Speaker 1: single language, because, as I said, you can have accents 350 00:21:24,960 --> 00:21:28,280 Speaker 1: and dialects. One voice can have a very different tonal 351 00:21:28,400 --> 00:21:32,679 Speaker 1: quality from another, people speak at different speeds. Teaching machines 352 00:21:32,720 --> 00:21:35,480 Speaker 1: how to recognize speech when the phonemes and pacing of 353 00:21:35,520 --> 00:21:40,840 Speaker 1: that speech aren't consistent from speaker to speaker, that's really hard. 354 00:21:41,320 --> 00:21:43,119 Speaker 1: This kind of gets back to the same sort of 355 00:21:43,200 --> 00:21:46,680 Speaker 1: challenges you have when you're teaching machines how to recognize images. 356 00:21:47,440 --> 00:21:51,080 Speaker 1: You know, you teach a human what a coffee mug is. 357 00:21:51,119 --> 00:21:53,320 Speaker 1: I always use this example, but you teach a human 358 00:21:53,359 --> 00:21:55,800 Speaker 1: what a coffee mug is, and pretty soon they can 359 00:21:55,840 --> 00:22:00,000 Speaker 1: extrapolate from that example and understand that coffee mugs can 360 00:22:00,000 --> 00:22:03,879 Speaker 1: them in all different sizes and colors, and you know 361 00:22:04,240 --> 00:22:08,320 Speaker 1: different designs and textures. We get it. Like you you 362 00:22:08,359 --> 00:22:11,640 Speaker 1: see a couple of coffee mugs, you understand machines though 363 00:22:12,480 --> 00:22:15,280 Speaker 1: they aren't able to do that. Machines, you know, you 364 00:22:15,320 --> 00:22:17,440 Speaker 1: have to give them lots and lots and lots of 365 00:22:17,480 --> 00:22:20,479 Speaker 1: different examples before they can start to pick up on 366 00:22:20,600 --> 00:22:24,960 Speaker 1: what things actually make a coffee mug. Same sort of 367 00:22:25,000 --> 00:22:28,639 Speaker 1: thing with speech, right, So if you don't have consistency 368 00:22:28,760 --> 00:22:31,679 Speaker 1: between speakers, it makes it very hard for machines to 369 00:22:31,800 --> 00:22:34,800 Speaker 1: learn what people are saying. Now, it didn't take long 370 00:22:34,880 --> 00:22:37,399 Speaker 1: for the tech industry at large to really dive into 371 00:22:37,400 --> 00:22:41,520 Speaker 1: trying to solve this problem. In ninete, DARPA, that's the 372 00:22:41,640 --> 00:22:45,359 Speaker 1: Research and Development division of the United States Department of Defense, 373 00:22:45,760 --> 00:22:48,800 Speaker 1: got behind speech recognition in a big way. Now, remember 374 00:22:49,280 --> 00:22:54,080 Speaker 1: darp it self doesn't do research. The organization's purpose is 375 00:22:54,080 --> 00:22:58,280 Speaker 1: to invite organizations to pitch projects that align with whatever 376 00:22:58,359 --> 00:23:01,879 Speaker 1: darpest goals are and and DARBA would provide funding to 377 00:23:02,440 --> 00:23:07,000 Speaker 1: the winning organizations to see these projects to completion if possible. 378 00:23:07,440 --> 00:23:09,840 Speaker 1: So DARK is really more of a vetting and funding 379 00:23:10,000 --> 00:23:15,400 Speaker 1: organization anyway. In n DARPA created a five year program 380 00:23:15,520 --> 00:23:20,160 Speaker 1: called Speech Understanding Research or s u are. The initial 381 00:23:20,240 --> 00:23:23,320 Speaker 1: goal was pretty darn ambitious considering the capabilities of the 382 00:23:23,359 --> 00:23:27,240 Speaker 1: technology at the time. The project director, Larry Roberts, wanted 383 00:23:27,240 --> 00:23:30,440 Speaker 1: a system that would be capable of recognizing a vocabulary 384 00:23:30,560 --> 00:23:34,119 Speaker 1: of ten thousand words with less than ten percent error. 385 00:23:34,560 --> 00:23:37,240 Speaker 1: After holding a few meetings with some of the leading 386 00:23:37,320 --> 00:23:41,840 Speaker 1: computer engineers of the day, Roberts suggusted that goal significantly. 387 00:23:42,560 --> 00:23:45,359 Speaker 1: After that adjustment, the target was going to be a 388 00:23:45,400 --> 00:23:50,040 Speaker 1: system capable of recognizing one thousand words, not ten thousand. 389 00:23:50,920 --> 00:23:53,359 Speaker 1: Nearror levels still had to be less than ten percent, 390 00:23:53,840 --> 00:23:55,720 Speaker 1: and the goal was for the system to be able 391 00:23:55,760 --> 00:24:02,359 Speaker 1: to accept continuous speech, as opposed to very deliberate speech 392 00:24:03,080 --> 00:24:08,000 Speaker 1: with pauses between each pair of words that would not 393 00:24:08,119 --> 00:24:13,040 Speaker 1: be really that useful. One person who was skeptical about 394 00:24:13,080 --> 00:24:16,760 Speaker 1: the potential success of this project was John R. Pierce 395 00:24:16,960 --> 00:24:20,639 Speaker 1: of Bell Labs. He argued that any success would be 396 00:24:20,720 --> 00:24:25,440 Speaker 1: limited so long as machines remained incapable of understanding the words, 397 00:24:25,840 --> 00:24:28,720 Speaker 1: not just recognizing a word based on phone names, but 398 00:24:28,840 --> 00:24:31,359 Speaker 1: understanding what the word is. That is. Pierce felt that 399 00:24:31,359 --> 00:24:34,080 Speaker 1: the machines needed some way to parse the language to 400 00:24:34,119 --> 00:24:37,040 Speaker 1: get to the meaning of what was being said. That's 401 00:24:37,080 --> 00:24:38,919 Speaker 1: an important idea that we will come back to in 402 00:24:38,960 --> 00:24:41,919 Speaker 1: just a bit now. Among the companies and organizations that 403 00:24:42,040 --> 00:24:46,600 Speaker 1: landed contracts with DARPA were a Carnegie Melon University BBN, 404 00:24:46,600 --> 00:24:49,080 Speaker 1: which actually played a big part in developing our ponette, 405 00:24:49,080 --> 00:24:53,240 Speaker 1: the predecessor to the Internet, Lincoln Laboratory, and several more 406 00:24:53,720 --> 00:24:56,840 Speaker 1: and very smart people began to create systems intended to 407 00:24:56,880 --> 00:25:00,520 Speaker 1: recognize speech and meaningful ways. The names of the programs 408 00:25:00,520 --> 00:25:02,800 Speaker 1: were a lot of fun. There was h W I 409 00:25:03,119 --> 00:25:06,280 Speaker 1: M that stood for hear what I mean as in 410 00:25:06,440 --> 00:25:09,040 Speaker 1: here as in listen hear what I mean. That one 411 00:25:09,160 --> 00:25:14,320 Speaker 1: was from BBN. CMU introduced hearsay, which was later designated 412 00:25:14,320 --> 00:25:17,080 Speaker 1: as Hearsay one, and then they came out with Hearsay two. 413 00:25:17,560 --> 00:25:22,800 Speaker 1: They also would demonstrate another one called harpy. Oh, and 414 00:25:22,840 --> 00:25:25,679 Speaker 1: there was a professor at CMU named Dr James Baker 415 00:25:25,840 --> 00:25:29,439 Speaker 1: who would design a system called Dragon in nineteen seventy 416 00:25:29,520 --> 00:25:32,040 Speaker 1: five that he would later leverage into a company with 417 00:25:32,119 --> 00:25:35,480 Speaker 1: his wife, Dr Janet M. Baker in the nineteen eighties, 418 00:25:35,520 --> 00:25:40,000 Speaker 1: and they had a very successful business with speech recognition software. Now, 419 00:25:40,040 --> 00:25:42,399 Speaker 1: I'm not going to go into each of those programs 420 00:25:42,400 --> 00:25:45,240 Speaker 1: in deep detail, but rather just mentioned that they all 421 00:25:45,280 --> 00:25:48,879 Speaker 1: helped advance the cause of creating systems that can recognize speech. 422 00:25:49,440 --> 00:25:51,480 Speaker 1: One of the big developments that came out of all 423 00:25:51,520 --> 00:25:55,280 Speaker 1: that work was a shift to probabilistic models, which would 424 00:25:55,320 --> 00:25:58,080 Speaker 1: also play a really important part in another phase of 425 00:25:58,160 --> 00:26:00,680 Speaker 1: developing the smart speaker. So what do I mean when 426 00:26:00,720 --> 00:26:04,520 Speaker 1: I say probabilistic? Well, as the name indicates, it all 427 00:26:04,520 --> 00:26:08,760 Speaker 1: has to do with probabilities. Essentially, systems would analyze incoming 428 00:26:08,760 --> 00:26:12,399 Speaker 1: phonemes and make guesses as to what was being said 429 00:26:12,680 --> 00:26:15,520 Speaker 1: based on the probability of it being a given word 430 00:26:15,720 --> 00:26:19,159 Speaker 1: or part of a word. The systems typically go with 431 00:26:19,240 --> 00:26:22,920 Speaker 1: whatever word has the highest probability of being the correct one. 432 00:26:23,640 --> 00:26:26,840 Speaker 1: Even with that approach, there are nuances to language that 433 00:26:26,880 --> 00:26:29,840 Speaker 1: are difficult to account for with a machine. So, for example, 434 00:26:29,840 --> 00:26:32,280 Speaker 1: you have homonyms and which you have two words that 435 00:26:32,440 --> 00:26:35,720 Speaker 1: sound the same but have very different meanings and potentially 436 00:26:35,760 --> 00:26:39,919 Speaker 1: spellings like right as in to write a sentence, or 437 00:26:40,080 --> 00:26:43,040 Speaker 1: right as in am I right? Or am I wrong? 438 00:26:43,600 --> 00:26:46,439 Speaker 1: Or you could have a pair of words that sound 439 00:26:46,480 --> 00:26:49,560 Speaker 1: like a single word and have confusion there, such as 440 00:26:49,880 --> 00:26:52,720 Speaker 1: a door. You can say a door you mean you're 441 00:26:52,720 --> 00:26:55,840 Speaker 1: meaning a single door a door to go into a building, 442 00:26:56,040 --> 00:26:58,520 Speaker 1: or you might say a dore as an I adore 443 00:26:58,960 --> 00:27:02,040 Speaker 1: this podcast you're doing, Jonathan. That's sweet of you, Thank 444 00:27:02,040 --> 00:27:06,399 Speaker 1: you for saying that. So computer scientists were hard at 445 00:27:06,400 --> 00:27:10,080 Speaker 1: work advancing both the capability of machines to make correct 446 00:27:10,200 --> 00:27:13,720 Speaker 1: guesses at individual phone names and then full words, as 447 00:27:13,720 --> 00:27:15,920 Speaker 1: well as figuring out a way to teach machines to 448 00:27:16,000 --> 00:27:20,959 Speaker 1: adjust guesses based on context. That requires a deeper understanding 449 00:27:21,000 --> 00:27:24,520 Speaker 1: of the language within which you're working. If you're aware 450 00:27:24,560 --> 00:27:27,439 Speaker 1: of certain idioms, you can make a good guess at 451 00:27:27,480 --> 00:27:29,320 Speaker 1: a word or phrase even if you didn't get a 452 00:27:29,320 --> 00:27:33,440 Speaker 1: clean pass at it right. So, for example, the phrase 453 00:27:33,760 --> 00:27:37,200 Speaker 1: it's raining cats and dogs just means it's raining a lot. 454 00:27:37,520 --> 00:27:40,359 Speaker 1: And if a system included a database that indicated the 455 00:27:40,480 --> 00:27:44,760 Speaker 1: phrase cats and dogs sometimes follows the phrase it's raining, 456 00:27:45,320 --> 00:27:47,640 Speaker 1: then the system is more likely to guess the correct 457 00:27:48,280 --> 00:27:52,760 Speaker 1: sequence of words instead of guessing something that sounded similar 458 00:27:52,840 --> 00:27:55,560 Speaker 1: but it's wrong. For example, if it said, oh, they 459 00:27:55,600 --> 00:27:59,800 Speaker 1: must have said it's raining bats and hogs, that would 460 00:27:59,800 --> 00:28:04,760 Speaker 1: not makes sense. So the systems estimate the probability that 461 00:28:04,840 --> 00:28:08,359 Speaker 1: any given sequence of sounds within the database matches what 462 00:28:08,480 --> 00:28:12,120 Speaker 1: the systems have just quote unquote heard progress in this 463 00:28:12,160 --> 00:28:15,040 Speaker 1: area was steady, but slow, and I'd argue that it 464 00:28:15,080 --> 00:28:17,959 Speaker 1: was also a reminder that concepts like Moore's law do 465 00:28:18,040 --> 00:28:22,760 Speaker 1: not apply universally across technology. Rapid development in one particular 466 00:28:22,800 --> 00:28:26,000 Speaker 1: domain of technology is not necessarily an indicator that the 467 00:28:26,040 --> 00:28:28,760 Speaker 1: same sort of progress will be observed in all other 468 00:28:28,840 --> 00:28:33,919 Speaker 1: areas of tech. We often get into the mistaken habit 469 00:28:34,000 --> 00:28:37,200 Speaker 1: of believing that Moore's law applies to everything. Alright. So 470 00:28:37,640 --> 00:28:42,120 Speaker 1: a related concept to voice recognition is something called natural 471 00:28:42,240 --> 00:28:45,480 Speaker 1: language processing, and this relates back to how we humans 472 00:28:45,480 --> 00:28:49,000 Speaker 1: tend to process information compared to the way machines tend 473 00:28:49,040 --> 00:28:52,200 Speaker 1: to do it. So we humans formulate ideas, we shape 474 00:28:52,200 --> 00:28:55,680 Speaker 1: those ideas into words and sentences. We communicate them in 475 00:28:55,760 --> 00:28:59,239 Speaker 1: some way to other people through that language. It may 476 00:28:59,280 --> 00:29:02,160 Speaker 1: be through speed you maybe through text. It may even 477 00:29:02,160 --> 00:29:06,280 Speaker 1: be through a nonverbal or non literary way, but we 478 00:29:06,400 --> 00:29:11,479 Speaker 1: communicate those ideas. Machines typically accept input, they perform some 479 00:29:11,560 --> 00:29:15,400 Speaker 1: process or sequence of processes on that input, and then 480 00:29:15,400 --> 00:29:18,960 Speaker 1: they supply an output of some sort. Machines do this 481 00:29:19,040 --> 00:29:22,480 Speaker 1: in machine language. That's a code that's far too difficult 482 00:29:22,480 --> 00:29:26,200 Speaker 1: for humans to process. Easily. Binary is an example of 483 00:29:26,280 --> 00:29:30,520 Speaker 1: machine language. Binary is represented as zeros and ones, which 484 00:29:30,520 --> 00:29:33,120 Speaker 1: would group together can represent all sorts of stuff. But 485 00:29:33,160 --> 00:29:35,600 Speaker 1: if you just looked at a big block of zeros 486 00:29:35,600 --> 00:29:38,280 Speaker 1: and ones, it would mean nothing to you. It's not 487 00:29:38,360 --> 00:29:41,520 Speaker 1: easy for humans to use, and then machines in turn 488 00:29:41,600 --> 00:29:44,960 Speaker 1: are not natively able to understand human language, so there's 489 00:29:44,960 --> 00:29:49,720 Speaker 1: a language barrier there. Because of that, people created different 490 00:29:49,760 --> 00:29:54,480 Speaker 1: programming languages. These languages provide layers of abstraction from the 491 00:29:54,560 --> 00:29:57,960 Speaker 1: machine language. They make it easier to create programs or 492 00:29:58,240 --> 00:30:01,560 Speaker 1: directions that the computer should fall low. So the person 493 00:30:01,680 --> 00:30:04,640 Speaker 1: who's doing the programming is using a programming language that's 494 00:30:04,640 --> 00:30:08,040 Speaker 1: easy for humans to use that then gets converted into 495 00:30:08,240 --> 00:30:11,520 Speaker 1: machine language that the computers understand. But what if you 496 00:30:11,560 --> 00:30:14,800 Speaker 1: could send commands to a computer using natural language, not 497 00:30:14,880 --> 00:30:20,320 Speaker 1: even programming language. You could just speak in Plaine vernacular, 498 00:30:20,400 --> 00:30:23,960 Speaker 1: whether it's English or any other language, the way humans 499 00:30:24,000 --> 00:30:27,320 Speaker 1: communicate with one another. What if a computer could extract 500 00:30:27,400 --> 00:30:30,760 Speaker 1: meaning from a sentence, understand what it was you wanted 501 00:30:30,800 --> 00:30:34,280 Speaker 1: the computer to do, and then respond appropriately. So imagine 502 00:30:34,280 --> 00:30:35,960 Speaker 1: how much time you could save if you could just 503 00:30:36,040 --> 00:30:38,640 Speaker 1: tell your computer what you wanted it to do, and 504 00:30:38,680 --> 00:30:41,440 Speaker 1: it took care of the rest. If you had a 505 00:30:41,480 --> 00:30:46,280 Speaker 1: powerful enough computer system with strong enough AI, maybe you 506 00:30:46,280 --> 00:30:49,480 Speaker 1: could even potentially do something like describe a game that 507 00:30:49,560 --> 00:30:52,240 Speaker 1: you would love to be able to play, like not 508 00:30:52,240 --> 00:30:54,400 Speaker 1: not a game that exists, a game in your head, 509 00:30:54,880 --> 00:30:56,760 Speaker 1: and you could describe it to a computer and the 510 00:30:56,800 --> 00:31:00,480 Speaker 1: computer could actually program that game. Well, we're we're definitely 511 00:31:00,520 --> 00:31:03,480 Speaker 1: not anywhere close to that yet, but we made enormous 512 00:31:03,520 --> 00:31:07,240 Speaker 1: progress with natural language processing. Now, the history of natural 513 00:31:07,320 --> 00:31:11,760 Speaker 1: language processing isn't exactly an extension of voice recognition. It's 514 00:31:11,760 --> 00:31:16,200 Speaker 1: actually more like a parallel line of investigation. And that's 515 00:31:16,200 --> 00:31:20,760 Speaker 1: because natural language processing doesn't require voice recognition. You can 516 00:31:20,840 --> 00:31:24,080 Speaker 1: have an implementation in which you just right commands in 517 00:31:24,200 --> 00:31:26,440 Speaker 1: natural language, you know, you type them out on a 518 00:31:26,520 --> 00:31:30,760 Speaker 1: keyboard and the machine then carries out those those instructions. 519 00:31:30,800 --> 00:31:33,400 Speaker 1: So much of the early work in natural language processing 520 00:31:33,440 --> 00:31:37,400 Speaker 1: was in text based communication rather than in speech. The 521 00:31:37,480 --> 00:31:41,240 Speaker 1: history of natural language processing includes stuff like the Turing test, 522 00:31:41,480 --> 00:31:44,840 Speaker 1: named after Alan Turing. So the most common interpretation of 523 00:31:44,880 --> 00:31:47,560 Speaker 1: the Turing test these days is that you've got a 524 00:31:47,600 --> 00:31:50,280 Speaker 1: scenario in which a person is alone in a room 525 00:31:50,320 --> 00:31:53,080 Speaker 1: with a computer terminal, they can type whatever they like 526 00:31:53,280 --> 00:31:57,520 Speaker 1: into the computer terminal, and someone or something is responding 527 00:31:57,560 --> 00:32:00,719 Speaker 1: to them in real time. Now it might be another person, 528 00:32:01,280 --> 00:32:04,120 Speaker 1: or it might be a computer system that's responding to 529 00:32:04,360 --> 00:32:08,959 Speaker 1: that person. You run a whole bunch of test subjects 530 00:32:08,960 --> 00:32:12,040 Speaker 1: through this process, and if the computer system is able 531 00:32:12,080 --> 00:32:15,280 Speaker 1: to fool a certain percentage of those test subjects, like 532 00:32:15,440 --> 00:32:18,720 Speaker 1: say thirty percent of them, that it is in fact 533 00:32:18,760 --> 00:32:21,560 Speaker 1: another human and not a computer, it is said to 534 00:32:21,640 --> 00:32:24,920 Speaker 1: have passed the Turing test, And typically we use that 535 00:32:24,960 --> 00:32:27,400 Speaker 1: to mean the machine has given off the appearance of 536 00:32:27,440 --> 00:32:31,680 Speaker 1: possessing intelligence similar to the one that we humans possess. 537 00:32:32,400 --> 00:32:35,520 Speaker 1: That gets beyond our scope for this episode, but it 538 00:32:35,560 --> 00:32:38,760 Speaker 1: helps point out that stuff like speech recognition and natural 539 00:32:38,840 --> 00:32:42,280 Speaker 1: language processing are both closely related to the field of 540 00:32:42,360 --> 00:32:45,720 Speaker 1: artificial intelligence. In fact, they really belong within the artificial 541 00:32:45,760 --> 00:32:50,320 Speaker 1: intelligence domain. The Turing test was more of a hypothetical. 542 00:32:50,560 --> 00:32:52,800 Speaker 1: It was a bit of a cheeky way of saying, Hey, 543 00:32:53,000 --> 00:32:55,680 Speaker 1: if you can't tell whether or not something is intelligent, 544 00:32:56,000 --> 00:32:58,560 Speaker 1: it makes sense to treat it as if it actually 545 00:32:58,840 --> 00:33:02,520 Speaker 1: is intelligent. After all, we assume that every human with 546 00:33:02,560 --> 00:33:06,440 Speaker 1: whom we interact possesses some level of intelligence. Based on 547 00:33:06,520 --> 00:33:09,640 Speaker 1: those interactions, so why should we not extend the same 548 00:33:09,680 --> 00:33:14,480 Speaker 1: courtesy to machines. Now, natural language processing would prove to 549 00:33:14,480 --> 00:33:18,000 Speaker 1: be another super challenging problem to solve. In computer science. 550 00:33:18,280 --> 00:33:21,880 Speaker 1: Early work was done in translation algorithms, and these were 551 00:33:21,880 --> 00:33:24,800 Speaker 1: programs that attempted to take phrases written in one language 552 00:33:24,840 --> 00:33:29,120 Speaker 1: and translate those automatically into a second language. At first, 553 00:33:29,160 --> 00:33:33,960 Speaker 1: that seemed pretty straightforward, but you realize that's also pretty tricky. Really. 554 00:33:34,200 --> 00:33:37,000 Speaker 1: For one thing, you can't just translate word for word 555 00:33:37,160 --> 00:33:40,080 Speaker 1: and keep the same order from one language to another. 556 00:33:40,520 --> 00:33:44,800 Speaker 1: The syntax or the rules that the language follow uh, 557 00:33:44,840 --> 00:33:47,719 Speaker 1: they could be different from language to language. In one language, 558 00:33:47,720 --> 00:33:50,960 Speaker 1: you might use an infinitive such as to record, in 559 00:33:51,000 --> 00:33:53,760 Speaker 1: the middle of a sentence, while another language might put 560 00:33:53,800 --> 00:33:56,400 Speaker 1: all the infinitives at the end of a sentence. So 561 00:33:56,960 --> 00:33:59,240 Speaker 1: in one language, I might say I'm going to record 562 00:33:59,280 --> 00:34:02,320 Speaker 1: a podcast in the studio right now, but in another 563 00:34:02,400 --> 00:34:05,080 Speaker 1: language it might come out as I'm going a podcast 564 00:34:05,080 --> 00:34:08,000 Speaker 1: in the studio right now to record. It starts to 565 00:34:08,000 --> 00:34:12,879 Speaker 1: sound like yoda. There was initial excitement around machine translation, 566 00:34:13,160 --> 00:34:16,400 Speaker 1: but once computer scientists and linguists began to see the 567 00:34:16,440 --> 00:34:20,320 Speaker 1: scope of this challenge, their excitement faded a bit. Also, 568 00:34:20,440 --> 00:34:22,400 Speaker 1: there was a lot of other stuff going on in 569 00:34:22,440 --> 00:34:25,200 Speaker 1: the nineteen sixties and seventies that was demanding a lot 570 00:34:25,200 --> 00:34:28,520 Speaker 1: of attention, such as the Space race. So for a while, 571 00:34:28,880 --> 00:34:32,799 Speaker 1: this branch of computer science was given less attention than 572 00:34:32,840 --> 00:34:37,160 Speaker 1: other branches, and by less attention, I really mean funding. Now, 573 00:34:37,160 --> 00:34:39,359 Speaker 1: when we come back, we'll talk a bit more about 574 00:34:39,400 --> 00:34:42,799 Speaker 1: the advances that were necessary to support natural language processing, 575 00:34:43,000 --> 00:34:44,880 Speaker 1: and we'll move on to how this would be another 576 00:34:44,960 --> 00:34:48,880 Speaker 1: important component in smart speakers. But first, let's take another 577 00:34:49,000 --> 00:35:00,960 Speaker 1: quick break. Okay, So early enthusiasm for an natural language 578 00:35:00,960 --> 00:35:04,520 Speaker 1: processing created a bit of a hype cycle that ultimately 579 00:35:04,640 --> 00:35:10,160 Speaker 1: crashed into the telephone poll of unmet expectations. That was 580 00:35:10,200 --> 00:35:15,280 Speaker 1: a really bad metaphor. Anyway, natural language processing went through 581 00:35:15,520 --> 00:35:18,520 Speaker 1: something similar to what we saw with virtual reality in 582 00:35:18,520 --> 00:35:22,920 Speaker 1: the nineteen nineties. You know, people saw what was actually achievable, 583 00:35:23,480 --> 00:35:26,440 Speaker 1: and then they compared that to what they thought they 584 00:35:26,440 --> 00:35:29,120 Speaker 1: were going to get, and those two things didn't match 585 00:35:29,200 --> 00:35:31,919 Speaker 1: up at all, and that really pulled the rug out 586 00:35:31,960 --> 00:35:35,440 Speaker 1: of funding for natural language processing, which men of course, 587 00:35:35,480 --> 00:35:40,040 Speaker 1: that progress slowed way down. It kept going, but it 588 00:35:40,239 --> 00:35:43,239 Speaker 1: was definitely on the back burner for a lot of projects. 589 00:35:43,680 --> 00:35:46,799 Speaker 1: When interest renewed in the nineteen eighties, there had been 590 00:35:46,800 --> 00:35:51,440 Speaker 1: a shift in thinking around natural language processing. Computer scientists 591 00:35:51,480 --> 00:35:54,640 Speaker 1: were starting to look at statistical approaches similar to what 592 00:35:54,719 --> 00:35:58,920 Speaker 1: was going on with speech recognition, building up probabilistic models 593 00:35:58,960 --> 00:36:01,520 Speaker 1: in which a computer can start making what amounts to 594 00:36:01,840 --> 00:36:05,880 Speaker 1: educated guesses at the meaning of a command or a phrase. 595 00:36:06,480 --> 00:36:10,080 Speaker 1: Machine learning became an important component on the back end 596 00:36:10,120 --> 00:36:13,839 Speaker 1: of these systems, and later artificial neural networks became an 597 00:36:13,880 --> 00:36:17,719 Speaker 1: important part as well. A neural network processes information in 598 00:36:17,760 --> 00:36:20,400 Speaker 1: a way that's sort of analogous to how our brains 599 00:36:20,480 --> 00:36:24,719 Speaker 1: do it. You have nodes or neurons that connect to 600 00:36:24,800 --> 00:36:29,000 Speaker 1: other nodes, and each node affects incoming data in a 601 00:36:29,040 --> 00:36:32,279 Speaker 1: certain way, performing some sort of operation on it, and 602 00:36:32,400 --> 00:36:34,960 Speaker 1: the degree to which they do that in one way 603 00:36:35,080 --> 00:36:39,120 Speaker 1: versus another is called the weight of that node. Computer 604 00:36:39,160 --> 00:36:42,799 Speaker 1: scientists apply weights across the nodes in an effort to 605 00:36:42,880 --> 00:36:46,840 Speaker 1: get a specific result in order to train these models. 606 00:36:46,880 --> 00:36:50,640 Speaker 1: So you might feed a specific command into such a system, 607 00:36:50,680 --> 00:36:53,560 Speaker 1: and you let it go through the computational process from 608 00:36:53,600 --> 00:36:56,080 Speaker 1: the beginning of the neural network through to the end, 609 00:36:56,560 --> 00:36:58,520 Speaker 1: and then you look at the result, and if the 610 00:36:58,560 --> 00:37:01,520 Speaker 1: result is correct, well, that just means the system is 611 00:37:01,520 --> 00:37:04,719 Speaker 1: already working as you intended it, which honestly is not 612 00:37:04,880 --> 00:37:08,480 Speaker 1: likely to happen early on. But if it's not correct, 613 00:37:08,760 --> 00:37:12,400 Speaker 1: then you start adjusting the weights on those nodes in 614 00:37:12,520 --> 00:37:15,200 Speaker 1: order to affect the outcome. I almost think of it 615 00:37:15,239 --> 00:37:18,440 Speaker 1: as like Plinko or pachinko, where you've got the little 616 00:37:18,520 --> 00:37:20,920 Speaker 1: coin and you drop it down and it bounces on 617 00:37:20,920 --> 00:37:24,080 Speaker 1: all the pegs and sometimes you're like you might think, 618 00:37:24,080 --> 00:37:25,520 Speaker 1: all right, well, this time it's going to go right 619 00:37:25,520 --> 00:37:28,600 Speaker 1: for that center slot, but it doesn't, and you think, well, 620 00:37:28,600 --> 00:37:31,080 Speaker 1: maybe if I remove some of these pegs or I 621 00:37:31,239 --> 00:37:33,839 Speaker 1: shift these pegs over a little bit, I can drop 622 00:37:33,880 --> 00:37:35,880 Speaker 1: it in that same spot and get hit the center. 623 00:37:36,120 --> 00:37:38,160 Speaker 1: It's kind of like that, except you're talking about data, 624 00:37:38,280 --> 00:37:42,120 Speaker 1: not physical moving parts. So you have to do this 625 00:37:42,160 --> 00:37:46,960 Speaker 1: a lot, like up to like millions of times in 626 00:37:47,080 --> 00:37:49,680 Speaker 1: order to try and train a system so that responds 627 00:37:49,680 --> 00:37:53,399 Speaker 1: appropriately to commands. And once it's trained, you can then 628 00:37:53,480 --> 00:37:55,799 Speaker 1: test new commands on the system to see if it 629 00:37:55,800 --> 00:37:58,640 Speaker 1: can parse them and respond appropriately. And in this way, 630 00:37:58,840 --> 00:38:03,240 Speaker 1: the system quote unquote learns over time how to respond 631 00:38:03,440 --> 00:38:07,040 Speaker 1: to commands. And then we have another component that's important 632 00:38:07,040 --> 00:38:10,839 Speaker 1: with smart speakers, and that's speech generation. So it's one 633 00:38:10,840 --> 00:38:14,600 Speaker 1: thing to have a machine either broadcast or play back 634 00:38:14,640 --> 00:38:18,080 Speaker 1: a recording of speech. It's another thing for a machine 635 00:38:18,120 --> 00:38:22,319 Speaker 1: to generate brand new speech. In computer science, we call 636 00:38:22,360 --> 00:38:26,919 Speaker 1: it speech synthesis. Now, this is the really old technology 637 00:38:26,960 --> 00:38:29,279 Speaker 1: I was alluding to at the beginning of this episode, 638 00:38:29,520 --> 00:38:32,960 Speaker 1: speech synthesis. If you want to be really, you know, 639 00:38:33,040 --> 00:38:36,479 Speaker 1: kind of technical about it, it actually predates every other 640 00:38:36,520 --> 00:38:39,759 Speaker 1: technology I've mentioned up to this point, at least in 641 00:38:40,000 --> 00:38:43,880 Speaker 1: its most rudimentary implementations. You have to go way back 642 00:38:43,960 --> 00:38:47,360 Speaker 1: to the eighteenth century the seventeen seventies, as when a 643 00:38:47,440 --> 00:38:52,480 Speaker 1: Russian smarty pants named Christian Kradsenstein was building a device 644 00:38:52,560 --> 00:38:56,800 Speaker 1: that used acoustic resonators. These these reads that would vibrate, 645 00:38:57,160 --> 00:39:01,879 Speaker 1: and it was in an attempt to replicate basic vowel sounds. Now, 646 00:39:01,920 --> 00:39:04,399 Speaker 1: even with such a working device, it would be really 647 00:39:04,400 --> 00:39:07,960 Speaker 1: difficult to communicate anything meaningful unless you were, i guess, 648 00:39:08,000 --> 00:39:11,399 Speaker 1: speaking whale like Dory and finding Nemo. But it would 649 00:39:11,400 --> 00:39:13,640 Speaker 1: be an early example of how people tried to create 650 00:39:13,680 --> 00:39:18,080 Speaker 1: mechanical systems that could replicate speech or elements of speech. 651 00:39:18,440 --> 00:39:23,600 Speaker 1: Another inventor named Wolfgang von Kimberland built an acoustic mechanical 652 00:39:23,719 --> 00:39:28,080 Speaker 1: speech machine and that used reads and tubes and a 653 00:39:28,160 --> 00:39:31,520 Speaker 1: pressure chamber, and it was all meant to replicate various 654 00:39:31,600 --> 00:39:35,640 Speaker 1: speech sounds. He had other elements to create sounds like plosives, 655 00:39:35,640 --> 00:39:39,880 Speaker 1: those hard sounds that I mentioned earlier in the episode. 656 00:39:40,520 --> 00:39:43,080 Speaker 1: So he had all these different elements that, working together, 657 00:39:43,160 --> 00:39:47,640 Speaker 1: could create parts of the sounds that we humans make 658 00:39:47,680 --> 00:39:51,480 Speaker 1: when we speak. He also built a supposed chess playing machine, 659 00:39:51,920 --> 00:39:53,799 Speaker 1: and it turned out that the chess playing part was 660 00:39:53,840 --> 00:39:58,680 Speaker 1: a hoax. So unfortunately, because that device was a hoax, 661 00:39:58,840 --> 00:40:03,360 Speaker 1: a lot of people dismiss his other work, which was legitimate. 662 00:40:03,880 --> 00:40:07,840 Speaker 1: So by fudging on one thing, he kind of cast 663 00:40:08,040 --> 00:40:12,120 Speaker 1: doubt on everything he had ever done. Skipping ahead quite 664 00:40:12,160 --> 00:40:15,720 Speaker 1: a bit, we get to Homer Dudley, which is a 665 00:40:15,760 --> 00:40:21,640 Speaker 1: fantastic name. He unveiled the voter or voice Operating Demonstrator 666 00:40:21,880 --> 00:40:25,480 Speaker 1: device at the New York World's Fair in nineteen thirty nine. 667 00:40:25,960 --> 00:40:29,880 Speaker 1: It consisted of a complex series of controls and it 668 00:40:29,960 --> 00:40:32,720 Speaker 1: sort of reminds me of something like a musical instrument, 669 00:40:32,800 --> 00:40:36,840 Speaker 1: kind of like a synthesizer, but with extra controlling units. 670 00:40:36,880 --> 00:40:39,400 Speaker 1: Like there was like a wrist element, there was a pedal. 671 00:40:39,719 --> 00:40:43,080 Speaker 1: There's a lot of stuff that made it very complex, 672 00:40:43,440 --> 00:40:47,239 Speaker 1: and with a lot of practice, you could create specific 673 00:40:47,320 --> 00:40:51,360 Speaker 1: sounds from this synthesizer. You could even create words or 674 00:40:51,440 --> 00:40:55,120 Speaker 1: full sentences, though from what I understand, it was incredibly 675 00:40:55,200 --> 00:40:57,760 Speaker 1: challenging to do. It was a very high learning curve, 676 00:40:58,120 --> 00:41:02,040 Speaker 1: but it demonstrate the possibility of a like tronic synthesized speech. Now. 677 00:41:02,080 --> 00:41:06,120 Speaker 1: There was a lot of work done in this field 678 00:41:07,120 --> 00:41:12,040 Speaker 1: by lots of different talented scientists and engineers, and someday 679 00:41:12,080 --> 00:41:14,320 Speaker 1: I'll have to do a full episode on the history 680 00:41:14,360 --> 00:41:17,839 Speaker 1: of speech synthesis. It's really fascinating, but it's far too 681 00:41:17,920 --> 00:41:21,279 Speaker 1: big a topic to cover in its entirety in this episode. 682 00:41:21,640 --> 00:41:24,480 Speaker 1: By the late nineteen sixties we had our first text 683 00:41:24,640 --> 00:41:27,920 Speaker 1: to speech system, and by the late nineteen seventies and 684 00:41:28,000 --> 00:41:31,280 Speaker 1: early nineteen eighties, the state of the art had progressed 685 00:41:31,360 --> 00:41:33,160 Speaker 1: quite a bit and we were starting to get to 686 00:41:33,200 --> 00:41:38,360 Speaker 1: a point where we could create very understandable computer voices. 687 00:41:38,400 --> 00:41:41,680 Speaker 1: They weren't natural, they didn't sound like people, but you 688 00:41:41,719 --> 00:41:45,439 Speaker 1: could understand what they were saying. And finally, something else 689 00:41:45,440 --> 00:41:48,840 Speaker 1: that would enable smart speakers and virtual assistance was the 690 00:41:48,880 --> 00:41:53,240 Speaker 1: pairing of improved network connectivity and cloud computing. That removes 691 00:41:53,239 --> 00:41:56,319 Speaker 1: the need for the device that you're interacting with to 692 00:41:56,400 --> 00:41:59,440 Speaker 1: do all the processing on its own. So, if you 693 00:41:59,440 --> 00:42:01,799 Speaker 1: think about or the history of computing, we used to 694 00:42:01,880 --> 00:42:05,160 Speaker 1: do main frames with dumb terminals that attached the main frame, 695 00:42:05,440 --> 00:42:08,120 Speaker 1: so the terminal wasn't doing any computing. It was just 696 00:42:08,200 --> 00:42:11,520 Speaker 1: tapping into the mainframe computer, which was sending results back 697 00:42:11,520 --> 00:42:13,880 Speaker 1: to the terminal. Then you get to the era of 698 00:42:13,960 --> 00:42:17,640 Speaker 1: personal computers, where you had a device sitting on your 699 00:42:17,680 --> 00:42:20,560 Speaker 1: desk that did all the computing and it didn't connect 700 00:42:20,560 --> 00:42:23,920 Speaker 1: to anything else. Then we get up to networking and 701 00:42:23,960 --> 00:42:27,640 Speaker 1: the Internet, where we suddenly had the capability of having 702 00:42:27,840 --> 00:42:31,720 Speaker 1: really powerful computers or grids of computers that were able 703 00:42:31,760 --> 00:42:35,200 Speaker 1: to take on processing power. Uh, and you just you 704 00:42:35,239 --> 00:42:38,719 Speaker 1: send the request out to the Internet and you get 705 00:42:38,719 --> 00:42:42,080 Speaker 1: the response back. That's the basis of cloud computing. So 706 00:42:43,200 --> 00:42:47,000 Speaker 1: your your command or message or whatever relays back to 707 00:42:47,160 --> 00:42:50,759 Speaker 1: servers on the cloud that then process it and send 708 00:42:50,760 --> 00:42:54,759 Speaker 1: the proper response to whatever device you're interacting with, and 709 00:42:54,800 --> 00:42:57,120 Speaker 1: then you get the result. So with the case of 710 00:42:57,120 --> 00:42:59,840 Speaker 1: the smart speaker, it might be playing a specific so 711 00:43:00,000 --> 00:43:02,360 Speaker 1: long or giving you a weather report or whatever it 712 00:43:02,440 --> 00:43:05,279 Speaker 1: might be. Now, if the speakers were doing some of 713 00:43:05,320 --> 00:43:09,759 Speaker 1: that computation themselves, that would be an example of edge computing, 714 00:43:10,160 --> 00:43:13,440 Speaker 1: where the processing takes place at least in part, at 715 00:43:13,480 --> 00:43:16,799 Speaker 1: the edge of a network at those end points. But 716 00:43:16,960 --> 00:43:20,240 Speaker 1: for now, most of the implementations we see send data 717 00:43:20,280 --> 00:43:22,120 Speaker 1: back to the cloud to get the right response, so 718 00:43:22,160 --> 00:43:25,240 Speaker 1: you have to have a persistent Internet connection. These devices 719 00:43:25,280 --> 00:43:28,480 Speaker 1: are not useful without that connection. You do have some 720 00:43:28,600 --> 00:43:32,360 Speaker 1: smart speakers that can connect to another device like a 721 00:43:32,360 --> 00:43:36,320 Speaker 1: smartphone via Bluetooth, so you could do things that way, 722 00:43:36,760 --> 00:43:40,920 Speaker 1: but without those connections, the smart speaker turns into, you know, 723 00:43:41,040 --> 00:43:44,320 Speaker 1: just a dumb speaker, or sometimes just a paperweight. Now, 724 00:43:44,840 --> 00:43:48,040 Speaker 1: this collection of technologies and disciplines are what enabled Apple 725 00:43:48,440 --> 00:43:52,360 Speaker 1: to introduce Sirie in two thousand and eleven, and Syria 726 00:43:52,440 --> 00:43:56,160 Speaker 1: is a virtual assistant. Series origins actually trace back to 727 00:43:56,520 --> 00:44:00,480 Speaker 1: the Stanford Research Institute and a group of guys Grouber, 728 00:44:00,560 --> 00:44:04,279 Speaker 1: Adamshire and dog kit Louse who had been working on 729 00:44:04,320 --> 00:44:08,240 Speaker 1: the concept since the nineteen nineties, and when Apple launched 730 00:44:08,239 --> 00:44:11,000 Speaker 1: the iPhone in two thousand seven, they saw the iPhone 731 00:44:11,040 --> 00:44:14,680 Speaker 1: as a potential platform for this virtual assistant that they 732 00:44:14,719 --> 00:44:17,520 Speaker 1: had been building, and they thought, well, this is perfect 733 00:44:17,560 --> 00:44:20,879 Speaker 1: because the iPhone has a microphone, so the assistant can 734 00:44:20,920 --> 00:44:23,719 Speaker 1: respond to voice commands as a speaker, so it could 735 00:44:23,719 --> 00:44:26,200 Speaker 1: communicate back to the user, it could do all sorts 736 00:44:26,239 --> 00:44:30,480 Speaker 1: of stuff. We can tap into the interoperability of apps 737 00:44:30,520 --> 00:44:33,160 Speaker 1: on the device. It's a perfect platform for us to 738 00:44:33,239 --> 00:44:36,839 Speaker 1: deploy this. So they developed an app once the opportunity 739 00:44:36,880 --> 00:44:41,279 Speaker 1: arose because apps were not available for development immediately when 740 00:44:41,320 --> 00:44:45,840 Speaker 1: Apple launched the iPhone, and once they did launch that app, 741 00:44:46,719 --> 00:44:50,240 Speaker 1: uh within a month, less than a month, Steve Jobs 742 00:44:50,280 --> 00:44:52,040 Speaker 1: was on the phone calling them up and offering to 743 00:44:52,120 --> 00:44:54,560 Speaker 1: buy the technology, which of course they would agree to 744 00:44:54,880 --> 00:44:57,920 Speaker 1: and it would become an integrated component in Apple's iPhone 745 00:44:57,960 --> 00:45:02,480 Speaker 1: line afterward. And that's where voice assistants kind of lived 746 00:45:02,640 --> 00:45:05,560 Speaker 1: for a few years. They mostly lived on smartphones like 747 00:45:05,640 --> 00:45:10,080 Speaker 1: the iPhone. But in November two thousand fourteen, Amazon introduced 748 00:45:10,160 --> 00:45:14,400 Speaker 1: the Amazon Echo smart speaker, which was originally only available 749 00:45:14,440 --> 00:45:17,600 Speaker 1: for Prime members, and it had its own virtual assistant 750 00:45:17,760 --> 00:45:22,879 Speaker 1: named Alexa, and thus the smart speaker era officially began. Now, 751 00:45:22,920 --> 00:45:25,480 Speaker 1: there are plenty of other smart speakers that are on 752 00:45:25,520 --> 00:45:28,600 Speaker 1: the market these days. There are products from Google like 753 00:45:28,719 --> 00:45:31,879 Speaker 1: Google Home. Uh, there are so no speakers that can 754 00:45:31,920 --> 00:45:35,920 Speaker 1: connect to services like Amazon's Alexa or Google's Assistant, and 755 00:45:36,000 --> 00:45:38,799 Speaker 1: we're probably going to see a ton more, both from 756 00:45:38,880 --> 00:45:42,120 Speaker 1: companies that piggyback onto services from the big providers like 757 00:45:42,160 --> 00:45:45,160 Speaker 1: Google and Amazon, and maybe some that are trying to 758 00:45:45,200 --> 00:45:47,360 Speaker 1: make a go of it with their own branded virtual 759 00:45:47,400 --> 00:45:52,040 Speaker 1: assistants and services. Smart speakers respond to commands after they 760 00:45:52,120 --> 00:45:55,880 Speaker 1: quote unquote here a wake up word or phrase. Now, 761 00:45:56,120 --> 00:45:59,040 Speaker 1: I'm gonna make up a wake up phrase right now 762 00:45:59,239 --> 00:46:02,440 Speaker 1: so that I don't set off anyone's smart speaker or 763 00:46:02,480 --> 00:46:05,520 Speaker 1: smart watch or smartphone or smart car or whatever it 764 00:46:05,600 --> 00:46:08,520 Speaker 1: might be. So this is just a fictional example of 765 00:46:08,560 --> 00:46:11,719 Speaker 1: a wake up phrase. So let's say I have a 766 00:46:11,719 --> 00:46:15,040 Speaker 1: smart speaker and the wake up phrase for my smart 767 00:46:15,080 --> 00:46:18,680 Speaker 1: speaker happens to be hey, they're Genie. Well, my smart 768 00:46:18,719 --> 00:46:21,120 Speaker 1: speaker has a microphone, so it can detect when I 769 00:46:21,160 --> 00:46:27,480 Speaker 1: say that, but really it's constantly detecting all sounds in 770 00:46:27,680 --> 00:46:31,000 Speaker 1: its environment. The microphone is always active. It has to 771 00:46:31,040 --> 00:46:33,239 Speaker 1: be in order to be able to pick up on 772 00:46:33,360 --> 00:46:38,160 Speaker 1: when I say the wake up phrase. So the microphone 773 00:46:38,200 --> 00:46:41,480 Speaker 1: is always active on most smart speakers. There's somewhere you 774 00:46:41,520 --> 00:46:44,160 Speaker 1: can program it so that it will only activate if 775 00:46:44,160 --> 00:46:47,480 Speaker 1: you first touch the speaker and that wakes it up. 776 00:46:47,840 --> 00:46:49,680 Speaker 1: There's some that you can do that with, But for 777 00:46:49,719 --> 00:46:53,359 Speaker 1: the most part, they're always listening. While the speaker can 778 00:46:53,480 --> 00:46:57,960 Speaker 1: quote unquote here everything, it's not listening to everything. In 779 00:46:57,960 --> 00:47:01,200 Speaker 1: other words, it's not mon of during the specific things 780 00:47:01,200 --> 00:47:03,640 Speaker 1: being said. At least that's what we've been told. And honestly, 781 00:47:04,200 --> 00:47:07,320 Speaker 1: that makes a ton of sense from an operational standpoint. 782 00:47:07,520 --> 00:47:10,080 Speaker 1: And the reason I say that is that the sheer 783 00:47:10,120 --> 00:47:13,359 Speaker 1: amount of information that would be flooding in from all 784 00:47:13,440 --> 00:47:16,919 Speaker 1: the microphones on all the smart devices from any one 785 00:47:17,080 --> 00:47:20,160 Speaker 1: provider that happened to be deployed all over the world, 786 00:47:20,360 --> 00:47:23,880 Speaker 1: that would be an astounding amount of data. And sifting 787 00:47:23,920 --> 00:47:27,000 Speaker 1: through all that data to find stuff that's useful would 788 00:47:27,040 --> 00:47:29,840 Speaker 1: take an enormous amount of effort and time and and 789 00:47:29,960 --> 00:47:33,719 Speaker 1: processing power. So while you could have all the microphones 790 00:47:33,760 --> 00:47:37,120 Speaker 1: listening in all over the place, finding out who to 791 00:47:37,200 --> 00:47:39,680 Speaker 1: listen to at what time would be a lot trickier 792 00:47:39,719 --> 00:47:41,840 Speaker 1: and probably not worth the effort it would take to 793 00:47:41,880 --> 00:47:46,440 Speaker 1: pull something like that off. So what these speakers and 794 00:47:46,560 --> 00:47:50,240 Speaker 1: other devices are actually doing is looking for a signal 795 00:47:50,480 --> 00:47:53,640 Speaker 1: that matches the one that represents the wake phrase. So 796 00:47:53,719 --> 00:47:57,799 Speaker 1: when I say, hey, they're Genie, the microphone picks up 797 00:47:57,800 --> 00:48:00,920 Speaker 1: my voice, which the mic then try inslates into an 798 00:48:00,920 --> 00:48:04,640 Speaker 1: electrical signal which gets digitized and compared against the digital 799 00:48:04,640 --> 00:48:09,000 Speaker 1: fingerprint of the predesignated wake up phrase. And in this case, 800 00:48:09,480 --> 00:48:13,880 Speaker 1: the two phrases match. It's like a fingerprint matching something 801 00:48:13,920 --> 00:48:16,719 Speaker 1: that was left at a site. So that turns the 802 00:48:16,760 --> 00:48:20,080 Speaker 1: speaker into an active listener rather than a passive one. 803 00:48:20,120 --> 00:48:23,600 Speaker 1: It's ready to accept a command or a question and 804 00:48:23,680 --> 00:48:27,200 Speaker 1: to respond to me. But if I didn't say, hey, 805 00:48:27,280 --> 00:48:30,920 Speaker 1: they're Genie, then the speaker would remain in passive mode 806 00:48:31,360 --> 00:48:35,239 Speaker 1: because it wouldn't have a digital fingerprint that matches the 807 00:48:35,320 --> 00:48:38,400 Speaker 1: one of the wake up phrase. Everything stays at the 808 00:48:38,440 --> 00:48:41,840 Speaker 1: local level, and none of my sweet secret speech gets 809 00:48:41,880 --> 00:48:45,080 Speaker 1: transmitt related across the internet. It's all staying right there. 810 00:48:45,560 --> 00:48:47,960 Speaker 1: At least that's what we've been told. And again I 811 00:48:48,000 --> 00:48:50,719 Speaker 1: don't have any reason to disbelieve this, but it is 812 00:48:50,760 --> 00:48:53,800 Speaker 1: something to keep in mind. You are talking about devices 813 00:48:53,840 --> 00:48:55,960 Speaker 1: that have microphones. Of course, if you have a smartphone, 814 00:48:56,000 --> 00:48:57,719 Speaker 1: you've already got one of those or a cell phone. 815 00:48:57,719 --> 00:49:00,600 Speaker 1: In general, you've got a device with a microphone on 816 00:49:00,680 --> 00:49:04,040 Speaker 1: it neck near you pretty much all the time. Now, 817 00:49:04,680 --> 00:49:07,360 Speaker 1: once I do make a request with my smart speaker, 818 00:49:07,560 --> 00:49:09,880 Speaker 1: the speaker then sends that request up to the cloud 819 00:49:10,000 --> 00:49:14,120 Speaker 1: where it gets processed, It's analyzed, uh, and then a 820 00:49:14,160 --> 00:49:18,080 Speaker 1: proper response is returned to me, whether that is playing 821 00:49:18,080 --> 00:49:20,480 Speaker 1: a song or giving me information I've asked for, or 822 00:49:20,520 --> 00:49:23,160 Speaker 1: maybe even interacting with some other smart device in my home, 823 00:49:23,280 --> 00:49:26,880 Speaker 1: such as adjusting the brightness of my smart lights in 824 00:49:26,960 --> 00:49:30,160 Speaker 1: my house. Now, if the system is not sure about 825 00:49:30,200 --> 00:49:34,240 Speaker 1: whatever it was I just said, it will probably return 826 00:49:34,400 --> 00:49:37,520 Speaker 1: an error phrase. So maybe maybe I'm too far away 827 00:49:37,600 --> 00:49:40,759 Speaker 1: from the speaker, so it's it couldn't quote unquote hear 828 00:49:40,840 --> 00:49:43,719 Speaker 1: me really well. Or maybe I've got a mouthful of 829 00:49:43,760 --> 00:49:46,600 Speaker 1: peanut butter or something as I want to do. Then 830 00:49:46,640 --> 00:49:48,600 Speaker 1: I'm going to get something like I'm sorry, I don't 831 00:49:48,640 --> 00:49:50,480 Speaker 1: know how to do that, or I'm sorry I didn't 832 00:49:50,520 --> 00:49:53,359 Speaker 1: understand you, and then I'd have to repeat it. Now, 833 00:49:53,400 --> 00:49:57,040 Speaker 1: smart speakers are pretty cool. However, they do represent another 834 00:49:57,080 --> 00:50:02,040 Speaker 1: piece of technology that you have to network to other devices, 835 00:50:02,200 --> 00:50:06,120 Speaker 1: including your own home network, and as such that means 836 00:50:06,200 --> 00:50:10,600 Speaker 1: that they represent a potential vulnerability in a network. It 837 00:50:10,640 --> 00:50:14,520 Speaker 1: doesn't mean they're automatically vulnerable, but it means that every 838 00:50:14,520 --> 00:50:18,319 Speaker 1: time you are connecting something to your network, then you're 839 00:50:18,400 --> 00:50:23,960 Speaker 1: creating another potential attack vector for a hacker. Right now, 840 00:50:24,000 --> 00:50:28,879 Speaker 1: if everything is super strong, it it doesn't really effectively 841 00:50:29,040 --> 00:50:32,799 Speaker 1: change your safety in any meaningful way. But if one 842 00:50:32,840 --> 00:50:35,880 Speaker 1: of those things that you connect to your network is 843 00:50:36,000 --> 00:50:38,640 Speaker 1: less strong than the others, you're looking at the weakest 844 00:50:38,680 --> 00:50:41,520 Speaker 1: link situation where a hacker with the right know how 845 00:50:41,560 --> 00:50:45,280 Speaker 1: in tools could potentially target that part of your network 846 00:50:45,600 --> 00:50:49,320 Speaker 1: to get entry into everything else. And when you're talking 847 00:50:49,360 --> 00:50:53,120 Speaker 1: about a smart speaker, you're talking about device that has 848 00:50:53,400 --> 00:50:57,440 Speaker 1: an active microphone on it. So potentially, if someone were 849 00:50:57,480 --> 00:50:59,879 Speaker 1: able to compromise a smart speaker, they would be able 850 00:50:59,920 --> 00:51:03,120 Speaker 1: to listening on anything that was within range of that 851 00:51:03,200 --> 00:51:07,920 Speaker 1: smart speakers microphone. So that's why you have to at 852 00:51:07,960 --> 00:51:11,759 Speaker 1: least be cognizant of that, do your research, make sure 853 00:51:11,800 --> 00:51:15,000 Speaker 1: the devices you're connecting to your network are rated well 854 00:51:15,040 --> 00:51:18,640 Speaker 1: as from a security standpoint, when you're setting things up 855 00:51:18,680 --> 00:51:22,239 Speaker 1: and you have to create passwords, create strong passwords that 856 00:51:22,320 --> 00:51:26,200 Speaker 1: are not used anywhere else. The harder you make things 857 00:51:26,440 --> 00:51:30,160 Speaker 1: the more likely hackers will just pass you by, not 858 00:51:30,239 --> 00:51:33,919 Speaker 1: because you're too tough to crack. Never get your into 859 00:51:33,960 --> 00:51:37,200 Speaker 1: your head that you're too strong to to be hacked, 860 00:51:37,440 --> 00:51:41,160 Speaker 1: but rather if there's someone who's weaker than the hackers 861 00:51:41,160 --> 00:51:43,600 Speaker 1: are going to go after that person instead. So just 862 00:51:43,680 --> 00:51:48,640 Speaker 1: don't be the weak person. Practice really good security behaviors, 863 00:51:48,840 --> 00:51:53,279 Speaker 1: and you're more likely to discourage attackers and they'll they'll 864 00:51:53,320 --> 00:51:57,359 Speaker 1: go on to someone else. Um, especially if you're talking 865 00:51:57,360 --> 00:51:59,960 Speaker 1: about newbies who don't really know their way around their 866 00:52:00,080 --> 00:52:02,919 Speaker 1: just using tools that other people have designed. They get 867 00:52:02,920 --> 00:52:05,319 Speaker 1: discouraged very quickly. They'll move on to someone else because 868 00:52:05,360 --> 00:52:09,359 Speaker 1: there's always another potential target. I'm curious about you guys, 869 00:52:09,360 --> 00:52:12,720 Speaker 1: whether or not you have any smart speakers in your life, 870 00:52:13,200 --> 00:52:15,919 Speaker 1: and uh if you find them useful. I find mine 871 00:52:15,960 --> 00:52:20,160 Speaker 1: pretty useful. I use it for a very narrow range 872 00:52:20,320 --> 00:52:23,480 Speaker 1: of things. I don't tend to use it. I definitely 873 00:52:23,520 --> 00:52:25,680 Speaker 1: don't use it to its full potential. I know that 874 00:52:26,239 --> 00:52:29,080 Speaker 1: because what's in the blue moon. I'll just try something 875 00:52:29,120 --> 00:52:31,719 Speaker 1: and I'm amazed at what happens when when I get 876 00:52:31,719 --> 00:52:34,759 Speaker 1: a response. But for the most part, I'm asking about 877 00:52:34,800 --> 00:52:38,080 Speaker 1: whether what I can feed my dog whether or not 878 00:52:38,120 --> 00:52:41,040 Speaker 1: it can turn on the lights and uh and and 879 00:52:41,760 --> 00:52:45,319 Speaker 1: that's about it. Are occasionally playing a song. Um, but 880 00:52:45,480 --> 00:52:47,759 Speaker 1: I'm curious what you guys are using them for. Reach 881 00:52:47,800 --> 00:52:50,480 Speaker 1: Out to me on social networks on Facebook and I'm 882 00:52:50,520 --> 00:52:52,480 Speaker 1: on Twitter, and the handle for both of those is 883 00:52:52,520 --> 00:52:56,880 Speaker 1: text stuff. H s W also use that those handles 884 00:52:56,880 --> 00:52:59,320 Speaker 1: if you have suggestions for future episodes. If you've got, 885 00:52:59,440 --> 00:53:02,120 Speaker 1: you know, an idea for either a company or a 886 00:53:02,160 --> 00:53:05,640 Speaker 1: technology or a theme in tech you'd really like me 887 00:53:05,719 --> 00:53:08,759 Speaker 1: to tackle, let me know there and I'll talk to 888 00:53:08,760 --> 00:53:16,279 Speaker 1: you again really soon. Text Stuff is a production of 889 00:53:16,280 --> 00:53:19,319 Speaker 1: I Heart Radio's How Stuff Works. For more podcasts from 890 00:53:19,360 --> 00:53:23,120 Speaker 1: my heart Radio, visit the i heart Radio app, Apple Podcasts, 891 00:53:23,280 --> 00:53:25,240 Speaker 1: or wherever you listen to your favorite shows.