WEBVTT - How Smart Speakers Work 0:00:04.240 --> 0:00:07.240 Welcome to Tech Stuff, a production of I Heart Radios 0:00:07.320 --> 0:00:14.000 How Stuff Works. Hey there, and welcome to tech Stuff. 0:00:14.040 --> 0:00:16.880 I'm your host, Jonathan Strickland. I'm an executive producer with 0:00:16.920 --> 0:00:19.960 I Heart Radio and I love all things tech, and guys, 0:00:19.960 --> 0:00:24.079 stick with me. I am fighting off a cold. You'll 0:00:24.120 --> 0:00:25.560 be able to hear it in my voice. I have 0:00:25.760 --> 0:00:28.720 no doubt. But you know, I wanted to get you 0:00:28.760 --> 0:00:32.040 guys a brand new episode. So we're gonna fight on 0:00:32.200 --> 0:00:37.200 because the show must keep going. I think I think 0:00:37.200 --> 0:00:40.120 this is saying, oh no, this cold medicine is good though. 0:00:40.200 --> 0:00:42.959 All right, Anyway, I thought that we would do an 0:00:42.960 --> 0:00:47.760 episode about smart speakers because I wanted to kind of 0:00:47.760 --> 0:00:51.199 start this whole episode off with with an old man observation, 0:00:51.240 --> 0:00:54.000 you know, get off my lawn kind of thing. And 0:00:54.080 --> 0:00:57.600 this is from our resident old man, old man Strickland. 0:00:57.840 --> 0:01:01.160 That meaning meaning me, So, when I was young, speakers 0:01:01.160 --> 0:01:03.959 were dumb. Now I don't. I don't mean that speakers 0:01:03.960 --> 0:01:07.479 were useless, or that they were terrible, or that they 0:01:07.480 --> 0:01:11.679 were incapable of replicating certain frequencies or volumes of sound, 0:01:12.319 --> 0:01:15.440 or that they were limited in some other way other 0:01:15.480 --> 0:01:19.319 than they didn't quote unquote think they didn't connect to 0:01:19.360 --> 0:01:22.800 any sort of computational engine in a meaningful way. You 0:01:22.880 --> 0:01:25.160 might have a set of speakers plugged into a computer, 0:01:25.600 --> 0:01:27.840 but that was just a one way communications tool, right. 0:01:27.840 --> 0:01:29.920 It was just a way to provide an outlet for 0:01:30.120 --> 0:01:33.200 sound that your computer was generating, nothing more than that. 0:01:33.880 --> 0:01:37.200 But contrast that with today, when we have numerous smart 0:01:37.240 --> 0:01:40.440 speakers on the market. These speakers act as a user 0:01:40.480 --> 0:01:44.840 interface between us and the Internet at large, often facilitated 0:01:44.840 --> 0:01:49.120 by a virtual assistant of some kind. Now with these speakers, 0:01:49.200 --> 0:01:52.760 we don't just listen to stuff like music and podcasts 0:01:52.840 --> 0:01:56.640 and the radio and you know, other traditional audio content. 0:01:57.200 --> 0:02:00.520 We use them to find out information. We might link 0:02:00.640 --> 0:02:03.360 them to our calendars so that we can get reminders 0:02:03.360 --> 0:02:06.760 for upcoming appointments. We probably use them to ask about 0:02:06.800 --> 0:02:09.720 the weather report. I use mine at home for that 0:02:09.800 --> 0:02:12.640 all the time, or even more often than that, if 0:02:12.639 --> 0:02:15.200 you're at my house, you'll hear us use it to 0:02:15.240 --> 0:02:17.400 find out which foods are safe for us to feed 0:02:17.480 --> 0:02:21.080 to our dog. My doggie, Tibolt, absolutely loves our smart 0:02:21.120 --> 0:02:24.160 speaker because it frequently gives us permission to spoil him 0:02:24.160 --> 0:02:27.560 with a carrot or a piece of banana. But how 0:02:27.639 --> 0:02:31.200 do these smart speakers work, How are they able to 0:02:31.320 --> 0:02:35.680 respond to our requests? And what are their limitations? How 0:02:35.760 --> 0:02:38.320 safe are they? That's the sort of stuff we're gonna 0:02:38.320 --> 0:02:40.920 be looking into in this episode of tech Stuff, and 0:02:40.960 --> 0:02:44.120 we'll start off with the basics, which means we have 0:02:44.160 --> 0:02:47.519 to start off with how speakers work in general. Now, 0:02:47.520 --> 0:02:49.840 this is something that I've covered before on tech Stuff, 0:02:50.120 --> 0:02:51.919 but I want to go over it again from a 0:02:52.000 --> 0:02:55.400 high level because well, I just find it fascinating that 0:02:55.480 --> 0:02:58.800 people figured out how to harness electricity to drive a 0:02:58.840 --> 0:03:02.760 motor so that it could in turn cause components to 0:03:02.800 --> 0:03:07.079 replicate a recorded or transmitted sound. And really motors being 0:03:07.120 --> 0:03:10.600 too generous, but to drive an element to create vibrations 0:03:10.639 --> 0:03:14.320 that could replicate a sound that was made into another component, 0:03:14.560 --> 0:03:17.080 that whole thing just boggles my mind that people are 0:03:17.120 --> 0:03:20.920 smart enough to figure that out. Okay, So to understand 0:03:20.960 --> 0:03:24.000 how speakers work, it first helps to understand how sound 0:03:24.160 --> 0:03:28.800 itself works. Sound is a physical phenomenon. Do do do do? 0:03:29.320 --> 0:03:33.560 Sound is all about vibrations, and typically we experience sound 0:03:33.600 --> 0:03:36.280 when we pick up on changes in air pressure that 0:03:36.520 --> 0:03:40.119 enter through our ear canal and then affect the tympanic 0:03:40.160 --> 0:03:44.800 membrane or ear drum. So it's all about these changes 0:03:44.880 --> 0:03:48.720 of of of air pressure, all about air molecules transmitting 0:03:48.800 --> 0:03:53.520 vibrations from a source outward in a radiating pattern from 0:03:53.520 --> 0:03:56.760 from that source. So let's think of someone knocking on 0:03:56.800 --> 0:03:59.520 a door. For example, you're inside a house, someone's knocking 0:03:59.560 --> 0:04:02.960 on your door. When that person's hand hits the door, 0:04:03.280 --> 0:04:07.600 it causes the door to vibrate, and that vibration transmits 0:04:07.640 --> 0:04:10.440 to the surrounding air molecules on the other side of 0:04:10.440 --> 0:04:13.640 the door. They get pushed through that vibration and then 0:04:13.640 --> 0:04:18.440 pulled when the the wood is vibrating back towards its 0:04:18.440 --> 0:04:23.359 original position. So the air molecules vibrate, those air molecules 0:04:23.400 --> 0:04:26.919 cause the next surrounding layer of air molecules to vibrate 0:04:26.960 --> 0:04:29.160 as well, and so on and so forth. It's like 0:04:29.200 --> 0:04:32.360 a cascade or domino effect. You get these little pockets 0:04:32.400 --> 0:04:35.599 of high and low air pressure that travel outward from 0:04:35.720 --> 0:04:40.919 that door. It spreads further as it goes towards you know, 0:04:41.000 --> 0:04:45.760 any distance, and if you are close enough so that 0:04:46.360 --> 0:04:49.200 you can still detect those changes in air pressure. You 0:04:49.360 --> 0:04:52.320 experience this by hearing the knocking on the door. Those 0:04:52.360 --> 0:04:56.039 vibrating air molecules lose a bit of energy as they 0:04:56.080 --> 0:04:59.200 move outward. Right, as they vibrate to the next layer, 0:04:59.320 --> 0:05:01.159 you start to lo use a bit of energy with 0:05:01.279 --> 0:05:05.560 each transmission of that So the sound gets quieter the 0:05:05.640 --> 0:05:08.800 further away you are because there's not as many air 0:05:08.839 --> 0:05:13.680 molecules vibrating, its amplitude as decreased. So if you are 0:05:13.720 --> 0:05:16.120 in hearing range, you can pick up on those changes 0:05:16.120 --> 0:05:18.720 of air pressure they encounter the tympanic membrane in your 0:05:18.720 --> 0:05:21.760 ear canal. Those changes in pressure will cause a reaction 0:05:21.800 --> 0:05:25.919 in your middle and inner ear set that will ultimately 0:05:25.920 --> 0:05:29.720 get picked up by your brain that interprets it as sound. Now, 0:05:29.720 --> 0:05:34.400 the frequency at which those fluctuations occur relate to the 0:05:34.440 --> 0:05:40.880 pitch that we hear, so faster vibrations are higher pitches, 0:05:41.080 --> 0:05:44.760 higher frequencies, higher notes. If you think of a musical scale, 0:05:45.520 --> 0:05:50.200 we perceive the force of the changes as volume, so 0:05:50.839 --> 0:05:54.560 lower forces lower volume right, and higher forces higher volume. 0:05:55.279 --> 0:05:58.039 The human ear can hear a pretty decent range of 0:05:58.080 --> 0:06:02.239 frequencies from twenty hurts, which means twenty cycles or twenty 0:06:02.360 --> 0:06:06.880 waves per second past a given point of reference, to 0:06:07.000 --> 0:06:12.320 twenty killer hurts. That's twenty thousand cycles or waves per second. 0:06:12.800 --> 0:06:15.440 So yeah, the cycle refers to the frequency of the 0:06:15.480 --> 0:06:19.120 wavelength of sound. The lower the frequency, the lower the sound. 0:06:19.440 --> 0:06:21.560 All right, and then our brain has to make meaning 0:06:21.560 --> 0:06:23.880 of all this, Right, it's not just that it's picking 0:06:23.960 --> 0:06:28.279 up on it. Our brain interprets this and we experience 0:06:28.360 --> 0:06:32.359 it as a sound we have heard. So it either 0:06:32.720 --> 0:06:36.840 matches this perceived sound with one we've encountered before, and 0:06:36.880 --> 0:06:39.840 then we say, oh, I know what that is. That's 0:06:39.880 --> 0:06:43.799 someone knocking at the door, or they might be Holy Cala, 0:06:44.000 --> 0:06:46.120 I've never heard that sound in my life. I have 0:06:46.200 --> 0:06:49.920 no idea what it is. If the sound is language, 0:06:50.000 --> 0:06:52.560 then our brains have to derive the meaning from the 0:06:52.640 --> 0:06:56.479 perceived sound. We've heard someone say words such as you're 0:06:56.520 --> 0:07:00.920 hearing me say this. Then our brains have to take 0:07:01.160 --> 0:07:03.960 that collection of sounds and say, what does that actually mean? 0:07:04.040 --> 0:07:07.200 What is the the context, what is the the intent? 0:07:07.640 --> 0:07:10.440 What is the message here? Otherwise it would just be 0:07:10.960 --> 0:07:14.760 you know, random noises that I'm making with my mouth. Alright, 0:07:14.800 --> 0:07:17.760 so we have a basic understanding behind the physics of sound. 0:07:17.760 --> 0:07:21.600 Now to talk about speakers and microphones and the reason 0:07:21.760 --> 0:07:24.000 I'm going to talk about both of them is that 0:07:24.080 --> 0:07:26.720 the devices complement one another. You can think of one 0:07:26.760 --> 0:07:31.080 as being the other in reverse. Plus smart speakers we 0:07:31.160 --> 0:07:34.160 have to talk about microphones anyway, because smart speakers have 0:07:34.400 --> 0:07:38.280 microphones as well as the speaker element. So you can 0:07:38.360 --> 0:07:41.520 think of this as one long process of taking the 0:07:41.520 --> 0:07:46.280 physical phenomena of sound waves, transforming that physical phenomena into 0:07:46.360 --> 0:07:49.800 an electrical signal, taking the electrical signal, and changing it 0:07:49.920 --> 0:07:52.920 back into something that can produce the sound waves that 0:07:53.040 --> 0:07:56.520 started the whole thing. So you're replicating the original sound 0:07:56.560 --> 0:08:00.480 waves with this end device, which in this case is 0:08:00.480 --> 0:08:03.120 allowed speaker. So the microphone is the part of the 0:08:03.160 --> 0:08:05.280 process where you take the sound and you turn it 0:08:05.320 --> 0:08:08.080 into an electrical signal, and the speakers where you take 0:08:08.120 --> 0:08:10.600 the electrical signal and you turn it back into actual sound. 0:08:10.680 --> 0:08:14.640 That's the simple way. But what's actually happening, Well, let's 0:08:14.680 --> 0:08:18.520 talk about on a physical level. Sound waves go into 0:08:18.560 --> 0:08:23.080 a microphone. So you've got these fluctuations and air pressure 0:08:23.200 --> 0:08:27.120 that encounter a microphone. I'm speaking into a microphone right now, 0:08:27.240 --> 0:08:30.480 so this is happening right now. Inside the microphone is 0:08:30.520 --> 0:08:33.679 a very thin diaphragm, typically made out of a very 0:08:33.720 --> 0:08:37.440 flexible plastic, and it's sort of like the skin of 0:08:37.440 --> 0:08:40.880 a drum. So as the changes in air pressure encounter 0:08:41.360 --> 0:08:45.520 the diaphragm, they cause the diaphragm to move back and forth. Well. 0:08:45.520 --> 0:08:49.319 Attached to the diaphragm is a coil of conductive wire, 0:08:49.760 --> 0:08:53.640 and that coil wraps either around or near a permanent magnet. 0:08:54.040 --> 0:08:57.200 Magnets have magnetic fields. They have a north pole and 0:08:57.240 --> 0:09:00.760 a south pole, and there's a magnetic field that surrounds 0:09:01.320 --> 0:09:05.720 the magnet. And the electro magnetic effect means that if 0:09:05.720 --> 0:09:10.600 you move a coil of conductive wire through a magnetic field, 0:09:11.040 --> 0:09:14.280 it will produce a change in voltage in that coil, 0:09:14.600 --> 0:09:19.000 otherwise known as electromotive force, and that means electrical current 0:09:19.040 --> 0:09:22.760 will flow through the coil. Now, if you have the 0:09:22.880 --> 0:09:26.240 end of that coil attached to a wire, a conductive 0:09:26.280 --> 0:09:30.120 wire for that current to flow through, you can send 0:09:30.160 --> 0:09:33.960 that current onto other components. So for our purposes, the 0:09:34.000 --> 0:09:37.360 component in question would be an amplifier, and I'll get 0:09:37.400 --> 0:09:40.480 to explaining why that is in just a moment, but 0:09:40.559 --> 0:09:43.160 first let's talk about loud speakers, and the way allowed 0:09:43.160 --> 0:09:48.000 speaker works is essentially the reverse of a microphone. You've 0:09:48.040 --> 0:09:51.440 got your permanent magnet around or near which is a 0:09:51.480 --> 0:09:56.360 coil of conductive wire. The wire is connected to a diaphragm, 0:09:56.400 --> 0:09:59.600 one much larger and typically made out of stiffer material 0:09:59.800 --> 0:10:03.480 that the plastic you'd find in a microphone. This is 0:10:03.520 --> 0:10:06.520 the element inside a speaker that will vibrate, that will 0:10:06.559 --> 0:10:10.960 push air and pull air as it moves either outward 0:10:11.040 --> 0:10:14.800 or inward. The electrical signal comes from a source such 0:10:14.880 --> 0:10:17.440 as the microphone we were just using a second ago 0:10:18.080 --> 0:10:22.439 that comes into the loudspeaker and it flows through the coil. Now, 0:10:22.480 --> 0:10:26.400 when you have an electrical current flowing through a conductive coil, 0:10:26.920 --> 0:10:31.079 you generate a magnetic field because the laws of electromagnetism. 0:10:31.600 --> 0:10:35.920 You've got the electro magnetic field generated as a result. 0:10:36.280 --> 0:10:39.440 Now that field will interact with the magnetic field of 0:10:39.480 --> 0:10:42.360 the permanent magnet. That the permnet magnet always has a 0:10:42.360 --> 0:10:46.040 magnetic field. The coil only has one when electric current 0:10:46.120 --> 0:10:48.760 is flowing through it. And as I said, we have 0:10:48.840 --> 0:10:51.120 magnets to have a north pole and a south pole. 0:10:51.160 --> 0:10:54.240 And we also know that when we bring two magnets 0:10:54.240 --> 0:10:57.840 with their north poles together, they'll push against each other, 0:10:57.960 --> 0:11:02.240 right because like repels like, But if we turn one 0:11:02.240 --> 0:11:04.640 of those magnets around so that now it's a south 0:11:04.679 --> 0:11:08.560 pole and a north pole, they attract one another, you know, 0:11:08.600 --> 0:11:15.160 opposites attract. So by having the this magnetic field being 0:11:15.200 --> 0:11:21.360 generated by the coil, uh, it starts to generate interactions 0:11:21.400 --> 0:11:25.520 with the magnetic field of the permanent magnet, so they 0:11:25.600 --> 0:11:28.160 start to push and pull against each other. Well, the 0:11:28.240 --> 0:11:31.959 coil is attached to that diaphragm, so it in turn 0:11:32.160 --> 0:11:36.000 drives the diaphragm to either push outward or pull inward. 0:11:36.480 --> 0:11:40.760 That causes air molecules to vibrate, just as it would 0:11:41.120 --> 0:11:43.840 with any other you know, source of sound, and it 0:11:43.920 --> 0:11:48.599 emanates outward from the loudspeaker, so you get a representation 0:11:48.920 --> 0:11:51.839 of the same sound that was going into the microphone 0:11:52.679 --> 0:11:56.760 got converted into an electrical current. The electrical current then 0:11:57.080 --> 0:12:00.360 was passed through a coil and next to a permanent 0:12:00.360 --> 0:12:03.720 magnet to create the same sort of movement. It replicates 0:12:03.720 --> 0:12:07.240 the movement of the original diaphragm in the microphone and 0:12:07.320 --> 0:12:11.200 generates the sound. So you get the replication of the 0:12:11.240 --> 0:12:15.079 sound that was made in the other location. It's pretty cool. 0:12:15.160 --> 0:12:18.400 I think now I did mention earlier that you would 0:12:18.400 --> 0:12:21.480 need an amplifier. And the reason you need an amplifier 0:12:21.640 --> 0:12:24.920 is that the electrical signal generated by a microphone is 0:12:24.960 --> 0:12:28.440 far too weak to drive allowed speakers diaphragm. You just 0:12:29.160 --> 0:12:31.880 wouldn't have the juice to do it. It would be 0:12:32.000 --> 0:12:35.839 much much less, uh powerful than what the speaker would need. 0:12:36.120 --> 0:12:39.040 So chances are the diaphragm would either not move at 0:12:39.080 --> 0:12:40.920 all because it would just be too stiff, it would 0:12:41.240 --> 0:12:44.559 resist the movement too much, or it would move so 0:12:44.600 --> 0:12:47.360 weakly as to generate little to no sound, so it 0:12:47.360 --> 0:12:50.559 wouldn't do you any good. So the signal from the 0:12:50.600 --> 0:12:53.240 microphone has to first pass through an amplifier, which, as 0:12:53.280 --> 0:12:56.679 the name implies, takes an incoming signal and increases the 0:12:56.720 --> 0:13:00.960 amplitude of that signal the volume. In other words, uh so, 0:13:01.000 --> 0:13:03.480 it doesn't affect pitch, but it does affect the signal 0:13:03.559 --> 0:13:08.160 strength and consequently the volume. And I've done episodes about amplifiers, 0:13:08.240 --> 0:13:11.920 including explaining the difference between amplifiers that use vacuum tubes 0:13:11.960 --> 0:13:14.880 and ones that use transistors, so I'm not going to 0:13:15.000 --> 0:13:18.640 go into that here. Besides, it doesn't really factor into 0:13:18.679 --> 0:13:22.679 our conversation about smart speakers anyway. It's just important for 0:13:23.000 --> 0:13:26.080 it to work with a microphone and speaker setting. Now, 0:13:26.120 --> 0:13:29.600 over the years, engineers have paired microphones and speakers and 0:13:29.720 --> 0:13:33.440 lots of stuff. You've got telephones, you've got intercom systems, 0:13:33.480 --> 0:13:37.280 public address systems, handheld radios, all sorts of things, so 0:13:37.320 --> 0:13:41.160 that technology was well and truly mature. Before we ever 0:13:41.240 --> 0:13:45.040 got our first smart speaker, there wasn't much call to 0:13:45.160 --> 0:13:49.200 incorporate microphones into home speaker systems for many years. I mean, 0:13:49.760 --> 0:13:52.560 what would you actually use a microphone embedded in a 0:13:52.640 --> 0:13:56.320 speaker for? Before smart speakers, Typically you would have your 0:13:56.360 --> 0:13:59.280 speakers like I'm talking about, like like sound system speakers. 0:13:59.400 --> 0:14:01.800 You would have them hooked up to some other dumb 0:14:02.080 --> 0:14:05.800 as in, not connected to a network technology. So it 0:14:05.880 --> 0:14:09.040 might be a sound system or home entertainment set up 0:14:09.040 --> 0:14:11.480 with a television as the focal point, or maybe even 0:14:11.720 --> 0:14:14.079 you know, a computer for the purposes of playing more 0:14:14.160 --> 0:14:19.240 dynamic sounds for like video games and and things like that. Um. 0:14:19.320 --> 0:14:21.800 But for a very long time, these were all thought 0:14:21.840 --> 0:14:25.320 of as one way communications applications, right, Like, the sound 0:14:25.400 --> 0:14:27.480 was coming from a source and it would get to 0:14:27.600 --> 0:14:30.800 us through the speakers, but we weren't meant to send 0:14:31.360 --> 0:14:34.480 sound back through those same channels. The information was just 0:14:34.560 --> 0:14:37.440 coming to you. You weren't sending anything back, But that 0:14:37.480 --> 0:14:40.040 would all change in time. Now. One thing to keep 0:14:40.040 --> 0:14:42.680 in mind about smart speakers is that they are the 0:14:42.680 --> 0:14:46.360 product of several different technologies and lines of innovation and 0:14:46.400 --> 0:14:50.800 development that all converged together. The microphone and speaker technology 0:14:51.120 --> 0:14:53.160 is one of the oldest ones that we can point 0:14:53.200 --> 0:14:57.000 to as far as the fundamental underlying technology is concerned, 0:14:57.560 --> 0:15:00.440 the stuff that's been around since the late nineties century. 0:15:00.600 --> 0:15:03.440 Now there is one other we'll talk about that's even older. 0:15:03.720 --> 0:15:06.440 But I don't want to spoil things. I'll just mention 0:15:06.920 --> 0:15:10.560 there is an even older line of development that goes 0:15:10.600 --> 0:15:14.240 into smart speakers than the microphone speaker stuff of the 0:15:14.320 --> 0:15:18.040 nineteenth century. Most of the other components, however, are much 0:15:18.080 --> 0:15:23.239 younger than that. One big one is speech or voice recognition. 0:15:23.600 --> 0:15:28.040 Creating computer systems that could detect noise was relatively simple. Right. 0:15:28.120 --> 0:15:31.120 You could have a computer connected to microphones and they 0:15:31.120 --> 0:15:35.360 could monitor the input from those microphones and any incoming 0:15:35.400 --> 0:15:38.680 signal could be registered. Right, they could record an incoming 0:15:38.720 --> 0:15:42.080 signal that would indicate the microphone had detected a noise. 0:15:42.560 --> 0:15:46.080 That's child's play. That's easy to do. But teaching computers 0:15:46.080 --> 0:15:49.160 how to analyze those signals and decipher them so that 0:15:49.160 --> 0:15:53.440 the computer could display in text or otherwise act upon 0:15:54.000 --> 0:15:57.880 that that sound in a meaningful way that was much 0:15:57.880 --> 0:16:02.400 more difficult. There was an IBM engineer named William C. 0:16:02.680 --> 0:16:06.560 Dirsh of the Advanced System Development Division who created an 0:16:06.640 --> 0:16:11.200 early implementation of voice recognition. It was a very limited application, 0:16:11.280 --> 0:16:14.240 but it proved that the ability to interact with computers 0:16:14.280 --> 0:16:18.800 by voice was more than just science fiction. Within IBM. 0:16:18.800 --> 0:16:23.080 It was called the Shoebox. Dirsh worked on this project 0:16:23.200 --> 0:16:26.440 in the early nineteen sixties and what he produced was 0:16:26.480 --> 0:16:29.840 a machine that had a microphone attached to it. The 0:16:29.880 --> 0:16:34.680 machine could detect sixteen spoken words, which included the digits 0:16:34.800 --> 0:16:39.160 of zero to nine plus some command indicators like plus 0:16:39.520 --> 0:16:43.360 minus total, sub total. You get the idea. So you 0:16:43.360 --> 0:16:46.680 could speak a string of numbers and then commands to 0:16:46.840 --> 0:16:49.920 this device, then ask it to total everything and it 0:16:49.960 --> 0:16:52.000 would do so. So it was more or less a 0:16:52.080 --> 0:16:58.000 basic calculator with some voice interpretation incorporated into it. Now 0:16:58.280 --> 0:17:02.000 there's a great newsreel piece about this shoebox. There's a 0:17:02.040 --> 0:17:05.040 demonstration of it, and it came out in nineteen one, 0:17:05.480 --> 0:17:08.480 and I love that newsreel because it has that great 0:17:08.560 --> 0:17:10.520 music you would hear in the background of those old 0:17:10.560 --> 0:17:14.560 industrial and business films. Anyway, there's also a helpful chart 0:17:14.840 --> 0:17:19.159 that hangs in the background of that video where Dersh 0:17:19.320 --> 0:17:22.439 is actually explaining how it works. You can see a 0:17:22.480 --> 0:17:25.919 little bit behind him what the what is actually being 0:17:25.960 --> 0:17:30.520 analyzed and uh he broke the words down into phonemes 0:17:30.560 --> 0:17:36.720 and syllables, so phonemes being specific sounds that make up words. So, 0:17:36.760 --> 0:17:40.679 for example, the digit one is a single syllable word 0:17:40.960 --> 0:17:43.520 with a vowel sound right at the front. But you 0:17:43.600 --> 0:17:48.200 also have the word eight that's another single syllable word 0:17:48.480 --> 0:17:51.040 as a vowel sound right at the front, but it's 0:17:51.359 --> 0:17:55.280 different from one phonetically in that eight also has a 0:17:55.359 --> 0:17:59.159 plosive and has that hard t at the end. So 0:17:59.200 --> 0:18:02.919 the shoebox was limited not just in what words it 0:18:02.960 --> 0:18:07.720 could recognize, but also the types of voices it could recognize. 0:18:07.880 --> 0:18:10.760 Get someone who has a different dialect or manner of speech, 0:18:10.760 --> 0:18:12.800 and the machine might not be able to understand them 0:18:12.800 --> 0:18:16.119 because they're not pronouncing the words the same way that 0:18:16.280 --> 0:18:20.560 drsh did. This would be a big challenge in speech 0:18:20.560 --> 0:18:24.240 recognition moving forward, and it's also an example of where 0:18:24.280 --> 0:18:28.480 we find bias creeping into technology. And it's not necessarily 0:18:28.520 --> 0:18:32.359 a conscious thing, but if you have people designing a 0:18:32.400 --> 0:18:36.520 system and they're designing it based off their own uh, 0:18:36.680 --> 0:18:41.280 you know, speech patterns, their own pronunciations, their own dialects, 0:18:41.800 --> 0:18:44.879 then it may be that the system they create works 0:18:44.960 --> 0:18:48.040 really well for them and less well for anyone who 0:18:48.160 --> 0:18:51.440 isn't them, And the further away you are from their 0:18:51.480 --> 0:18:56.200 manner of speaking, the more frustration you will encounter as 0:18:56.240 --> 0:18:59.719 you try to interact with that technology. That's an example 0:18:59.760 --> 0:19:03.200 of s and in fact, if you read the histories 0:19:03.320 --> 0:19:06.359 of speech recognition and as we'll get too later natural 0:19:06.640 --> 0:19:10.119 language processing, you'll see a lot of people say it 0:19:10.200 --> 0:19:13.119 works great if you happen to be a white man, 0:19:13.840 --> 0:19:17.880 because the manner of speech was being or the people 0:19:17.920 --> 0:19:21.000 who were designing it were primarily white men who were 0:19:21.760 --> 0:19:26.000 uh typically aiming for a a what is considered a 0:19:26.080 --> 0:19:31.840 non accented American dialect somewhere in you know, the Eastern 0:19:31.920 --> 0:19:35.439 seaboard side. But that meant that if you did have 0:19:35.520 --> 0:19:39.639 an accent or a dialect, or you had a different vernacular, 0:19:40.200 --> 0:19:43.240 that it was harder for the systems to actually understand 0:19:43.240 --> 0:19:46.399 what you were saying. That's an example of bias. Well. 0:19:46.760 --> 0:19:49.359 The general strategy was again to break up speech and 0:19:49.400 --> 0:19:52.560 too constituent sound units, you know, those phonemes, and then 0:19:52.600 --> 0:19:55.879 to susse out which words were being spoken based on 0:19:55.920 --> 0:19:59.880 those phonemes, and that was done by digitizing the voice train, 0:20:00.160 --> 0:20:04.159 forming it from sound into data that represented stuff like 0:20:04.240 --> 0:20:08.320 the sounds frequency or pitch, and then matching up specific 0:20:08.359 --> 0:20:12.199 signal signal signatures with specific phone nmes. So generally the 0:20:12.240 --> 0:20:14.919 idea was that the computer system would monitor incoming sound, 0:20:15.280 --> 0:20:18.919 convert the sound into digital data, compare that data that 0:20:19.000 --> 0:20:22.679 had received with information stored in a database, and effort 0:20:22.720 --> 0:20:26.199 to look for matches. Uh. The shoebox database was just 0:20:26.320 --> 0:20:29.280 sixteen words and size. Later ones would be much larger, 0:20:29.320 --> 0:20:33.399 but pretty quickly people realized this was not an efficient 0:20:33.480 --> 0:20:37.640 way of doing speech recognition because the bigger the vocabulary, 0:20:37.840 --> 0:20:40.040 the more work intens of it was to build out 0:20:40.080 --> 0:20:43.520 those databases. So it wasn't something that people thought would 0:20:43.520 --> 0:20:48.560 be sustainable for very large vocabularies. But the Shoebox marked 0:20:48.560 --> 0:20:50.680 the beginning of a serious effort to create machines that 0:20:50.720 --> 0:20:53.720 could accept audio cues as actual input, and as we'll see, 0:20:54.080 --> 0:20:57.760 that's one important component for these smart speaker systems. I've 0:20:57.800 --> 0:20:59.560 got a lot more to say, but before I get 0:20:59.600 --> 0:21:09.760 into the next part, let's take a quick break. Now, 0:21:09.800 --> 0:21:13.480 obviously we didn't jump right into full voice recognition right 0:21:13.520 --> 0:21:17.520 after IBM S Shoebus innovation. The challenges related to building 0:21:17.560 --> 0:21:21.399 automated speech recognition systems were numerous, even for just a 0:21:21.520 --> 0:21:24.879 single language, because, as I said, you can have accents 0:21:24.960 --> 0:21:28.280 and dialects. One voice can have a very different tonal 0:21:28.400 --> 0:21:32.679 quality from another, people speak at different speeds. Teaching machines 0:21:32.720 --> 0:21:35.480 how to recognize speech when the phonemes and pacing of 0:21:35.520 --> 0:21:40.840 that speech aren't consistent from speaker to speaker, that's really hard. 0:21:41.320 --> 0:21:43.119 This kind of gets back to the same sort of 0:21:43.200 --> 0:21:46.680 challenges you have when you're teaching machines how to recognize images. 0:21:47.440 --> 0:21:51.080 You know, you teach a human what a coffee mug is. 0:21:51.119 --> 0:21:53.320 I always use this example, but you teach a human 0:21:53.359 --> 0:21:55.800 what a coffee mug is, and pretty soon they can 0:21:55.840 --> 0:22:00.000 extrapolate from that example and understand that coffee mugs can 0:22:00.000 --> 0:22:03.879 them in all different sizes and colors, and you know 0:22:04.240 --> 0:22:08.320 different designs and textures. We get it. Like you you 0:22:08.359 --> 0:22:11.640 see a couple of coffee mugs, you understand machines though 0:22:12.480 --> 0:22:15.280 they aren't able to do that. Machines, you know, you 0:22:15.320 --> 0:22:17.440 have to give them lots and lots and lots of 0:22:17.480 --> 0:22:20.479 different examples before they can start to pick up on 0:22:20.600 --> 0:22:24.960 what things actually make a coffee mug. Same sort of 0:22:25.000 --> 0:22:28.639 thing with speech, right, So if you don't have consistency 0:22:28.760 --> 0:22:31.679 between speakers, it makes it very hard for machines to 0:22:31.800 --> 0:22:34.800 learn what people are saying. Now, it didn't take long 0:22:34.880 --> 0:22:37.399 for the tech industry at large to really dive into 0:22:37.400 --> 0:22:41.520 trying to solve this problem. In ninete, DARPA, that's the 0:22:41.640 --> 0:22:45.359 Research and Development division of the United States Department of Defense, 0:22:45.760 --> 0:22:48.800 got behind speech recognition in a big way. Now, remember 0:22:49.280 --> 0:22:54.080 darp it self doesn't do research. The organization's purpose is 0:22:54.080 --> 0:22:58.280 to invite organizations to pitch projects that align with whatever 0:22:58.359 --> 0:23:01.879 darpest goals are and and DARBA would provide funding to 0:23:02.440 --> 0:23:07.000 the winning organizations to see these projects to completion if possible. 0:23:07.440 --> 0:23:09.840 So DARK is really more of a vetting and funding 0:23:10.000 --> 0:23:15.400 organization anyway. In n DARPA created a five year program 0:23:15.520 --> 0:23:20.160 called Speech Understanding Research or s u are. The initial 0:23:20.240 --> 0:23:23.320 goal was pretty darn ambitious considering the capabilities of the 0:23:23.359 --> 0:23:27.240 technology at the time. The project director, Larry Roberts, wanted 0:23:27.240 --> 0:23:30.440 a system that would be capable of recognizing a vocabulary 0:23:30.560 --> 0:23:34.119 of ten thousand words with less than ten percent error. 0:23:34.560 --> 0:23:37.240 After holding a few meetings with some of the leading 0:23:37.320 --> 0:23:41.840 computer engineers of the day, Roberts suggusted that goal significantly. 0:23:42.560 --> 0:23:45.359 After that adjustment, the target was going to be a 0:23:45.400 --> 0:23:50.040 system capable of recognizing one thousand words, not ten thousand. 0:23:50.920 --> 0:23:53.359 Nearror levels still had to be less than ten percent, 0:23:53.840 --> 0:23:55.720 and the goal was for the system to be able 0:23:55.760 --> 0:24:02.359 to accept continuous speech, as opposed to very deliberate speech 0:24:03.080 --> 0:24:08.000 with pauses between each pair of words that would not 0:24:08.119 --> 0:24:13.040 be really that useful. One person who was skeptical about 0:24:13.080 --> 0:24:16.760 the potential success of this project was John R. Pierce 0:24:16.960 --> 0:24:20.639 of Bell Labs. He argued that any success would be 0:24:20.720 --> 0:24:25.440 limited so long as machines remained incapable of understanding the words, 0:24:25.840 --> 0:24:28.720 not just recognizing a word based on phone names, but 0:24:28.840 --> 0:24:31.359 understanding what the word is. That is. Pierce felt that 0:24:31.359 --> 0:24:34.080 the machines needed some way to parse the language to 0:24:34.119 --> 0:24:37.040 get to the meaning of what was being said. That's 0:24:37.080 --> 0:24:38.919