WEBVTT - How Smart Speakers Work

0:00:04.240 --> 0:00:07.240
<v Speaker 1>Welcome to Tech Stuff, a production of I Heart Radios

0:00:07.320 --> 0:00:14.000
<v Speaker 1>How Stuff Works. Hey there, and welcome to tech Stuff.

0:00:14.040 --> 0:00:16.880
<v Speaker 1>I'm your host, Jonathan Strickland. I'm an executive producer with

0:00:16.920 --> 0:00:19.960
<v Speaker 1>I Heart Radio and I love all things tech, and guys,

0:00:19.960 --> 0:00:24.079
<v Speaker 1>stick with me. I am fighting off a cold. You'll

0:00:24.120 --> 0:00:25.560
<v Speaker 1>be able to hear it in my voice. I have

0:00:25.760 --> 0:00:28.720
<v Speaker 1>no doubt. But you know, I wanted to get you

0:00:28.760 --> 0:00:32.040
<v Speaker 1>guys a brand new episode. So we're gonna fight on

0:00:32.200 --> 0:00:37.200
<v Speaker 1>because the show must keep going. I think I think

0:00:37.200 --> 0:00:40.120
<v Speaker 1>this is saying, oh no, this cold medicine is good though.

0:00:40.200 --> 0:00:42.959
<v Speaker 1>All right, Anyway, I thought that we would do an

0:00:42.960 --> 0:00:47.760
<v Speaker 1>episode about smart speakers because I wanted to kind of

0:00:47.760 --> 0:00:51.199
<v Speaker 1>start this whole episode off with with an old man observation,

0:00:51.240 --> 0:00:54.000
<v Speaker 1>you know, get off my lawn kind of thing. And

0:00:54.080 --> 0:00:57.600
<v Speaker 1>this is from our resident old man, old man Strickland.

0:00:57.840 --> 0:01:01.160
<v Speaker 1>That meaning meaning me, So, when I was young, speakers

0:01:01.160 --> 0:01:03.959
<v Speaker 1>were dumb. Now I don't. I don't mean that speakers

0:01:03.960 --> 0:01:07.479
<v Speaker 1>were useless, or that they were terrible, or that they

0:01:07.480 --> 0:01:11.679
<v Speaker 1>were incapable of replicating certain frequencies or volumes of sound,

0:01:12.319 --> 0:01:15.440
<v Speaker 1>or that they were limited in some other way other

0:01:15.480 --> 0:01:19.319
<v Speaker 1>than they didn't quote unquote think they didn't connect to

0:01:19.360 --> 0:01:22.800
<v Speaker 1>any sort of computational engine in a meaningful way. You

0:01:22.880 --> 0:01:25.160
<v Speaker 1>might have a set of speakers plugged into a computer,

0:01:25.600 --> 0:01:27.840
<v Speaker 1>but that was just a one way communications tool, right.

0:01:27.840 --> 0:01:29.920
<v Speaker 1>It was just a way to provide an outlet for

0:01:30.120 --> 0:01:33.200
<v Speaker 1>sound that your computer was generating, nothing more than that.

0:01:33.880 --> 0:01:37.200
<v Speaker 1>But contrast that with today, when we have numerous smart

0:01:37.240 --> 0:01:40.440
<v Speaker 1>speakers on the market. These speakers act as a user

0:01:40.480 --> 0:01:44.840
<v Speaker 1>interface between us and the Internet at large, often facilitated

0:01:44.840 --> 0:01:49.120
<v Speaker 1>by a virtual assistant of some kind. Now with these speakers,

0:01:49.200 --> 0:01:52.760
<v Speaker 1>we don't just listen to stuff like music and podcasts

0:01:52.840 --> 0:01:56.640
<v Speaker 1>and the radio and you know, other traditional audio content.

0:01:57.200 --> 0:02:00.520
<v Speaker 1>We use them to find out information. We might link

0:02:00.640 --> 0:02:03.360
<v Speaker 1>them to our calendars so that we can get reminders

0:02:03.360 --> 0:02:06.760
<v Speaker 1>for upcoming appointments. We probably use them to ask about

0:02:06.800 --> 0:02:09.720
<v Speaker 1>the weather report. I use mine at home for that

0:02:09.800 --> 0:02:12.640
<v Speaker 1>all the time, or even more often than that, if

0:02:12.639 --> 0:02:15.200
<v Speaker 1>you're at my house, you'll hear us use it to

0:02:15.240 --> 0:02:17.400
<v Speaker 1>find out which foods are safe for us to feed

0:02:17.480 --> 0:02:21.080
<v Speaker 1>to our dog. My doggie, Tibolt, absolutely loves our smart

0:02:21.120 --> 0:02:24.160
<v Speaker 1>speaker because it frequently gives us permission to spoil him

0:02:24.160 --> 0:02:27.560
<v Speaker 1>with a carrot or a piece of banana. But how

0:02:27.639 --> 0:02:31.200
<v Speaker 1>do these smart speakers work, How are they able to

0:02:31.320 --> 0:02:35.680
<v Speaker 1>respond to our requests? And what are their limitations? How

0:02:35.760 --> 0:02:38.320
<v Speaker 1>safe are they? That's the sort of stuff we're gonna

0:02:38.320 --> 0:02:40.920
<v Speaker 1>be looking into in this episode of tech Stuff, and

0:02:40.960 --> 0:02:44.120
<v Speaker 1>we'll start off with the basics, which means we have

0:02:44.160 --> 0:02:47.519
<v Speaker 1>to start off with how speakers work in general. Now,

0:02:47.520 --> 0:02:49.840
<v Speaker 1>this is something that I've covered before on tech Stuff,

0:02:50.120 --> 0:02:51.919
<v Speaker 1>but I want to go over it again from a

0:02:52.000 --> 0:02:55.400
<v Speaker 1>high level because well, I just find it fascinating that

0:02:55.480 --> 0:02:58.800
<v Speaker 1>people figured out how to harness electricity to drive a

0:02:58.840 --> 0:03:02.760
<v Speaker 1>motor so that it could in turn cause components to

0:03:02.800 --> 0:03:07.079
<v Speaker 1>replicate a recorded or transmitted sound. And really motors being

0:03:07.120 --> 0:03:10.600
<v Speaker 1>too generous, but to drive an element to create vibrations

0:03:10.639 --> 0:03:14.320
<v Speaker 1>that could replicate a sound that was made into another component,

0:03:14.560 --> 0:03:17.080
<v Speaker 1>that whole thing just boggles my mind that people are

0:03:17.120 --> 0:03:20.920
<v Speaker 1>smart enough to figure that out. Okay, So to understand

0:03:20.960 --> 0:03:24.000
<v Speaker 1>how speakers work, it first helps to understand how sound

0:03:24.160 --> 0:03:28.800
<v Speaker 1>itself works. Sound is a physical phenomenon. Do do do do?

0:03:29.320 --> 0:03:33.560
<v Speaker 1>Sound is all about vibrations, and typically we experience sound

0:03:33.600 --> 0:03:36.280
<v Speaker 1>when we pick up on changes in air pressure that

0:03:36.520 --> 0:03:40.119
<v Speaker 1>enter through our ear canal and then affect the tympanic

0:03:40.160 --> 0:03:44.800
<v Speaker 1>membrane or ear drum. So it's all about these changes

0:03:44.880 --> 0:03:48.720
<v Speaker 1>of of of air pressure, all about air molecules transmitting

0:03:48.800 --> 0:03:53.520
<v Speaker 1>vibrations from a source outward in a radiating pattern from

0:03:53.520 --> 0:03:56.760
<v Speaker 1>from that source. So let's think of someone knocking on

0:03:56.800 --> 0:03:59.520
<v Speaker 1>a door. For example, you're inside a house, someone's knocking

0:03:59.560 --> 0:04:02.960
<v Speaker 1>on your door. When that person's hand hits the door,

0:04:03.280 --> 0:04:07.600
<v Speaker 1>it causes the door to vibrate, and that vibration transmits

0:04:07.640 --> 0:04:10.440
<v Speaker 1>to the surrounding air molecules on the other side of

0:04:10.440 --> 0:04:13.640
<v Speaker 1>the door. They get pushed through that vibration and then

0:04:13.640 --> 0:04:18.440
<v Speaker 1>pulled when the the wood is vibrating back towards its

0:04:18.440 --> 0:04:23.359
<v Speaker 1>original position. So the air molecules vibrate, those air molecules

0:04:23.400 --> 0:04:26.919
<v Speaker 1>cause the next surrounding layer of air molecules to vibrate

0:04:26.960 --> 0:04:29.160
<v Speaker 1>as well, and so on and so forth. It's like

0:04:29.200 --> 0:04:32.360
<v Speaker 1>a cascade or domino effect. You get these little pockets

0:04:32.400 --> 0:04:35.599
<v Speaker 1>of high and low air pressure that travel outward from

0:04:35.720 --> 0:04:40.919
<v Speaker 1>that door. It spreads further as it goes towards you know,

0:04:41.000 --> 0:04:45.760
<v Speaker 1>any distance, and if you are close enough so that

0:04:46.360 --> 0:04:49.200
<v Speaker 1>you can still detect those changes in air pressure. You

0:04:49.360 --> 0:04:52.320
<v Speaker 1>experience this by hearing the knocking on the door. Those

0:04:52.360 --> 0:04:56.039
<v Speaker 1>vibrating air molecules lose a bit of energy as they

0:04:56.080 --> 0:04:59.200
<v Speaker 1>move outward. Right, as they vibrate to the next layer,

0:04:59.320 --> 0:05:01.159
<v Speaker 1>you start to lo use a bit of energy with

0:05:01.279 --> 0:05:05.560
<v Speaker 1>each transmission of that So the sound gets quieter the

0:05:05.640 --> 0:05:08.800
<v Speaker 1>further away you are because there's not as many air

0:05:08.839 --> 0:05:13.680
<v Speaker 1>molecules vibrating, its amplitude as decreased. So if you are

0:05:13.720 --> 0:05:16.120
<v Speaker 1>in hearing range, you can pick up on those changes

0:05:16.120 --> 0:05:18.720
<v Speaker 1>of air pressure they encounter the tympanic membrane in your

0:05:18.720 --> 0:05:21.760
<v Speaker 1>ear canal. Those changes in pressure will cause a reaction

0:05:21.800 --> 0:05:25.919
<v Speaker 1>in your middle and inner ear set that will ultimately

0:05:25.920 --> 0:05:29.720
<v Speaker 1>get picked up by your brain that interprets it as sound. Now,

0:05:29.720 --> 0:05:34.400
<v Speaker 1>the frequency at which those fluctuations occur relate to the

0:05:34.440 --> 0:05:40.880
<v Speaker 1>pitch that we hear, so faster vibrations are higher pitches,

0:05:41.080 --> 0:05:44.760
<v Speaker 1>higher frequencies, higher notes. If you think of a musical scale,

0:05:45.520 --> 0:05:50.200
<v Speaker 1>we perceive the force of the changes as volume, so

0:05:50.839 --> 0:05:54.560
<v Speaker 1>lower forces lower volume right, and higher forces higher volume.

0:05:55.279 --> 0:05:58.039
<v Speaker 1>The human ear can hear a pretty decent range of

0:05:58.080 --> 0:06:02.239
<v Speaker 1>frequencies from twenty hurts, which means twenty cycles or twenty

0:06:02.360 --> 0:06:06.880
<v Speaker 1>waves per second past a given point of reference, to

0:06:07.000 --> 0:06:12.320
<v Speaker 1>twenty killer hurts. That's twenty thousand cycles or waves per second.

0:06:12.800 --> 0:06:15.440
<v Speaker 1>So yeah, the cycle refers to the frequency of the

0:06:15.480 --> 0:06:19.120
<v Speaker 1>wavelength of sound. The lower the frequency, the lower the sound.

0:06:19.440 --> 0:06:21.560
<v Speaker 1>All right, and then our brain has to make meaning

0:06:21.560 --> 0:06:23.880
<v Speaker 1>of all this, Right, it's not just that it's picking

0:06:23.960 --> 0:06:28.279
<v Speaker 1>up on it. Our brain interprets this and we experience

0:06:28.360 --> 0:06:32.359
<v Speaker 1>it as a sound we have heard. So it either

0:06:32.720 --> 0:06:36.840
<v Speaker 1>matches this perceived sound with one we've encountered before, and

0:06:36.880 --> 0:06:39.840
<v Speaker 1>then we say, oh, I know what that is. That's

0:06:39.880 --> 0:06:43.799
<v Speaker 1>someone knocking at the door, or they might be Holy Cala,

0:06:44.000 --> 0:06:46.120
<v Speaker 1>I've never heard that sound in my life. I have

0:06:46.200 --> 0:06:49.920
<v Speaker 1>no idea what it is. If the sound is language,

0:06:50.000 --> 0:06:52.560
<v Speaker 1>then our brains have to derive the meaning from the

0:06:52.640 --> 0:06:56.479
<v Speaker 1>perceived sound. We've heard someone say words such as you're

0:06:56.520 --> 0:07:00.920
<v Speaker 1>hearing me say this. Then our brains have to take

0:07:01.160 --> 0:07:03.960
<v Speaker 1>that collection of sounds and say, what does that actually mean?

0:07:04.040 --> 0:07:07.200
<v Speaker 1>What is the the context, what is the the intent?

0:07:07.640 --> 0:07:10.440
<v Speaker 1>What is the message here? Otherwise it would just be

0:07:10.960 --> 0:07:14.760
<v Speaker 1>you know, random noises that I'm making with my mouth. Alright,

0:07:14.800 --> 0:07:17.760
<v Speaker 1>so we have a basic understanding behind the physics of sound.

0:07:17.760 --> 0:07:21.600
<v Speaker 1>Now to talk about speakers and microphones and the reason

0:07:21.760 --> 0:07:24.000
<v Speaker 1>I'm going to talk about both of them is that

0:07:24.080 --> 0:07:26.720
<v Speaker 1>the devices complement one another. You can think of one

0:07:26.760 --> 0:07:31.080
<v Speaker 1>as being the other in reverse. Plus smart speakers we

0:07:31.160 --> 0:07:34.160
<v Speaker 1>have to talk about microphones anyway, because smart speakers have

0:07:34.400 --> 0:07:38.280
<v Speaker 1>microphones as well as the speaker element. So you can

0:07:38.360 --> 0:07:41.520
<v Speaker 1>think of this as one long process of taking the

0:07:41.520 --> 0:07:46.280
<v Speaker 1>physical phenomena of sound waves, transforming that physical phenomena into

0:07:46.360 --> 0:07:49.800
<v Speaker 1>an electrical signal, taking the electrical signal, and changing it

0:07:49.920 --> 0:07:52.920
<v Speaker 1>back into something that can produce the sound waves that

0:07:53.040 --> 0:07:56.520
<v Speaker 1>started the whole thing. So you're replicating the original sound

0:07:56.560 --> 0:08:00.480
<v Speaker 1>waves with this end device, which in this case is

0:08:00.480 --> 0:08:03.120
<v Speaker 1>allowed speaker. So the microphone is the part of the

0:08:03.160 --> 0:08:05.280
<v Speaker 1>process where you take the sound and you turn it

0:08:05.320 --> 0:08:08.080
<v Speaker 1>into an electrical signal, and the speakers where you take

0:08:08.120 --> 0:08:10.600
<v Speaker 1>the electrical signal and you turn it back into actual sound.

0:08:10.680 --> 0:08:14.640
<v Speaker 1>That's the simple way. But what's actually happening, Well, let's

0:08:14.680 --> 0:08:18.520
<v Speaker 1>talk about on a physical level. Sound waves go into

0:08:18.560 --> 0:08:23.080
<v Speaker 1>a microphone. So you've got these fluctuations and air pressure

0:08:23.200 --> 0:08:27.120
<v Speaker 1>that encounter a microphone. I'm speaking into a microphone right now,

0:08:27.240 --> 0:08:30.480
<v Speaker 1>so this is happening right now. Inside the microphone is

0:08:30.520 --> 0:08:33.679
<v Speaker 1>a very thin diaphragm, typically made out of a very

0:08:33.720 --> 0:08:37.440
<v Speaker 1>flexible plastic, and it's sort of like the skin of

0:08:37.440 --> 0:08:40.880
<v Speaker 1>a drum. So as the changes in air pressure encounter

0:08:41.360 --> 0:08:45.520
<v Speaker 1>the diaphragm, they cause the diaphragm to move back and forth. Well.

0:08:45.520 --> 0:08:49.319
<v Speaker 1>Attached to the diaphragm is a coil of conductive wire,

0:08:49.760 --> 0:08:53.640
<v Speaker 1>and that coil wraps either around or near a permanent magnet.

0:08:54.040 --> 0:08:57.200
<v Speaker 1>Magnets have magnetic fields. They have a north pole and

0:08:57.240 --> 0:09:00.760
<v Speaker 1>a south pole, and there's a magnetic field that surrounds

0:09:01.320 --> 0:09:05.720
<v Speaker 1>the magnet. And the electro magnetic effect means that if

0:09:05.720 --> 0:09:10.600
<v Speaker 1>you move a coil of conductive wire through a magnetic field,

0:09:11.040 --> 0:09:14.280
<v Speaker 1>it will produce a change in voltage in that coil,

0:09:14.600 --> 0:09:19.000
<v Speaker 1>otherwise known as electromotive force, and that means electrical current

0:09:19.040 --> 0:09:22.760
<v Speaker 1>will flow through the coil. Now, if you have the

0:09:22.880 --> 0:09:26.240
<v Speaker 1>end of that coil attached to a wire, a conductive

0:09:26.280 --> 0:09:30.120
<v Speaker 1>wire for that current to flow through, you can send

0:09:30.160 --> 0:09:33.960
<v Speaker 1>that current onto other components. So for our purposes, the

0:09:34.000 --> 0:09:37.360
<v Speaker 1>component in question would be an amplifier, and I'll get

0:09:37.400 --> 0:09:40.480
<v Speaker 1>to explaining why that is in just a moment, but

0:09:40.559 --> 0:09:43.160
<v Speaker 1>first let's talk about loud speakers, and the way allowed

0:09:43.160 --> 0:09:48.000
<v Speaker 1>speaker works is essentially the reverse of a microphone. You've

0:09:48.040 --> 0:09:51.440
<v Speaker 1>got your permanent magnet around or near which is a

0:09:51.480 --> 0:09:56.360
<v Speaker 1>coil of conductive wire. The wire is connected to a diaphragm,

0:09:56.400 --> 0:09:59.600
<v Speaker 1>one much larger and typically made out of stiffer material

0:09:59.800 --> 0:10:03.480
<v Speaker 1>that the plastic you'd find in a microphone. This is

0:10:03.520 --> 0:10:06.520
<v Speaker 1>the element inside a speaker that will vibrate, that will

0:10:06.559 --> 0:10:10.960
<v Speaker 1>push air and pull air as it moves either outward

0:10:11.040 --> 0:10:14.800
<v Speaker 1>or inward. The electrical signal comes from a source such

0:10:14.880 --> 0:10:17.440
<v Speaker 1>as the microphone we were just using a second ago

0:10:18.080 --> 0:10:22.439
<v Speaker 1>that comes into the loudspeaker and it flows through the coil. Now,

0:10:22.480 --> 0:10:26.400
<v Speaker 1>when you have an electrical current flowing through a conductive coil,

0:10:26.920 --> 0:10:31.079
<v Speaker 1>you generate a magnetic field because the laws of electromagnetism.

0:10:31.600 --> 0:10:35.920
<v Speaker 1>You've got the electro magnetic field generated as a result.

0:10:36.280 --> 0:10:39.440
<v Speaker 1>Now that field will interact with the magnetic field of

0:10:39.480 --> 0:10:42.360
<v Speaker 1>the permanent magnet. That the permnet magnet always has a

0:10:42.360 --> 0:10:46.040
<v Speaker 1>magnetic field. The coil only has one when electric current

0:10:46.120 --> 0:10:48.760
<v Speaker 1>is flowing through it. And as I said, we have

0:10:48.840 --> 0:10:51.120
<v Speaker 1>magnets to have a north pole and a south pole.

0:10:51.160 --> 0:10:54.240
<v Speaker 1>And we also know that when we bring two magnets

0:10:54.240 --> 0:10:57.840
<v Speaker 1>with their north poles together, they'll push against each other,

0:10:57.960 --> 0:11:02.240
<v Speaker 1>right because like repels like, But if we turn one

0:11:02.240 --> 0:11:04.640
<v Speaker 1>of those magnets around so that now it's a south

0:11:04.679 --> 0:11:08.560
<v Speaker 1>pole and a north pole, they attract one another, you know,

0:11:08.600 --> 0:11:15.160
<v Speaker 1>opposites attract. So by having the this magnetic field being

0:11:15.200 --> 0:11:21.360
<v Speaker 1>generated by the coil, uh, it starts to generate interactions

0:11:21.400 --> 0:11:25.520
<v Speaker 1>with the magnetic field of the permanent magnet, so they

0:11:25.600 --> 0:11:28.160
<v Speaker 1>start to push and pull against each other. Well, the

0:11:28.240 --> 0:11:31.959
<v Speaker 1>coil is attached to that diaphragm, so it in turn

0:11:32.160 --> 0:11:36.000
<v Speaker 1>drives the diaphragm to either push outward or pull inward.

0:11:36.480 --> 0:11:40.760
<v Speaker 1>That causes air molecules to vibrate, just as it would

0:11:41.120 --> 0:11:43.840
<v Speaker 1>with any other you know, source of sound, and it

0:11:43.920 --> 0:11:48.599
<v Speaker 1>emanates outward from the loudspeaker, so you get a representation

0:11:48.920 --> 0:11:51.839
<v Speaker 1>of the same sound that was going into the microphone

0:11:52.679 --> 0:11:56.760
<v Speaker 1>got converted into an electrical current. The electrical current then

0:11:57.080 --> 0:12:00.360
<v Speaker 1>was passed through a coil and next to a permanent

0:12:00.360 --> 0:12:03.720
<v Speaker 1>magnet to create the same sort of movement. It replicates

0:12:03.720 --> 0:12:07.240
<v Speaker 1>the movement of the original diaphragm in the microphone and

0:12:07.320 --> 0:12:11.200
<v Speaker 1>generates the sound. So you get the replication of the

0:12:11.240 --> 0:12:15.079
<v Speaker 1>sound that was made in the other location. It's pretty cool.

0:12:15.160 --> 0:12:18.400
<v Speaker 1>I think now I did mention earlier that you would

0:12:18.400 --> 0:12:21.480
<v Speaker 1>need an amplifier. And the reason you need an amplifier

0:12:21.640 --> 0:12:24.920
<v Speaker 1>is that the electrical signal generated by a microphone is

0:12:24.960 --> 0:12:28.440
<v Speaker 1>far too weak to drive allowed speakers diaphragm. You just

0:12:29.160 --> 0:12:31.880
<v Speaker 1>wouldn't have the juice to do it. It would be

0:12:32.000 --> 0:12:35.839
<v Speaker 1>much much less, uh powerful than what the speaker would need.

0:12:36.120 --> 0:12:39.040
<v Speaker 1>So chances are the diaphragm would either not move at

0:12:39.080 --> 0:12:40.920
<v Speaker 1>all because it would just be too stiff, it would

0:12:41.240 --> 0:12:44.559
<v Speaker 1>resist the movement too much, or it would move so

0:12:44.600 --> 0:12:47.360
<v Speaker 1>weakly as to generate little to no sound, so it

0:12:47.360 --> 0:12:50.559
<v Speaker 1>wouldn't do you any good. So the signal from the

0:12:50.600 --> 0:12:53.240
<v Speaker 1>microphone has to first pass through an amplifier, which, as

0:12:53.280 --> 0:12:56.679
<v Speaker 1>the name implies, takes an incoming signal and increases the

0:12:56.720 --> 0:13:00.960
<v Speaker 1>amplitude of that signal the volume. In other words, uh so,

0:13:01.000 --> 0:13:03.480
<v Speaker 1>it doesn't affect pitch, but it does affect the signal

0:13:03.559 --> 0:13:08.160
<v Speaker 1>strength and consequently the volume. And I've done episodes about amplifiers,

0:13:08.240 --> 0:13:11.920
<v Speaker 1>including explaining the difference between amplifiers that use vacuum tubes

0:13:11.960 --> 0:13:14.880
<v Speaker 1>and ones that use transistors, so I'm not going to

0:13:15.000 --> 0:13:18.640
<v Speaker 1>go into that here. Besides, it doesn't really factor into

0:13:18.679 --> 0:13:22.679
<v Speaker 1>our conversation about smart speakers anyway. It's just important for

0:13:23.000 --> 0:13:26.080
<v Speaker 1>it to work with a microphone and speaker setting. Now,

0:13:26.120 --> 0:13:29.600
<v Speaker 1>over the years, engineers have paired microphones and speakers and

0:13:29.720 --> 0:13:33.440
<v Speaker 1>lots of stuff. You've got telephones, you've got intercom systems,

0:13:33.480 --> 0:13:37.280
<v Speaker 1>public address systems, handheld radios, all sorts of things, so

0:13:37.320 --> 0:13:41.160
<v Speaker 1>that technology was well and truly mature. Before we ever

0:13:41.240 --> 0:13:45.040
<v Speaker 1>got our first smart speaker, there wasn't much call to

0:13:45.160 --> 0:13:49.200
<v Speaker 1>incorporate microphones into home speaker systems for many years. I mean,

0:13:49.760 --> 0:13:52.560
<v Speaker 1>what would you actually use a microphone embedded in a

0:13:52.640 --> 0:13:56.320
<v Speaker 1>speaker for? Before smart speakers, Typically you would have your

0:13:56.360 --> 0:13:59.280
<v Speaker 1>speakers like I'm talking about, like like sound system speakers.

0:13:59.400 --> 0:14:01.800
<v Speaker 1>You would have them hooked up to some other dumb

0:14:02.080 --> 0:14:05.800
<v Speaker 1>as in, not connected to a network technology. So it

0:14:05.880 --> 0:14:09.040
<v Speaker 1>might be a sound system or home entertainment set up

0:14:09.040 --> 0:14:11.480
<v Speaker 1>with a television as the focal point, or maybe even

0:14:11.720 --> 0:14:14.079
<v Speaker 1>you know, a computer for the purposes of playing more

0:14:14.160 --> 0:14:19.240
<v Speaker 1>dynamic sounds for like video games and and things like that. Um.

0:14:19.320 --> 0:14:21.800
<v Speaker 1>But for a very long time, these were all thought

0:14:21.840 --> 0:14:25.320
<v Speaker 1>of as one way communications applications, right, Like, the sound

0:14:25.400 --> 0:14:27.480
<v Speaker 1>was coming from a source and it would get to

0:14:27.600 --> 0:14:30.800
<v Speaker 1>us through the speakers, but we weren't meant to send

0:14:31.360 --> 0:14:34.480
<v Speaker 1>sound back through those same channels. The information was just

0:14:34.560 --> 0:14:37.440
<v Speaker 1>coming to you. You weren't sending anything back, But that

0:14:37.480 --> 0:14:40.040
<v Speaker 1>would all change in time. Now. One thing to keep

0:14:40.040 --> 0:14:42.680
<v Speaker 1>in mind about smart speakers is that they are the

0:14:42.680 --> 0:14:46.360
<v Speaker 1>product of several different technologies and lines of innovation and

0:14:46.400 --> 0:14:50.800
<v Speaker 1>development that all converged together. The microphone and speaker technology

0:14:51.120 --> 0:14:53.160
<v Speaker 1>is one of the oldest ones that we can point

0:14:53.200 --> 0:14:57.000
<v Speaker 1>to as far as the fundamental underlying technology is concerned,

0:14:57.560 --> 0:15:00.440
<v Speaker 1>the stuff that's been around since the late nineties century.

0:15:00.600 --> 0:15:03.440
<v Speaker 1>Now there is one other we'll talk about that's even older.

0:15:03.720 --> 0:15:06.440
<v Speaker 1>But I don't want to spoil things. I'll just mention

0:15:06.920 --> 0:15:10.560
<v Speaker 1>there is an even older line of development that goes

0:15:10.600 --> 0:15:14.240
<v Speaker 1>into smart speakers than the microphone speaker stuff of the

0:15:14.320 --> 0:15:18.040
<v Speaker 1>nineteenth century. Most of the other components, however, are much

0:15:18.080 --> 0:15:23.239
<v Speaker 1>younger than that. One big one is speech or voice recognition.

0:15:23.600 --> 0:15:28.040
<v Speaker 1>Creating computer systems that could detect noise was relatively simple. Right.

0:15:28.120 --> 0:15:31.120
<v Speaker 1>You could have a computer connected to microphones and they

0:15:31.120 --> 0:15:35.360
<v Speaker 1>could monitor the input from those microphones and any incoming

0:15:35.400 --> 0:15:38.680
<v Speaker 1>signal could be registered. Right, they could record an incoming

0:15:38.720 --> 0:15:42.080
<v Speaker 1>signal that would indicate the microphone had detected a noise.

0:15:42.560 --> 0:15:46.080
<v Speaker 1>That's child's play. That's easy to do. But teaching computers

0:15:46.080 --> 0:15:49.160
<v Speaker 1>how to analyze those signals and decipher them so that

0:15:49.160 --> 0:15:53.440
<v Speaker 1>the computer could display in text or otherwise act upon

0:15:54.000 --> 0:15:57.880
<v Speaker 1>that that sound in a meaningful way that was much

0:15:57.880 --> 0:16:02.400
<v Speaker 1>more difficult. There was an IBM engineer named William C.

0:16:02.680 --> 0:16:06.560
<v Speaker 1>Dirsh of the Advanced System Development Division who created an

0:16:06.640 --> 0:16:11.200
<v Speaker 1>early implementation of voice recognition. It was a very limited application,

0:16:11.280 --> 0:16:14.240
<v Speaker 1>but it proved that the ability to interact with computers

0:16:14.280 --> 0:16:18.800
<v Speaker 1>by voice was more than just science fiction. Within IBM.

0:16:18.800 --> 0:16:23.080
<v Speaker 1>It was called the Shoebox. Dirsh worked on this project

0:16:23.200 --> 0:16:26.440
<v Speaker 1>in the early nineteen sixties and what he produced was

0:16:26.480 --> 0:16:29.840
<v Speaker 1>a machine that had a microphone attached to it. The

0:16:29.880 --> 0:16:34.680
<v Speaker 1>machine could detect sixteen spoken words, which included the digits

0:16:34.800 --> 0:16:39.160
<v Speaker 1>of zero to nine plus some command indicators like plus

0:16:39.520 --> 0:16:43.360
<v Speaker 1>minus total, sub total. You get the idea. So you

0:16:43.360 --> 0:16:46.680
<v Speaker 1>could speak a string of numbers and then commands to

0:16:46.840 --> 0:16:49.920
<v Speaker 1>this device, then ask it to total everything and it

0:16:49.960 --> 0:16:52.000
<v Speaker 1>would do so. So it was more or less a

0:16:52.080 --> 0:16:58.000
<v Speaker 1>basic calculator with some voice interpretation incorporated into it. Now

0:16:58.280 --> 0:17:02.000
<v Speaker 1>there's a great newsreel piece about this shoebox. There's a

0:17:02.040 --> 0:17:05.040
<v Speaker 1>demonstration of it, and it came out in nineteen one,

0:17:05.480 --> 0:17:08.480
<v Speaker 1>and I love that newsreel because it has that great

0:17:08.560 --> 0:17:10.520
<v Speaker 1>music you would hear in the background of those old

0:17:10.560 --> 0:17:14.560
<v Speaker 1>industrial and business films. Anyway, there's also a helpful chart

0:17:14.840 --> 0:17:19.159
<v Speaker 1>that hangs in the background of that video where Dersh

0:17:19.320 --> 0:17:22.439
<v Speaker 1>is actually explaining how it works. You can see a

0:17:22.480 --> 0:17:25.919
<v Speaker 1>little bit behind him what the what is actually being

0:17:25.960 --> 0:17:30.520
<v Speaker 1>analyzed and uh he broke the words down into phonemes

0:17:30.560 --> 0:17:36.720
<v Speaker 1>and syllables, so phonemes being specific sounds that make up words. So,

0:17:36.760 --> 0:17:40.679
<v Speaker 1>for example, the digit one is a single syllable word

0:17:40.960 --> 0:17:43.520
<v Speaker 1>with a vowel sound right at the front. But you

0:17:43.600 --> 0:17:48.200
<v Speaker 1>also have the word eight that's another single syllable word

0:17:48.480 --> 0:17:51.040
<v Speaker 1>as a vowel sound right at the front, but it's

0:17:51.359 --> 0:17:55.280
<v Speaker 1>different from one phonetically in that eight also has a

0:17:55.359 --> 0:17:59.159
<v Speaker 1>plosive and has that hard t at the end. So

0:17:59.200 --> 0:18:02.919
<v Speaker 1>the shoebox was limited not just in what words it

0:18:02.960 --> 0:18:07.720
<v Speaker 1>could recognize, but also the types of voices it could recognize.

0:18:07.880 --> 0:18:10.760
<v Speaker 1>Get someone who has a different dialect or manner of speech,

0:18:10.760 --> 0:18:12.800
<v Speaker 1>and the machine might not be able to understand them

0:18:12.800 --> 0:18:16.119
<v Speaker 1>because they're not pronouncing the words the same way that

0:18:16.280 --> 0:18:20.560
<v Speaker 1>drsh did. This would be a big challenge in speech

0:18:20.560 --> 0:18:24.240
<v Speaker 1>recognition moving forward, and it's also an example of where

0:18:24.280 --> 0:18:28.480
<v Speaker 1>we find bias creeping into technology. And it's not necessarily

0:18:28.520 --> 0:18:32.359
<v Speaker 1>a conscious thing, but if you have people designing a

0:18:32.400 --> 0:18:36.520
<v Speaker 1>system and they're designing it based off their own uh,

0:18:36.680 --> 0:18:41.280
<v Speaker 1>you know, speech patterns, their own pronunciations, their own dialects,

0:18:41.800 --> 0:18:44.879
<v Speaker 1>then it may be that the system they create works

0:18:44.960 --> 0:18:48.040
<v Speaker 1>really well for them and less well for anyone who

0:18:48.160 --> 0:18:51.440
<v Speaker 1>isn't them, And the further away you are from their

0:18:51.480 --> 0:18:56.200
<v Speaker 1>manner of speaking, the more frustration you will encounter as

0:18:56.240 --> 0:18:59.719
<v Speaker 1>you try to interact with that technology. That's an example

0:18:59.760 --> 0:19:03.200
<v Speaker 1>of s and in fact, if you read the histories

0:19:03.320 --> 0:19:06.359
<v Speaker 1>of speech recognition and as we'll get too later natural

0:19:06.640 --> 0:19:10.119
<v Speaker 1>language processing, you'll see a lot of people say it

0:19:10.200 --> 0:19:13.119
<v Speaker 1>works great if you happen to be a white man,

0:19:13.840 --> 0:19:17.880
<v Speaker 1>because the manner of speech was being or the people

0:19:17.920 --> 0:19:21.000
<v Speaker 1>who were designing it were primarily white men who were

0:19:21.760 --> 0:19:26.000
<v Speaker 1>uh typically aiming for a a what is considered a

0:19:26.080 --> 0:19:31.840
<v Speaker 1>non accented American dialect somewhere in you know, the Eastern

0:19:31.920 --> 0:19:35.439
<v Speaker 1>seaboard side. But that meant that if you did have

0:19:35.520 --> 0:19:39.639
<v Speaker 1>an accent or a dialect, or you had a different vernacular,

0:19:40.200 --> 0:19:43.240
<v Speaker 1>that it was harder for the systems to actually understand

0:19:43.240 --> 0:19:46.399
<v Speaker 1>what you were saying. That's an example of bias. Well.

0:19:46.760 --> 0:19:49.359
<v Speaker 1>The general strategy was again to break up speech and

0:19:49.400 --> 0:19:52.560
<v Speaker 1>too constituent sound units, you know, those phonemes, and then

0:19:52.600 --> 0:19:55.879
<v Speaker 1>to susse out which words were being spoken based on

0:19:55.920 --> 0:19:59.880
<v Speaker 1>those phonemes, and that was done by digitizing the voice train,

0:20:00.160 --> 0:20:04.159
<v Speaker 1>forming it from sound into data that represented stuff like

0:20:04.240 --> 0:20:08.320
<v Speaker 1>the sounds frequency or pitch, and then matching up specific

0:20:08.359 --> 0:20:12.199
<v Speaker 1>signal signal signatures with specific phone nmes. So generally the

0:20:12.240 --> 0:20:14.919
<v Speaker 1>idea was that the computer system would monitor incoming sound,

0:20:15.280 --> 0:20:18.919
<v Speaker 1>convert the sound into digital data, compare that data that

0:20:19.000 --> 0:20:22.679
<v Speaker 1>had received with information stored in a database, and effort

0:20:22.720 --> 0:20:26.199
<v Speaker 1>to look for matches. Uh. The shoebox database was just

0:20:26.320 --> 0:20:29.280
<v Speaker 1>sixteen words and size. Later ones would be much larger,

0:20:29.320 --> 0:20:33.399
<v Speaker 1>but pretty quickly people realized this was not an efficient

0:20:33.480 --> 0:20:37.640
<v Speaker 1>way of doing speech recognition because the bigger the vocabulary,

0:20:37.840 --> 0:20:40.040
<v Speaker 1>the more work intens of it was to build out

0:20:40.080 --> 0:20:43.520
<v Speaker 1>those databases. So it wasn't something that people thought would

0:20:43.520 --> 0:20:48.560
<v Speaker 1>be sustainable for very large vocabularies. But the Shoebox marked

0:20:48.560 --> 0:20:50.680
<v Speaker 1>the beginning of a serious effort to create machines that

0:20:50.720 --> 0:20:53.720
<v Speaker 1>could accept audio cues as actual input, and as we'll see,

0:20:54.080 --> 0:20:57.760
<v Speaker 1>that's one important component for these smart speaker systems. I've

0:20:57.800 --> 0:20:59.560
<v Speaker 1>got a lot more to say, but before I get

0:20:59.600 --> 0:21:09.760
<v Speaker 1>into the next part, let's take a quick break. Now,

0:21:09.800 --> 0:21:13.480
<v Speaker 1>obviously we didn't jump right into full voice recognition right

0:21:13.520 --> 0:21:17.520
<v Speaker 1>after IBM S Shoebus innovation. The challenges related to building

0:21:17.560 --> 0:21:21.399
<v Speaker 1>automated speech recognition systems were numerous, even for just a

0:21:21.520 --> 0:21:24.879
<v Speaker 1>single language, because, as I said, you can have accents

0:21:24.960 --> 0:21:28.280
<v Speaker 1>and dialects. One voice can have a very different tonal

0:21:28.400 --> 0:21:32.679
<v Speaker 1>quality from another, people speak at different speeds. Teaching machines

0:21:32.720 --> 0:21:35.480
<v Speaker 1>how to recognize speech when the phonemes and pacing of

0:21:35.520 --> 0:21:40.840
<v Speaker 1>that speech aren't consistent from speaker to speaker, that's really hard.

0:21:41.320 --> 0:21:43.119
<v Speaker 1>This kind of gets back to the same sort of

0:21:43.200 --> 0:21:46.680
<v Speaker 1>challenges you have when you're teaching machines how to recognize images.

0:21:47.440 --> 0:21:51.080
<v Speaker 1>You know, you teach a human what a coffee mug is.

0:21:51.119 --> 0:21:53.320
<v Speaker 1>I always use this example, but you teach a human

0:21:53.359 --> 0:21:55.800
<v Speaker 1>what a coffee mug is, and pretty soon they can

0:21:55.840 --> 0:22:00.000
<v Speaker 1>extrapolate from that example and understand that coffee mugs can

0:22:00.000 --> 0:22:03.879
<v Speaker 1>them in all different sizes and colors, and you know

0:22:04.240 --> 0:22:08.320
<v Speaker 1>different designs and textures. We get it. Like you you

0:22:08.359 --> 0:22:11.640
<v Speaker 1>see a couple of coffee mugs, you understand machines though

0:22:12.480 --> 0:22:15.280
<v Speaker 1>they aren't able to do that. Machines, you know, you

0:22:15.320 --> 0:22:17.440
<v Speaker 1>have to give them lots and lots and lots of

0:22:17.480 --> 0:22:20.479
<v Speaker 1>different examples before they can start to pick up on

0:22:20.600 --> 0:22:24.960
<v Speaker 1>what things actually make a coffee mug. Same sort of

0:22:25.000 --> 0:22:28.639
<v Speaker 1>thing with speech, right, So if you don't have consistency

0:22:28.760 --> 0:22:31.679
<v Speaker 1>between speakers, it makes it very hard for machines to

0:22:31.800 --> 0:22:34.800
<v Speaker 1>learn what people are saying. Now, it didn't take long

0:22:34.880 --> 0:22:37.399
<v Speaker 1>for the tech industry at large to really dive into

0:22:37.400 --> 0:22:41.520
<v Speaker 1>trying to solve this problem. In ninete, DARPA, that's the

0:22:41.640 --> 0:22:45.359
<v Speaker 1>Research and Development division of the United States Department of Defense,

0:22:45.760 --> 0:22:48.800
<v Speaker 1>got behind speech recognition in a big way. Now, remember

0:22:49.280 --> 0:22:54.080
<v Speaker 1>darp it self doesn't do research. The organization's purpose is

0:22:54.080 --> 0:22:58.280
<v Speaker 1>to invite organizations to pitch projects that align with whatever

0:22:58.359 --> 0:23:01.879
<v Speaker 1>darpest goals are and and DARBA would provide funding to

0:23:02.440 --> 0:23:07.000
<v Speaker 1>the winning organizations to see these projects to completion if possible.

0:23:07.440 --> 0:23:09.840
<v Speaker 1>So DARK is really more of a vetting and funding

0:23:10.000 --> 0:23:15.400
<v Speaker 1>organization anyway. In n DARPA created a five year program

0:23:15.520 --> 0:23:20.160
<v Speaker 1>called Speech Understanding Research or s u are. The initial

0:23:20.240 --> 0:23:23.320
<v Speaker 1>goal was pretty darn ambitious considering the capabilities of the

0:23:23.359 --> 0:23:27.240
<v Speaker 1>technology at the time. The project director, Larry Roberts, wanted

0:23:27.240 --> 0:23:30.440
<v Speaker 1>a system that would be capable of recognizing a vocabulary

0:23:30.560 --> 0:23:34.119
<v Speaker 1>of ten thousand words with less than ten percent error.

0:23:34.560 --> 0:23:37.240
<v Speaker 1>After holding a few meetings with some of the leading

0:23:37.320 --> 0:23:41.840
<v Speaker 1>computer engineers of the day, Roberts suggusted that goal significantly.

0:23:42.560 --> 0:23:45.359
<v Speaker 1>After that adjustment, the target was going to be a

0:23:45.400 --> 0:23:50.040
<v Speaker 1>system capable of recognizing one thousand words, not ten thousand.

0:23:50.920 --> 0:23:53.359
<v Speaker 1>Nearror levels still had to be less than ten percent,

0:23:53.840 --> 0:23:55.720
<v Speaker 1>and the goal was for the system to be able

0:23:55.760 --> 0:24:02.359
<v Speaker 1>to accept continuous speech, as opposed to very deliberate speech

0:24:03.080 --> 0:24:08.000
<v Speaker 1>with pauses between each pair of words that would not

0:24:08.119 --> 0:24:13.040
<v Speaker 1>be really that useful. One person who was skeptical about

0:24:13.080 --> 0:24:16.760
<v Speaker 1>the potential success of this project was John R. Pierce

0:24:16.960 --> 0:24:20.639
<v Speaker 1>of Bell Labs. He argued that any success would be

0:24:20.720 --> 0:24:25.440
<v Speaker 1>limited so long as machines remained incapable of understanding the words,

0:24:25.840 --> 0:24:28.720
<v Speaker 1>not just recognizing a word based on phone names, but

0:24:28.840 --> 0:24:31.359
<v Speaker 1>understanding what the word is. That is. Pierce felt that

0:24:31.359 --> 0:24:34.080
<v Speaker 1>the machines needed some way to parse the language to

0:24:34.119 --> 0:24:37.040
<v Speaker 1>get to the meaning of what was being said. That's

0:24:37.080 --> 0:24:38.919
<v Speaker 1>an important idea that we will come back to in

0:24:38.960 --> 0:24:41.919
<v Speaker 1>just a bit now. Among the companies and organizations that

0:24:42.040 --> 0:24:46.600
<v Speaker 1>landed contracts with DARPA were a Carnegie Melon University BBN,

0:24:46.600 --> 0:24:49.080
<v Speaker 1>which actually played a big part in developing our ponette,

0:24:49.080 --> 0:24:53.240
<v Speaker 1>the predecessor to the Internet, Lincoln Laboratory, and several more

0:24:53.720 --> 0:24:56.840
<v Speaker 1>and very smart people began to create systems intended to

0:24:56.880 --> 0:25:00.520
<v Speaker 1>recognize speech and meaningful ways. The names of the programs

0:25:00.520 --> 0:25:02.800
<v Speaker 1>were a lot of fun. There was h W I

0:25:03.119 --> 0:25:06.280
<v Speaker 1>M that stood for hear what I mean as in

0:25:06.440 --> 0:25:09.040
<v Speaker 1>here as in listen hear what I mean. That one

0:25:09.160 --> 0:25:14.320
<v Speaker 1>was from BBN. CMU introduced hearsay, which was later designated

0:25:14.320 --> 0:25:17.080
<v Speaker 1>as Hearsay one, and then they came out with Hearsay two.

0:25:17.560 --> 0:25:22.800
<v Speaker 1>They also would demonstrate another one called harpy. Oh, and

0:25:22.840 --> 0:25:25.679
<v Speaker 1>there was a professor at CMU named Dr James Baker

0:25:25.840 --> 0:25:29.439
<v Speaker 1>who would design a system called Dragon in nineteen seventy

0:25:29.520 --> 0:25:32.040
<v Speaker 1>five that he would later leverage into a company with

0:25:32.119 --> 0:25:35.480
<v Speaker 1>his wife, Dr Janet M. Baker in the nineteen eighties,

0:25:35.520 --> 0:25:40.000
<v Speaker 1>and they had a very successful business with speech recognition software. Now,

0:25:40.040 --> 0:25:42.399
<v Speaker 1>I'm not going to go into each of those programs

0:25:42.400 --> 0:25:45.240
<v Speaker 1>in deep detail, but rather just mentioned that they all

0:25:45.280 --> 0:25:48.879
<v Speaker 1>helped advance the cause of creating systems that can recognize speech.

0:25:49.440 --> 0:25:51.480
<v Speaker 1>One of the big developments that came out of all

0:25:51.520 --> 0:25:55.280
<v Speaker 1>that work was a shift to probabilistic models, which would

0:25:55.320 --> 0:25:58.080
<v Speaker 1>also play a really important part in another phase of

0:25:58.160 --> 0:26:00.680
<v Speaker 1>developing the smart speaker. So what do I mean when

0:26:00.720 --> 0:26:04.520
<v Speaker 1>I say probabilistic? Well, as the name indicates, it all

0:26:04.520 --> 0:26:08.760
<v Speaker 1>has to do with probabilities. Essentially, systems would analyze incoming

0:26:08.760 --> 0:26:12.399
<v Speaker 1>phonemes and make guesses as to what was being said

0:26:12.680 --> 0:26:15.520
<v Speaker 1>based on the probability of it being a given word

0:26:15.720 --> 0:26:19.159
<v Speaker 1>or part of a word. The systems typically go with

0:26:19.240 --> 0:26:22.920
<v Speaker 1>whatever word has the highest probability of being the correct one.

0:26:23.640 --> 0:26:26.840
<v Speaker 1>Even with that approach, there are nuances to language that

0:26:26.880 --> 0:26:29.840
<v Speaker 1>are difficult to account for with a machine. So, for example,

0:26:29.840 --> 0:26:32.280
<v Speaker 1>you have homonyms and which you have two words that

0:26:32.440 --> 0:26:35.720
<v Speaker 1>sound the same but have very different meanings and potentially

0:26:35.760 --> 0:26:39.919
<v Speaker 1>spellings like right as in to write a sentence, or

0:26:40.080 --> 0:26:43.040
<v Speaker 1>right as in am I right? Or am I wrong?

0:26:43.600 --> 0:26:46.439
<v Speaker 1>Or you could have a pair of words that sound

0:26:46.480 --> 0:26:49.560
<v Speaker 1>like a single word and have confusion there, such as

0:26:49.880 --> 0:26:52.720
<v Speaker 1>a door. You can say a door you mean you're

0:26:52.720 --> 0:26:55.840
<v Speaker 1>meaning a single door a door to go into a building,

0:26:56.040 --> 0:26:58.520
<v Speaker 1>or you might say a dore as an I adore

0:26:58.960 --> 0:27:02.040
<v Speaker 1>this podcast you're doing, Jonathan. That's sweet of you, Thank

0:27:02.040 --> 0:27:06.399
<v Speaker 1>you for saying that. So computer scientists were hard at

0:27:06.400 --> 0:27:10.080
<v Speaker 1>work advancing both the capability of machines to make correct

0:27:10.200 --> 0:27:13.720
<v Speaker 1>guesses at individual phone names and then full words, as

0:27:13.720 --> 0:27:15.920
<v Speaker 1>well as figuring out a way to teach machines to

0:27:16.000 --> 0:27:20.959
<v Speaker 1>adjust guesses based on context. That requires a deeper understanding

0:27:21.000 --> 0:27:24.520
<v Speaker 1>of the language within which you're working. If you're aware

0:27:24.560 --> 0:27:27.439
<v Speaker 1>of certain idioms, you can make a good guess at

0:27:27.480 --> 0:27:29.320
<v Speaker 1>a word or phrase even if you didn't get a

0:27:29.320 --> 0:27:33.440
<v Speaker 1>clean pass at it right. So, for example, the phrase

0:27:33.760 --> 0:27:37.200
<v Speaker 1>it's raining cats and dogs just means it's raining a lot.

0:27:37.520 --> 0:27:40.359
<v Speaker 1>And if a system included a database that indicated the

0:27:40.480 --> 0:27:44.760
<v Speaker 1>phrase cats and dogs sometimes follows the phrase it's raining,

0:27:45.320 --> 0:27:47.640
<v Speaker 1>then the system is more likely to guess the correct

0:27:48.280 --> 0:27:52.760
<v Speaker 1>sequence of words instead of guessing something that sounded similar

0:27:52.840 --> 0:27:55.560
<v Speaker 1>but it's wrong. For example, if it said, oh, they

0:27:55.600 --> 0:27:59.800
<v Speaker 1>must have said it's raining bats and hogs, that would

0:27:59.800 --> 0:28:04.760
<v Speaker 1>not makes sense. So the systems estimate the probability that

0:28:04.840 --> 0:28:08.359
<v Speaker 1>any given sequence of sounds within the database matches what

0:28:08.480 --> 0:28:12.120
<v Speaker 1>the systems have just quote unquote heard progress in this

0:28:12.160 --> 0:28:15.040
<v Speaker 1>area was steady, but slow, and I'd argue that it

0:28:15.080 --> 0:28:17.959
<v Speaker 1>was also a reminder that concepts like Moore's law do

0:28:18.040 --> 0:28:22.760
<v Speaker 1>not apply universally across technology. Rapid development in one particular

0:28:22.800 --> 0:28:26.000
<v Speaker 1>domain of technology is not necessarily an indicator that the

0:28:26.040 --> 0:28:28.760
<v Speaker 1>same sort of progress will be observed in all other

0:28:28.840 --> 0:28:33.919
<v Speaker 1>areas of tech. We often get into the mistaken habit

0:28:34.000 --> 0:28:37.200
<v Speaker 1>of believing that Moore's law applies to everything. Alright. So

0:28:37.640 --> 0:28:42.120
<v Speaker 1>a related concept to voice recognition is something called natural

0:28:42.240 --> 0:28:45.480
<v Speaker 1>language processing, and this relates back to how we humans

0:28:45.480 --> 0:28:49.000
<v Speaker 1>tend to process information compared to the way machines tend

0:28:49.040 --> 0:28:52.200
<v Speaker 1>to do it. So we humans formulate ideas, we shape

0:28:52.200 --> 0:28:55.680
<v Speaker 1>those ideas into words and sentences. We communicate them in

0:28:55.760 --> 0:28:59.239
<v Speaker 1>some way to other people through that language. It may

0:28:59.280 --> 0:29:02.160
<v Speaker 1>be through speed you maybe through text. It may even

0:29:02.160 --> 0:29:06.280
<v Speaker 1>be through a nonverbal or non literary way, but we

0:29:06.400 --> 0:29:11.479
<v Speaker 1>communicate those ideas. Machines typically accept input, they perform some

0:29:11.560 --> 0:29:15.400
<v Speaker 1>process or sequence of processes on that input, and then

0:29:15.400 --> 0:29:18.960
<v Speaker 1>they supply an output of some sort. Machines do this

0:29:19.040 --> 0:29:22.480
<v Speaker 1>in machine language. That's a code that's far too difficult

0:29:22.480 --> 0:29:26.200
<v Speaker 1>for humans to process. Easily. Binary is an example of

0:29:26.280 --> 0:29:30.520
<v Speaker 1>machine language. Binary is represented as zeros and ones, which

0:29:30.520 --> 0:29:33.120
<v Speaker 1>would group together can represent all sorts of stuff. But

0:29:33.160 --> 0:29:35.600
<v Speaker 1>if you just looked at a big block of zeros

0:29:35.600 --> 0:29:38.280
<v Speaker 1>and ones, it would mean nothing to you. It's not

0:29:38.360 --> 0:29:41.520
<v Speaker 1>easy for humans to use, and then machines in turn

0:29:41.600 --> 0:29:44.960
<v Speaker 1>are not natively able to understand human language, so there's

0:29:44.960 --> 0:29:49.720
<v Speaker 1>a language barrier there. Because of that, people created different

0:29:49.760 --> 0:29:54.480
<v Speaker 1>programming languages. These languages provide layers of abstraction from the

0:29:54.560 --> 0:29:57.960
<v Speaker 1>machine language. They make it easier to create programs or

0:29:58.240 --> 0:30:01.560
<v Speaker 1>directions that the computer should fall low. So the person

0:30:01.680 --> 0:30:04.640
<v Speaker 1>who's doing the programming is using a programming language that's

0:30:04.640 --> 0:30:08.040
<v Speaker 1>easy for humans to use that then gets converted into

0:30:08.240 --> 0:30:11.520
<v Speaker 1>machine language that the computers understand. But what if you

0:30:11.560 --> 0:30:14.800
<v Speaker 1>could send commands to a computer using natural language, not

0:30:14.880 --> 0:30:20.320
<v Speaker 1>even programming language. You could just speak in Plaine vernacular,

0:30:20.400 --> 0:30:23.960
<v Speaker 1>whether it's English or any other language, the way humans

0:30:24.000 --> 0:30:27.320
<v Speaker 1>communicate with one another. What if a computer could extract

0:30:27.400 --> 0:30:30.760
<v Speaker 1>meaning from a sentence, understand what it was you wanted

0:30:30.800 --> 0:30:34.280
<v Speaker 1>the computer to do, and then respond appropriately. So imagine

0:30:34.280 --> 0:30:35.960
<v Speaker 1>how much time you could save if you could just

0:30:36.040 --> 0:30:38.640
<v Speaker 1>tell your computer what you wanted it to do, and

0:30:38.680 --> 0:30:41.440
<v Speaker 1>it took care of the rest. If you had a

0:30:41.480 --> 0:30:46.280
<v Speaker 1>powerful enough computer system with strong enough AI, maybe you

0:30:46.280 --> 0:30:49.480
<v Speaker 1>could even potentially do something like describe a game that

0:30:49.560 --> 0:30:52.240
<v Speaker 1>you would love to be able to play, like not

0:30:52.240 --> 0:30:54.400
<v Speaker 1>not a game that exists, a game in your head,

0:30:54.880 --> 0:30:56.760
<v Speaker 1>and you could describe it to a computer and the

0:30:56.800 --> 0:31:00.480
<v Speaker 1>computer could actually program that game. Well, we're we're definitely

0:31:00.520 --> 0:31:03.480
<v Speaker 1>not anywhere close to that yet, but we made enormous

0:31:03.520 --> 0:31:07.240
<v Speaker 1>progress with natural language processing. Now, the history of natural

0:31:07.320 --> 0:31:11.760
<v Speaker 1>language processing isn't exactly an extension of voice recognition. It's

0:31:11.760 --> 0:31:16.200
<v Speaker 1>actually more like a parallel line of investigation. And that's

0:31:16.200 --> 0:31:20.760
<v Speaker 1>because natural language processing doesn't require voice recognition. You can

0:31:20.840 --> 0:31:24.080
<v Speaker 1>have an implementation in which you just right commands in

0:31:24.200 --> 0:31:26.440
<v Speaker 1>natural language, you know, you type them out on a

0:31:26.520 --> 0:31:30.760
<v Speaker 1>keyboard and the machine then carries out those those instructions.

0:31:30.800 --> 0:31:33.400
<v Speaker 1>So much of the early work in natural language processing

0:31:33.440 --> 0:31:37.400
<v Speaker 1>was in text based communication rather than in speech. The

0:31:37.480 --> 0:31:41.240
<v Speaker 1>history of natural language processing includes stuff like the Turing test,

0:31:41.480 --> 0:31:44.840
<v Speaker 1>named after Alan Turing. So the most common interpretation of

0:31:44.880 --> 0:31:47.560
<v Speaker 1>the Turing test these days is that you've got a

0:31:47.600 --> 0:31:50.280
<v Speaker 1>scenario in which a person is alone in a room

0:31:50.320 --> 0:31:53.080
<v Speaker 1>with a computer terminal, they can type whatever they like

0:31:53.280 --> 0:31:57.520
<v Speaker 1>into the computer terminal, and someone or something is responding

0:31:57.560 --> 0:32:00.719
<v Speaker 1>to them in real time. Now it might be another person,

0:32:01.280 --> 0:32:04.120
<v Speaker 1>or it might be a computer system that's responding to

0:32:04.360 --> 0:32:08.959
<v Speaker 1>that person. You run a whole bunch of test subjects

0:32:08.960 --> 0:32:12.040
<v Speaker 1>through this process, and if the computer system is able

0:32:12.080 --> 0:32:15.280
<v Speaker 1>to fool a certain percentage of those test subjects, like

0:32:15.440 --> 0:32:18.720
<v Speaker 1>say thirty percent of them, that it is in fact

0:32:18.760 --> 0:32:21.560
<v Speaker 1>another human and not a computer, it is said to

0:32:21.640 --> 0:32:24.920
<v Speaker 1>have passed the Turing test, And typically we use that

0:32:24.960 --> 0:32:27.400
<v Speaker 1>to mean the machine has given off the appearance of

0:32:27.440 --> 0:32:31.680
<v Speaker 1>possessing intelligence similar to the one that we humans possess.

0:32:32.400 --> 0:32:35.520
<v Speaker 1>That gets beyond our scope for this episode, but it

0:32:35.560 --> 0:32:38.760
<v Speaker 1>helps point out that stuff like speech recognition and natural

0:32:38.840 --> 0:32:42.280
<v Speaker 1>language processing are both closely related to the field of

0:32:42.360 --> 0:32:45.720
<v Speaker 1>artificial intelligence. In fact, they really belong within the artificial

0:32:45.760 --> 0:32:50.320
<v Speaker 1>intelligence domain. The Turing test was more of a hypothetical.

0:32:50.560 --> 0:32:52.800
<v Speaker 1>It was a bit of a cheeky way of saying, Hey,

0:32:53.000 --> 0:32:55.680
<v Speaker 1>if you can't tell whether or not something is intelligent,

0:32:56.000 --> 0:32:58.560
<v Speaker 1>it makes sense to treat it as if it actually

0:32:58.840 --> 0:33:02.520
<v Speaker 1>is intelligent. After all, we assume that every human with

0:33:02.560 --> 0:33:06.440
<v Speaker 1>whom we interact possesses some level of intelligence. Based on

0:33:06.520 --> 0:33:09.640
<v Speaker 1>those interactions, so why should we not extend the same

0:33:09.680 --> 0:33:14.480
<v Speaker 1>courtesy to machines. Now, natural language processing would prove to

0:33:14.480 --> 0:33:18.000
<v Speaker 1>be another super challenging problem to solve. In computer science.

0:33:18.280 --> 0:33:21.880
<v Speaker 1>Early work was done in translation algorithms, and these were

0:33:21.880 --> 0:33:24.800
<v Speaker 1>programs that attempted to take phrases written in one language

0:33:24.840 --> 0:33:29.120
<v Speaker 1>and translate those automatically into a second language. At first,

0:33:29.160 --> 0:33:33.960
<v Speaker 1>that seemed pretty straightforward, but you realize that's also pretty tricky. Really.

0:33:34.200 --> 0:33:37.000
<v Speaker 1>For one thing, you can't just translate word for word

0:33:37.160 --> 0:33:40.080
<v Speaker 1>and keep the same order from one language to another.

0:33:40.520 --> 0:33:44.800
<v Speaker 1>The syntax or the rules that the language follow uh,

0:33:44.840 --> 0:33:47.719
<v Speaker 1>they could be different from language to language. In one language,

0:33:47.720 --> 0:33:50.960
<v Speaker 1>you might use an infinitive such as to record, in

0:33:51.000 --> 0:33:53.760
<v Speaker 1>the middle of a sentence, while another language might put

0:33:53.800 --> 0:33:56.400
<v Speaker 1>all the infinitives at the end of a sentence. So

0:33:56.960 --> 0:33:59.240
<v Speaker 1>in one language, I might say I'm going to record

0:33:59.280 --> 0:34:02.320
<v Speaker 1>a podcast in the studio right now, but in another

0:34:02.400 --> 0:34:05.080
<v Speaker 1>language it might come out as I'm going a podcast

0:34:05.080 --> 0:34:08.000
<v Speaker 1>in the studio right now to record. It starts to

0:34:08.000 --> 0:34:12.879
<v Speaker 1>sound like yoda. There was initial excitement around machine translation,

0:34:13.160 --> 0:34:16.400
<v Speaker 1>but once computer scientists and linguists began to see the

0:34:16.440 --> 0:34:20.320
<v Speaker 1>scope of this challenge, their excitement faded a bit. Also,

0:34:20.440 --> 0:34:22.400
<v Speaker 1>there was a lot of other stuff going on in

0:34:22.440 --> 0:34:25.200
<v Speaker 1>the nineteen sixties and seventies that was demanding a lot

0:34:25.200 --> 0:34:28.520
<v Speaker 1>of attention, such as the Space race. So for a while,

0:34:28.880 --> 0:34:32.799
<v Speaker 1>this branch of computer science was given less attention than

0:34:32.840 --> 0:34:37.160
<v Speaker 1>other branches, and by less attention, I really mean funding. Now,

0:34:37.160 --> 0:34:39.359
<v Speaker 1>when we come back, we'll talk a bit more about

0:34:39.400 --> 0:34:42.799
<v Speaker 1>the advances that were necessary to support natural language processing,

0:34:43.000 --> 0:34:44.880
<v Speaker 1>and we'll move on to how this would be another

0:34:44.960 --> 0:34:48.880
<v Speaker 1>important component in smart speakers. But first, let's take another

0:34:49.000 --> 0:35:00.960
<v Speaker 1>quick break. Okay, So early enthusiasm for an natural language

0:35:00.960 --> 0:35:04.520
<v Speaker 1>processing created a bit of a hype cycle that ultimately

0:35:04.640 --> 0:35:10.160
<v Speaker 1>crashed into the telephone poll of unmet expectations. That was

0:35:10.200 --> 0:35:15.280
<v Speaker 1>a really bad metaphor. Anyway, natural language processing went through

0:35:15.520 --> 0:35:18.520
<v Speaker 1>something similar to what we saw with virtual reality in

0:35:18.520 --> 0:35:22.920
<v Speaker 1>the nineteen nineties. You know, people saw what was actually achievable,

0:35:23.480 --> 0:35:26.440
<v Speaker 1>and then they compared that to what they thought they

0:35:26.440 --> 0:35:29.120
<v Speaker 1>were going to get, and those two things didn't match

0:35:29.200 --> 0:35:31.919
<v Speaker 1>up at all, and that really pulled the rug out

0:35:31.960 --> 0:35:35.440
<v Speaker 1>of funding for natural language processing, which men of course,

0:35:35.480 --> 0:35:40.040
<v Speaker 1>that progress slowed way down. It kept going, but it

0:35:40.239 --> 0:35:43.239
<v Speaker 1>was definitely on the back burner for a lot of projects.

0:35:43.680 --> 0:35:46.799
<v Speaker 1>When interest renewed in the nineteen eighties, there had been

0:35:46.800 --> 0:35:51.440
<v Speaker 1>a shift in thinking around natural language processing. Computer scientists

0:35:51.480 --> 0:35:54.640
<v Speaker 1>were starting to look at statistical approaches similar to what

0:35:54.719 --> 0:35:58.920
<v Speaker 1>was going on with speech recognition, building up probabilistic models

0:35:58.960 --> 0:36:01.520
<v Speaker 1>in which a computer can start making what amounts to

0:36:01.840 --> 0:36:05.880
<v Speaker 1>educated guesses at the meaning of a command or a phrase.

0:36:06.480 --> 0:36:10.080
<v Speaker 1>Machine learning became an important component on the back end

0:36:10.120 --> 0:36:13.839
<v Speaker 1>of these systems, and later artificial neural networks became an

0:36:13.880 --> 0:36:17.719
<v Speaker 1>important part as well. A neural network processes information in

0:36:17.760 --> 0:36:20.400
<v Speaker 1>a way that's sort of analogous to how our brains

0:36:20.480 --> 0:36:24.719
<v Speaker 1>do it. You have nodes or neurons that connect to

0:36:24.800 --> 0:36:29.000
<v Speaker 1>other nodes, and each node affects incoming data in a

0:36:29.040 --> 0:36:32.279
<v Speaker 1>certain way, performing some sort of operation on it, and

0:36:32.400 --> 0:36:34.960
<v Speaker 1>the degree to which they do that in one way

0:36:35.080 --> 0:36:39.120
<v Speaker 1>versus another is called the weight of that node. Computer

0:36:39.160 --> 0:36:42.799
<v Speaker 1>scientists apply weights across the nodes in an effort to

0:36:42.880 --> 0:36:46.840
<v Speaker 1>get a specific result in order to train these models.

0:36:46.880 --> 0:36:50.640
<v Speaker 1>So you might feed a specific command into such a system,

0:36:50.680 --> 0:36:53.560
<v Speaker 1>and you let it go through the computational process from

0:36:53.600 --> 0:36:56.080
<v Speaker 1>the beginning of the neural network through to the end,

0:36:56.560 --> 0:36:58.520
<v Speaker 1>and then you look at the result, and if the

0:36:58.560 --> 0:37:01.520
<v Speaker 1>result is correct, well, that just means the system is

0:37:01.520 --> 0:37:04.719
<v Speaker 1>already working as you intended it, which honestly is not

0:37:04.880 --> 0:37:08.480
<v Speaker 1>likely to happen early on. But if it's not correct,

0:37:08.760 --> 0:37:12.400
<v Speaker 1>then you start adjusting the weights on those nodes in

0:37:12.520 --> 0:37:15.200
<v Speaker 1>order to affect the outcome. I almost think of it

0:37:15.239 --> 0:37:18.440
<v Speaker 1>as like Plinko or pachinko, where you've got the little

0:37:18.520 --> 0:37:20.920
<v Speaker 1>coin and you drop it down and it bounces on

0:37:20.920 --> 0:37:24.080
<v Speaker 1>all the pegs and sometimes you're like you might think,

0:37:24.080 --> 0:37:25.520
<v Speaker 1>all right, well, this time it's going to go right

0:37:25.520 --> 0:37:28.600
<v Speaker 1>for that center slot, but it doesn't, and you think, well,

0:37:28.600 --> 0:37:31.080
<v Speaker 1>maybe if I remove some of these pegs or I

0:37:31.239 --> 0:37:33.839
<v Speaker 1>shift these pegs over a little bit, I can drop

0:37:33.880 --> 0:37:35.880
<v Speaker 1>it in that same spot and get hit the center.

0:37:36.120 --> 0:37:38.160
<v Speaker 1>It's kind of like that, except you're talking about data,

0:37:38.280 --> 0:37:42.120
<v Speaker 1>not physical moving parts. So you have to do this

0:37:42.160 --> 0:37:46.960
<v Speaker 1>a lot, like up to like millions of times in

0:37:47.080 --> 0:37:49.680
<v Speaker 1>order to try and train a system so that responds

0:37:49.680 --> 0:37:53.399
<v Speaker 1>appropriately to commands. And once it's trained, you can then

0:37:53.480 --> 0:37:55.799
<v Speaker 1>test new commands on the system to see if it

0:37:55.800 --> 0:37:58.640
<v Speaker 1>can parse them and respond appropriately. And in this way,

0:37:58.840 --> 0:38:03.240
<v Speaker 1>the system quote unquote learns over time how to respond

0:38:03.440 --> 0:38:07.040
<v Speaker 1>to commands. And then we have another component that's important

0:38:07.040 --> 0:38:10.839
<v Speaker 1>with smart speakers, and that's speech generation. So it's one

0:38:10.840 --> 0:38:14.600
<v Speaker 1>thing to have a machine either broadcast or play back

0:38:14.640 --> 0:38:18.080
<v Speaker 1>a recording of speech. It's another thing for a machine

0:38:18.120 --> 0:38:22.319
<v Speaker 1>to generate brand new speech. In computer science, we call

0:38:22.360 --> 0:38:26.919
<v Speaker 1>it speech synthesis. Now, this is the really old technology

0:38:26.960 --> 0:38:29.279
<v Speaker 1>I was alluding to at the beginning of this episode,

0:38:29.520 --> 0:38:32.960
<v Speaker 1>speech synthesis. If you want to be really, you know,

0:38:33.040 --> 0:38:36.479
<v Speaker 1>kind of technical about it, it actually predates every other

0:38:36.520 --> 0:38:39.759
<v Speaker 1>technology I've mentioned up to this point, at least in

0:38:40.000 --> 0:38:43.880
<v Speaker 1>its most rudimentary implementations. You have to go way back

0:38:43.960 --> 0:38:47.360
<v Speaker 1>to the eighteenth century the seventeen seventies, as when a

0:38:47.440 --> 0:38:52.480
<v Speaker 1>Russian smarty pants named Christian Kradsenstein was building a device

0:38:52.560 --> 0:38:56.800
<v Speaker 1>that used acoustic resonators. These these reads that would vibrate,

0:38:57.160 --> 0:39:01.879
<v Speaker 1>and it was in an attempt to replicate basic vowel sounds. Now,

0:39:01.920 --> 0:39:04.399
<v Speaker 1>even with such a working device, it would be really

0:39:04.400 --> 0:39:07.960
<v Speaker 1>difficult to communicate anything meaningful unless you were, i guess,

0:39:08.000 --> 0:39:11.399
<v Speaker 1>speaking whale like Dory and finding Nemo. But it would

0:39:11.400 --> 0:39:13.640
<v Speaker 1>be an early example of how people tried to create

0:39:13.680 --> 0:39:18.080
<v Speaker 1>mechanical systems that could replicate speech or elements of speech.

0:39:18.440 --> 0:39:23.600
<v Speaker 1>Another inventor named Wolfgang von Kimberland built an acoustic mechanical

0:39:23.719 --> 0:39:28.080
<v Speaker 1>speech machine and that used reads and tubes and a

0:39:28.160 --> 0:39:31.520
<v Speaker 1>pressure chamber, and it was all meant to replicate various

0:39:31.600 --> 0:39:35.640
<v Speaker 1>speech sounds. He had other elements to create sounds like plosives,

0:39:35.640 --> 0:39:39.880
<v Speaker 1>those hard sounds that I mentioned earlier in the episode.

0:39:40.520 --> 0:39:43.080
<v Speaker 1>So he had all these different elements that, working together,

0:39:43.160 --> 0:39:47.640
<v Speaker 1>could create parts of the sounds that we humans make

0:39:47.680 --> 0:39:51.480
<v Speaker 1>when we speak. He also built a supposed chess playing machine,

0:39:51.920 --> 0:39:53.799
<v Speaker 1>and it turned out that the chess playing part was

0:39:53.840 --> 0:39:58.680
<v Speaker 1>a hoax. So unfortunately, because that device was a hoax,

0:39:58.840 --> 0:40:03.360
<v Speaker 1>a lot of people dismiss his other work, which was legitimate.

0:40:03.880 --> 0:40:07.840
<v Speaker 1>So by fudging on one thing, he kind of cast

0:40:08.040 --> 0:40:12.120
<v Speaker 1>doubt on everything he had ever done. Skipping ahead quite

0:40:12.160 --> 0:40:15.720
<v Speaker 1>a bit, we get to Homer Dudley, which is a

0:40:15.760 --> 0:40:21.640
<v Speaker 1>fantastic name. He unveiled the voter or voice Operating Demonstrator

0:40:21.880 --> 0:40:25.480
<v Speaker 1>device at the New York World's Fair in nineteen thirty nine.

0:40:25.960 --> 0:40:29.880
<v Speaker 1>It consisted of a complex series of controls and it

0:40:29.960 --> 0:40:32.720
<v Speaker 1>sort of reminds me of something like a musical instrument,

0:40:32.800 --> 0:40:36.840
<v Speaker 1>kind of like a synthesizer, but with extra controlling units.

0:40:36.880 --> 0:40:39.400
<v Speaker 1>Like there was like a wrist element, there was a pedal.

0:40:39.719 --> 0:40:43.080
<v Speaker 1>There's a lot of stuff that made it very complex,

0:40:43.440 --> 0:40:47.239
<v Speaker 1>and with a lot of practice, you could create specific

0:40:47.320 --> 0:40:51.360
<v Speaker 1>sounds from this synthesizer. You could even create words or

0:40:51.440 --> 0:40:55.120
<v Speaker 1>full sentences, though from what I understand, it was incredibly

0:40:55.200 --> 0:40:57.760
<v Speaker 1>challenging to do. It was a very high learning curve,

0:40:58.120 --> 0:41:02.040
<v Speaker 1>but it demonstrate the possibility of a like tronic synthesized speech. Now.

0:41:02.080 --> 0:41:06.120
<v Speaker 1>There was a lot of work done in this field

0:41:07.120 --> 0:41:12.040
<v Speaker 1>by lots of different talented scientists and engineers, and someday

0:41:12.080 --> 0:41:14.320
<v Speaker 1>I'll have to do a full episode on the history

0:41:14.360 --> 0:41:17.839
<v Speaker 1>of speech synthesis. It's really fascinating, but it's far too

0:41:17.920 --> 0:41:21.279
<v Speaker 1>big a topic to cover in its entirety in this episode.

0:41:21.640 --> 0:41:24.480
<v Speaker 1>By the late nineteen sixties we had our first text

0:41:24.640 --> 0:41:27.920
<v Speaker 1>to speech system, and by the late nineteen seventies and

0:41:28.000 --> 0:41:31.280
<v Speaker 1>early nineteen eighties, the state of the art had progressed

0:41:31.360 --> 0:41:33.160
<v Speaker 1>quite a bit and we were starting to get to

0:41:33.200 --> 0:41:38.360
<v Speaker 1>a point where we could create very understandable computer voices.

0:41:38.400 --> 0:41:41.680
<v Speaker 1>They weren't natural, they didn't sound like people, but you

0:41:41.719 --> 0:41:45.439
<v Speaker 1>could understand what they were saying. And finally, something else

0:41:45.440 --> 0:41:48.840
<v Speaker 1>that would enable smart speakers and virtual assistance was the

0:41:48.880 --> 0:41:53.240
<v Speaker 1>pairing of improved network connectivity and cloud computing. That removes

0:41:53.239 --> 0:41:56.319
<v Speaker 1>the need for the device that you're interacting with to

0:41:56.400 --> 0:41:59.440
<v Speaker 1>do all the processing on its own. So, if you

0:41:59.440 --> 0:42:01.799
<v Speaker 1>think about or the history of computing, we used to

0:42:01.880 --> 0:42:05.160
<v Speaker 1>do main frames with dumb terminals that attached the main frame,

0:42:05.440 --> 0:42:08.120
<v Speaker 1>so the terminal wasn't doing any computing. It was just

0:42:08.200 --> 0:42:11.520
<v Speaker 1>tapping into the mainframe computer, which was sending results back

0:42:11.520 --> 0:42:13.880
<v Speaker 1>to the terminal. Then you get to the era of

0:42:13.960 --> 0:42:17.640
<v Speaker 1>personal computers, where you had a device sitting on your

0:42:17.680 --> 0:42:20.560
<v Speaker 1>desk that did all the computing and it didn't connect

0:42:20.560 --> 0:42:23.920
<v Speaker 1>to anything else. Then we get up to networking and

0:42:23.960 --> 0:42:27.640
<v Speaker 1>the Internet, where we suddenly had the capability of having

0:42:27.840 --> 0:42:31.720
<v Speaker 1>really powerful computers or grids of computers that were able

0:42:31.760 --> 0:42:35.200
<v Speaker 1>to take on processing power. Uh, and you just you

0:42:35.239 --> 0:42:38.719
<v Speaker 1>send the request out to the Internet and you get

0:42:38.719 --> 0:42:42.080
<v Speaker 1>the response back. That's the basis of cloud computing. So

0:42:43.200 --> 0:42:47.000
<v Speaker 1>your your command or message or whatever relays back to

0:42:47.160 --> 0:42:50.759
<v Speaker 1>servers on the cloud that then process it and send

0:42:50.760 --> 0:42:54.759
<v Speaker 1>the proper response to whatever device you're interacting with, and

0:42:54.800 --> 0:42:57.120
<v Speaker 1>then you get the result. So with the case of

0:42:57.120 --> 0:42:59.840
<v Speaker 1>the smart speaker, it might be playing a specific so

0:43:00.000 --> 0:43:02.360
<v Speaker 1>long or giving you a weather report or whatever it

0:43:02.440 --> 0:43:05.279
<v Speaker 1>might be. Now, if the speakers were doing some of

0:43:05.320 --> 0:43:09.759
<v Speaker 1>that computation themselves, that would be an example of edge computing,

0:43:10.160 --> 0:43:13.440
<v Speaker 1>where the processing takes place at least in part, at

0:43:13.480 --> 0:43:16.799
<v Speaker 1>the edge of a network at those end points. But

0:43:16.960 --> 0:43:20.240
<v Speaker 1>for now, most of the implementations we see send data

0:43:20.280 --> 0:43:22.120
<v Speaker 1>back to the cloud to get the right response, so

0:43:22.160 --> 0:43:25.240
<v Speaker 1>you have to have a persistent Internet connection. These devices

0:43:25.280 --> 0:43:28.480
<v Speaker 1>are not useful without that connection. You do have some

0:43:28.600 --> 0:43:32.360
<v Speaker 1>smart speakers that can connect to another device like a

0:43:32.360 --> 0:43:36.320
<v Speaker 1>smartphone via Bluetooth, so you could do things that way,

0:43:36.760 --> 0:43:40.920
<v Speaker 1>but without those connections, the smart speaker turns into, you know,

0:43:41.040 --> 0:43:44.320
<v Speaker 1>just a dumb speaker, or sometimes just a paperweight. Now,

0:43:44.840 --> 0:43:48.040
<v Speaker 1>this collection of technologies and disciplines are what enabled Apple

0:43:48.440 --> 0:43:52.360
<v Speaker 1>to introduce Sirie in two thousand and eleven, and Syria

0:43:52.440 --> 0:43:56.160
<v Speaker 1>is a virtual assistant. Series origins actually trace back to

0:43:56.520 --> 0:44:00.480
<v Speaker 1>the Stanford Research Institute and a group of guys Grouber,

0:44:00.560 --> 0:44:04.279
<v Speaker 1>Adamshire and dog kit Louse who had been working on

0:44:04.320 --> 0:44:08.240
<v Speaker 1>the concept since the nineteen nineties, and when Apple launched

0:44:08.239 --> 0:44:11.000
<v Speaker 1>the iPhone in two thousand seven, they saw the iPhone

0:44:11.040 --> 0:44:14.680
<v Speaker 1>as a potential platform for this virtual assistant that they

0:44:14.719 --> 0:44:17.520
<v Speaker 1>had been building, and they thought, well, this is perfect

0:44:17.560 --> 0:44:20.879
<v Speaker 1>because the iPhone has a microphone, so the assistant can

0:44:20.920 --> 0:44:23.719
<v Speaker 1>respond to voice commands as a speaker, so it could

0:44:23.719 --> 0:44:26.200
<v Speaker 1>communicate back to the user, it could do all sorts

0:44:26.239 --> 0:44:30.480
<v Speaker 1>of stuff. We can tap into the interoperability of apps

0:44:30.520 --> 0:44:33.160
<v Speaker 1>on the device. It's a perfect platform for us to

0:44:33.239 --> 0:44:36.839
<v Speaker 1>deploy this. So they developed an app once the opportunity

0:44:36.880 --> 0:44:41.279
<v Speaker 1>arose because apps were not available for development immediately when

0:44:41.320 --> 0:44:45.840
<v Speaker 1>Apple launched the iPhone, and once they did launch that app,

0:44:46.719 --> 0:44:50.240
<v Speaker 1>uh within a month, less than a month, Steve Jobs

0:44:50.280 --> 0:44:52.040
<v Speaker 1>was on the phone calling them up and offering to

0:44:52.120 --> 0:44:54.560
<v Speaker 1>buy the technology, which of course they would agree to

0:44:54.880 --> 0:44:57.920
<v Speaker 1>and it would become an integrated component in Apple's iPhone

0:44:57.960 --> 0:45:02.480
<v Speaker 1>line afterward. And that's where voice assistants kind of lived

0:45:02.640 --> 0:45:05.560
<v Speaker 1>for a few years. They mostly lived on smartphones like

0:45:05.640 --> 0:45:10.080
<v Speaker 1>the iPhone. But in November two thousand fourteen, Amazon introduced

0:45:10.160 --> 0:45:14.400
<v Speaker 1>the Amazon Echo smart speaker, which was originally only available

0:45:14.440 --> 0:45:17.600
<v Speaker 1>for Prime members, and it had its own virtual assistant

0:45:17.760 --> 0:45:22.879
<v Speaker 1>named Alexa, and thus the smart speaker era officially began. Now,

0:45:22.920 --> 0:45:25.480
<v Speaker 1>there are plenty of other smart speakers that are on

0:45:25.520 --> 0:45:28.600
<v Speaker 1>the market these days. There are products from Google like

0:45:28.719 --> 0:45:31.879
<v Speaker 1>Google Home. Uh, there are so no speakers that can

0:45:31.920 --> 0:45:35.920
<v Speaker 1>connect to services like Amazon's Alexa or Google's Assistant, and

0:45:36.000 --> 0:45:38.799
<v Speaker 1>we're probably going to see a ton more, both from

0:45:38.880 --> 0:45:42.120
<v Speaker 1>companies that piggyback onto services from the big providers like

0:45:42.160 --> 0:45:45.160
<v Speaker 1>Google and Amazon, and maybe some that are trying to

0:45:45.200 --> 0:45:47.360
<v Speaker 1>make a go of it with their own branded virtual

0:45:47.400 --> 0:45:52.040
<v Speaker 1>assistants and services. Smart speakers respond to commands after they

0:45:52.120 --> 0:45:55.880
<v Speaker 1>quote unquote here a wake up word or phrase. Now,

0:45:56.120 --> 0:45:59.040
<v Speaker 1>I'm gonna make up a wake up phrase right now

0:45:59.239 --> 0:46:02.440
<v Speaker 1>so that I don't set off anyone's smart speaker or

0:46:02.480 --> 0:46:05.520
<v Speaker 1>smart watch or smartphone or smart car or whatever it

0:46:05.600 --> 0:46:08.520
<v Speaker 1>might be. So this is just a fictional example of

0:46:08.560 --> 0:46:11.719
<v Speaker 1>a wake up phrase. So let's say I have a

0:46:11.719 --> 0:46:15.040
<v Speaker 1>smart speaker and the wake up phrase for my smart

0:46:15.080 --> 0:46:18.680
<v Speaker 1>speaker happens to be hey, they're Genie. Well, my smart

0:46:18.719 --> 0:46:21.120
<v Speaker 1>speaker has a microphone, so it can detect when I

0:46:21.160 --> 0:46:27.480
<v Speaker 1>say that, but really it's constantly detecting all sounds in

0:46:27.680 --> 0:46:31.000
<v Speaker 1>its environment. The microphone is always active. It has to

0:46:31.040 --> 0:46:33.239
<v Speaker 1>be in order to be able to pick up on

0:46:33.360 --> 0:46:38.160
<v Speaker 1>when I say the wake up phrase. So the microphone

0:46:38.200 --> 0:46:41.480
<v Speaker 1>is always active on most smart speakers. There's somewhere you

0:46:41.520 --> 0:46:44.160
<v Speaker 1>can program it so that it will only activate if

0:46:44.160 --> 0:46:47.480
<v Speaker 1>you first touch the speaker and that wakes it up.

0:46:47.840 --> 0:46:49.680
<v Speaker 1>There's some that you can do that with, But for

0:46:49.719 --> 0:46:53.359
<v Speaker 1>the most part, they're always listening. While the speaker can

0:46:53.480 --> 0:46:57.960
<v Speaker 1>quote unquote here everything, it's not listening to everything. In

0:46:57.960 --> 0:47:01.200
<v Speaker 1>other words, it's not mon of during the specific things

0:47:01.200 --> 0:47:03.640
<v Speaker 1>being said. At least that's what we've been told. And honestly,

0:47:04.200 --> 0:47:07.320
<v Speaker 1>that makes a ton of sense from an operational standpoint.

0:47:07.520 --> 0:47:10.080
<v Speaker 1>And the reason I say that is that the sheer

0:47:10.120 --> 0:47:13.359
<v Speaker 1>amount of information that would be flooding in from all

0:47:13.440 --> 0:47:16.919
<v Speaker 1>the microphones on all the smart devices from any one

0:47:17.080 --> 0:47:20.160
<v Speaker 1>provider that happened to be deployed all over the world,

0:47:20.360 --> 0:47:23.880
<v Speaker 1>that would be an astounding amount of data. And sifting

0:47:23.920 --> 0:47:27.000
<v Speaker 1>through all that data to find stuff that's useful would

0:47:27.040 --> 0:47:29.840
<v Speaker 1>take an enormous amount of effort and time and and

0:47:29.960 --> 0:47:33.719
<v Speaker 1>processing power. So while you could have all the microphones

0:47:33.760 --> 0:47:37.120
<v Speaker 1>listening in all over the place, finding out who to

0:47:37.200 --> 0:47:39.680
<v Speaker 1>listen to at what time would be a lot trickier

0:47:39.719 --> 0:47:41.840
<v Speaker 1>and probably not worth the effort it would take to

0:47:41.880 --> 0:47:46.440
<v Speaker 1>pull something like that off. So what these speakers and

0:47:46.560 --> 0:47:50.240
<v Speaker 1>other devices are actually doing is looking for a signal

0:47:50.480 --> 0:47:53.640
<v Speaker 1>that matches the one that represents the wake phrase. So

0:47:53.719 --> 0:47:57.799
<v Speaker 1>when I say, hey, they're Genie, the microphone picks up

0:47:57.800 --> 0:48:00.920
<v Speaker 1>my voice, which the mic then try inslates into an

0:48:00.920 --> 0:48:04.640
<v Speaker 1>electrical signal which gets digitized and compared against the digital

0:48:04.640 --> 0:48:09.000
<v Speaker 1>fingerprint of the predesignated wake up phrase. And in this case,

0:48:09.480 --> 0:48:13.880
<v Speaker 1>the two phrases match. It's like a fingerprint matching something

0:48:13.920 --> 0:48:16.719
<v Speaker 1>that was left at a site. So that turns the

0:48:16.760 --> 0:48:20.080
<v Speaker 1>speaker into an active listener rather than a passive one.

0:48:20.120 --> 0:48:23.600
<v Speaker 1>It's ready to accept a command or a question and

0:48:23.680 --> 0:48:27.200
<v Speaker 1>to respond to me. But if I didn't say, hey,

0:48:27.280 --> 0:48:30.920
<v Speaker 1>they're Genie, then the speaker would remain in passive mode

0:48:31.360 --> 0:48:35.239
<v Speaker 1>because it wouldn't have a digital fingerprint that matches the

0:48:35.320 --> 0:48:38.400
<v Speaker 1>one of the wake up phrase. Everything stays at the

0:48:38.440 --> 0:48:41.840
<v Speaker 1>local level, and none of my sweet secret speech gets

0:48:41.880 --> 0:48:45.080
<v Speaker 1>transmitt related across the internet. It's all staying right there.

0:48:45.560 --> 0:48:47.960
<v Speaker 1>At least that's what we've been told. And again I

0:48:48.000 --> 0:48:50.719
<v Speaker 1>don't have any reason to disbelieve this, but it is

0:48:50.760 --> 0:48:53.800
<v Speaker 1>something to keep in mind. You are talking about devices

0:48:53.840 --> 0:48:55.960
<v Speaker 1>that have microphones. Of course, if you have a smartphone,

0:48:56.000 --> 0:48:57.719
<v Speaker 1>you've already got one of those or a cell phone.

0:48:57.719 --> 0:49:00.600
<v Speaker 1>In general, you've got a device with a microphone on

0:49:00.680 --> 0:49:04.040
<v Speaker 1>it neck near you pretty much all the time. Now,

0:49:04.680 --> 0:49:07.360
<v Speaker 1>once I do make a request with my smart speaker,

0:49:07.560 --> 0:49:09.880
<v Speaker 1>the speaker then sends that request up to the cloud

0:49:10.000 --> 0:49:14.120
<v Speaker 1>where it gets processed, It's analyzed, uh, and then a

0:49:14.160 --> 0:49:18.080
<v Speaker 1>proper response is returned to me, whether that is playing

0:49:18.080 --> 0:49:20.480
<v Speaker 1>a song or giving me information I've asked for, or

0:49:20.520 --> 0:49:23.160
<v Speaker 1>maybe even interacting with some other smart device in my home,

0:49:23.280 --> 0:49:26.880
<v Speaker 1>such as adjusting the brightness of my smart lights in

0:49:26.960 --> 0:49:30.160
<v Speaker 1>my house. Now, if the system is not sure about

0:49:30.200 --> 0:49:34.240
<v Speaker 1>whatever it was I just said, it will probably return

0:49:34.400 --> 0:49:37.520
<v Speaker 1>an error phrase. So maybe maybe I'm too far away

0:49:37.600 --> 0:49:40.759
<v Speaker 1>from the speaker, so it's it couldn't quote unquote hear

0:49:40.840 --> 0:49:43.719
<v Speaker 1>me really well. Or maybe I've got a mouthful of

0:49:43.760 --> 0:49:46.600
<v Speaker 1>peanut butter or something as I want to do. Then

0:49:46.640 --> 0:49:48.600
<v Speaker 1>I'm going to get something like I'm sorry, I don't

0:49:48.640 --> 0:49:50.480
<v Speaker 1>know how to do that, or I'm sorry I didn't

0:49:50.520 --> 0:49:53.359
<v Speaker 1>understand you, and then I'd have to repeat it. Now,

0:49:53.400 --> 0:49:57.040
<v Speaker 1>smart speakers are pretty cool. However, they do represent another

0:49:57.080 --> 0:50:02.040
<v Speaker 1>piece of technology that you have to network to other devices,

0:50:02.200 --> 0:50:06.120
<v Speaker 1>including your own home network, and as such that means

0:50:06.200 --> 0:50:10.600
<v Speaker 1>that they represent a potential vulnerability in a network. It

0:50:10.640 --> 0:50:14.520
<v Speaker 1>doesn't mean they're automatically vulnerable, but it means that every

0:50:14.520 --> 0:50:18.319
<v Speaker 1>time you are connecting something to your network, then you're

0:50:18.400 --> 0:50:23.960
<v Speaker 1>creating another potential attack vector for a hacker. Right now,

0:50:24.000 --> 0:50:28.879
<v Speaker 1>if everything is super strong, it it doesn't really effectively

0:50:29.040 --> 0:50:32.799
<v Speaker 1>change your safety in any meaningful way. But if one

0:50:32.840 --> 0:50:35.880
<v Speaker 1>of those things that you connect to your network is

0:50:36.000 --> 0:50:38.640
<v Speaker 1>less strong than the others, you're looking at the weakest

0:50:38.680 --> 0:50:41.520
<v Speaker 1>link situation where a hacker with the right know how

0:50:41.560 --> 0:50:45.280
<v Speaker 1>in tools could potentially target that part of your network

0:50:45.600 --> 0:50:49.320
<v Speaker 1>to get entry into everything else. And when you're talking

0:50:49.360 --> 0:50:53.120
<v Speaker 1>about a smart speaker, you're talking about device that has

0:50:53.400 --> 0:50:57.440
<v Speaker 1>an active microphone on it. So potentially, if someone were

0:50:57.480 --> 0:50:59.879
<v Speaker 1>able to compromise a smart speaker, they would be able

0:50:59.920 --> 0:51:03.120
<v Speaker 1>to listening on anything that was within range of that

0:51:03.200 --> 0:51:07.920
<v Speaker 1>smart speakers microphone. So that's why you have to at

0:51:07.960 --> 0:51:11.759
<v Speaker 1>least be cognizant of that, do your research, make sure

0:51:11.800 --> 0:51:15.000
<v Speaker 1>the devices you're connecting to your network are rated well

0:51:15.040 --> 0:51:18.640
<v Speaker 1>as from a security standpoint, when you're setting things up

0:51:18.680 --> 0:51:22.239
<v Speaker 1>and you have to create passwords, create strong passwords that

0:51:22.320 --> 0:51:26.200
<v Speaker 1>are not used anywhere else. The harder you make things

0:51:26.440 --> 0:51:30.160
<v Speaker 1>the more likely hackers will just pass you by, not

0:51:30.239 --> 0:51:33.919
<v Speaker 1>because you're too tough to crack. Never get your into

0:51:33.960 --> 0:51:37.200
<v Speaker 1>your head that you're too strong to to be hacked,

0:51:37.440 --> 0:51:41.160
<v Speaker 1>but rather if there's someone who's weaker than the hackers

0:51:41.160 --> 0:51:43.600
<v Speaker 1>are going to go after that person instead. So just

0:51:43.680 --> 0:51:48.640
<v Speaker 1>don't be the weak person. Practice really good security behaviors,

0:51:48.840 --> 0:51:53.279
<v Speaker 1>and you're more likely to discourage attackers and they'll they'll

0:51:53.320 --> 0:51:57.359
<v Speaker 1>go on to someone else. Um, especially if you're talking

0:51:57.360 --> 0:51:59.960
<v Speaker 1>about newbies who don't really know their way around their

0:52:00.080 --> 0:52:02.919
<v Speaker 1>just using tools that other people have designed. They get

0:52:02.920 --> 0:52:05.319
<v Speaker 1>discouraged very quickly. They'll move on to someone else because

0:52:05.360 --> 0:52:09.359
<v Speaker 1>there's always another potential target. I'm curious about you guys,

0:52:09.360 --> 0:52:12.720
<v Speaker 1>whether or not you have any smart speakers in your life,

0:52:13.200 --> 0:52:15.919
<v Speaker 1>and uh if you find them useful. I find mine

0:52:15.960 --> 0:52:20.160
<v Speaker 1>pretty useful. I use it for a very narrow range

0:52:20.320 --> 0:52:23.480
<v Speaker 1>of things. I don't tend to use it. I definitely

0:52:23.520 --> 0:52:25.680
<v Speaker 1>don't use it to its full potential. I know that

0:52:26.239 --> 0:52:29.080
<v Speaker 1>because what's in the blue moon. I'll just try something

0:52:29.120 --> 0:52:31.719
<v Speaker 1>and I'm amazed at what happens when when I get

0:52:31.719 --> 0:52:34.759
<v Speaker 1>a response. But for the most part, I'm asking about

0:52:34.800 --> 0:52:38.080
<v Speaker 1>whether what I can feed my dog whether or not

0:52:38.120 --> 0:52:41.040
<v Speaker 1>it can turn on the lights and uh and and

0:52:41.760 --> 0:52:45.319
<v Speaker 1>that's about it. Are occasionally playing a song. Um, but

0:52:45.480 --> 0:52:47.759
<v Speaker 1>I'm curious what you guys are using them for. Reach

0:52:47.800 --> 0:52:50.480
<v Speaker 1>Out to me on social networks on Facebook and I'm

0:52:50.520 --> 0:52:52.480
<v Speaker 1>on Twitter, and the handle for both of those is

0:52:52.520 --> 0:52:56.880
<v Speaker 1>text stuff. H s W also use that those handles

0:52:56.880 --> 0:52:59.320
<v Speaker 1>if you have suggestions for future episodes. If you've got,

0:52:59.440 --> 0:53:02.120
<v Speaker 1>you know, an idea for either a company or a

0:53:02.160 --> 0:53:05.640
<v Speaker 1>technology or a theme in tech you'd really like me

0:53:05.719 --> 0:53:08.759
<v Speaker 1>to tackle, let me know there and I'll talk to

0:53:08.760 --> 0:53:16.279
<v Speaker 1>you again really soon. Text Stuff is a production of

0:53:16.280 --> 0:53:19.319
<v Speaker 1>I Heart Radio's How Stuff Works. For more podcasts from

0:53:19.360 --> 0:53:23.120
<v Speaker 1>my heart Radio, visit the i heart Radio app, Apple Podcasts,

0:53:23.280 --> 0:53:25.240
<v Speaker 1>or wherever you listen to your favorite shows.