1
00:00:04,240 --> 00:00:07,240
Speaker 1: Welcome to Tech Stuff, a production of I Heart Radios

2
00:00:07,320 --> 00:00:14,000
Speaker 1: How Stuff Works. Hey there, and welcome to tech Stuff.

3
00:00:14,040 --> 00:00:16,880
Speaker 1: I'm your host, Jonathan Strickland. I'm an executive producer with

4
00:00:16,920 --> 00:00:19,960
Speaker 1: I Heart Radio and I love all things tech, and guys,

5
00:00:19,960 --> 00:00:24,079
Speaker 1: stick with me. I am fighting off a cold. You'll

6
00:00:24,120 --> 00:00:25,560
Speaker 1: be able to hear it in my voice. I have

7
00:00:25,760 --> 00:00:28,720
Speaker 1: no doubt. But you know, I wanted to get you

8
00:00:28,760 --> 00:00:32,040
Speaker 1: guys a brand new episode. So we're gonna fight on

9
00:00:32,200 --> 00:00:37,200
Speaker 1: because the show must keep going. I think I think

10
00:00:37,200 --> 00:00:40,120
Speaker 1: this is saying, oh no, this cold medicine is good though.

11
00:00:40,200 --> 00:00:42,959
Speaker 1: All right, Anyway, I thought that we would do an

12
00:00:42,960 --> 00:00:47,760
Speaker 1: episode about smart speakers because I wanted to kind of

13
00:00:47,760 --> 00:00:51,199
Speaker 1: start this whole episode off with with an old man observation,

14
00:00:51,240 --> 00:00:54,000
Speaker 1: you know, get off my lawn kind of thing. And

15
00:00:54,080 --> 00:00:57,600
Speaker 1: this is from our resident old man, old man Strickland.

16
00:00:57,840 --> 00:01:01,160
Speaker 1: That meaning meaning me, So, when I was young, speakers

17
00:01:01,160 --> 00:01:03,959
Speaker 1: were dumb. Now I don't. I don't mean that speakers

18
00:01:03,960 --> 00:01:07,479
Speaker 1: were useless, or that they were terrible, or that they

19
00:01:07,480 --> 00:01:11,679
Speaker 1: were incapable of replicating certain frequencies or volumes of sound,

20
00:01:12,319 --> 00:01:15,440
Speaker 1: or that they were limited in some other way other

21
00:01:15,480 --> 00:01:19,319
Speaker 1: than they didn't quote unquote think they didn't connect to

22
00:01:19,360 --> 00:01:22,800
Speaker 1: any sort of computational engine in a meaningful way. You

23
00:01:22,880 --> 00:01:25,160
Speaker 1: might have a set of speakers plugged into a computer,

24
00:01:25,600 --> 00:01:27,840
Speaker 1: but that was just a one way communications tool, right.

25
00:01:27,840 --> 00:01:29,920
Speaker 1: It was just a way to provide an outlet for

26
00:01:30,120 --> 00:01:33,200
Speaker 1: sound that your computer was generating, nothing more than that.

27
00:01:33,880 --> 00:01:37,200
Speaker 1: But contrast that with today, when we have numerous smart

28
00:01:37,240 --> 00:01:40,440
Speaker 1: speakers on the market. These speakers act as a user

29
00:01:40,480 --> 00:01:44,840
Speaker 1: interface between us and the Internet at large, often facilitated

30
00:01:44,840 --> 00:01:49,120
Speaker 1: by a virtual assistant of some kind. Now with these speakers,

31
00:01:49,200 --> 00:01:52,760
Speaker 1: we don't just listen to stuff like music and podcasts

32
00:01:52,840 --> 00:01:56,640
Speaker 1: and the radio and you know, other traditional audio content.

33
00:01:57,200 --> 00:02:00,520
Speaker 1: We use them to find out information. We might link

34
00:02:00,640 --> 00:02:03,360
Speaker 1: them to our calendars so that we can get reminders

35
00:02:03,360 --> 00:02:06,760
Speaker 1: for upcoming appointments. We probably use them to ask about

36
00:02:06,800 --> 00:02:09,720
Speaker 1: the weather report. I use mine at home for that

37
00:02:09,800 --> 00:02:12,640
Speaker 1: all the time, or even more often than that, if

38
00:02:12,639 --> 00:02:15,200
Speaker 1: you're at my house, you'll hear us use it to

39
00:02:15,240 --> 00:02:17,400
Speaker 1: find out which foods are safe for us to feed

40
00:02:17,480 --> 00:02:21,080
Speaker 1: to our dog. My doggie, Tibolt, absolutely loves our smart

41
00:02:21,120 --> 00:02:24,160
Speaker 1: speaker because it frequently gives us permission to spoil him

42
00:02:24,160 --> 00:02:27,560
Speaker 1: with a carrot or a piece of banana. But how

43
00:02:27,639 --> 00:02:31,200
Speaker 1: do these smart speakers work, How are they able to

44
00:02:31,320 --> 00:02:35,680
Speaker 1: respond to our requests? And what are their limitations? How

45
00:02:35,760 --> 00:02:38,320
Speaker 1: safe are they? That's the sort of stuff we're gonna

46
00:02:38,320 --> 00:02:40,920
Speaker 1: be looking into in this episode of tech Stuff, and

47
00:02:40,960 --> 00:02:44,120
Speaker 1: we'll start off with the basics, which means we have

48
00:02:44,160 --> 00:02:47,519
Speaker 1: to start off with how speakers work in general. Now,

49
00:02:47,520 --> 00:02:49,840
Speaker 1: this is something that I've covered before on tech Stuff,

50
00:02:50,120 --> 00:02:51,919
Speaker 1: but I want to go over it again from a

51
00:02:52,000 --> 00:02:55,400
Speaker 1: high level because well, I just find it fascinating that

52
00:02:55,480 --> 00:02:58,800
Speaker 1: people figured out how to harness electricity to drive a

53
00:02:58,840 --> 00:03:02,760
Speaker 1: motor so that it could in turn cause components to

54
00:03:02,800 --> 00:03:07,079
Speaker 1: replicate a recorded or transmitted sound. And really motors being

55
00:03:07,120 --> 00:03:10,600
Speaker 1: too generous, but to drive an element to create vibrations

56
00:03:10,639 --> 00:03:14,320
Speaker 1: that could replicate a sound that was made into another component,

57
00:03:14,560 --> 00:03:17,080
Speaker 1: that whole thing just boggles my mind that people are

58
00:03:17,120 --> 00:03:20,920
Speaker 1: smart enough to figure that out. Okay, So to understand

59
00:03:20,960 --> 00:03:24,000
Speaker 1: how speakers work, it first helps to understand how sound

60
00:03:24,160 --> 00:03:28,800
Speaker 1: itself works. Sound is a physical phenomenon. Do do do do?

61
00:03:29,320 --> 00:03:33,560
Speaker 1: Sound is all about vibrations, and typically we experience sound

62
00:03:33,600 --> 00:03:36,280
Speaker 1: when we pick up on changes in air pressure that

63
00:03:36,520 --> 00:03:40,119
Speaker 1: enter through our ear canal and then affect the tympanic

64
00:03:40,160 --> 00:03:44,800
Speaker 1: membrane or ear drum. So it's all about these changes

65
00:03:44,880 --> 00:03:48,720
Speaker 1: of of of air pressure, all about air molecules transmitting

66
00:03:48,800 --> 00:03:53,520
Speaker 1: vibrations from a source outward in a radiating pattern from

67
00:03:53,520 --> 00:03:56,760
Speaker 1: from that source. So let's think of someone knocking on

68
00:03:56,800 --> 00:03:59,520
Speaker 1: a door. For example, you're inside a house, someone's knocking

69
00:03:59,560 --> 00:04:02,960
Speaker 1: on your door. When that person's hand hits the door,

70
00:04:03,280 --> 00:04:07,600
Speaker 1: it causes the door to vibrate, and that vibration transmits

71
00:04:07,640 --> 00:04:10,440
Speaker 1: to the surrounding air molecules on the other side of

72
00:04:10,440 --> 00:04:13,640
Speaker 1: the door. They get pushed through that vibration and then

73
00:04:13,640 --> 00:04:18,440
Speaker 1: pulled when the the wood is vibrating back towards its

74
00:04:18,440 --> 00:04:23,359
Speaker 1: original position. So the air molecules vibrate, those air molecules

75
00:04:23,400 --> 00:04:26,919
Speaker 1: cause the next surrounding layer of air molecules to vibrate

76
00:04:26,960 --> 00:04:29,160
Speaker 1: as well, and so on and so forth. It's like

77
00:04:29,200 --> 00:04:32,360
Speaker 1: a cascade or domino effect. You get these little pockets

78
00:04:32,400 --> 00:04:35,599
Speaker 1: of high and low air pressure that travel outward from

79
00:04:35,720 --> 00:04:40,919
Speaker 1: that door. It spreads further as it goes towards you know,

80
00:04:41,000 --> 00:04:45,760
Speaker 1: any distance, and if you are close enough so that

81
00:04:46,360 --> 00:04:49,200
Speaker 1: you can still detect those changes in air pressure. You

82
00:04:49,360 --> 00:04:52,320
Speaker 1: experience this by hearing the knocking on the door. Those

83
00:04:52,360 --> 00:04:56,039
Speaker 1: vibrating air molecules lose a bit of energy as they

84
00:04:56,080 --> 00:04:59,200
Speaker 1: move outward. Right, as they vibrate to the next layer,

85
00:04:59,320 --> 00:05:01,159
Speaker 1: you start to lo use a bit of energy with

86
00:05:01,279 --> 00:05:05,560
Speaker 1: each transmission of that So the sound gets quieter the

87
00:05:05,640 --> 00:05:08,800
Speaker 1: further away you are because there's not as many air

88
00:05:08,839 --> 00:05:13,680
Speaker 1: molecules vibrating, its amplitude as decreased. So if you are

89
00:05:13,720 --> 00:05:16,120
Speaker 1: in hearing range, you can pick up on those changes

90
00:05:16,120 --> 00:05:18,720
Speaker 1: of air pressure they encounter the tympanic membrane in your

91
00:05:18,720 --> 00:05:21,760
Speaker 1: ear canal. Those changes in pressure will cause a reaction

92
00:05:21,800 --> 00:05:25,919
Speaker 1: in your middle and inner ear set that will ultimately

93
00:05:25,920 --> 00:05:29,720
Speaker 1: get picked up by your brain that interprets it as sound. Now,

94
00:05:29,720 --> 00:05:34,400
Speaker 1: the frequency at which those fluctuations occur relate to the

95
00:05:34,440 --> 00:05:40,880
Speaker 1: pitch that we hear, so faster vibrations are higher pitches,

96
00:05:41,080 --> 00:05:44,760
Speaker 1: higher frequencies, higher notes. If you think of a musical scale,

97
00:05:45,520 --> 00:05:50,200
Speaker 1: we perceive the force of the changes as volume, so

98
00:05:50,839 --> 00:05:54,560
Speaker 1: lower forces lower volume right, and higher forces higher volume.

99
00:05:55,279 --> 00:05:58,039
Speaker 1: The human ear can hear a pretty decent range of

100
00:05:58,080 --> 00:06:02,239
Speaker 1: frequencies from twenty hurts, which means twenty cycles or twenty

101
00:06:02,360 --> 00:06:06,880
Speaker 1: waves per second past a given point of reference, to

102
00:06:07,000 --> 00:06:12,320
Speaker 1: twenty killer hurts. That's twenty thousand cycles or waves per second.

103
00:06:12,800 --> 00:06:15,440
Speaker 1: So yeah, the cycle refers to the frequency of the

104
00:06:15,480 --> 00:06:19,120
Speaker 1: wavelength of sound. The lower the frequency, the lower the sound.

105
00:06:19,440 --> 00:06:21,560
Speaker 1: All right, and then our brain has to make meaning

106
00:06:21,560 --> 00:06:23,880
Speaker 1: of all this, Right, it's not just that it's picking

107
00:06:23,960 --> 00:06:28,279
Speaker 1: up on it. Our brain interprets this and we experience

108
00:06:28,360 --> 00:06:32,359
Speaker 1: it as a sound we have heard. So it either

109
00:06:32,720 --> 00:06:36,840
Speaker 1: matches this perceived sound with one we've encountered before, and

110
00:06:36,880 --> 00:06:39,840
Speaker 1: then we say, oh, I know what that is. That's

111
00:06:39,880 --> 00:06:43,799
Speaker 1: someone knocking at the door, or they might be Holy Cala,

112
00:06:44,000 --> 00:06:46,120
Speaker 1: I've never heard that sound in my life. I have

113
00:06:46,200 --> 00:06:49,920
Speaker 1: no idea what it is. If the sound is language,

114
00:06:50,000 --> 00:06:52,560
Speaker 1: then our brains have to derive the meaning from the

115
00:06:52,640 --> 00:06:56,479
Speaker 1: perceived sound. We've heard someone say words such as you're

116
00:06:56,520 --> 00:07:00,920
Speaker 1: hearing me say this. Then our brains have to take

117
00:07:01,160 --> 00:07:03,960
Speaker 1: that collection of sounds and say, what does that actually mean?

118
00:07:04,040 --> 00:07:07,200
Speaker 1: What is the the context, what is the the intent?

119
00:07:07,640 --> 00:07:10,440
Speaker 1: What is the message here? Otherwise it would just be

120
00:07:10,960 --> 00:07:14,760
Speaker 1: you know, random noises that I'm making with my mouth. Alright,

121
00:07:14,800 --> 00:07:17,760
Speaker 1: so we have a basic understanding behind the physics of sound.

122
00:07:17,760 --> 00:07:21,600
Speaker 1: Now to talk about speakers and microphones and the reason

123
00:07:21,760 --> 00:07:24,000
Speaker 1: I'm going to talk about both of them is that

124
00:07:24,080 --> 00:07:26,720
Speaker 1: the devices complement one another. You can think of one

125
00:07:26,760 --> 00:07:31,080
Speaker 1: as being the other in reverse. Plus smart speakers we

126
00:07:31,160 --> 00:07:34,160
Speaker 1: have to talk about microphones anyway, because smart speakers have

127
00:07:34,400 --> 00:07:38,280
Speaker 1: microphones as well as the speaker element. So you can

128
00:07:38,360 --> 00:07:41,520
Speaker 1: think of this as one long process of taking the

129
00:07:41,520 --> 00:07:46,280
Speaker 1: physical phenomena of sound waves, transforming that physical phenomena into

130
00:07:46,360 --> 00:07:49,800
Speaker 1: an electrical signal, taking the electrical signal, and changing it

131
00:07:49,920 --> 00:07:52,920
Speaker 1: back into something that can produce the sound waves that

132
00:07:53,040 --> 00:07:56,520
Speaker 1: started the whole thing. So you're replicating the original sound

133
00:07:56,560 --> 00:08:00,480
Speaker 1: waves with this end device, which in this case is

134
00:08:00,480 --> 00:08:03,120
Speaker 1: allowed speaker. So the microphone is the part of the

135
00:08:03,160 --> 00:08:05,280
Speaker 1: process where you take the sound and you turn it

136
00:08:05,320 --> 00:08:08,080
Speaker 1: into an electrical signal, and the speakers where you take

137
00:08:08,120 --> 00:08:10,600
Speaker 1: the electrical signal and you turn it back into actual sound.

138
00:08:10,680 --> 00:08:14,640
Speaker 1: That's the simple way. But what's actually happening, Well, let's

139
00:08:14,680 --> 00:08:18,520
Speaker 1: talk about on a physical level. Sound waves go into

140
00:08:18,560 --> 00:08:23,080
Speaker 1: a microphone. So you've got these fluctuations and air pressure

141
00:08:23,200 --> 00:08:27,120
Speaker 1: that encounter a microphone. I'm speaking into a microphone right now,

142
00:08:27,240 --> 00:08:30,480
Speaker 1: so this is happening right now. Inside the microphone is

143
00:08:30,520 --> 00:08:33,679
Speaker 1: a very thin diaphragm, typically made out of a very

144
00:08:33,720 --> 00:08:37,440
Speaker 1: flexible plastic, and it's sort of like the skin of

145
00:08:37,440 --> 00:08:40,880
Speaker 1: a drum. So as the changes in air pressure encounter

146
00:08:41,360 --> 00:08:45,520
Speaker 1: the diaphragm, they cause the diaphragm to move back and forth. Well.

147
00:08:45,520 --> 00:08:49,319
Speaker 1: Attached to the diaphragm is a coil of conductive wire,

148
00:08:49,760 --> 00:08:53,640
Speaker 1: and that coil wraps either around or near a permanent magnet.

149
00:08:54,040 --> 00:08:57,200
Speaker 1: Magnets have magnetic fields. They have a north pole and

150
00:08:57,240 --> 00:09:00,760
Speaker 1: a south pole, and there's a magnetic field that surrounds

151
00:09:01,320 --> 00:09:05,720
Speaker 1: the magnet. And the electro magnetic effect means that if

152
00:09:05,720 --> 00:09:10,600
Speaker 1: you move a coil of conductive wire through a magnetic field,

153
00:09:11,040 --> 00:09:14,280
Speaker 1: it will produce a change in voltage in that coil,

154
00:09:14,600 --> 00:09:19,000
Speaker 1: otherwise known as electromotive force, and that means electrical current

155
00:09:19,040 --> 00:09:22,760
Speaker 1: will flow through the coil. Now, if you have the

156
00:09:22,880 --> 00:09:26,240
Speaker 1: end of that coil attached to a wire, a conductive

157
00:09:26,280 --> 00:09:30,120
Speaker 1: wire for that current to flow through, you can send

158
00:09:30,160 --> 00:09:33,960
Speaker 1: that current onto other components. So for our purposes, the

159
00:09:34,000 --> 00:09:37,360
Speaker 1: component in question would be an amplifier, and I'll get

160
00:09:37,400 --> 00:09:40,480
Speaker 1: to explaining why that is in just a moment, but

161
00:09:40,559 --> 00:09:43,160
Speaker 1: first let's talk about loud speakers, and the way allowed

162
00:09:43,160 --> 00:09:48,000
Speaker 1: speaker works is essentially the reverse of a microphone. You've

163
00:09:48,040 --> 00:09:51,440
Speaker 1: got your permanent magnet around or near which is a

164
00:09:51,480 --> 00:09:56,360
Speaker 1: coil of conductive wire. The wire is connected to a diaphragm,

165
00:09:56,400 --> 00:09:59,600
Speaker 1: one much larger and typically made out of stiffer material

166
00:09:59,800 --> 00:10:03,480
Speaker 1: that the plastic you'd find in a microphone. This is

167
00:10:03,520 --> 00:10:06,520
Speaker 1: the element inside a speaker that will vibrate, that will

168
00:10:06,559 --> 00:10:10,960
Speaker 1: push air and pull air as it moves either outward

169
00:10:11,040 --> 00:10:14,800
Speaker 1: or inward. The electrical signal comes from a source such

170
00:10:14,880 --> 00:10:17,440
Speaker 1: as the microphone we were just using a second ago

171
00:10:18,080 --> 00:10:22,439
Speaker 1: that comes into the loudspeaker and it flows through the coil. Now,

172
00:10:22,480 --> 00:10:26,400
Speaker 1: when you have an electrical current flowing through a conductive coil,

173
00:10:26,920 --> 00:10:31,079
Speaker 1: you generate a magnetic field because the laws of electromagnetism.

174
00:10:31,600 --> 00:10:35,920
Speaker 1: You've got the electro magnetic field generated as a result.

175
00:10:36,280 --> 00:10:39,440
Speaker 1: Now that field will interact with the magnetic field of

176
00:10:39,480 --> 00:10:42,360
Speaker 1: the permanent magnet. That the permnet magnet always has a

177
00:10:42,360 --> 00:10:46,040
Speaker 1: magnetic field. The coil only has one when electric current

178
00:10:46,120 --> 00:10:48,760
Speaker 1: is flowing through it. And as I said, we have

179
00:10:48,840 --> 00:10:51,120
Speaker 1: magnets to have a north pole and a south pole.

180
00:10:51,160 --> 00:10:54,240
Speaker 1: And we also know that when we bring two magnets

181
00:10:54,240 --> 00:10:57,840
Speaker 1: with their north poles together, they'll push against each other,

182
00:10:57,960 --> 00:11:02,240
Speaker 1: right because like repels like, But if we turn one

183
00:11:02,240 --> 00:11:04,640
Speaker 1: of those magnets around so that now it's a south

184
00:11:04,679 --> 00:11:08,560
Speaker 1: pole and a north pole, they attract one another, you know,

185
00:11:08,600 --> 00:11:15,160
Speaker 1: opposites attract. So by having the this magnetic field being

186
00:11:15,200 --> 00:11:21,360
Speaker 1: generated by the coil, uh, it starts to generate interactions

187
00:11:21,400 --> 00:11:25,520
Speaker 1: with the magnetic field of the permanent magnet, so they

188
00:11:25,600 --> 00:11:28,160
Speaker 1: start to push and pull against each other. Well, the

189
00:11:28,240 --> 00:11:31,959
Speaker 1: coil is attached to that diaphragm, so it in turn

190
00:11:32,160 --> 00:11:36,000
Speaker 1: drives the diaphragm to either push outward or pull inward.

191
00:11:36,480 --> 00:11:40,760
Speaker 1: That causes air molecules to vibrate, just as it would

192
00:11:41,120 --> 00:11:43,840
Speaker 1: with any other you know, source of sound, and it

193
00:11:43,920 --> 00:11:48,599
Speaker 1: emanates outward from the loudspeaker, so you get a representation

194
00:11:48,920 --> 00:11:51,839
Speaker 1: of the same sound that was going into the microphone

195
00:11:52,679 --> 00:11:56,760
Speaker 1: got converted into an electrical current. The electrical current then

196
00:11:57,080 --> 00:12:00,360
Speaker 1: was passed through a coil and next to a permanent

197
00:12:00,360 --> 00:12:03,720
Speaker 1: magnet to create the same sort of movement. It replicates

198
00:12:03,720 --> 00:12:07,240
Speaker 1: the movement of the original diaphragm in the microphone and

199
00:12:07,320 --> 00:12:11,200
Speaker 1: generates the sound. So you get the replication of the

200
00:12:11,240 --> 00:12:15,079
Speaker 1: sound that was made in the other location. It's pretty cool.

201
00:12:15,160 --> 00:12:18,400
Speaker 1: I think now I did mention earlier that you would

202
00:12:18,400 --> 00:12:21,480
Speaker 1: need an amplifier. And the reason you need an amplifier

203
00:12:21,640 --> 00:12:24,920
Speaker 1: is that the electrical signal generated by a microphone is

204
00:12:24,960 --> 00:12:28,440
Speaker 1: far too weak to drive allowed speakers diaphragm. You just

205
00:12:29,160 --> 00:12:31,880
Speaker 1: wouldn't have the juice to do it. It would be

206
00:12:32,000 --> 00:12:35,839
Speaker 1: much much less, uh powerful than what the speaker would need.

207
00:12:36,120 --> 00:12:39,040
Speaker 1: So chances are the diaphragm would either not move at

208
00:12:39,080 --> 00:12:40,920
Speaker 1: all because it would just be too stiff, it would

209
00:12:41,240 --> 00:12:44,559
Speaker 1: resist the movement too much, or it would move so

210
00:12:44,600 --> 00:12:47,360
Speaker 1: weakly as to generate little to no sound, so it

211
00:12:47,360 --> 00:12:50,559
Speaker 1: wouldn't do you any good. So the signal from the

212
00:12:50,600 --> 00:12:53,240
Speaker 1: microphone has to first pass through an amplifier, which, as

213
00:12:53,280 --> 00:12:56,679
Speaker 1: the name implies, takes an incoming signal and increases the

214
00:12:56,720 --> 00:13:00,960
Speaker 1: amplitude of that signal the volume. In other words, uh so,

215
00:13:01,000 --> 00:13:03,480
Speaker 1: it doesn't affect pitch, but it does affect the signal

216
00:13:03,559 --> 00:13:08,160
Speaker 1: strength and consequently the volume. And I've done episodes about amplifiers,

217
00:13:08,240 --> 00:13:11,920
Speaker 1: including explaining the difference between amplifiers that use vacuum tubes

218
00:13:11,960 --> 00:13:14,880
Speaker 1: and ones that use transistors, so I'm not going to

219
00:13:15,000 --> 00:13:18,640
Speaker 1: go into that here. Besides, it doesn't really factor into

220
00:13:18,679 --> 00:13:22,679
Speaker 1: our conversation about smart speakers anyway. It's just important for

221
00:13:23,000 --> 00:13:26,080
Speaker 1: it to work with a microphone and speaker setting. Now,

222
00:13:26,120 --> 00:13:29,600
Speaker 1: over the years, engineers have paired microphones and speakers and

223
00:13:29,720 --> 00:13:33,440
Speaker 1: lots of stuff. You've got telephones, you've got intercom systems,

224
00:13:33,480 --> 00:13:37,280
Speaker 1: public address systems, handheld radios, all sorts of things, so

225
00:13:37,320 --> 00:13:41,160
Speaker 1: that technology was well and truly mature. Before we ever

226
00:13:41,240 --> 00:13:45,040
Speaker 1: got our first smart speaker, there wasn't much call to

227
00:13:45,160 --> 00:13:49,200
Speaker 1: incorporate microphones into home speaker systems for many years. I mean,

228
00:13:49,760 --> 00:13:52,560
Speaker 1: what would you actually use a microphone embedded in a

229
00:13:52,640 --> 00:13:56,320
Speaker 1: speaker for? Before smart speakers, Typically you would have your

230
00:13:56,360 --> 00:13:59,280
Speaker 1: speakers like I'm talking about, like like sound system speakers.

231
00:13:59,400 --> 00:14:01,800
Speaker 1: You would have them hooked up to some other dumb

232
00:14:02,080 --> 00:14:05,800
Speaker 1: as in, not connected to a network technology. So it

233
00:14:05,880 --> 00:14:09,040
Speaker 1: might be a sound system or home entertainment set up

234
00:14:09,040 --> 00:14:11,480
Speaker 1: with a television as the focal point, or maybe even

235
00:14:11,720 --> 00:14:14,079
Speaker 1: you know, a computer for the purposes of playing more

236
00:14:14,160 --> 00:14:19,240
Speaker 1: dynamic sounds for like video games and and things like that. Um.

237
00:14:19,320 --> 00:14:21,800
Speaker 1: But for a very long time, these were all thought

238
00:14:21,840 --> 00:14:25,320
Speaker 1: of as one way communications applications, right, Like, the sound

239
00:14:25,400 --> 00:14:27,480
Speaker 1: was coming from a source and it would get to

240
00:14:27,600 --> 00:14:30,800
Speaker 1: us through the speakers, but we weren't meant to send

241
00:14:31,360 --> 00:14:34,480
Speaker 1: sound back through those same channels. The information was just

242
00:14:34,560 --> 00:14:37,440
Speaker 1: coming to you. You weren't sending anything back, But that

243
00:14:37,480 --> 00:14:40,040
Speaker 1: would all change in time. Now. One thing to keep

244
00:14:40,040 --> 00:14:42,680
Speaker 1: in mind about smart speakers is that they are the

245
00:14:42,680 --> 00:14:46,360
Speaker 1: product of several different technologies and lines of innovation and

246
00:14:46,400 --> 00:14:50,800
Speaker 1: development that all converged together. The microphone and speaker technology

247
00:14:51,120 --> 00:14:53,160
Speaker 1: is one of the oldest ones that we can point

248
00:14:53,200 --> 00:14:57,000
Speaker 1: to as far as the fundamental underlying technology is concerned,

249
00:14:57,560 --> 00:15:00,440
Speaker 1: the stuff that's been around since the late nineties century.

250
00:15:00,600 --> 00:15:03,440
Speaker 1: Now there is one other we'll talk about that's even older.

251
00:15:03,720 --> 00:15:06,440
Speaker 1: But I don't want to spoil things. I'll just mention

252
00:15:06,920 --> 00:15:10,560
Speaker 1: there is an even older line of development that goes

253
00:15:10,600 --> 00:15:14,240
Speaker 1: into smart speakers than the microphone speaker stuff of the

254
00:15:14,320 --> 00:15:18,040
Speaker 1: nineteenth century. Most of the other components, however, are much

255
00:15:18,080 --> 00:15:23,239
Speaker 1: younger than that. One big one is speech or voice recognition.

256
00:15:23,600 --> 00:15:28,040
Speaker 1: Creating computer systems that could detect noise was relatively simple. Right.

257
00:15:28,120 --> 00:15:31,120
Speaker 1: You could have a computer connected to microphones and they

258
00:15:31,120 --> 00:15:35,360
Speaker 1: could monitor the input from those microphones and any incoming

259
00:15:35,400 --> 00:15:38,680
Speaker 1: signal could be registered. Right, they could record an incoming

260
00:15:38,720 --> 00:15:42,080
Speaker 1: signal that would indicate the microphone had detected a noise.

261
00:15:42,560 --> 00:15:46,080
Speaker 1: That's child's play. That's easy to do. But teaching computers

262
00:15:46,080 --> 00:15:49,160
Speaker 1: how to analyze those signals and decipher them so that

263
00:15:49,160 --> 00:15:53,440
Speaker 1: the computer could display in text or otherwise act upon

264
00:15:54,000 --> 00:15:57,880
Speaker 1: that that sound in a meaningful way that was much

265
00:15:57,880 --> 00:16:02,400
Speaker 1: more difficult. There was an IBM engineer named William C.

266
00:16:02,680 --> 00:16:06,560
Speaker 1: Dirsh of the Advanced System Development Division who created an

267
00:16:06,640 --> 00:16:11,200
Speaker 1: early implementation of voice recognition. It was a very limited application,

268
00:16:11,280 --> 00:16:14,240
Speaker 1: but it proved that the ability to interact with computers

269
00:16:14,280 --> 00:16:18,800
Speaker 1: by voice was more than just science fiction. Within IBM.

270
00:16:18,800 --> 00:16:23,080
Speaker 1: It was called the Shoebox. Dirsh worked on this project

271
00:16:23,200 --> 00:16:26,440
Speaker 1: in the early nineteen sixties and what he produced was

272
00:16:26,480 --> 00:16:29,840
Speaker 1: a machine that had a microphone attached to it. The

273
00:16:29,880 --> 00:16:34,680
Speaker 1: machine could detect sixteen spoken words, which included the digits

274
00:16:34,800 --> 00:16:39,160
Speaker 1: of zero to nine plus some command indicators like plus

275
00:16:39,520 --> 00:16:43,360
Speaker 1: minus total, sub total. You get the idea. So you

276
00:16:43,360 --> 00:16:46,680
Speaker 1: could speak a string of numbers and then commands to

277
00:16:46,840 --> 00:16:49,920
Speaker 1: this device, then ask it to total everything and it

278
00:16:49,960 --> 00:16:52,000
Speaker 1: would do so. So it was more or less a

279
00:16:52,080 --> 00:16:58,000
Speaker 1: basic calculator with some voice interpretation incorporated into it. Now

280
00:16:58,280 --> 00:17:02,000
Speaker 1: there's a great newsreel piece about this shoebox. There's a

281
00:17:02,040 --> 00:17:05,040
Speaker 1: demonstration of it, and it came out in nineteen one,

282
00:17:05,480 --> 00:17:08,480
Speaker 1: and I love that newsreel because it has that great

283
00:17:08,560 --> 00:17:10,520
Speaker 1: music you would hear in the background of those old

284
00:17:10,560 --> 00:17:14,560
Speaker 1: industrial and business films. Anyway, there's also a helpful chart

285
00:17:14,840 --> 00:17:19,159
Speaker 1: that hangs in the background of that video where Dersh

286
00:17:19,320 --> 00:17:22,439
Speaker 1: is actually explaining how it works. You can see a

287
00:17:22,480 --> 00:17:25,919
Speaker 1: little bit behind him what the what is actually being

288
00:17:25,960 --> 00:17:30,520
Speaker 1: analyzed and uh he broke the words down into phonemes

289
00:17:30,560 --> 00:17:36,720
Speaker 1: and syllables, so phonemes being specific sounds that make up words. So,

290
00:17:36,760 --> 00:17:40,679
Speaker 1: for example, the digit one is a single syllable word

291
00:17:40,960 --> 00:17:43,520
Speaker 1: with a vowel sound right at the front. But you

292
00:17:43,600 --> 00:17:48,200
Speaker 1: also have the word eight that's another single syllable word

293
00:17:48,480 --> 00:17:51,040
Speaker 1: as a vowel sound right at the front, but it's

294
00:17:51,359 --> 00:17:55,280
Speaker 1: different from one phonetically in that eight also has a

295
00:17:55,359 --> 00:17:59,159
Speaker 1: plosive and has that hard t at the end. So

296
00:17:59,200 --> 00:18:02,919
Speaker 1: the shoebox was limited not just in what words it

297
00:18:02,960 --> 00:18:07,720
Speaker 1: could recognize, but also the types of voices it could recognize.

298
00:18:07,880 --> 00:18:10,760
Speaker 1: Get someone who has a different dialect or manner of speech,

299
00:18:10,760 --> 00:18:12,800
Speaker 1: and the machine might not be able to understand them

300
00:18:12,800 --> 00:18:16,119
Speaker 1: because they're not pronouncing the words the same way that

301
00:18:16,280 --> 00:18:20,560
Speaker 1: drsh did. This would be a big challenge in speech

302
00:18:20,560 --> 00:18:24,240
Speaker 1: recognition moving forward, and it's also an example of where

303
00:18:24,280 --> 00:18:28,480
Speaker 1: we find bias creeping into technology. And it's not necessarily

304
00:18:28,520 --> 00:18:32,359
Speaker 1: a conscious thing, but if you have people designing a

305
00:18:32,400 --> 00:18:36,520
Speaker 1: system and they're designing it based off their own uh,

306
00:18:36,680 --> 00:18:41,280
Speaker 1: you know, speech patterns, their own pronunciations, their own dialects,

307
00:18:41,800 --> 00:18:44,879
Speaker 1: then it may be that the system they create works

308
00:18:44,960 --> 00:18:48,040
Speaker 1: really well for them and less well for anyone who

309
00:18:48,160 --> 00:18:51,440
Speaker 1: isn't them, And the further away you are from their

310
00:18:51,480 --> 00:18:56,200
Speaker 1: manner of speaking, the more frustration you will encounter as

311
00:18:56,240 --> 00:18:59,719
Speaker 1: you try to interact with that technology. That's an example

312
00:18:59,760 --> 00:19:03,200
Speaker 1: of s and in fact, if you read the histories

313
00:19:03,320 --> 00:19:06,359
Speaker 1: of speech recognition and as we'll get too later natural

314
00:19:06,640 --> 00:19:10,119
Speaker 1: language processing, you'll see a lot of people say it

315
00:19:10,200 --> 00:19:13,119
Speaker 1: works great if you happen to be a white man,

316
00:19:13,840 --> 00:19:17,880
Speaker 1: because the manner of speech was being or the people

317
00:19:17,920 --> 00:19:21,000
Speaker 1: who were designing it were primarily white men who were

318
00:19:21,760 --> 00:19:26,000
Speaker 1: uh typically aiming for a a what is considered a

319
00:19:26,080 --> 00:19:31,840
Speaker 1: non accented American dialect somewhere in you know, the Eastern

320
00:19:31,920 --> 00:19:35,439
Speaker 1: seaboard side. But that meant that if you did have

321
00:19:35,520 --> 00:19:39,639
Speaker 1: an accent or a dialect, or you had a different vernacular,

322
00:19:40,200 --> 00:19:43,240
Speaker 1: that it was harder for the systems to actually understand

323
00:19:43,240 --> 00:19:46,399
Speaker 1: what you were saying. That's an example of bias. Well.

324
00:19:46,760 --> 00:19:49,359
Speaker 1: The general strategy was again to break up speech and

325
00:19:49,400 --> 00:19:52,560
Speaker 1: too constituent sound units, you know, those phonemes, and then

326
00:19:52,600 --> 00:19:55,879
Speaker 1: to susse out which words were being spoken based on

327
00:19:55,920 --> 00:19:59,880
Speaker 1: those phonemes, and that was done by digitizing the voice train,

328
00:20:00,160 --> 00:20:04,159
Speaker 1: forming it from sound into data that represented stuff like

329
00:20:04,240 --> 00:20:08,320
Speaker 1: the sounds frequency or pitch, and then matching up specific

330
00:20:08,359 --> 00:20:12,199
Speaker 1: signal signal signatures with specific phone nmes. So generally the

331
00:20:12,240 --> 00:20:14,919
Speaker 1: idea was that the computer system would monitor incoming sound,

332
00:20:15,280 --> 00:20:18,919
Speaker 1: convert the sound into digital data, compare that data that

333
00:20:19,000 --> 00:20:22,679
Speaker 1: had received with information stored in a database, and effort

334
00:20:22,720 --> 00:20:26,199
Speaker 1: to look for matches. Uh. The shoebox database was just

335
00:20:26,320 --> 00:20:29,280
Speaker 1: sixteen words and size. Later ones would be much larger,

336
00:20:29,320 --> 00:20:33,399
Speaker 1: but pretty quickly people realized this was not an efficient

337
00:20:33,480 --> 00:20:37,640
Speaker 1: way of doing speech recognition because the bigger the vocabulary,

338
00:20:37,840 --> 00:20:40,040
Speaker 1: the more work intens of it was to build out

339
00:20:40,080 --> 00:20:43,520
Speaker 1: those databases. So it wasn't something that people thought would

340
00:20:43,520 --> 00:20:48,560
Speaker 1: be sustainable for very large vocabularies. But the Shoebox marked

341
00:20:48,560 --> 00:20:50,680
Speaker 1: the beginning of a serious effort to create machines that

342
00:20:50,720 --> 00:20:53,720
Speaker 1: could accept audio cues as actual input, and as we'll see,

343
00:20:54,080 --> 00:20:57,760
Speaker 1: that's one important component for these smart speaker systems. I've

344
00:20:57,800 --> 00:20:59,560
Speaker 1: got a lot more to say, but before I get

345
00:20:59,600 --> 00:21:09,760
Speaker 1: into the next part, let's take a quick break. Now,

346
00:21:09,800 --> 00:21:13,480
Speaker 1: obviously we didn't jump right into full voice recognition right

347
00:21:13,520 --> 00:21:17,520
Speaker 1: after IBM S Shoebus innovation. The challenges related to building

348
00:21:17,560 --> 00:21:21,399
Speaker 1: automated speech recognition systems were numerous, even for just a

349
00:21:21,520 --> 00:21:24,879
Speaker 1: single language, because, as I said, you can have accents

350
00:21:24,960 --> 00:21:28,280
Speaker 1: and dialects. One voice can have a very different tonal

351
00:21:28,400 --> 00:21:32,679
Speaker 1: quality from another, people speak at different speeds. Teaching machines

352
00:21:32,720 --> 00:21:35,480
Speaker 1: how to recognize speech when the phonemes and pacing of

353
00:21:35,520 --> 00:21:40,840
Speaker 1: that speech aren't consistent from speaker to speaker, that's really hard.

354
00:21:41,320 --> 00:21:43,119
Speaker 1: This kind of gets back to the same sort of

355
00:21:43,200 --> 00:21:46,680
Speaker 1: challenges you have when you're teaching machines how to recognize images.

356
00:21:47,440 --> 00:21:51,080
Speaker 1: You know, you teach a human what a coffee mug is.

357
00:21:51,119 --> 00:21:53,320
Speaker 1: I always use this example, but you teach a human

358
00:21:53,359 --> 00:21:55,800
Speaker 1: what a coffee mug is, and pretty soon they can

359
00:21:55,840 --> 00:22:00,000
Speaker 1: extrapolate from that example and understand that coffee mugs can

360
00:22:00,000 --> 00:22:03,879
Speaker 1: them in all different sizes and colors, and you know

361
00:22:04,240 --> 00:22:08,320
Speaker 1: different designs and textures. We get it. Like you you

362
00:22:08,359 --> 00:22:11,640
Speaker 1: see a couple of coffee mugs, you understand machines though

363
00:22:12,480 --> 00:22:15,280
Speaker 1: they aren't able to do that. Machines, you know, you

364
00:22:15,320 --> 00:22:17,440
Speaker 1: have to give them lots and lots and lots of

365
00:22:17,480 --> 00:22:20,479
Speaker 1: different examples before they can start to pick up on

366
00:22:20,600 --> 00:22:24,960
Speaker 1: what things actually make a coffee mug. Same sort of

367
00:22:25,000 --> 00:22:28,639
Speaker 1: thing with speech, right, So if you don't have consistency

368
00:22:28,760 --> 00:22:31,679
Speaker 1: between speakers, it makes it very hard for machines to

369
00:22:31,800 --> 00:22:34,800
Speaker 1: learn what people are saying. Now, it didn't take long

370
00:22:34,880 --> 00:22:37,399
Speaker 1: for the tech industry at large to really dive into

371
00:22:37,400 --> 00:22:41,520
Speaker 1: trying to solve this problem. In ninete, DARPA, that's the

372
00:22:41,640 --> 00:22:45,359
Speaker 1: Research and Development division of the United States Department of Defense,

373
00:22:45,760 --> 00:22:48,800
Speaker 1: got behind speech recognition in a big way. Now, remember

374
00:22:49,280 --> 00:22:54,080
Speaker 1: darp it self doesn't do research. The organization's purpose is

375
00:22:54,080 --> 00:22:58,280
Speaker 1: to invite organizations to pitch projects that align with whatever

376
00:22:58,359 --> 00:23:01,879
Speaker 1: darpest goals are and and DARBA would provide funding to

377
00:23:02,440 --> 00:23:07,000
Speaker 1: the winning organizations to see these projects to completion if possible.

378
00:23:07,440 --> 00:23:09,840
Speaker 1: So DARK is really more of a vetting and funding

379
00:23:10,000 --> 00:23:15,400
Speaker 1: organization anyway. In n DARPA created a five year program

380
00:23:15,520 --> 00:23:20,160
Speaker 1: called Speech Understanding Research or s u are. The initial

381
00:23:20,240 --> 00:23:23,320
Speaker 1: goal was pretty darn ambitious considering the capabilities of the

382
00:23:23,359 --> 00:23:27,240
Speaker 1: technology at the time. The project director, Larry Roberts, wanted

383
00:23:27,240 --> 00:23:30,440
Speaker 1: a system that would be capable of recognizing a vocabulary

384
00:23:30,560 --> 00:23:34,119
Speaker 1: of ten thousand words with less than ten percent error.

385
00:23:34,560 --> 00:23:37,240
Speaker 1: After holding a few meetings with some of the leading

386
00:23:37,320 --> 00:23:41,840
Speaker 1: computer engineers of the day, Roberts suggusted that goal significantly.

387
00:23:42,560 --> 00:23:45,359
Speaker 1: After that adjustment, the target was going to be a

388
00:23:45,400 --> 00:23:50,040
Speaker 1: system capable of recognizing one thousand words, not ten thousand.

389
00:23:50,920 --> 00:23:53,359
Speaker 1: Nearror levels still had to be less than ten percent,

390
00:23:53,840 --> 00:23:55,720
Speaker 1: and the goal was for the system to be able

391
00:23:55,760 --> 00:24:02,359
Speaker 1: to accept continuous speech, as opposed to very deliberate speech

392
00:24:03,080 --> 00:24:08,000
Speaker 1: with pauses between each pair of words that would not

393
00:24:08,119 --> 00:24:13,040
Speaker 1: be really that useful. One person who was skeptical about

394
00:24:13,080 --> 00:24:16,760
Speaker 1: the potential success of this project was John R. Pierce

395
00:24:16,960 --> 00:24:20,639
Speaker 1: of Bell Labs. He argued that any success would be

396
00:24:20,720 --> 00:24:25,440
Speaker 1: limited so long as machines remained incapable of understanding the words,

397
00:24:25,840 --> 00:24:28,720
Speaker 1: not just recognizing a word based on phone names, but

398
00:24:28,840 --> 00:24:31,359
Speaker 1: understanding what the word is. That is. Pierce felt that

399
00:24:31,359 --> 00:24:34,080
Speaker 1: the machines needed some way to parse the language to

400
00:24:34,119 --> 00:24:37,040
Speaker 1: get to the meaning of what was being said. That's

401
00:24:37,080 --> 00:24:38,919
Speaker 1: an important idea that we will come back to in

402
00:24:38,960 --> 00:24:41,919
Speaker 1: just a bit now. Among the companies and organizations that

403
00:24:42,040 --> 00:24:46,600
Speaker 1: landed contracts with DARPA were a Carnegie Melon University BBN,

404
00:24:46,600 --> 00:24:49,080
Speaker 1: which actually played a big part in developing our ponette,

405
00:24:49,080 --> 00:24:53,240
Speaker 1: the predecessor to the Internet, Lincoln Laboratory, and several more

406
00:24:53,720 --> 00:24:56,840
Speaker 1: and very smart people began to create systems intended to

407
00:24:56,880 --> 00:25:00,520
Speaker 1: recognize speech and meaningful ways. The names of the programs

408
00:25:00,520 --> 00:25:02,800
Speaker 1: were a lot of fun. There was h W I

409
00:25:03,119 --> 00:25:06,280
Speaker 1: M that stood for hear what I mean as in

410
00:25:06,440 --> 00:25:09,040
Speaker 1: here as in listen hear what I mean. That one

411
00:25:09,160 --> 00:25:14,320
Speaker 1: was from BBN. CMU introduced hearsay, which was later designated

412
00:25:14,320 --> 00:25:17,080
Speaker 1: as Hearsay one, and then they came out with Hearsay two.

413
00:25:17,560 --> 00:25:22,800
Speaker 1: They also would demonstrate another one called harpy. Oh, and

414
00:25:22,840 --> 00:25:25,679
Speaker 1: there was a professor at CMU named Dr James Baker

415
00:25:25,840 --> 00:25:29,439
Speaker 1: who would design a system called Dragon in nineteen seventy

416
00:25:29,520 --> 00:25:32,040
Speaker 1: five that he would later leverage into a company with

417
00:25:32,119 --> 00:25:35,480
Speaker 1: his wife, Dr Janet M. Baker in the nineteen eighties,

418
00:25:35,520 --> 00:25:40,000
Speaker 1: and they had a very successful business with speech recognition software. Now,

419
00:25:40,040 --> 00:25:42,399
Speaker 1: I'm not going to go into each of those programs

420
00:25:42,400 --> 00:25:45,240
Speaker 1: in deep detail, but rather just mentioned that they all

421
00:25:45,280 --> 00:25:48,879
Speaker 1: helped advance the cause of creating systems that can recognize speech.

422
00:25:49,440 --> 00:25:51,480
Speaker 1: One of the big developments that came out of all

423
00:25:51,520 --> 00:25:55,280
Speaker 1: that work was a shift to probabilistic models, which would

424
00:25:55,320 --> 00:25:58,080
Speaker 1: also play a really important part in another phase of

425
00:25:58,160 --> 00:26:00,680
Speaker 1: developing the smart speaker. So what do I mean when

426
00:26:00,720 --> 00:26:04,520
Speaker 1: I say probabilistic? Well, as the name indicates, it all

427
00:26:04,520 --> 00:26:08,760
Speaker 1: has to do with probabilities. Essentially, systems would analyze incoming

428
00:26:08,760 --> 00:26:12,399
Speaker 1: phonemes and make guesses as to what was being said

429
00:26:12,680 --> 00:26:15,520
Speaker 1: based on the probability of it being a given word

430
00:26:15,720 --> 00:26:19,159
Speaker 1: or part of a word. The systems typically go with

431
00:26:19,240 --> 00:26:22,920
Speaker 1: whatever word has the highest probability of being the correct one.

432
00:26:23,640 --> 00:26:26,840
Speaker 1: Even with that approach, there are nuances to language that

433
00:26:26,880 --> 00:26:29,840
Speaker 1: are difficult to account for with a machine. So, for example,

434
00:26:29,840 --> 00:26:32,280
Speaker 1: you have homonyms and which you have two words that

435
00:26:32,440 --> 00:26:35,720
Speaker 1: sound the same but have very different meanings and potentially

436
00:26:35,760 --> 00:26:39,919
Speaker 1: spellings like right as in to write a sentence, or

437
00:26:40,080 --> 00:26:43,040
Speaker 1: right as in am I right? Or am I wrong?

438
00:26:43,600 --> 00:26:46,439
Speaker 1: Or you could have a pair of words that sound

439
00:26:46,480 --> 00:26:49,560
Speaker 1: like a single word and have confusion there, such as

440
00:26:49,880 --> 00:26:52,720
Speaker 1: a door. You can say a door you mean you're

441
00:26:52,720 --> 00:26:55,840
Speaker 1: meaning a single door a door to go into a building,

442
00:26:56,040 --> 00:26:58,520
Speaker 1: or you might say a dore as an I adore

443
00:26:58,960 --> 00:27:02,040
Speaker 1: this podcast you're doing, Jonathan. That's sweet of you, Thank

444
00:27:02,040 --> 00:27:06,399
Speaker 1: you for saying that. So computer scientists were hard at

445
00:27:06,400 --> 00:27:10,080
Speaker 1: work advancing both the capability of machines to make correct

446
00:27:10,200 --> 00:27:13,720
Speaker 1: guesses at individual phone names and then full words, as

447
00:27:13,720 --> 00:27:15,920
Speaker 1: well as figuring out a way to teach machines to

448
00:27:16,000 --> 00:27:20,959
Speaker 1: adjust guesses based on context. That requires a deeper understanding

449
00:27:21,000 --> 00:27:24,520
Speaker 1: of the language within which you're working. If you're aware

450
00:27:24,560 --> 00:27:27,439
Speaker 1: of certain idioms, you can make a good guess at

451
00:27:27,480 --> 00:27:29,320
Speaker 1: a word or phrase even if you didn't get a

452
00:27:29,320 --> 00:27:33,440
Speaker 1: clean pass at it right. So, for example, the phrase

453
00:27:33,760 --> 00:27:37,200
Speaker 1: it's raining cats and dogs just means it's raining a lot.

454
00:27:37,520 --> 00:27:40,359
Speaker 1: And if a system included a database that indicated the

455
00:27:40,480 --> 00:27:44,760
Speaker 1: phrase cats and dogs sometimes follows the phrase it's raining,

456
00:27:45,320 --> 00:27:47,640
Speaker 1: then the system is more likely to guess the correct

457
00:27:48,280 --> 00:27:52,760
Speaker 1: sequence of words instead of guessing something that sounded similar

458
00:27:52,840 --> 00:27:55,560
Speaker 1: but it's wrong. For example, if it said, oh, they

459
00:27:55,600 --> 00:27:59,800
Speaker 1: must have said it's raining bats and hogs, that would

460
00:27:59,800 --> 00:28:04,760
Speaker 1: not makes sense. So the systems estimate the probability that

461
00:28:04,840 --> 00:28:08,359
Speaker 1: any given sequence of sounds within the database matches what

462
00:28:08,480 --> 00:28:12,120
Speaker 1: the systems have just quote unquote heard progress in this

463
00:28:12,160 --> 00:28:15,040
Speaker 1: area was steady, but slow, and I'd argue that it

464
00:28:15,080 --> 00:28:17,959
Speaker 1: was also a reminder that concepts like Moore's law do

465
00:28:18,040 --> 00:28:22,760
Speaker 1: not apply universally across technology. Rapid development in one particular

466
00:28:22,800 --> 00:28:26,000
Speaker 1: domain of technology is not necessarily an indicator that the

467
00:28:26,040 --> 00:28:28,760
Speaker 1: same sort of progress will be observed in all other

468
00:28:28,840 --> 00:28:33,919
Speaker 1: areas of tech. We often get into the mistaken habit

469
00:28:34,000 --> 00:28:37,200
Speaker 1: of believing that Moore's law applies to everything. Alright. So

470
00:28:37,640 --> 00:28:42,120
Speaker 1: a related concept to voice recognition is something called natural

471
00:28:42,240 --> 00:28:45,480
Speaker 1: language processing, and this relates back to how we humans

472
00:28:45,480 --> 00:28:49,000
Speaker 1: tend to process information compared to the way machines tend

473
00:28:49,040 --> 00:28:52,200
Speaker 1: to do it. So we humans formulate ideas, we shape

474
00:28:52,200 --> 00:28:55,680
Speaker 1: those ideas into words and sentences. We communicate them in

475
00:28:55,760 --> 00:28:59,239
Speaker 1: some way to other people through that language. It may

476
00:28:59,280 --> 00:29:02,160
Speaker 1: be through speed you maybe through text. It may even

477
00:29:02,160 --> 00:29:06,280
Speaker 1: be through a nonverbal or non literary way, but we

478
00:29:06,400 --> 00:29:11,479
Speaker 1: communicate those ideas. Machines typically accept input, they perform some

479
00:29:11,560 --> 00:29:15,400
Speaker 1: process or sequence of processes on that input, and then

480
00:29:15,400 --> 00:29:18,960
Speaker 1: they supply an output of some sort. Machines do this

481
00:29:19,040 --> 00:29:22,480
Speaker 1: in machine language. That's a code that's far too difficult

482
00:29:22,480 --> 00:29:26,200
Speaker 1: for humans to process. Easily. Binary is an example of

483
00:29:26,280 --> 00:29:30,520
Speaker 1: machine language. Binary is represented as zeros and ones, which

484
00:29:30,520 --> 00:29:33,120
Speaker 1: would group together can represent all sorts of stuff. But

485
00:29:33,160 --> 00:29:35,600
Speaker 1: if you just looked at a big block of zeros

486
00:29:35,600 --> 00:29:38,280
Speaker 1: and ones, it would mean nothing to you. It's not

487
00:29:38,360 --> 00:29:41,520
Speaker 1: easy for humans to use, and then machines in turn

488
00:29:41,600 --> 00:29:44,960
Speaker 1: are not natively able to understand human language, so there's

489
00:29:44,960 --> 00:29:49,720
Speaker 1: a language barrier there. Because of that, people created different

490
00:29:49,760 --> 00:29:54,480
Speaker 1: programming languages. These languages provide layers of abstraction from the

491
00:29:54,560 --> 00:29:57,960
Speaker 1: machine language. They make it easier to create programs or

492
00:29:58,240 --> 00:30:01,560
Speaker 1: directions that the computer should fall low. So the person

493
00:30:01,680 --> 00:30:04,640
Speaker 1: who's doing the programming is using a programming language that's

494
00:30:04,640 --> 00:30:08,040
Speaker 1: easy for humans to use that then gets converted into

495
00:30:08,240 --> 00:30:11,520
Speaker 1: machine language that the computers understand. But what if you

496
00:30:11,560 --> 00:30:14,800
Speaker 1: could send commands to a computer using natural language, not

497
00:30:14,880 --> 00:30:20,320
Speaker 1: even programming language. You could just speak in Plaine vernacular,

498
00:30:20,400 --> 00:30:23,960
Speaker 1: whether it's English or any other language, the way humans

499
00:30:24,000 --> 00:30:27,320
Speaker 1: communicate with one another. What if a computer could extract

500
00:30:27,400 --> 00:30:30,760
Speaker 1: meaning from a sentence, understand what it was you wanted

501
00:30:30,800 --> 00:30:34,280
Speaker 1: the computer to do, and then respond appropriately. So imagine

502
00:30:34,280 --> 00:30:35,960
Speaker 1: how much time you could save if you could just

503
00:30:36,040 --> 00:30:38,640
Speaker 1: tell your computer what you wanted it to do, and

504
00:30:38,680 --> 00:30:41,440
Speaker 1: it took care of the rest. If you had a

505
00:30:41,480 --> 00:30:46,280
Speaker 1: powerful enough computer system with strong enough AI, maybe you

506
00:30:46,280 --> 00:30:49,480
Speaker 1: could even potentially do something like describe a game that

507
00:30:49,560 --> 00:30:52,240
Speaker 1: you would love to be able to play, like not

508
00:30:52,240 --> 00:30:54,400
Speaker 1: not a game that exists, a game in your head,

509
00:30:54,880 --> 00:30:56,760
Speaker 1: and you could describe it to a computer and the

510
00:30:56,800 --> 00:31:00,480
Speaker 1: computer could actually program that game. Well, we're we're definitely

511
00:31:00,520 --> 00:31:03,480
Speaker 1: not anywhere close to that yet, but we made enormous

512
00:31:03,520 --> 00:31:07,240
Speaker 1: progress with natural language processing. Now, the history of natural

513
00:31:07,320 --> 00:31:11,760
Speaker 1: language processing isn't exactly an extension of voice recognition. It's

514
00:31:11,760 --> 00:31:16,200
Speaker 1: actually more like a parallel line of investigation. And that's

515
00:31:16,200 --> 00:31:20,760
Speaker 1: because natural language processing doesn't require voice recognition. You can

516
00:31:20,840 --> 00:31:24,080
Speaker 1: have an implementation in which you just right commands in

517
00:31:24,200 --> 00:31:26,440
Speaker 1: natural language, you know, you type them out on a

518
00:31:26,520 --> 00:31:30,760
Speaker 1: keyboard and the machine then carries out those those instructions.

519
00:31:30,800 --> 00:31:33,400
Speaker 1: So much of the early work in natural language processing

520
00:31:33,440 --> 00:31:37,400
Speaker 1: was in text based communication rather than in speech. The

521
00:31:37,480 --> 00:31:41,240
Speaker 1: history of natural language processing includes stuff like the Turing test,

522
00:31:41,480 --> 00:31:44,840
Speaker 1: named after Alan Turing. So the most common interpretation of

523
00:31:44,880 --> 00:31:47,560
Speaker 1: the Turing test these days is that you've got a

524
00:31:47,600 --> 00:31:50,280
Speaker 1: scenario in which a person is alone in a room

525
00:31:50,320 --> 00:31:53,080
Speaker 1: with a computer terminal, they can type whatever they like

526
00:31:53,280 --> 00:31:57,520
Speaker 1: into the computer terminal, and someone or something is responding

527
00:31:57,560 --> 00:32:00,719
Speaker 1: to them in real time. Now it might be another person,

528
00:32:01,280 --> 00:32:04,120
Speaker 1: or it might be a computer system that's responding to

529
00:32:04,360 --> 00:32:08,959
Speaker 1: that person. You run a whole bunch of test subjects

530
00:32:08,960 --> 00:32:12,040
Speaker 1: through this process, and if the computer system is able

531
00:32:12,080 --> 00:32:15,280
Speaker 1: to fool a certain percentage of those test subjects, like

532
00:32:15,440 --> 00:32:18,720
Speaker 1: say thirty percent of them, that it is in fact

533
00:32:18,760 --> 00:32:21,560
Speaker 1: another human and not a computer, it is said to

534
00:32:21,640 --> 00:32:24,920
Speaker 1: have passed the Turing test, And typically we use that

535
00:32:24,960 --> 00:32:27,400
Speaker 1: to mean the machine has given off the appearance of

536
00:32:27,440 --> 00:32:31,680
Speaker 1: possessing intelligence similar to the one that we humans possess.

537
00:32:32,400 --> 00:32:35,520
Speaker 1: That gets beyond our scope for this episode, but it

538
00:32:35,560 --> 00:32:38,760
Speaker 1: helps point out that stuff like speech recognition and natural

539
00:32:38,840 --> 00:32:42,280
Speaker 1: language processing are both closely related to the field of

540
00:32:42,360 --> 00:32:45,720
Speaker 1: artificial intelligence. In fact, they really belong within the artificial

541
00:32:45,760 --> 00:32:50,320
Speaker 1: intelligence domain. The Turing test was more of a hypothetical.

542
00:32:50,560 --> 00:32:52,800
Speaker 1: It was a bit of a cheeky way of saying, Hey,

543
00:32:53,000 --> 00:32:55,680
Speaker 1: if you can't tell whether or not something is intelligent,

544
00:32:56,000 --> 00:32:58,560
Speaker 1: it makes sense to treat it as if it actually

545
00:32:58,840 --> 00:33:02,520
Speaker 1: is intelligent. After all, we assume that every human with

546
00:33:02,560 --> 00:33:06,440
Speaker 1: whom we interact possesses some level of intelligence. Based on

547
00:33:06,520 --> 00:33:09,640
Speaker 1: those interactions, so why should we not extend the same

548
00:33:09,680 --> 00:33:14,480
Speaker 1: courtesy to machines. Now, natural language processing would prove to

549
00:33:14,480 --> 00:33:18,000
Speaker 1: be another super challenging problem to solve. In computer science.

550
00:33:18,280 --> 00:33:21,880
Speaker 1: Early work was done in translation algorithms, and these were

551
00:33:21,880 --> 00:33:24,800
Speaker 1: programs that attempted to take phrases written in one language

552
00:33:24,840 --> 00:33:29,120
Speaker 1: and translate those automatically into a second language. At first,

553
00:33:29,160 --> 00:33:33,960
Speaker 1: that seemed pretty straightforward, but you realize that's also pretty tricky. Really.

554
00:33:34,200 --> 00:33:37,000
Speaker 1: For one thing, you can't just translate word for word

555
00:33:37,160 --> 00:33:40,080
Speaker 1: and keep the same order from one language to another.

556
00:33:40,520 --> 00:33:44,800
Speaker 1: The syntax or the rules that the language follow uh,

557
00:33:44,840 --> 00:33:47,719
Speaker 1: they could be different from language to language. In one language,

558
00:33:47,720 --> 00:33:50,960
Speaker 1: you might use an infinitive such as to record, in

559
00:33:51,000 --> 00:33:53,760
Speaker 1: the middle of a sentence, while another language might put

560
00:33:53,800 --> 00:33:56,400
Speaker 1: all the infinitives at the end of a sentence. So

561
00:33:56,960 --> 00:33:59,240
Speaker 1: in one language, I might say I'm going to record

562
00:33:59,280 --> 00:34:02,320
Speaker 1: a podcast in the studio right now, but in another

563
00:34:02,400 --> 00:34:05,080
Speaker 1: language it might come out as I'm going a podcast

564
00:34:05,080 --> 00:34:08,000
Speaker 1: in the studio right now to record. It starts to

565
00:34:08,000 --> 00:34:12,879
Speaker 1: sound like yoda. There was initial excitement around machine translation,

566
00:34:13,160 --> 00:34:16,400
Speaker 1: but once computer scientists and linguists began to see the

567
00:34:16,440 --> 00:34:20,320
Speaker 1: scope of this challenge, their excitement faded a bit. Also,

568
00:34:20,440 --> 00:34:22,400
Speaker 1: there was a lot of other stuff going on in

569
00:34:22,440 --> 00:34:25,200
Speaker 1: the nineteen sixties and seventies that was demanding a lot

570
00:34:25,200 --> 00:34:28,520
Speaker 1: of attention, such as the Space race. So for a while,

571
00:34:28,880 --> 00:34:32,799
Speaker 1: this branch of computer science was given less attention than

572
00:34:32,840 --> 00:34:37,160
Speaker 1: other branches, and by less attention, I really mean funding. Now,

573
00:34:37,160 --> 00:34:39,359
Speaker 1: when we come back, we'll talk a bit more about

574
00:34:39,400 --> 00:34:42,799
Speaker 1: the advances that were necessary to support natural language processing,

575
00:34:43,000 --> 00:34:44,880
Speaker 1: and we'll move on to how this would be another

576
00:34:44,960 --> 00:34:48,880
Speaker 1: important component in smart speakers. But first, let's take another

577
00:34:49,000 --> 00:35:00,960
Speaker 1: quick break. Okay, So early enthusiasm for an natural language

578
00:35:00,960 --> 00:35:04,520
Speaker 1: processing created a bit of a hype cycle that ultimately

579
00:35:04,640 --> 00:35:10,160
Speaker 1: crashed into the telephone poll of unmet expectations. That was

580
00:35:10,200 --> 00:35:15,280
Speaker 1: a really bad metaphor. Anyway, natural language processing went through

581
00:35:15,520 --> 00:35:18,520
Speaker 1: something similar to what we saw with virtual reality in

582
00:35:18,520 --> 00:35:22,920
Speaker 1: the nineteen nineties. You know, people saw what was actually achievable,

583
00:35:23,480 --> 00:35:26,440
Speaker 1: and then they compared that to what they thought they

584
00:35:26,440 --> 00:35:29,120
Speaker 1: were going to get, and those two things didn't match

585
00:35:29,200 --> 00:35:31,919
Speaker 1: up at all, and that really pulled the rug out

586
00:35:31,960 --> 00:35:35,440
Speaker 1: of funding for natural language processing, which men of course,

587
00:35:35,480 --> 00:35:40,040
Speaker 1: that progress slowed way down. It kept going, but it

588
00:35:40,239 --> 00:35:43,239
Speaker 1: was definitely on the back burner for a lot of projects.

589
00:35:43,680 --> 00:35:46,799
Speaker 1: When interest renewed in the nineteen eighties, there had been

590
00:35:46,800 --> 00:35:51,440
Speaker 1: a shift in thinking around natural language processing. Computer scientists

591
00:35:51,480 --> 00:35:54,640
Speaker 1: were starting to look at statistical approaches similar to what

592
00:35:54,719 --> 00:35:58,920
Speaker 1: was going on with speech recognition, building up probabilistic models

593
00:35:58,960 --> 00:36:01,520
Speaker 1: in which a computer can start making what amounts to

594
00:36:01,840 --> 00:36:05,880
Speaker 1: educated guesses at the meaning of a command or a phrase.

595
00:36:06,480 --> 00:36:10,080
Speaker 1: Machine learning became an important component on the back end

596
00:36:10,120 --> 00:36:13,839
Speaker 1: of these systems, and later artificial neural networks became an

597
00:36:13,880 --> 00:36:17,719
Speaker 1: important part as well. A neural network processes information in

598
00:36:17,760 --> 00:36:20,400
Speaker 1: a way that's sort of analogous to how our brains

599
00:36:20,480 --> 00:36:24,719
Speaker 1: do it. You have nodes or neurons that connect to

600
00:36:24,800 --> 00:36:29,000
Speaker 1: other nodes, and each node affects incoming data in a

601
00:36:29,040 --> 00:36:32,279
Speaker 1: certain way, performing some sort of operation on it, and

602
00:36:32,400 --> 00:36:34,960
Speaker 1: the degree to which they do that in one way

603
00:36:35,080 --> 00:36:39,120
Speaker 1: versus another is called the weight of that node. Computer

604
00:36:39,160 --> 00:36:42,799
Speaker 1: scientists apply weights across the nodes in an effort to

605
00:36:42,880 --> 00:36:46,840
Speaker 1: get a specific result in order to train these models.

606
00:36:46,880 --> 00:36:50,640
Speaker 1: So you might feed a specific command into such a system,

607
00:36:50,680 --> 00:36:53,560
Speaker 1: and you let it go through the computational process from

608
00:36:53,600 --> 00:36:56,080
Speaker 1: the beginning of the neural network through to the end,

609
00:36:56,560 --> 00:36:58,520
Speaker 1: and then you look at the result, and if the

610
00:36:58,560 --> 00:37:01,520
Speaker 1: result is correct, well, that just means the system is

611
00:37:01,520 --> 00:37:04,719
Speaker 1: already working as you intended it, which honestly is not

612
00:37:04,880 --> 00:37:08,480
Speaker 1: likely to happen early on. But if it's not correct,

613
00:37:08,760 --> 00:37:12,400
Speaker 1: then you start adjusting the weights on those nodes in

614
00:37:12,520 --> 00:37:15,200
Speaker 1: order to affect the outcome. I almost think of it

615
00:37:15,239 --> 00:37:18,440
Speaker 1: as like Plinko or pachinko, where you've got the little

616
00:37:18,520 --> 00:37:20,920
Speaker 1: coin and you drop it down and it bounces on

617
00:37:20,920 --> 00:37:24,080
Speaker 1: all the pegs and sometimes you're like you might think,

618
00:37:24,080 --> 00:37:25,520
Speaker 1: all right, well, this time it's going to go right

619
00:37:25,520 --> 00:37:28,600
Speaker 1: for that center slot, but it doesn't, and you think, well,

620
00:37:28,600 --> 00:37:31,080
Speaker 1: maybe if I remove some of these pegs or I

621
00:37:31,239 --> 00:37:33,839
Speaker 1: shift these pegs over a little bit, I can drop

622
00:37:33,880 --> 00:37:35,880
Speaker 1: it in that same spot and get hit the center.

623
00:37:36,120 --> 00:37:38,160
Speaker 1: It's kind of like that, except you're talking about data,

624
00:37:38,280 --> 00:37:42,120
Speaker 1: not physical moving parts. So you have to do this

625
00:37:42,160 --> 00:37:46,960
Speaker 1: a lot, like up to like millions of times in

626
00:37:47,080 --> 00:37:49,680
Speaker 1: order to try and train a system so that responds

627
00:37:49,680 --> 00:37:53,399
Speaker 1: appropriately to commands. And once it's trained, you can then

628
00:37:53,480 --> 00:37:55,799
Speaker 1: test new commands on the system to see if it

629
00:37:55,800 --> 00:37:58,640
Speaker 1: can parse them and respond appropriately. And in this way,

630
00:37:58,840 --> 00:38:03,240
Speaker 1: the system quote unquote learns over time how to respond

631
00:38:03,440 --> 00:38:07,040
Speaker 1: to commands. And then we have another component that's important

632
00:38:07,040 --> 00:38:10,839
Speaker 1: with smart speakers, and that's speech generation. So it's one

633
00:38:10,840 --> 00:38:14,600
Speaker 1: thing to have a machine either broadcast or play back

634
00:38:14,640 --> 00:38:18,080
Speaker 1: a recording of speech. It's another thing for a machine

635
00:38:18,120 --> 00:38:22,319
Speaker 1: to generate brand new speech. In computer science, we call

636
00:38:22,360 --> 00:38:26,919
Speaker 1: it speech synthesis. Now, this is the really old technology

637
00:38:26,960 --> 00:38:29,279
Speaker 1: I was alluding to at the beginning of this episode,

638
00:38:29,520 --> 00:38:32,960
Speaker 1: speech synthesis. If you want to be really, you know,

639
00:38:33,040 --> 00:38:36,479
Speaker 1: kind of technical about it, it actually predates every other

640
00:38:36,520 --> 00:38:39,759
Speaker 1: technology I've mentioned up to this point, at least in

641
00:38:40,000 --> 00:38:43,880
Speaker 1: its most rudimentary implementations. You have to go way back

642
00:38:43,960 --> 00:38:47,360
Speaker 1: to the eighteenth century the seventeen seventies, as when a

643
00:38:47,440 --> 00:38:52,480
Speaker 1: Russian smarty pants named Christian Kradsenstein was building a device

644
00:38:52,560 --> 00:38:56,800
Speaker 1: that used acoustic resonators. These these reads that would vibrate,

645
00:38:57,160 --> 00:39:01,879
Speaker 1: and it was in an attempt to replicate basic vowel sounds. Now,

646
00:39:01,920 --> 00:39:04,399
Speaker 1: even with such a working device, it would be really

647
00:39:04,400 --> 00:39:07,960
Speaker 1: difficult to communicate anything meaningful unless you were, i guess,

648
00:39:08,000 --> 00:39:11,399
Speaker 1: speaking whale like Dory and finding Nemo. But it would

649
00:39:11,400 --> 00:39:13,640
Speaker 1: be an early example of how people tried to create

650
00:39:13,680 --> 00:39:18,080
Speaker 1: mechanical systems that could replicate speech or elements of speech.

651
00:39:18,440 --> 00:39:23,600
Speaker 1: Another inventor named Wolfgang von Kimberland built an acoustic mechanical

652
00:39:23,719 --> 00:39:28,080
Speaker 1: speech machine and that used reads and tubes and a

653
00:39:28,160 --> 00:39:31,520
Speaker 1: pressure chamber, and it was all meant to replicate various

654
00:39:31,600 --> 00:39:35,640
Speaker 1: speech sounds. He had other elements to create sounds like plosives,

655
00:39:35,640 --> 00:39:39,880
Speaker 1: those hard sounds that I mentioned earlier in the episode.

656
00:39:40,520 --> 00:39:43,080
Speaker 1: So he had all these different elements that, working together,

657
00:39:43,160 --> 00:39:47,640
Speaker 1: could create parts of the sounds that we humans make

658
00:39:47,680 --> 00:39:51,480
Speaker 1: when we speak. He also built a supposed chess playing machine,

659
00:39:51,920 --> 00:39:53,799
Speaker 1: and it turned out that the chess playing part was

660
00:39:53,840 --> 00:39:58,680
Speaker 1: a hoax. So unfortunately, because that device was a hoax,

661
00:39:58,840 --> 00:40:03,360
Speaker 1: a lot of people dismiss his other work, which was legitimate.

662
00:40:03,880 --> 00:40:07,840
Speaker 1: So by fudging on one thing, he kind of cast

663
00:40:08,040 --> 00:40:12,120
Speaker 1: doubt on everything he had ever done. Skipping ahead quite

664
00:40:12,160 --> 00:40:15,720
Speaker 1: a bit, we get to Homer Dudley, which is a

665
00:40:15,760 --> 00:40:21,640
Speaker 1: fantastic name. He unveiled the voter or voice Operating Demonstrator

666
00:40:21,880 --> 00:40:25,480
Speaker 1: device at the New York World's Fair in nineteen thirty nine.

667
00:40:25,960 --> 00:40:29,880
Speaker 1: It consisted of a complex series of controls and it

668
00:40:29,960 --> 00:40:32,720
Speaker 1: sort of reminds me of something like a musical instrument,

669
00:40:32,800 --> 00:40:36,840
Speaker 1: kind of like a synthesizer, but with extra controlling units.

670
00:40:36,880 --> 00:40:39,400
Speaker 1: Like there was like a wrist element, there was a pedal.

671
00:40:39,719 --> 00:40:43,080
Speaker 1: There's a lot of stuff that made it very complex,

672
00:40:43,440 --> 00:40:47,239
Speaker 1: and with a lot of practice, you could create specific

673
00:40:47,320 --> 00:40:51,360
Speaker 1: sounds from this synthesizer. You could even create words or

674
00:40:51,440 --> 00:40:55,120
Speaker 1: full sentences, though from what I understand, it was incredibly

675
00:40:55,200 --> 00:40:57,760
Speaker 1: challenging to do. It was a very high learning curve,

676
00:40:58,120 --> 00:41:02,040
Speaker 1: but it demonstrate the possibility of a like tronic synthesized speech. Now.

677
00:41:02,080 --> 00:41:06,120
Speaker 1: There was a lot of work done in this field

678
00:41:07,120 --> 00:41:12,040
Speaker 1: by lots of different talented scientists and engineers, and someday

679
00:41:12,080 --> 00:41:14,320
Speaker 1: I'll have to do a full episode on the history

680
00:41:14,360 --> 00:41:17,839
Speaker 1: of speech synthesis. It's really fascinating, but it's far too

681
00:41:17,920 --> 00:41:21,279
Speaker 1: big a topic to cover in its entirety in this episode.

682
00:41:21,640 --> 00:41:24,480
Speaker 1: By the late nineteen sixties we had our first text

683
00:41:24,640 --> 00:41:27,920
Speaker 1: to speech system, and by the late nineteen seventies and

684
00:41:28,000 --> 00:41:31,280
Speaker 1: early nineteen eighties, the state of the art had progressed

685
00:41:31,360 --> 00:41:33,160
Speaker 1: quite a bit and we were starting to get to

686
00:41:33,200 --> 00:41:38,360
Speaker 1: a point where we could create very understandable computer voices.

687
00:41:38,400 --> 00:41:41,680
Speaker 1: They weren't natural, they didn't sound like people, but you

688
00:41:41,719 --> 00:41:45,439
Speaker 1: could understand what they were saying. And finally, something else

689
00:41:45,440 --> 00:41:48,840
Speaker 1: that would enable smart speakers and virtual assistance was the

690
00:41:48,880 --> 00:41:53,240
Speaker 1: pairing of improved network connectivity and cloud computing. That removes

691
00:41:53,239 --> 00:41:56,319
Speaker 1: the need for the device that you're interacting with to

692
00:41:56,400 --> 00:41:59,440
Speaker 1: do all the processing on its own. So, if you

693
00:41:59,440 --> 00:42:01,799
Speaker 1: think about or the history of computing, we used to

694
00:42:01,880 --> 00:42:05,160
Speaker 1: do main frames with dumb terminals that attached the main frame,

695
00:42:05,440 --> 00:42:08,120
Speaker 1: so the terminal wasn't doing any computing. It was just

696
00:42:08,200 --> 00:42:11,520
Speaker 1: tapping into the mainframe computer, which was sending results back

697
00:42:11,520 --> 00:42:13,880
Speaker 1: to the terminal. Then you get to the era of

698
00:42:13,960 --> 00:42:17,640
Speaker 1: personal computers, where you had a device sitting on your

699
00:42:17,680 --> 00:42:20,560
Speaker 1: desk that did all the computing and it didn't connect

700
00:42:20,560 --> 00:42:23,920
Speaker 1: to anything else. Then we get up to networking and

701
00:42:23,960 --> 00:42:27,640
Speaker 1: the Internet, where we suddenly had the capability of having

702
00:42:27,840 --> 00:42:31,720
Speaker 1: really powerful computers or grids of computers that were able

703
00:42:31,760 --> 00:42:35,200
Speaker 1: to take on processing power. Uh, and you just you

704
00:42:35,239 --> 00:42:38,719
Speaker 1: send the request out to the Internet and you get

705
00:42:38,719 --> 00:42:42,080
Speaker 1: the response back. That's the basis of cloud computing. So

706
00:42:43,200 --> 00:42:47,000
Speaker 1: your your command or message or whatever relays back to

707
00:42:47,160 --> 00:42:50,759
Speaker 1: servers on the cloud that then process it and send

708
00:42:50,760 --> 00:42:54,759
Speaker 1: the proper response to whatever device you're interacting with, and

709
00:42:54,800 --> 00:42:57,120
Speaker 1: then you get the result. So with the case of

710
00:42:57,120 --> 00:42:59,840
Speaker 1: the smart speaker, it might be playing a specific so

711
00:43:00,000 --> 00:43:02,360
Speaker 1: long or giving you a weather report or whatever it

712
00:43:02,440 --> 00:43:05,279
Speaker 1: might be. Now, if the speakers were doing some of

713
00:43:05,320 --> 00:43:09,759
Speaker 1: that computation themselves, that would be an example of edge computing,

714
00:43:10,160 --> 00:43:13,440
Speaker 1: where the processing takes place at least in part, at

715
00:43:13,480 --> 00:43:16,799
Speaker 1: the edge of a network at those end points. But

716
00:43:16,960 --> 00:43:20,240
Speaker 1: for now, most of the implementations we see send data

717
00:43:20,280 --> 00:43:22,120
Speaker 1: back to the cloud to get the right response, so

718
00:43:22,160 --> 00:43:25,240
Speaker 1: you have to have a persistent Internet connection. These devices

719
00:43:25,280 --> 00:43:28,480
Speaker 1: are not useful without that connection. You do have some

720
00:43:28,600 --> 00:43:32,360
Speaker 1: smart speakers that can connect to another device like a

721
00:43:32,360 --> 00:43:36,320
Speaker 1: smartphone via Bluetooth, so you could do things that way,

722
00:43:36,760 --> 00:43:40,920
Speaker 1: but without those connections, the smart speaker turns into, you know,

723
00:43:41,040 --> 00:43:44,320
Speaker 1: just a dumb speaker, or sometimes just a paperweight. Now,

724
00:43:44,840 --> 00:43:48,040
Speaker 1: this collection of technologies and disciplines are what enabled Apple

725
00:43:48,440 --> 00:43:52,360
Speaker 1: to introduce Sirie in two thousand and eleven, and Syria

726
00:43:52,440 --> 00:43:56,160
Speaker 1: is a virtual assistant. Series origins actually trace back to

727
00:43:56,520 --> 00:44:00,480
Speaker 1: the Stanford Research Institute and a group of guys Grouber,

728
00:44:00,560 --> 00:44:04,279
Speaker 1: Adamshire and dog kit Louse who had been working on

729
00:44:04,320 --> 00:44:08,240
Speaker 1: the concept since the nineteen nineties, and when Apple launched

730
00:44:08,239 --> 00:44:11,000
Speaker 1: the iPhone in two thousand seven, they saw the iPhone

731
00:44:11,040 --> 00:44:14,680
Speaker 1: as a potential platform for this virtual assistant that they

732
00:44:14,719 --> 00:44:17,520
Speaker 1: had been building, and they thought, well, this is perfect

733
00:44:17,560 --> 00:44:20,879
Speaker 1: because the iPhone has a microphone, so the assistant can

734
00:44:20,920 --> 00:44:23,719
Speaker 1: respond to voice commands as a speaker, so it could

735
00:44:23,719 --> 00:44:26,200
Speaker 1: communicate back to the user, it could do all sorts

736
00:44:26,239 --> 00:44:30,480
Speaker 1: of stuff. We can tap into the interoperability of apps

737
00:44:30,520 --> 00:44:33,160
Speaker 1: on the device. It's a perfect platform for us to

738
00:44:33,239 --> 00:44:36,839
Speaker 1: deploy this. So they developed an app once the opportunity

739
00:44:36,880 --> 00:44:41,279
Speaker 1: arose because apps were not available for development immediately when

740
00:44:41,320 --> 00:44:45,840
Speaker 1: Apple launched the iPhone, and once they did launch that app,

741
00:44:46,719 --> 00:44:50,240
Speaker 1: uh within a month, less than a month, Steve Jobs

742
00:44:50,280 --> 00:44:52,040
Speaker 1: was on the phone calling them up and offering to

743
00:44:52,120 --> 00:44:54,560
Speaker 1: buy the technology, which of course they would agree to

744
00:44:54,880 --> 00:44:57,920
Speaker 1: and it would become an integrated component in Apple's iPhone

745
00:44:57,960 --> 00:45:02,480
Speaker 1: line afterward. And that's where voice assistants kind of lived

746
00:45:02,640 --> 00:45:05,560
Speaker 1: for a few years. They mostly lived on smartphones like

747
00:45:05,640 --> 00:45:10,080
Speaker 1: the iPhone. But in November two thousand fourteen, Amazon introduced

748
00:45:10,160 --> 00:45:14,400
Speaker 1: the Amazon Echo smart speaker, which was originally only available

749
00:45:14,440 --> 00:45:17,600
Speaker 1: for Prime members, and it had its own virtual assistant

750
00:45:17,760 --> 00:45:22,879
Speaker 1: named Alexa, and thus the smart speaker era officially began. Now,

751
00:45:22,920 --> 00:45:25,480
Speaker 1: there are plenty of other smart speakers that are on

752
00:45:25,520 --> 00:45:28,600
Speaker 1: the market these days. There are products from Google like

753
00:45:28,719 --> 00:45:31,879
Speaker 1: Google Home. Uh, there are so no speakers that can

754
00:45:31,920 --> 00:45:35,920
Speaker 1: connect to services like Amazon's Alexa or Google's Assistant, and

755
00:45:36,000 --> 00:45:38,799
Speaker 1: we're probably going to see a ton more, both from

756
00:45:38,880 --> 00:45:42,120
Speaker 1: companies that piggyback onto services from the big providers like

757
00:45:42,160 --> 00:45:45,160
Speaker 1: Google and Amazon, and maybe some that are trying to

758
00:45:45,200 --> 00:45:47,360
Speaker 1: make a go of it with their own branded virtual

759
00:45:47,400 --> 00:45:52,040
Speaker 1: assistants and services. Smart speakers respond to commands after they

760
00:45:52,120 --> 00:45:55,880
Speaker 1: quote unquote here a wake up word or phrase. Now,

761
00:45:56,120 --> 00:45:59,040
Speaker 1: I'm gonna make up a wake up phrase right now

762
00:45:59,239 --> 00:46:02,440
Speaker 1: so that I don't set off anyone's smart speaker or

763
00:46:02,480 --> 00:46:05,520
Speaker 1: smart watch or smartphone or smart car or whatever it

764
00:46:05,600 --> 00:46:08,520
Speaker 1: might be. So this is just a fictional example of

765
00:46:08,560 --> 00:46:11,719
Speaker 1: a wake up phrase. So let's say I have a

766
00:46:11,719 --> 00:46:15,040
Speaker 1: smart speaker and the wake up phrase for my smart

767
00:46:15,080 --> 00:46:18,680
Speaker 1: speaker happens to be hey, they're Genie. Well, my smart

768
00:46:18,719 --> 00:46:21,120
Speaker 1: speaker has a microphone, so it can detect when I

769
00:46:21,160 --> 00:46:27,480
Speaker 1: say that, but really it's constantly detecting all sounds in

770
00:46:27,680 --> 00:46:31,000
Speaker 1: its environment. The microphone is always active. It has to

771
00:46:31,040 --> 00:46:33,239
Speaker 1: be in order to be able to pick up on

772
00:46:33,360 --> 00:46:38,160
Speaker 1: when I say the wake up phrase. So the microphone

773
00:46:38,200 --> 00:46:41,480
Speaker 1: is always active on most smart speakers. There's somewhere you

774
00:46:41,520 --> 00:46:44,160
Speaker 1: can program it so that it will only activate if

775
00:46:44,160 --> 00:46:47,480
Speaker 1: you first touch the speaker and that wakes it up.

776
00:46:47,840 --> 00:46:49,680
Speaker 1: There's some that you can do that with, But for

777
00:46:49,719 --> 00:46:53,359
Speaker 1: the most part, they're always listening. While the speaker can

778
00:46:53,480 --> 00:46:57,960
Speaker 1: quote unquote here everything, it's not listening to everything. In

779
00:46:57,960 --> 00:47:01,200
Speaker 1: other words, it's not mon of during the specific things

780
00:47:01,200 --> 00:47:03,640
Speaker 1: being said. At least that's what we've been told. And honestly,

781
00:47:04,200 --> 00:47:07,320
Speaker 1: that makes a ton of sense from an operational standpoint.

782
00:47:07,520 --> 00:47:10,080
Speaker 1: And the reason I say that is that the sheer

783
00:47:10,120 --> 00:47:13,359
Speaker 1: amount of information that would be flooding in from all

784
00:47:13,440 --> 00:47:16,919
Speaker 1: the microphones on all the smart devices from any one

785
00:47:17,080 --> 00:47:20,160
Speaker 1: provider that happened to be deployed all over the world,

786
00:47:20,360 --> 00:47:23,880
Speaker 1: that would be an astounding amount of data. And sifting

787
00:47:23,920 --> 00:47:27,000
Speaker 1: through all that data to find stuff that's useful would

788
00:47:27,040 --> 00:47:29,840
Speaker 1: take an enormous amount of effort and time and and

789
00:47:29,960 --> 00:47:33,719
Speaker 1: processing power. So while you could have all the microphones

790
00:47:33,760 --> 00:47:37,120
Speaker 1: listening in all over the place, finding out who to

791
00:47:37,200 --> 00:47:39,680
Speaker 1: listen to at what time would be a lot trickier

792
00:47:39,719 --> 00:47:41,840
Speaker 1: and probably not worth the effort it would take to

793
00:47:41,880 --> 00:47:46,440
Speaker 1: pull something like that off. So what these speakers and

794
00:47:46,560 --> 00:47:50,240
Speaker 1: other devices are actually doing is looking for a signal

795
00:47:50,480 --> 00:47:53,640
Speaker 1: that matches the one that represents the wake phrase. So

796
00:47:53,719 --> 00:47:57,799
Speaker 1: when I say, hey, they're Genie, the microphone picks up

797
00:47:57,800 --> 00:48:00,920
Speaker 1: my voice, which the mic then try inslates into an

798
00:48:00,920 --> 00:48:04,640
Speaker 1: electrical signal which gets digitized and compared against the digital

799
00:48:04,640 --> 00:48:09,000
Speaker 1: fingerprint of the predesignated wake up phrase. And in this case,

800
00:48:09,480 --> 00:48:13,880
Speaker 1: the two phrases match. It's like a fingerprint matching something

801
00:48:13,920 --> 00:48:16,719
Speaker 1: that was left at a site. So that turns the

802
00:48:16,760 --> 00:48:20,080
Speaker 1: speaker into an active listener rather than a passive one.

803
00:48:20,120 --> 00:48:23,600
Speaker 1: It's ready to accept a command or a question and

804
00:48:23,680 --> 00:48:27,200
Speaker 1: to respond to me. But if I didn't say, hey,

805
00:48:27,280 --> 00:48:30,920
Speaker 1: they're Genie, then the speaker would remain in passive mode

806
00:48:31,360 --> 00:48:35,239
Speaker 1: because it wouldn't have a digital fingerprint that matches the

807
00:48:35,320 --> 00:48:38,400
Speaker 1: one of the wake up phrase. Everything stays at the

808
00:48:38,440 --> 00:48:41,840
Speaker 1: local level, and none of my sweet secret speech gets

809
00:48:41,880 --> 00:48:45,080
Speaker 1: transmitt related across the internet. It's all staying right there.

810
00:48:45,560 --> 00:48:47,960
Speaker 1: At least that's what we've been told. And again I

811
00:48:48,000 --> 00:48:50,719
Speaker 1: don't have any reason to disbelieve this, but it is

812
00:48:50,760 --> 00:48:53,800
Speaker 1: something to keep in mind. You are talking about devices

813
00:48:53,840 --> 00:48:55,960
Speaker 1: that have microphones. Of course, if you have a smartphone,

814
00:48:56,000 --> 00:48:57,719
Speaker 1: you've already got one of those or a cell phone.

815
00:48:57,719 --> 00:49:00,600
Speaker 1: In general, you've got a device with a microphone on

816
00:49:00,680 --> 00:49:04,040
Speaker 1: it neck near you pretty much all the time. Now,

817
00:49:04,680 --> 00:49:07,360
Speaker 1: once I do make a request with my smart speaker,

818
00:49:07,560 --> 00:49:09,880
Speaker 1: the speaker then sends that request up to the cloud

819
00:49:10,000 --> 00:49:14,120
Speaker 1: where it gets processed, It's analyzed, uh, and then a

820
00:49:14,160 --> 00:49:18,080
Speaker 1: proper response is returned to me, whether that is playing

821
00:49:18,080 --> 00:49:20,480
Speaker 1: a song or giving me information I've asked for, or

822
00:49:20,520 --> 00:49:23,160
Speaker 1: maybe even interacting with some other smart device in my home,

823
00:49:23,280 --> 00:49:26,880
Speaker 1: such as adjusting the brightness of my smart lights in

824
00:49:26,960 --> 00:49:30,160
Speaker 1: my house. Now, if the system is not sure about

825
00:49:30,200 --> 00:49:34,240
Speaker 1: whatever it was I just said, it will probably return

826
00:49:34,400 --> 00:49:37,520
Speaker 1: an error phrase. So maybe maybe I'm too far away

827
00:49:37,600 --> 00:49:40,759
Speaker 1: from the speaker, so it's it couldn't quote unquote hear

828
00:49:40,840 --> 00:49:43,719
Speaker 1: me really well. Or maybe I've got a mouthful of

829
00:49:43,760 --> 00:49:46,600
Speaker 1: peanut butter or something as I want to do. Then

830
00:49:46,640 --> 00:49:48,600
Speaker 1: I'm going to get something like I'm sorry, I don't

831
00:49:48,640 --> 00:49:50,480
Speaker 1: know how to do that, or I'm sorry I didn't

832
00:49:50,520 --> 00:49:53,359
Speaker 1: understand you, and then I'd have to repeat it. Now,

833
00:49:53,400 --> 00:49:57,040
Speaker 1: smart speakers are pretty cool. However, they do represent another

834
00:49:57,080 --> 00:50:02,040
Speaker 1: piece of technology that you have to network to other devices,

835
00:50:02,200 --> 00:50:06,120
Speaker 1: including your own home network, and as such that means

836
00:50:06,200 --> 00:50:10,600
Speaker 1: that they represent a potential vulnerability in a network. It

837
00:50:10,640 --> 00:50:14,520
Speaker 1: doesn't mean they're automatically vulnerable, but it means that every

838
00:50:14,520 --> 00:50:18,319
Speaker 1: time you are connecting something to your network, then you're

839
00:50:18,400 --> 00:50:23,960
Speaker 1: creating another potential attack vector for a hacker. Right now,

840
00:50:24,000 --> 00:50:28,879
Speaker 1: if everything is super strong, it it doesn't really effectively

841
00:50:29,040 --> 00:50:32,799
Speaker 1: change your safety in any meaningful way. But if one

842
00:50:32,840 --> 00:50:35,880
Speaker 1: of those things that you connect to your network is

843
00:50:36,000 --> 00:50:38,640
Speaker 1: less strong than the others, you're looking at the weakest

844
00:50:38,680 --> 00:50:41,520
Speaker 1: link situation where a hacker with the right know how

845
00:50:41,560 --> 00:50:45,280
Speaker 1: in tools could potentially target that part of your network

846
00:50:45,600 --> 00:50:49,320
Speaker 1: to get entry into everything else. And when you're talking

847
00:50:49,360 --> 00:50:53,120
Speaker 1: about a smart speaker, you're talking about device that has

848
00:50:53,400 --> 00:50:57,440
Speaker 1: an active microphone on it. So potentially, if someone were

849
00:50:57,480 --> 00:50:59,879
Speaker 1: able to compromise a smart speaker, they would be able

850
00:50:59,920 --> 00:51:03,120
Speaker 1: to listening on anything that was within range of that

851
00:51:03,200 --> 00:51:07,920
Speaker 1: smart speakers microphone. So that's why you have to at

852
00:51:07,960 --> 00:51:11,759
Speaker 1: least be cognizant of that, do your research, make sure

853
00:51:11,800 --> 00:51:15,000
Speaker 1: the devices you're connecting to your network are rated well

854
00:51:15,040 --> 00:51:18,640
Speaker 1: as from a security standpoint, when you're setting things up

855
00:51:18,680 --> 00:51:22,239
Speaker 1: and you have to create passwords, create strong passwords that

856
00:51:22,320 --> 00:51:26,200
Speaker 1: are not used anywhere else. The harder you make things

857
00:51:26,440 --> 00:51:30,160
Speaker 1: the more likely hackers will just pass you by, not

858
00:51:30,239 --> 00:51:33,919
Speaker 1: because you're too tough to crack. Never get your into

859
00:51:33,960 --> 00:51:37,200
Speaker 1: your head that you're too strong to to be hacked,

860
00:51:37,440 --> 00:51:41,160
Speaker 1: but rather if there's someone who's weaker than the hackers

861
00:51:41,160 --> 00:51:43,600
Speaker 1: are going to go after that person instead. So just

862
00:51:43,680 --> 00:51:48,640
Speaker 1: don't be the weak person. Practice really good security behaviors,

863
00:51:48,840 --> 00:51:53,279
Speaker 1: and you're more likely to discourage attackers and they'll they'll

864
00:51:53,320 --> 00:51:57,359
Speaker 1: go on to someone else. Um, especially if you're talking

865
00:51:57,360 --> 00:51:59,960
Speaker 1: about newbies who don't really know their way around their

866
00:52:00,080 --> 00:52:02,919
Speaker 1: just using tools that other people have designed. They get

867
00:52:02,920 --> 00:52:05,319
Speaker 1: discouraged very quickly. They'll move on to someone else because

868
00:52:05,360 --> 00:52:09,359
Speaker 1: there's always another potential target. I'm curious about you guys,

869
00:52:09,360 --> 00:52:12,720
Speaker 1: whether or not you have any smart speakers in your life,

870
00:52:13,200 --> 00:52:15,919
Speaker 1: and uh if you find them useful. I find mine

871
00:52:15,960 --> 00:52:20,160
Speaker 1: pretty useful. I use it for a very narrow range

872
00:52:20,320 --> 00:52:23,480
Speaker 1: of things. I don't tend to use it. I definitely

873
00:52:23,520 --> 00:52:25,680
Speaker 1: don't use it to its full potential. I know that

874
00:52:26,239 --> 00:52:29,080
Speaker 1: because what's in the blue moon. I'll just try something

875
00:52:29,120 --> 00:52:31,719
Speaker 1: and I'm amazed at what happens when when I get

876
00:52:31,719 --> 00:52:34,759
Speaker 1: a response. But for the most part, I'm asking about

877
00:52:34,800 --> 00:52:38,080
Speaker 1: whether what I can feed my dog whether or not

878
00:52:38,120 --> 00:52:41,040
Speaker 1: it can turn on the lights and uh and and

879
00:52:41,760 --> 00:52:45,319
Speaker 1: that's about it. Are occasionally playing a song. Um, but

880
00:52:45,480 --> 00:52:47,759
Speaker 1: I'm curious what you guys are using them for. Reach

881
00:52:47,800 --> 00:52:50,480
Speaker 1: Out to me on social networks on Facebook and I'm

882
00:52:50,520 --> 00:52:52,480
Speaker 1: on Twitter, and the handle for both of those is

883
00:52:52,520 --> 00:52:56,880
Speaker 1: text stuff. H s W also use that those handles

884
00:52:56,880 --> 00:52:59,320
Speaker 1: if you have suggestions for future episodes. If you've got,

885
00:52:59,440 --> 00:53:02,120
Speaker 1: you know, an idea for either a company or a

886
00:53:02,160 --> 00:53:05,640
Speaker 1: technology or a theme in tech you'd really like me

887
00:53:05,719 --> 00:53:08,759
Speaker 1: to tackle, let me know there and I'll talk to

888
00:53:08,760 --> 00:53:16,279
Speaker 1: you again really soon. Text Stuff is a production of

889
00:53:16,280 --> 00:53:19,319
Speaker 1: I Heart Radio's How Stuff Works. For more podcasts from

890
00:53:19,360 --> 00:53:23,120
Speaker 1: my heart Radio, visit the i heart Radio app, Apple Podcasts,

891
00:53:23,280 --> 00:53:25,240
Speaker 1: or wherever you listen to your favorite shows.