1
00:00:04,120 --> 00:00:07,160
Speaker 1: Get in tech with technology with tech Stuff from how

2
00:00:07,200 --> 00:00:13,680
Speaker 1: stuff works dot com. Hey there, and welcome to tech Stuff.

3
00:00:13,720 --> 00:00:16,279
Speaker 1: I'm your host, Jonathan Strickland. I'm an executive producer with

4
00:00:16,280 --> 00:00:19,960
Speaker 1: how Stuff Works in love all Things Tech and listener

5
00:00:20,079 --> 00:00:22,320
Speaker 1: Nate wrote in and asked that I do an episode

6
00:00:22,560 --> 00:00:27,720
Speaker 1: about personal digital assistance or virtual assistance or voice helpers.

7
00:00:28,200 --> 00:00:30,160
Speaker 1: This is hard because we don't really have a great

8
00:00:30,560 --> 00:00:33,839
Speaker 1: term for these things, but I'm talking about applications like

9
00:00:34,400 --> 00:00:36,559
Speaker 1: and I apologize ahead of time if I activate your

10
00:00:36,560 --> 00:00:42,840
Speaker 1: technology Sirie, Alexa and Google Assistant. These sort of voice

11
00:00:42,840 --> 00:00:45,920
Speaker 1: helpers that can respond to voice commands as well as

12
00:00:46,040 --> 00:00:48,000
Speaker 1: other means of input in a way that makes them

13
00:00:48,000 --> 00:00:51,479
Speaker 1: seem almost intelligent. Now, as it turns out, that's actually

14
00:00:51,479 --> 00:00:55,240
Speaker 1: a pretty complicated history because it requires a discussion about

15
00:00:55,320 --> 00:00:59,840
Speaker 1: a lot of different connected ideas that we're all in

16
00:01:00,040 --> 00:01:03,080
Speaker 1: dependent and then ultimately converged. We're talking about stuff like

17
00:01:03,320 --> 00:01:08,560
Speaker 1: speech recognition, natural language processing, and technology that was meant

18
00:01:08,600 --> 00:01:12,200
Speaker 1: to improve accessibility and a whole lot more So, it

19
00:01:12,240 --> 00:01:15,280
Speaker 1: makes talking about the services somewhat challenging because it's not

20
00:01:15,360 --> 00:01:18,360
Speaker 1: like there was just one pathway that led to their development.

21
00:01:18,640 --> 00:01:22,960
Speaker 1: They exist largely because of these independent but converging areas

22
00:01:22,959 --> 00:01:25,920
Speaker 1: of innovation. Much of the work that made these services

23
00:01:25,959 --> 00:01:29,840
Speaker 1: possible took place in events that were concurrent with each other,

24
00:01:30,000 --> 00:01:36,920
Speaker 1: with different organizations all working towards similar but unconnected, disconnected goals.

25
00:01:36,959 --> 00:01:40,040
Speaker 1: So going by strict timeline approach would be really hard,

26
00:01:40,120 --> 00:01:42,880
Speaker 1: if not impossible, just because you have to jump around

27
00:01:42,920 --> 00:01:46,200
Speaker 1: a lot to talk about different advances. So today I'm

28
00:01:46,240 --> 00:01:50,480
Speaker 1: going to focus solely on speech recognition. This in itself

29
00:01:50,640 --> 00:01:53,640
Speaker 1: is a huge topic, so it's more than enough for

30
00:01:53,680 --> 00:01:56,280
Speaker 1: a single episode of tech stuff. In the next episode,

31
00:01:56,320 --> 00:02:00,000
Speaker 1: I'm going to dive more into natural language processing, which

32
00:02:00,520 --> 00:02:03,800
Speaker 1: has some crossover with speech recognition, but it is its

33
00:02:03,840 --> 00:02:06,200
Speaker 1: own thing. And then after that we'll take a look

34
00:02:06,200 --> 00:02:09,640
Speaker 1: at how voice assistants like Sirie and Alexa popped up

35
00:02:09,680 --> 00:02:14,240
Speaker 1: over time. First, the idea of creating a machine that

36
00:02:14,280 --> 00:02:18,920
Speaker 1: could interpret speech is older than computers. If you listen

37
00:02:18,960 --> 00:02:21,440
Speaker 1: to my episodes about the history of the turntable, you'll

38
00:02:21,480 --> 00:02:26,800
Speaker 1: remember the phanatograph, designed by Eduard Leon Scott de Martinville

39
00:02:27,120 --> 00:02:31,680
Speaker 1: in eighteen fifty seven. The gadget had a small brush

40
00:02:31,760 --> 00:02:35,360
Speaker 1: that was attached to a parchment diaphragm and the bristles

41
00:02:35,400 --> 00:02:38,520
Speaker 1: on the brush rested against a sheet of paper that

42
00:02:38,600 --> 00:02:41,520
Speaker 1: itself was wrapped around a cylinder. On top of the

43
00:02:41,520 --> 00:02:44,520
Speaker 1: sheet of paper was a layer of soot. So to

44
00:02:44,560 --> 00:02:47,400
Speaker 1: operate the device, you would turn the cylinder, the brush

45
00:02:47,400 --> 00:02:50,679
Speaker 1: would drag across the soot on the paper, and you

46
00:02:50,720 --> 00:02:54,120
Speaker 1: would shout at the diaphragm. The vibrations of sound would

47
00:02:54,160 --> 00:02:56,959
Speaker 1: cause the parchment diaphragm to vibrate. That would make the

48
00:02:57,000 --> 00:02:59,720
Speaker 1: brush vibrate and move against the paper, and that would

49
00:02:59,720 --> 00:03:03,480
Speaker 1: create a pattern corresponding to the vibrations that were made

50
00:03:03,560 --> 00:03:06,799
Speaker 1: by the paper diaphragm. The phonautograph was supposed to aid

51
00:03:06,840 --> 00:03:10,240
Speaker 1: in the study of language and sound. The machine itself

52
00:03:10,320 --> 00:03:14,639
Speaker 1: was not intended to interpret sound, but rather facilitate interpretation.

53
00:03:14,680 --> 00:03:19,240
Speaker 1: A human would take a look at these tracings essentially

54
00:03:19,520 --> 00:03:22,360
Speaker 1: and be able to analyze sound, or at least that

55
00:03:22,400 --> 00:03:24,760
Speaker 1: was the intent. It didn't quite work out that way,

56
00:03:24,760 --> 00:03:28,480
Speaker 1: but that was the concept behind it. Now, let's set

57
00:03:28,480 --> 00:03:32,160
Speaker 1: the way back machine to the nineteen fifties. In nineteen

58
00:03:32,200 --> 00:03:37,280
Speaker 1: fifty two, Bell Labs created the Audrey system, which was

59
00:03:37,320 --> 00:03:41,000
Speaker 1: not a mean green mother from outer space, but rather

60
00:03:41,480 --> 00:03:45,960
Speaker 1: the first documented speech recognizer system. It was an analog system,

61
00:03:46,000 --> 00:03:50,200
Speaker 1: not a digital one. It was its own dedicated massive circuit,

62
00:03:50,880 --> 00:03:53,400
Speaker 1: and it even had vacuum tubes in this thing. Because

63
00:03:53,440 --> 00:03:56,960
Speaker 1: this is before the transistor. It could recognize strings of

64
00:03:57,000 --> 00:04:00,720
Speaker 1: digits spoken by its creator with about nine and accuracy.

65
00:04:00,800 --> 00:04:03,360
Speaker 1: If anyone else tried it, the accuracy dropped a bit.

66
00:04:03,440 --> 00:04:07,200
Speaker 1: This already shows that speech recognition is tough because not

67
00:04:07,320 --> 00:04:10,480
Speaker 1: everyone says things exactly the same way. I know that's

68
00:04:10,480 --> 00:04:12,680
Speaker 1: not a news flash, but it is important for the

69
00:04:12,720 --> 00:04:16,279
Speaker 1: concept of speech recognition. Uh. You also had to pause

70
00:04:16,800 --> 00:04:21,160
Speaker 1: between strings of numbers. You couldn't just rattle off conversationally.

71
00:04:21,240 --> 00:04:23,440
Speaker 1: You had to put pauses in there. But it was

72
00:04:23,480 --> 00:04:26,680
Speaker 1: also an enormous piece of machinery. It took up a

73
00:04:26,800 --> 00:04:29,640
Speaker 1: six ft high relay rack and it consumed a lot

74
00:04:29,640 --> 00:04:34,280
Speaker 1: of electricity. Then Big Blue, also known as IBM, had

75
00:04:34,360 --> 00:04:37,880
Speaker 1: scientists and engineers working on the possibility of designing technologies

76
00:04:37,880 --> 00:04:40,560
Speaker 1: that could recognize speech. They were kind of working around

77
00:04:40,600 --> 00:04:44,520
Speaker 1: the same time that Bell South was computer scientists Nathaniel Rochester,

78
00:04:44,600 --> 00:04:47,200
Speaker 1: who designed an IBM computer called the seven oh one.

79
00:04:47,480 --> 00:04:50,120
Speaker 1: He also wrote the first assembler. Headed up a group

80
00:04:50,160 --> 00:04:53,359
Speaker 1: of engineers at IBM who were researching pattern recognition and

81
00:04:53,480 --> 00:04:57,440
Speaker 1: information theory. That work, which was early research into fundamental

82
00:04:57,520 --> 00:05:01,039
Speaker 1: building blocks for artificial intelligence, would also become important for

83
00:05:01,080 --> 00:05:05,240
Speaker 1: speech recognition. In the late nineteen fifties, William C. Dirsh,

84
00:05:05,360 --> 00:05:09,640
Speaker 1: another IBM computer scientist, developed a computer system as part

85
00:05:09,720 --> 00:05:14,720
Speaker 1: of IBMS Advanced Systems Development Division laboratory, and it incorporated

86
00:05:14,760 --> 00:05:18,679
Speaker 1: basic elements of speech recognition. He unveiled the device, called

87
00:05:18,720 --> 00:05:22,359
Speaker 1: the IBM Shoebox in nineteen sixty two at the World's Fair.

88
00:05:22,880 --> 00:05:26,480
Speaker 1: Using a microphone, you could speak basic digits from zero

89
00:05:26,520 --> 00:05:30,680
Speaker 1: to nine, and also six additional control words like plus

90
00:05:30,800 --> 00:05:34,279
Speaker 1: or minus, and the shoebox would recognize the words and

91
00:05:34,480 --> 00:05:39,960
Speaker 1: perform calculation, So essentially this was a basic voice controlled calculator.

92
00:05:40,279 --> 00:05:43,920
Speaker 1: While the application was limited, this showed off a remarkable achievement.

93
00:05:44,120 --> 00:05:46,719
Speaker 1: Finding a way to program a machine to accept speech

94
00:05:46,880 --> 00:05:50,120
Speaker 1: as a command is a non trivial problem. Throughout the

95
00:05:50,160 --> 00:05:53,880
Speaker 1: nineteen sixties, computer scientists took a brute force sort of

96
00:05:53,920 --> 00:05:57,520
Speaker 1: approach to solving speech recognition, which could work in very

97
00:05:57,640 --> 00:06:01,280
Speaker 1: narrow applications such as the calculator approach, but were by

98
00:06:01,320 --> 00:06:04,839
Speaker 1: their nature difficult to scale up. Even in the early

99
00:06:04,920 --> 00:06:09,200
Speaker 1: nineteen seventies, the Speech Understanding Research Project from our PA

100
00:06:09,279 --> 00:06:12,000
Speaker 1: as the same organization that would help bring the Internet

101
00:06:12,040 --> 00:06:17,040
Speaker 1: into being, produced a brute force template called Harpy. While

102
00:06:17,120 --> 00:06:19,920
Speaker 1: it was reliant upon brute force, Harpy, which came out

103
00:06:19,960 --> 00:06:24,200
Speaker 1: of Carnegie Melon Research, could recognize about one thousand words.

104
00:06:24,600 --> 00:06:28,320
Speaker 1: Harpy also made use of a process called beam search.

105
00:06:28,880 --> 00:06:31,560
Speaker 1: This is a search strategy in which a search algorithm

106
00:06:31,640 --> 00:06:35,120
Speaker 1: can consider multiple possible hits at a single time, rather

107
00:06:35,160 --> 00:06:38,720
Speaker 1: than looking through a large data set for a specific

108
00:06:38,839 --> 00:06:42,560
Speaker 1: perfect hit. Then the algorithm would determine the probability of

109
00:06:42,600 --> 00:06:44,880
Speaker 1: each of the hits as being the right word. The

110
00:06:44,960 --> 00:06:47,599
Speaker 1: number of potential hits is determined by a value called

111
00:06:47,640 --> 00:06:51,320
Speaker 1: the beam width, setting the speech recognition and application designer

112
00:06:51,360 --> 00:06:54,000
Speaker 1: can set. Beam search is a much more efficient way

113
00:06:54,000 --> 00:06:56,920
Speaker 1: to suss out speech, and it's frequently used today, not

114
00:06:56,960 --> 00:07:00,000
Speaker 1: just in speech recognition but also in natural language process

115
00:07:00,000 --> 00:07:03,160
Speaker 1: saying another sequential models, but it gets super technical, so

116
00:07:03,200 --> 00:07:05,800
Speaker 1: we're gonna leave it at that kind of high level approach.

117
00:07:06,600 --> 00:07:10,520
Speaker 1: But these systems still mapped all words to a template,

118
00:07:10,720 --> 00:07:14,480
Speaker 1: one template per word. It didn't break words up into sounds,

119
00:07:14,760 --> 00:07:18,080
Speaker 1: but look for a match against a database of established

120
00:07:18,160 --> 00:07:21,240
Speaker 1: vocabulary words, which meant that if you did not pronounce

121
00:07:21,280 --> 00:07:23,640
Speaker 1: the word the same way as it was represented in

122
00:07:23,680 --> 00:07:26,760
Speaker 1: the database, you might not get a hit. You would

123
00:07:26,800 --> 00:07:29,520
Speaker 1: have to get it close enough to that template for

124
00:07:29,640 --> 00:07:31,400
Speaker 1: you to be able to get a hit. This is

125
00:07:31,440 --> 00:07:35,120
Speaker 1: a big problem. People speak with accents or dialects, or

126
00:07:35,160 --> 00:07:39,200
Speaker 1: they may have difficulty replicating certain sounds. The brute force

127
00:07:39,280 --> 00:07:42,120
Speaker 1: approach often meant you you'd have to say the same

128
00:07:42,120 --> 00:07:45,560
Speaker 1: word a few times with clear enunciation and long pauses

129
00:07:45,600 --> 00:07:48,680
Speaker 1: to get a hit. And again, it just didn't scale

130
00:07:48,800 --> 00:07:52,040
Speaker 1: very well. It wasn't until the late nineteen seventies that

131
00:07:52,080 --> 00:07:55,360
Speaker 1: computer scientists were able to find a different approach that

132
00:07:55,400 --> 00:07:59,280
Speaker 1: would power more modern speech recognition systems. And let's go

133
00:07:59,360 --> 00:08:01,720
Speaker 1: through some of the steps that are necessary, from the

134
00:08:01,760 --> 00:08:07,040
Speaker 1: basic physical attributes of speech to the processing of the information. First, speech,

135
00:08:07,280 --> 00:08:11,800
Speaker 1: like all sound, ultimately is a physical phenomenon. It is vibration.

136
00:08:12,160 --> 00:08:15,520
Speaker 1: We produce these vibrations with vocal cords and our lips, teeth,

137
00:08:15,640 --> 00:08:18,680
Speaker 1: and tongue according to the rules of whatever language we

138
00:08:18,720 --> 00:08:22,160
Speaker 1: are speaking. These vibrations travel through a medium such as

139
00:08:22,160 --> 00:08:25,000
Speaker 1: the air, and then they get picked up by something else,

140
00:08:25,080 --> 00:08:28,360
Speaker 1: like someone else's ears or a microphone or whatever. But

141
00:08:28,440 --> 00:08:32,319
Speaker 1: at this stage we're talking about physical vibrations and analog

142
00:08:32,920 --> 00:08:37,840
Speaker 1: form of input. Computers do not directly interpret physical vibrations.

143
00:08:37,840 --> 00:08:42,839
Speaker 1: Computers process digital information, and speech is an analog phenomena.

144
00:08:42,960 --> 00:08:45,320
Speaker 1: So the first thing we need for a computer to

145
00:08:45,360 --> 00:08:49,000
Speaker 1: recognize speech is for some sort of analog to digital

146
00:08:49,040 --> 00:08:52,840
Speaker 1: converter that can accept the analog information and then translated

147
00:08:52,920 --> 00:08:56,400
Speaker 1: into digital information. The a d C would typically sample

148
00:08:56,520 --> 00:09:00,400
Speaker 1: speech by taking precise measurements of the sound at frequent

149
00:09:00,480 --> 00:09:04,439
Speaker 1: intervals or samples such as thousands of times per second,

150
00:09:04,480 --> 00:09:07,319
Speaker 1: so you can almost think of it like snapshots, Like

151
00:09:07,320 --> 00:09:11,480
Speaker 1: like pictures. The a d C is measuring quantifiable elements

152
00:09:11,559 --> 00:09:14,920
Speaker 1: of the sound every time it takes a sample. That

153
00:09:15,000 --> 00:09:19,160
Speaker 1: might include stuff like amplitude and frequency, or volume and pitch.

154
00:09:19,240 --> 00:09:22,880
Speaker 1: If you're talking about how we perceive sound. There's usually

155
00:09:23,000 --> 00:09:25,679
Speaker 1: some sort of noise filter incorporated into this step as

156
00:09:25,720 --> 00:09:29,640
Speaker 1: well to help remove any unwanted sounds from the signal.

157
00:09:30,080 --> 00:09:32,719
Speaker 1: The system has to be able to recognize which signals

158
00:09:32,960 --> 00:09:36,200
Speaker 1: represent a command in which ones are not important. This

159
00:09:36,280 --> 00:09:38,760
Speaker 1: is why I can do stuff like send vocal commands

160
00:09:38,840 --> 00:09:42,200
Speaker 1: to a voice assistant, even if there's another conversation going

161
00:09:42,240 --> 00:09:45,560
Speaker 1: on nearby, or if I have the radio or television on. Now.

162
00:09:45,559 --> 00:09:47,960
Speaker 1: I have a lot more to say about the technology

163
00:09:48,000 --> 00:09:50,880
Speaker 1: that makes speech recognition possible, but before I get into that,

164
00:09:50,960 --> 00:10:01,360
Speaker 1: let's take a quick break to thank our sponsor. So,

165
00:10:01,559 --> 00:10:05,840
Speaker 1: a speech recognition system typically as a database of sound

166
00:10:05,880 --> 00:10:10,679
Speaker 1: samples that will allow the recognition system to compare incoming

167
00:10:10,760 --> 00:10:15,480
Speaker 1: signals against that database. The speech recognition system might have

168
00:10:15,600 --> 00:10:19,760
Speaker 1: to put the incoming sound through a process called temporal alignment,

169
00:10:20,240 --> 00:10:22,240
Speaker 1: which is a fancy way of saying the system might

170
00:10:22,240 --> 00:10:25,480
Speaker 1: have to slow down or speed up the incoming sound.

171
00:10:25,960 --> 00:10:28,439
Speaker 1: You can think of this as like making a recording

172
00:10:28,600 --> 00:10:31,800
Speaker 1: and then almost immediately playing the recording back. Obviously, the

173
00:10:31,800 --> 00:10:34,920
Speaker 1: speech recognition system can't change the speed at which you're speaking,

174
00:10:35,600 --> 00:10:37,959
Speaker 1: though you might get a feature that prompts you to

175
00:10:38,880 --> 00:10:41,679
Speaker 1: slow down or speed up if the message may say

176
00:10:41,679 --> 00:10:44,480
Speaker 1: could you say that again, but slower that kind of thing. Um.

177
00:10:44,520 --> 00:10:47,080
Speaker 1: If you happen to be someone from the Northeastern United States,

178
00:10:47,080 --> 00:10:50,120
Speaker 1: for example, you may frequently get these messages saying slow

179
00:10:50,160 --> 00:10:54,040
Speaker 1: the heck down. Temporal alignment allows the speech recognition system

180
00:10:54,040 --> 00:10:56,400
Speaker 1: to look for matches between the incoming sound and the

181
00:10:56,440 --> 00:10:59,839
Speaker 1: samples in the system's memory. The system must also do

182
00:11:00,000 --> 00:11:03,560
Speaker 1: gied up the sounds in the incoming signal into segments

183
00:11:03,600 --> 00:11:08,040
Speaker 1: that represent specific sounds in the native language, such as

184
00:11:08,080 --> 00:11:12,320
Speaker 1: the sound or the hard to sound. It looks for

185
00:11:12,440 --> 00:11:17,000
Speaker 1: matches in its memory that represent phonemes, and a phoneme

186
00:11:17,080 --> 00:11:20,360
Speaker 1: is a basic sound native to a specific language, to

187
00:11:20,440 --> 00:11:23,760
Speaker 1: a particular language, whichever when you're looking at. So, for example,

188
00:11:23,800 --> 00:11:28,280
Speaker 1: the English language has about forty phonemes. Linguists actually get

189
00:11:28,280 --> 00:11:31,200
Speaker 1: into some pretty vicious fights about exactly how many phonemes

190
00:11:31,280 --> 00:11:34,720
Speaker 1: English language has, but it's around forties. Some people argue

191
00:11:34,760 --> 00:11:37,840
Speaker 1: that there are more phonemes, some say that there are.

192
00:11:37,880 --> 00:11:41,199
Speaker 1: Some of the supposed additional phonemes are in fact repeats

193
00:11:41,200 --> 00:11:45,280
Speaker 1: of existing ones. Other languages, though, will have different number

194
00:11:45,280 --> 00:11:48,760
Speaker 1: of phonemes in them. Some may have far more than English,

195
00:11:48,800 --> 00:11:52,079
Speaker 1: some may have fewer than English. The system then has

196
00:11:52,120 --> 00:11:56,240
Speaker 1: to analyze the phonemes in sequence. So it's looking at

197
00:11:56,280 --> 00:11:59,480
Speaker 1: these little markers that represent different sounds, and this is

198
00:11:59,520 --> 00:12:01,760
Speaker 1: how it says them can look for matches between a

199
00:12:01,760 --> 00:12:05,000
Speaker 1: series of phonemes and the words that it can recognize

200
00:12:05,000 --> 00:12:08,679
Speaker 1: it can try and build words from these sounds. This

201
00:12:08,760 --> 00:12:12,040
Speaker 1: is way harder than I'm making it. Sound speech recognition

202
00:12:12,040 --> 00:12:16,280
Speaker 1: systems have complicated statistical models to help them determine what

203
00:12:16,440 --> 00:12:20,560
Speaker 1: a word might be. Even a simple speech recognition system

204
00:12:20,600 --> 00:12:25,280
Speaker 1: will have a complex statistical model to recognize individual words.

205
00:12:25,720 --> 00:12:30,120
Speaker 1: More sophisticated systems might also look at contextual information surrounding

206
00:12:30,160 --> 00:12:33,720
Speaker 1: the phonemes. In other words, a really sophisticated system isn't

207
00:12:33,760 --> 00:12:36,120
Speaker 1: just looking for a match in phonemes to sus out

208
00:12:36,160 --> 00:12:39,240
Speaker 1: what a single word is in a sentence. It's looking

209
00:12:39,240 --> 00:12:42,560
Speaker 1: at the phonemes that came before and after to determine

210
00:12:42,600 --> 00:12:45,559
Speaker 1: what those words were and to help increase the confidence

211
00:12:45,640 --> 00:12:48,600
Speaker 1: level overall. So let me give an example. Let's say

212
00:12:48,640 --> 00:12:51,040
Speaker 1: have activated one of these voice assistants, and I've used

213
00:12:51,040 --> 00:12:53,720
Speaker 1: whatever voice command activates it. I'm not going to do

214
00:12:53,760 --> 00:12:55,480
Speaker 1: it here because some of you might be listening on

215
00:12:55,520 --> 00:12:58,800
Speaker 1: those devices. And then I say turn the volume up

216
00:12:58,920 --> 00:13:02,200
Speaker 1: thirty percent. The speech recognition system begins to parse what

217
00:13:02,280 --> 00:13:05,440
Speaker 1: I said by analyzing those sounds phone name by phone name,

218
00:13:05,640 --> 00:13:09,240
Speaker 1: identifying them, analyzing them, trying to group them together to

219
00:13:09,320 --> 00:13:11,839
Speaker 1: form words, and when it thinks it's found a word,

220
00:13:11,880 --> 00:13:15,760
Speaker 1: it assigns a certain probability to that, and when it

221
00:13:15,800 --> 00:13:18,160
Speaker 1: starts to analyze the phone names that make up the

222
00:13:18,200 --> 00:13:21,160
Speaker 1: word volume, it's also looking at the words that came

223
00:13:21,200 --> 00:13:24,360
Speaker 1: before turn the and it's looking at the words that

224
00:13:24,400 --> 00:13:29,000
Speaker 1: came after up. That boosts the system's confidence overall that

225
00:13:29,080 --> 00:13:32,240
Speaker 1: the keyword volume is in fact volume, and then it

226
00:13:32,280 --> 00:13:34,199
Speaker 1: does what I told it to do. When I talk

227
00:13:34,240 --> 00:13:37,559
Speaker 1: about confidence, I don't mean the system feels good about itself.

228
00:13:37,600 --> 00:13:41,440
Speaker 1: I'm talking about probabilities. These systems largely work in the

229
00:13:41,440 --> 00:13:44,920
Speaker 1: realm of probabilities. What is the probability that I said

230
00:13:45,000 --> 00:13:49,040
Speaker 1: volume rather than some other word. For speech recognition system

231
00:13:49,040 --> 00:13:51,560
Speaker 1: to work, it needs to be able to assign a

232
00:13:51,640 --> 00:13:55,880
Speaker 1: confidence level towards. The higher the level, the more certain

233
00:13:56,160 --> 00:13:59,079
Speaker 1: quote unquote the system is that it got things correct.

234
00:13:59,559 --> 00:14:03,320
Speaker 1: Typical computer engineers will design systems that will only execute

235
00:14:03,320 --> 00:14:07,240
Speaker 1: a command or return a result of some sort if

236
00:14:07,280 --> 00:14:10,160
Speaker 1: the system has reached a certain threshold of confidence, and

237
00:14:10,200 --> 00:14:13,600
Speaker 1: if it hasn't, you won't get a result. So, for example,

238
00:14:13,679 --> 00:14:17,000
Speaker 1: and this isn't about speech recognition exactly, but it illustrates

239
00:14:17,040 --> 00:14:20,760
Speaker 1: my point. IBM S Watson computer would not offer up

240
00:14:20,800 --> 00:14:24,760
Speaker 1: an answer on Jeopardy unless it met a certain threshold

241
00:14:24,880 --> 00:14:26,800
Speaker 1: of confidence in an answer, and I think it was

242
00:14:26,800 --> 00:14:30,600
Speaker 1: about eight percent. So if it or eight percent certain

243
00:14:30,680 --> 00:14:32,520
Speaker 1: that it had the right answer, it would buzz in.

244
00:14:32,600 --> 00:14:35,680
Speaker 1: But if it was less than eight sure, it would

245
00:14:35,680 --> 00:14:38,880
Speaker 1: not put forth that answer. There are two broad types

246
00:14:38,920 --> 00:14:43,120
Speaker 1: of statistical models in speech recognition systems today. There are

247
00:14:43,120 --> 00:14:45,400
Speaker 1: others that could be used, but there are two broad

248
00:14:45,480 --> 00:14:47,640
Speaker 1: ones that tend to be used these days. They are

249
00:14:47,640 --> 00:14:52,040
Speaker 1: the hidden Markov model and neural networks. Hidden Markov model,

250
00:14:52,080 --> 00:14:57,280
Speaker 1: by the way, is overwhelmingly the most popular method of

251
00:14:58,080 --> 00:15:02,640
Speaker 1: using a statistical model to analyze speech recognition. It is

252
00:15:02,680 --> 00:15:05,200
Speaker 1: the prevalent approach, and it works sort of how I

253
00:15:05,280 --> 00:15:07,640
Speaker 1: just described. It looks at each phone name and starts

254
00:15:07,640 --> 00:15:10,080
Speaker 1: to build out a pathway. If you think of this

255
00:15:10,120 --> 00:15:13,120
Speaker 1: as like an actual physical path that you're following, you

256
00:15:13,120 --> 00:15:15,640
Speaker 1: would start off with the first phone name that represents

257
00:15:15,680 --> 00:15:18,120
Speaker 1: the beginning of the path, and the phone name might

258
00:15:18,160 --> 00:15:22,000
Speaker 1: eliminate other possible phone names right away. By that, I

259
00:15:22,000 --> 00:15:26,560
Speaker 1: mean it might be a sound that doesn't combine with

260
00:15:26,640 --> 00:15:29,080
Speaker 1: certain other sounds within that language. There might be a

261
00:15:29,120 --> 00:15:33,800
Speaker 1: phone name that does not combine with other specific phone names.

262
00:15:33,880 --> 00:15:36,880
Speaker 1: So imagine you have a path and originally it splits

263
00:15:36,920 --> 00:15:40,160
Speaker 1: into tons of other pathways, but a couple of those

264
00:15:40,160 --> 00:15:42,960
Speaker 1: pathways are blocked off with signs that say the pathway

265
00:15:43,000 --> 00:15:47,120
Speaker 1: is closed. It's closed because those pathways represent phone names

266
00:15:47,200 --> 00:15:51,320
Speaker 1: that would never be paired with the initial one. You

267
00:15:51,360 --> 00:15:55,480
Speaker 1: just don't get that sound in English. The closed paths

268
00:15:55,600 --> 00:15:58,640
Speaker 1: would therefore be off limits, and only the open paths

269
00:15:58,640 --> 00:16:01,960
Speaker 1: would be the possibility. Then the hidden Markov model would

270
00:16:01,960 --> 00:16:04,640
Speaker 1: look at the next phone name the next step along

271
00:16:04,640 --> 00:16:08,080
Speaker 1: this pathway. That phone name determines which of the viable

272
00:16:08,120 --> 00:16:11,800
Speaker 1: path options is actually the one to follow. All the

273
00:16:11,800 --> 00:16:14,640
Speaker 1: other options would be discarded, and so on. It would

274
00:16:14,680 --> 00:16:17,320
Speaker 1: go all the way down the list of phone names

275
00:16:17,360 --> 00:16:19,840
Speaker 1: until the model arrives at a conclusion of the most

276
00:16:19,880 --> 00:16:23,280
Speaker 1: likely word that was spoken. It assigns a probability score

277
00:16:23,280 --> 00:16:26,720
Speaker 1: to each phone names, thinking I'm pretty sure the sound

278
00:16:26,760 --> 00:16:30,600
Speaker 1: that I heard, quote unquote was this. That helps the

279
00:16:30,640 --> 00:16:33,280
Speaker 1: system make an educated guess as to what word was

280
00:16:33,320 --> 00:16:36,320
Speaker 1: actually spoken. Now, I've talked a lot about neural networks

281
00:16:36,320 --> 00:16:37,760
Speaker 1: in the past. I'm just going to give it a

282
00:16:37,840 --> 00:16:42,800
Speaker 1: quick cursory covering here, because they really aren't the dominant

283
00:16:43,080 --> 00:16:47,800
Speaker 1: statistical model in speech recognition. UH Neural networks have nodes,

284
00:16:48,080 --> 00:16:51,560
Speaker 1: computer nodes or algorithms that act like a neuron right

285
00:16:51,680 --> 00:16:54,760
Speaker 1: like a like a brain cell, and they execute operations

286
00:16:54,840 --> 00:16:58,440
Speaker 1: on data. The neurons also assigned a probability score to

287
00:16:58,640 --> 00:17:02,080
Speaker 1: that x that execution of of data and shows the

288
00:17:02,080 --> 00:17:05,400
Speaker 1: confidence in the system in the result before they pass

289
00:17:05,440 --> 00:17:08,080
Speaker 1: it on to another neuron in the network, which then

290
00:17:08,160 --> 00:17:10,879
Speaker 1: executes another operation on the data and so on, and

291
00:17:10,960 --> 00:17:13,760
Speaker 1: ultimately the network produces an end result of all those

292
00:17:13,800 --> 00:17:17,320
Speaker 1: operations and judges the probability of whether or not that

293
00:17:17,440 --> 00:17:20,560
Speaker 1: result is the right one, and again, if it meets

294
00:17:20,560 --> 00:17:24,560
Speaker 1: a certain threshold, then it's considered the correct answer or

295
00:17:24,560 --> 00:17:27,119
Speaker 1: the closest to correct that the system can manage. In

296
00:17:27,160 --> 00:17:30,840
Speaker 1: any case, speech recognition systems have to be trained, and

297
00:17:30,840 --> 00:17:34,200
Speaker 1: there are trillions of potential combinations of sounds that could

298
00:17:34,280 --> 00:17:37,640
Speaker 1: represent different words. And the How stuff Works article How

299
00:17:37,720 --> 00:17:41,000
Speaker 1: Speech Recognition Works Ed Grabanowski, who is one of the

300
00:17:41,160 --> 00:17:44,120
Speaker 1: powerhouses of the site. He's written some of the best

301
00:17:44,240 --> 00:17:47,320
Speaker 1: articles on how stuff Works, gave a great example. He says,

302
00:17:47,600 --> 00:17:52,800
Speaker 1: take the phrase recognize speech right the phone emes in

303
00:17:52,800 --> 00:17:55,520
Speaker 1: that phrase happened to be pretty similar to a totally

304
00:17:55,600 --> 00:17:59,199
Speaker 1: different phrase, which would be recognized beach. So you have

305
00:17:59,280 --> 00:18:04,720
Speaker 1: recognized speech or wreck a nice beach. The speech recognition

306
00:18:04,720 --> 00:18:07,480
Speaker 1: software has to be able to determine the difference, or

307
00:18:07,560 --> 00:18:09,200
Speaker 1: else the next thing you know, you're gonna have terminators

308
00:18:09,280 --> 00:18:12,960
Speaker 1: kicking sand in everyone's face, and that's no good. Alexander Wibel,

309
00:18:13,320 --> 00:18:17,360
Speaker 1: who worked on that system called Harpy that I mentioned earlier,

310
00:18:18,000 --> 00:18:21,200
Speaker 1: had another couple of examples. He said, you might say

311
00:18:21,320 --> 00:18:25,919
Speaker 1: youth and Asia and get the result youth in Asia.

312
00:18:26,080 --> 00:18:29,280
Speaker 1: Or you might say give me a new display and

313
00:18:29,320 --> 00:18:32,040
Speaker 1: you get the result, give me a newdist play. If

314
00:18:32,080 --> 00:18:36,040
Speaker 1: you've ever used something like Google transcripts, where if you

315
00:18:36,040 --> 00:18:38,760
Speaker 1: had a Google Voice and you were reading the voicemails,

316
00:18:39,760 --> 00:18:43,159
Speaker 1: you could get hilarious results. Because of this, the speech recognition,

317
00:18:43,480 --> 00:18:46,640
Speaker 1: the speech to text feature could end up spelling out

318
00:18:47,560 --> 00:18:52,080
Speaker 1: truly ridiculous messages. I would get messages from my mother,

319
00:18:52,920 --> 00:18:56,000
Speaker 1: and I only wish my mom would leave me messages

320
00:18:56,080 --> 00:18:59,600
Speaker 1: the way that Google transcript thought she was leaving me messages,

321
00:18:59,640 --> 00:19:03,320
Speaker 1: because they were the most crazy messages ever. But it's

322
00:19:03,359 --> 00:19:05,720
Speaker 1: mostly because my mom has a Southern accent and so

323
00:19:05,800 --> 00:19:10,640
Speaker 1: Google would often misinterpret what she was saying, so these

324
00:19:10,640 --> 00:19:14,520
Speaker 1: systems have to undergo hours of training. John Garofolo, a

325
00:19:14,520 --> 00:19:17,800
Speaker 1: computer scientist who was cited in that House Stuff Works article,

326
00:19:18,119 --> 00:19:22,600
Speaker 1: had this to say. These statistical systems need lots of

327
00:19:22,640 --> 00:19:26,960
Speaker 1: exemplary training data to reach their optimal performance, sometimes on

328
00:19:27,000 --> 00:19:30,840
Speaker 1: the order of thousands of hours of human transcribed speech

329
00:19:30,880 --> 00:19:34,960
Speaker 1: and hundreds of megabytes of text. These training data are

330
00:19:35,040 --> 00:19:38,919
Speaker 1: used to create acoustic models of words, word lists, and

331
00:19:39,040 --> 00:19:43,200
Speaker 1: multi word probability networks. There is some art into how

332
00:19:43,240 --> 00:19:47,680
Speaker 1: one selects, compiles, and prepares this training data for digestion

333
00:19:47,800 --> 00:19:50,680
Speaker 1: by the system, and how the system models are tuned

334
00:19:50,840 --> 00:19:54,400
Speaker 1: to a particular application. These details can make the difference

335
00:19:54,400 --> 00:19:57,719
Speaker 1: between a well performing system and a poorly performing system,

336
00:19:57,960 --> 00:20:01,639
Speaker 1: even when using the same basic algorith Rhythm speech recognition

337
00:20:01,680 --> 00:20:05,240
Speaker 1: also requires a decent amount of processing power. This was

338
00:20:05,280 --> 00:20:08,440
Speaker 1: a limiting factor on speech recognition for a really long time.

339
00:20:08,880 --> 00:20:12,639
Speaker 1: Systems were limited in their capabilities, which meant that for years,

340
00:20:12,680 --> 00:20:16,399
Speaker 1: if you wanted to incorporate speech recognition in a computer system,

341
00:20:16,480 --> 00:20:19,320
Speaker 1: and then most of the computer's processing power would have

342
00:20:19,400 --> 00:20:22,119
Speaker 1: to dedicate itself just to parsing speech. You couldn't do

343
00:20:22,240 --> 00:20:25,320
Speaker 1: much else on that machine. But since Moore's laws held

344
00:20:25,400 --> 00:20:27,400
Speaker 1: up so well for decades, we got to a point

345
00:20:27,400 --> 00:20:30,000
Speaker 1: where the process and capabilities of machines reached a stage

346
00:20:30,200 --> 00:20:33,679
Speaker 1: where this isn't as big a concern, And another development

347
00:20:33,960 --> 00:20:38,680
Speaker 1: that Google really helped pioneer definitely change things. I'll talk

348
00:20:38,720 --> 00:20:41,160
Speaker 1: more about that in our next section, but first let's

349
00:20:41,160 --> 00:20:51,560
Speaker 1: take another quick break to thank our sponsors. Okay, So,

350
00:20:51,600 --> 00:20:54,800
Speaker 1: advances in speech recognition in the late nineteen seventies paved

351
00:20:54,920 --> 00:20:58,280
Speaker 1: the way from how most systems work these days, though

352
00:20:58,320 --> 00:21:01,480
Speaker 1: of course the models have under gone multiple refinements and

353
00:21:01,520 --> 00:21:05,080
Speaker 1: tweaking over time. The first speech recognition product to ever

354
00:21:05,200 --> 00:21:09,200
Speaker 1: launch for consumers was a program called Dragon Dictate, which

355
00:21:09,240 --> 00:21:13,879
Speaker 1: debuted in Dragon Dictate. The original version that is, because

356
00:21:13,920 --> 00:21:17,119
Speaker 1: they still come out to this day, relied on discrete

357
00:21:17,160 --> 00:21:20,320
Speaker 1: speech recognition. Now, I don't mean you had to be

358
00:21:20,400 --> 00:21:22,879
Speaker 1: secretive and hush hush about it. It's not that kind

359
00:21:22,880 --> 00:21:25,879
Speaker 1: of discreet. Rather, I mean you had to pronounce each

360
00:21:26,040 --> 00:21:28,879
Speaker 1: word clearly, with a pause between words. You could not

361
00:21:29,080 --> 00:21:33,040
Speaker 1: speak conversationally, or the dictation software could not interpret what

362
00:21:33,080 --> 00:21:43,080
Speaker 1: you were saying, so using the software would sound like this.

363
00:21:44,840 --> 00:21:47,440
Speaker 1: It was limited and it was primitive compared to today's

364
00:21:47,520 --> 00:21:51,080
Speaker 1: speech recognition products, but it was a groundbreaking product in

365
00:21:51,119 --> 00:21:54,560
Speaker 1: the early nineties. And it also costs somewhere between six

366
00:21:54,600 --> 00:21:59,320
Speaker 1: thousand and nine thousand dollars I saw differing accounts, but

367
00:21:59,359 --> 00:22:02,920
Speaker 1: that would be between nine and fourteen grand in today's dollars,

368
00:22:02,960 --> 00:22:07,320
Speaker 1: so pretty expensive software package. Dragon still produces speech recognition

369
00:22:07,359 --> 00:22:09,760
Speaker 1: technologies to this day, and of course they are much

370
00:22:09,800 --> 00:22:13,000
Speaker 1: more adept at recognizing and transcribing speech than the original

371
00:22:13,080 --> 00:22:16,160
Speaker 1: version was years ago. The software is also less expensive.

372
00:22:16,640 --> 00:22:19,399
Speaker 1: One version I saw retails for less than a hundred dollars,

373
00:22:19,400 --> 00:22:23,000
Speaker 1: so nice. Big deep price cut advancements and model design

374
00:22:23,040 --> 00:22:27,800
Speaker 1: and processor speed meant that speech recognition technology advanced rather quickly.

375
00:22:28,240 --> 00:22:33,280
Speaker 1: In Bell South released Vowel v a L. The Voice

376
00:22:33,280 --> 00:22:36,840
Speaker 1: Portal VAL was an automated interactive system that could respond

377
00:22:36,880 --> 00:22:40,520
Speaker 1: to questions over the phone. This was a basic implementation

378
00:22:40,560 --> 00:22:42,719
Speaker 1: that would evolve over time to the systems you may

379
00:22:42,760 --> 00:22:45,800
Speaker 1: have encountered when calling up automated menus where it's a

380
00:22:46,160 --> 00:22:49,800
Speaker 1: press three or say three and that kind of thing,

381
00:22:50,080 --> 00:22:53,600
Speaker 1: or do you have any questions? You can say anything

382
00:22:53,640 --> 00:22:56,479
Speaker 1: from check my balance to you know, that kind of stuff.

383
00:22:57,320 --> 00:22:59,920
Speaker 1: In two thousand five DARPA, which is the same brand

384
00:23:00,000 --> 00:23:02,199
Speaker 1: each of the Department of Defense that used to be

385
00:23:02,240 --> 00:23:04,280
Speaker 1: known as ARPA, So in other words, it's the same

386
00:23:04,680 --> 00:23:09,120
Speaker 1: R and d ARM that funded the creation of the Internet.

387
00:23:09,480 --> 00:23:11,760
Speaker 1: They funded a program in two thousand five called the

388
00:23:11,800 --> 00:23:17,359
Speaker 1: Global Autonomous Language Exploitation Project or GALE. The purpose of

389
00:23:17,359 --> 00:23:20,639
Speaker 1: this project was to advance research and development into automated

390
00:23:20,640 --> 00:23:24,480
Speaker 1: translation between languages. So not only were computers supposed to

391
00:23:24,520 --> 00:23:27,560
Speaker 1: be able to recognize speech, but also translate that speech

392
00:23:27,600 --> 00:23:31,320
Speaker 1: from one language into another, which adds another layer of

393
00:23:31,359 --> 00:23:35,120
Speaker 1: complexity on top. Right well, according to s r I International,

394
00:23:35,680 --> 00:23:39,480
Speaker 1: the system should be able to quote automatically take multi

395
00:23:39,600 --> 00:23:44,239
Speaker 1: lingual newscasts, text documents, and other forms of communication and

396
00:23:44,280 --> 00:23:47,679
Speaker 1: make their information available to human queries end quote. So

397
00:23:47,760 --> 00:23:50,840
Speaker 1: wouldn't just translate the information, which was already even more

398
00:23:50,880 --> 00:23:54,760
Speaker 1: complicated than speech recognition, It could also index that information

399
00:23:54,800 --> 00:23:57,119
Speaker 1: in a meaningful way so you could search for stuff.

400
00:23:57,960 --> 00:24:02,720
Speaker 1: So layer upon layer of complexity for that project. Things

401
00:24:02,720 --> 00:24:06,080
Speaker 1: that helped push speech recognition as well as natural language

402
00:24:06,080 --> 00:24:10,680
Speaker 1: processing to new heights largely came from two competing companies,

403
00:24:11,040 --> 00:24:14,200
Speaker 1: Apple and Google. So let me explain that In two

404
00:24:14,280 --> 00:24:18,000
Speaker 1: thousand seven, Apple introduced the iPhone, which was the first

405
00:24:18,040 --> 00:24:21,800
Speaker 1: truly successful consumer smartphone, especially here in the United States.

406
00:24:22,119 --> 00:24:25,959
Speaker 1: The smartphone introduced a new era and form of computing.

407
00:24:26,400 --> 00:24:31,720
Speaker 1: It created countless opportunities in numerous areas, including location based computing,

408
00:24:32,160 --> 00:24:36,280
Speaker 1: mobile interactions, and speech recognition. The computer was in a

409
00:24:36,440 --> 00:24:39,800
Speaker 1: phone form factor. Phones are designed for us to talk into,

410
00:24:39,880 --> 00:24:42,520
Speaker 1: So now you can walk around carrying a computer that

411
00:24:42,600 --> 00:24:45,080
Speaker 1: was designed to transmit your voice. It's only a matter

412
00:24:45,080 --> 00:24:47,480
Speaker 1: of time before someone figured out a way to leverage

413
00:24:47,520 --> 00:24:51,760
Speaker 1: that for speech recognition. Google meanwhile, was pioneering an approach

414
00:24:51,800 --> 00:24:55,639
Speaker 1: in what would perform all the processing functions necessary to

415
00:24:55,720 --> 00:24:58,680
Speaker 1: support speech recognition. It was doing it in the cloud,

416
00:24:59,240 --> 00:25:02,480
Speaker 1: so instead of having the device itself have to run

417
00:25:02,560 --> 00:25:05,919
Speaker 1: all that processing power, the device would have a persistent

418
00:25:05,960 --> 00:25:10,520
Speaker 1: connection to a server on the Internet, and the server

419
00:25:10,640 --> 00:25:13,240
Speaker 1: would do the work. It would just send the signal

420
00:25:13,280 --> 00:25:16,080
Speaker 1: to the server. The server would process and analyze the

421
00:25:16,119 --> 00:25:18,760
Speaker 1: signal and return the result back to the phone, and

422
00:25:18,800 --> 00:25:21,200
Speaker 1: the phone just was acting as a transmitter. It wasn't

423
00:25:21,280 --> 00:25:24,200
Speaker 1: really having to do any of that analysis itself. So

424
00:25:25,000 --> 00:25:29,600
Speaker 1: in two thousand and eight, Google launched the Google Voice

425
00:25:29,600 --> 00:25:33,040
Speaker 1: search app for the iPhone that would do all the

426
00:25:33,400 --> 00:25:38,040
Speaker 1: this uh speech recognition processing. Right, you could speak into

427
00:25:38,080 --> 00:25:40,960
Speaker 1: it and have Google search the terms for you for

428
00:25:41,080 --> 00:25:43,720
Speaker 1: whatever it was you were saying. But again, what was

429
00:25:43,760 --> 00:25:45,920
Speaker 1: really going on was that Google was sending those search

430
00:25:46,080 --> 00:25:50,040
Speaker 1: terms or that that speech signal over to a server

431
00:25:50,160 --> 00:25:53,520
Speaker 1: that Google operated, and then send the results back down

432
00:25:53,560 --> 00:25:55,840
Speaker 1: to the phone. But to the user it looked like

433
00:25:55,880 --> 00:25:58,240
Speaker 1: the phone itself was doing all the work. The truth

434
00:25:58,400 --> 00:26:02,600
Speaker 1: was it was simply a very basic application of true

435
00:26:02,680 --> 00:26:05,840
Speaker 1: cloud computing, and that created a new method of rolling

436
00:26:05,880 --> 00:26:09,320
Speaker 1: out speech recognition and apps and services. No longer did

437
00:26:09,320 --> 00:26:12,119
Speaker 1: you have to worry about creating a really powerful piece

438
00:26:12,160 --> 00:26:15,280
Speaker 1: of equipment. You can have that be on the back end.

439
00:26:15,840 --> 00:26:18,719
Speaker 1: The piece of equipment the user could have could be

440
00:26:18,760 --> 00:26:23,480
Speaker 1: a relatively underpowered terminal. Essentially. Meanwhile, it also meant that

441
00:26:23,520 --> 00:26:27,879
Speaker 1: Google could collect enormous samples of data, not necessarily to

442
00:26:27,960 --> 00:26:31,520
Speaker 1: market to people or to identify specific individual but rather

443
00:26:31,680 --> 00:26:34,840
Speaker 1: it could collect a lot of data for training its

444
00:26:34,920 --> 00:26:39,200
Speaker 1: speech recognition and natural language recognition models. Google could build

445
00:26:39,200 --> 00:26:42,080
Speaker 1: out a much more robust model of human speech patterns

446
00:26:42,359 --> 00:26:46,840
Speaker 1: because they had thousands of real world uses going on

447
00:26:46,960 --> 00:26:49,520
Speaker 1: in real time they could keep using that to build

448
00:26:49,520 --> 00:26:54,919
Speaker 1: out and bolster their models, and that improved Google's speech

449
00:26:54,920 --> 00:26:59,439
Speaker 1: recognition accuracy. Today, major speech recognition platforms typically have an

450
00:26:59,520 --> 00:27:02,800
Speaker 1: error rate below five percent, which is pretty darn impressive.

451
00:27:03,280 --> 00:27:07,880
Speaker 1: According to a calm score estimation, by twenty half of

452
00:27:07,920 --> 00:27:11,600
Speaker 1: all searches on the Internet will be voice searches. So

453
00:27:11,640 --> 00:27:15,720
Speaker 1: speech recognition, along with natural language processing, could lead to

454
00:27:15,760 --> 00:27:19,679
Speaker 1: a future of ambient computing in which the environments we

455
00:27:19,880 --> 00:27:23,720
Speaker 1: move through our effectively computer interfaces, and we can access

456
00:27:23,760 --> 00:27:26,600
Speaker 1: them through voice commands and other ways of commanding, maybe

457
00:27:26,600 --> 00:27:29,480
Speaker 1: gesture commands, but that seems like it might be better,

458
00:27:29,520 --> 00:27:32,360
Speaker 1: say for our episode about voice assistance and where we're

459
00:27:32,359 --> 00:27:36,160
Speaker 1: headed with that technology. In our next episode, I'm going

460
00:27:36,200 --> 00:27:40,199
Speaker 1: to really explore natural language processing, how it works, and

461
00:27:40,240 --> 00:27:42,840
Speaker 1: how that field of research has evolved over the last

462
00:27:42,840 --> 00:27:46,119
Speaker 1: few decades. It's also really fascinating, and it does, in

463
00:27:46,200 --> 00:27:49,000
Speaker 1: fact cross over quite a bit with speech recognition. But

464
00:27:49,280 --> 00:27:53,679
Speaker 1: natural language processing goes beyond speech. It also includes text,

465
00:27:54,320 --> 00:27:56,320
Speaker 1: and that will be our next episode. But if you

466
00:27:56,359 --> 00:27:58,760
Speaker 1: have a suggestion for a future topic I should cover

467
00:27:58,960 --> 00:28:01,520
Speaker 1: on tech Stuff, send me a message let me know

468
00:28:01,560 --> 00:28:04,199
Speaker 1: about it. The email for the show is tech stuff

469
00:28:04,359 --> 00:28:07,440
Speaker 1: at how stuff works dot com, or you can drop

470
00:28:07,440 --> 00:28:09,240
Speaker 1: me a line on Facebook or Twitter. The handle for

471
00:28:09,320 --> 00:28:13,040
Speaker 1: both of those is text Stuff hs W, and you

472
00:28:13,119 --> 00:28:15,800
Speaker 1: can also follow us on Instagram. I would love it

473
00:28:15,840 --> 00:28:19,200
Speaker 1: if you did, and I'll talk to you again really soon.

474
00:28:24,960 --> 00:28:27,399
Speaker 1: For more on this and bouthsands of other topics, is

475
00:28:27,400 --> 00:28:38,520
Speaker 1: it how stuff works dot com