1
00:00:04,120 --> 00:00:07,160
Speaker 1: Get in touch with technology with tech Stuff from how

2
00:00:07,200 --> 00:00:13,880
Speaker 1: stuff works dot com. Hey there, and welcome to tech Stuff.

3
00:00:13,920 --> 00:00:17,520
Speaker 1: I'm your host, Jonathan Strickland. I'm an executive producer with

4
00:00:17,560 --> 00:00:21,319
Speaker 1: how Stuff Works in a love all things tech, and

5
00:00:21,440 --> 00:00:26,720
Speaker 1: this is the second episode about natural language processing an LP,

6
00:00:27,040 --> 00:00:31,040
Speaker 1: also natural language understanding and LU. The two are related.

7
00:00:31,800 --> 00:00:35,080
Speaker 1: With that describes the technologies and processes we use to

8
00:00:35,120 --> 00:00:39,080
Speaker 1: give machines the ability to interpret and respond to language

9
00:00:39,120 --> 00:00:43,479
Speaker 1: the way we use it, so not just understanding our input,

10
00:00:43,520 --> 00:00:47,839
Speaker 1: but also generating output that still follows the rules of

11
00:00:47,960 --> 00:00:51,479
Speaker 1: various languages. So it's all about getting machines to conform

12
00:00:51,600 --> 00:00:54,520
Speaker 1: to us rather than the other way around. If you

13
00:00:54,640 --> 00:00:58,200
Speaker 1: have not listened to the episode immediately before this one,

14
00:00:58,720 --> 00:01:00,760
Speaker 1: you should do that. But as I'm about to pick

15
00:01:00,800 --> 00:01:02,920
Speaker 1: up where I left off, which was just after our

16
00:01:03,000 --> 00:01:07,280
Speaker 1: PA pulled the plug on its Speech Understanding research project,

17
00:01:07,920 --> 00:01:10,920
Speaker 1: and the research under the r PA project had shown

18
00:01:11,000 --> 00:01:15,280
Speaker 1: that NLP was an even more challenging problem than had

19
00:01:15,319 --> 00:01:20,360
Speaker 1: previously been anticipated. Even the simplest approaches were creating enormous

20
00:01:20,400 --> 00:01:22,959
Speaker 1: demands on both the work programmers had to do to

21
00:01:23,000 --> 00:01:26,160
Speaker 1: build a system out and the processing the system would

22
00:01:26,200 --> 00:01:29,640
Speaker 1: have to rely upon in order to interpret language. Work

23
00:01:29,680 --> 00:01:35,280
Speaker 1: in the late nineties seventies ranged into psychology. NLP researchers

24
00:01:35,440 --> 00:01:37,720
Speaker 1: felt a system needed to be able to identify a

25
00:01:37,840 --> 00:01:42,400
Speaker 1: user's needs and goals in order to function properly, had

26
00:01:42,440 --> 00:01:46,240
Speaker 1: to understand not just the surface level meaning of a phrase,

27
00:01:46,760 --> 00:01:50,920
Speaker 1: but the underlying meaning of linguistic expressions as well. Only

28
00:01:50,960 --> 00:01:53,880
Speaker 1: then could you have a computer system that could collaborate

29
00:01:53,920 --> 00:01:56,560
Speaker 1: with a human being in a seamless way. So, in

30
00:01:56,560 --> 00:01:59,640
Speaker 1: other words, what they're saying is that you could translate

31
00:01:59,680 --> 00:02:03,080
Speaker 1: stuff for interpret stuff word by word, but unless you

32
00:02:03,080 --> 00:02:05,800
Speaker 1: have an understanding of what the person is trying to

33
00:02:05,840 --> 00:02:09,799
Speaker 1: actually accomplish, chances are the results you're going to get

34
00:02:09,800 --> 00:02:12,160
Speaker 1: back are not going to be as relevant as they

35
00:02:12,160 --> 00:02:15,440
Speaker 1: could be. And so that was where the psychology was

36
00:02:15,480 --> 00:02:19,480
Speaker 1: starting to take form. By the early nineteen eighties, which

37
00:02:19,720 --> 00:02:22,640
Speaker 1: marks the third phase of n LP development. According to

38
00:02:22,680 --> 00:02:26,160
Speaker 1: the researcher Karen spark Jones, who I talked about in

39
00:02:26,160 --> 00:02:29,800
Speaker 1: the last episode, researchers were coming to terms with the

40
00:02:29,840 --> 00:02:34,000
Speaker 1: idea that a scalable NLP system that relied upon the

41
00:02:34,040 --> 00:02:38,160
Speaker 1: old methods of building lexicons and syntax rules just was

42
00:02:38,200 --> 00:02:41,040
Speaker 1: not practical It required far too much work on the

43
00:02:41,080 --> 00:02:43,880
Speaker 1: front end when designing a system to make a general

44
00:02:43,919 --> 00:02:48,040
Speaker 1: purpose in LP application. The problem was just way too

45
00:02:48,040 --> 00:02:52,880
Speaker 1: big to take that approach. Even with relatively narrow implementations

46
00:02:52,919 --> 00:02:57,080
Speaker 1: like designing a system that would parse technical documents, you think,

47
00:02:57,360 --> 00:02:59,799
Speaker 1: all right, well, the language used in technical documents is

48
00:02:59,840 --> 00:03:02,799
Speaker 1: a subset of the language you would encounter in the

49
00:03:02,880 --> 00:03:07,640
Speaker 1: quote unquote real world. Even with those use cases, the

50
00:03:07,720 --> 00:03:10,600
Speaker 1: old methods were proving to require far too much investment

51
00:03:10,639 --> 00:03:14,799
Speaker 1: in time, money, and effort on the design front. Spark

52
00:03:14,919 --> 00:03:18,919
Speaker 1: Jones identifies the key focus during this phase as being

53
00:03:19,000 --> 00:03:24,160
Speaker 1: on grammar and logic. During this phase, researchers developed several

54
00:03:24,240 --> 00:03:27,680
Speaker 1: different grammar types. Now, grammars are sets of rules for

55
00:03:27,720 --> 00:03:31,440
Speaker 1: analyzing and formalizing language. I would love to go into

56
00:03:31,480 --> 00:03:34,679
Speaker 1: more detail about the different grammars that were developed during

57
00:03:34,680 --> 00:03:39,120
Speaker 1: this phase or adopted for computational models, but honestly, it

58
00:03:39,160 --> 00:03:44,040
Speaker 1: gets really, really heavy, really quickly. It gets extremely technical,

59
00:03:44,280 --> 00:03:46,680
Speaker 1: though not on a technological side, but more on the

60
00:03:46,760 --> 00:03:50,320
Speaker 1: linguistic side. And suffice it to say that a lot

61
00:03:50,360 --> 00:03:53,080
Speaker 1: of research and debate centered around what is the best

62
00:03:53,120 --> 00:03:56,440
Speaker 1: way to arrive at the meaning of language? How do

63
00:03:56,520 --> 00:04:00,080
Speaker 1: we get to that? How how can you ascertain it

64
00:04:00,200 --> 00:04:03,400
Speaker 1: is meant by what was spoken or what was written.

65
00:04:03,760 --> 00:04:07,240
Speaker 1: The grammars were meant to direct NLP models to analyze

66
00:04:07,320 --> 00:04:11,680
Speaker 1: language in different ways that were computationally viable and that

67
00:04:11,720 --> 00:04:15,320
Speaker 1: wouldn't require the laborious process of programming everything in a

68
00:04:15,360 --> 00:04:19,280
Speaker 1: word for word style. Another big area of focus at

69
00:04:19,279 --> 00:04:23,320
Speaker 1: this time was on generation, meaning creating models that would

70
00:04:23,320 --> 00:04:28,040
Speaker 1: allow machines to generate natural language responses to users, including

71
00:04:28,080 --> 00:04:32,240
Speaker 1: responses that were extended, long examples of discourse, not just

72
00:04:32,920 --> 00:04:36,760
Speaker 1: a quick message. While machines wouldn't be able to think,

73
00:04:37,480 --> 00:04:39,880
Speaker 1: they would be able to put together a more sophisticated

74
00:04:39,960 --> 00:04:43,320
Speaker 1: response than chatbots like Eliza that I mentioned in the

75
00:04:43,400 --> 00:04:46,800
Speaker 1: last episode could manage. So the idea being, how can

76
00:04:46,839 --> 00:04:51,120
Speaker 1: we make a machine that can communicate results to a

77
00:04:51,160 --> 00:04:54,880
Speaker 1: person in a way that just makes sense. It's almost

78
00:04:54,880 --> 00:04:57,440
Speaker 1: as if a normal human being is chatting with you.

79
00:04:58,200 --> 00:05:01,360
Speaker 1: But as we understand it, it's very difficult to do

80
00:05:01,440 --> 00:05:04,960
Speaker 1: this on an extended basis. You can do it for

81
00:05:05,360 --> 00:05:09,280
Speaker 1: responses to individual queries, but when you start trying to

82
00:05:09,320 --> 00:05:12,680
Speaker 1: create something that can carry on an actual conversation, that's

83
00:05:12,680 --> 00:05:16,120
Speaker 1: where things start. To break down. In the nineties, work

84
00:05:16,200 --> 00:05:20,600
Speaker 1: in n LP focused on representing words as as mathematical vectors.

85
00:05:21,279 --> 00:05:25,480
Speaker 1: Many words are related to one another, so for example,

86
00:05:25,720 --> 00:05:29,719
Speaker 1: hotel and motel are related. They don't mean exactly the

87
00:05:29,760 --> 00:05:33,640
Speaker 1: same thing, but they mean very similar things. Then you

88
00:05:33,720 --> 00:05:37,080
Speaker 1: have a term like bet and breakfast. A bet and

89
00:05:37,080 --> 00:05:40,120
Speaker 1: breakfast is similar again to a hotel or a motel.

90
00:05:40,200 --> 00:05:43,200
Speaker 1: It's a different thing, but it's related. So these words

91
00:05:43,240 --> 00:05:46,640
Speaker 1: have similarities. They also have differences between them, but they're

92
00:05:46,680 --> 00:05:49,560
Speaker 1: all more similar to each other than if I used

93
00:05:49,560 --> 00:05:52,520
Speaker 1: a different word like hospital. A bet and breakfast is

94
00:05:52,600 --> 00:05:54,880
Speaker 1: more like a hotel or a motel than it is

95
00:05:54,920 --> 00:05:57,880
Speaker 1: a hospital. So in other words, we can group words

96
00:05:57,920 --> 00:06:02,880
Speaker 1: together into vector spaces and calculate the quote unquote distances

97
00:06:02,920 --> 00:06:07,240
Speaker 1: between vectors, and that determines degrees of similarity, and this

98
00:06:07,320 --> 00:06:11,560
Speaker 1: is very helpful for both translation and natural language processing.

99
00:06:12,040 --> 00:06:15,360
Speaker 1: There are ways to do this that even take context

100
00:06:15,440 --> 00:06:18,799
Speaker 1: into account. And this relates back to what was being

101
00:06:19,760 --> 00:06:26,240
Speaker 1: uh suggested by Warren Weaver when I talked about that memorandum.

102
00:06:26,279 --> 00:06:28,960
Speaker 1: There's a model called skip Graham, which is essentially what

103
00:06:29,040 --> 00:06:33,200
Speaker 1: he was talking about. This model takes a window of

104
00:06:33,240 --> 00:06:36,800
Speaker 1: words surrounding each word in a sentence to determine context,

105
00:06:36,920 --> 00:06:38,800
Speaker 1: so it's not looking at it just from a word

106
00:06:38,960 --> 00:06:42,440
Speaker 1: toward basis. Let's say that I write a phrase and

107
00:06:42,440 --> 00:06:46,520
Speaker 1: it says, I'm going to the bank to make a withdrawal. Now,

108
00:06:46,560 --> 00:06:48,560
Speaker 1: the word bank can actually refer to a couple of

109
00:06:48,560 --> 00:06:52,240
Speaker 1: different things. Right, it could be a financial institution, which

110
00:06:52,279 --> 00:06:55,000
Speaker 1: is obviously what I do mean when I say that sentence.

111
00:06:55,320 --> 00:06:58,440
Speaker 1: That it could also mean the area right next to

112
00:06:58,480 --> 00:07:01,279
Speaker 1: a river, right the bank of a river. The Skip

113
00:07:01,320 --> 00:07:04,520
Speaker 1: Graham model would take each word in that sentence and

114
00:07:04,560 --> 00:07:07,440
Speaker 1: then part with a few other words that are close

115
00:07:07,520 --> 00:07:10,880
Speaker 1: by to determine the meaning of the phrase. So it's

116
00:07:10,880 --> 00:07:13,160
Speaker 1: looking at I'm going to the bank to make a

117
00:07:13,200 --> 00:07:17,800
Speaker 1: withdrawal for bank, it might say to bank, the bank,

118
00:07:18,000 --> 00:07:22,640
Speaker 1: to bank, make bank a bank withdrawal bank. By looking

119
00:07:22,680 --> 00:07:26,440
Speaker 1: at these pairings, the system can figure out from context

120
00:07:26,880 --> 00:07:30,240
Speaker 1: that the bank I'm talking about is probably a financial institution.

121
00:07:30,520 --> 00:07:33,239
Speaker 1: I'm probably not making a withdrawal from a river bank.

122
00:07:33,960 --> 00:07:38,600
Speaker 1: So it's a way of machine systems figuring out the

123
00:07:38,640 --> 00:07:42,120
Speaker 1: meaning of a phrase through contextual cues by using this

124
00:07:42,160 --> 00:07:45,800
Speaker 1: windowed approach. And again, Warren weaver Back had proposed such

125
00:07:45,800 --> 00:07:48,800
Speaker 1: a thing. The vector approach would become more important as

126
00:07:48,800 --> 00:07:53,240
Speaker 1: computer scientists made advances in neural networks. That approach also

127
00:07:53,360 --> 00:07:56,920
Speaker 1: made machine translation much more effective because it no longer

128
00:07:57,000 --> 00:08:00,560
Speaker 1: looked for word for word matches, but rather matches meaning

129
00:08:00,880 --> 00:08:05,880
Speaker 1: based on vectors and probabilities. That's really important because once

130
00:08:05,920 --> 00:08:08,640
Speaker 1: you determine the meaning of a phrase in one language,

131
00:08:09,040 --> 00:08:13,320
Speaker 1: then you can look for a phrase in another language

132
00:08:13,360 --> 00:08:18,720
Speaker 1: that most closely resembles the meaning of the original. Uh.

133
00:08:18,760 --> 00:08:22,679
Speaker 1: This is the art of translation. A real translator, someone

134
00:08:22,680 --> 00:08:26,040
Speaker 1: who's translated from one language to another, is probably not

135
00:08:26,120 --> 00:08:28,880
Speaker 1: doing so word for word. Rather, they're doing meaning for

136
00:08:29,080 --> 00:08:33,120
Speaker 1: meaning to make certain that the intent of what is

137
00:08:33,160 --> 00:08:38,480
Speaker 1: being communicated gets through, not just the vocabulary. The ninety nineties,

138
00:08:38,520 --> 00:08:42,480
Speaker 1: which sparked Jones identifies as the fourth phase of NLP

139
00:08:42,600 --> 00:08:45,400
Speaker 1: development that would be the final phase in her report,

140
00:08:46,040 --> 00:08:50,960
Speaker 1: saw a more concentrated focus on lexicons over syntax, and

141
00:08:50,960 --> 00:08:55,000
Speaker 1: it also saw more practical applications of natural language processing,

142
00:08:55,320 --> 00:08:57,880
Speaker 1: as well as leveraging the Worldwide Web to help train

143
00:08:58,000 --> 00:09:01,840
Speaker 1: natural language processing models. There was an a rich source

144
00:09:02,440 --> 00:09:06,120
Speaker 1: of natural language on the Worldwide Web. Pretty much every

145
00:09:06,120 --> 00:09:09,800
Speaker 1: permutation you could imagine from people who are very careful

146
00:09:10,160 --> 00:09:13,560
Speaker 1: and the way they construct sentences and paragraphs to people

147
00:09:13,559 --> 00:09:17,040
Speaker 1: who are much more cavalier in the way they use language,

148
00:09:17,040 --> 00:09:21,680
Speaker 1: whether purposefully or otherwise. And also that report from spark

149
00:09:21,800 --> 00:09:25,480
Speaker 1: Jones again is dated October two thousand one, so that's

150
00:09:25,520 --> 00:09:30,160
Speaker 1: where her work stops for that particular report. But nearly

151
00:09:30,160 --> 00:09:34,240
Speaker 1: two decades have passed since that time, So in that time,

152
00:09:34,280 --> 00:09:36,719
Speaker 1: what has changed. Well, I would argue we are now

153
00:09:36,720 --> 00:09:40,520
Speaker 1: in a new phase of NLP development, one marked largely

154
00:09:40,600 --> 00:09:43,680
Speaker 1: by the rise and a few key technologies. One of

155
00:09:43,679 --> 00:09:47,640
Speaker 1: those is cloud computing. Cloud computing has removed the necessity

156
00:09:47,760 --> 00:09:51,840
Speaker 1: to build in complex capabilities in end machines like a

157
00:09:51,880 --> 00:09:55,640
Speaker 1: smartphone or a computer terminal, So an organization can create

158
00:09:55,679 --> 00:10:00,480
Speaker 1: a cloud infrastructure which consists of powerful machines and data basis.

159
00:10:00,679 --> 00:10:03,680
Speaker 1: Those machines could be real, they could be virtual. Virtual

160
00:10:03,760 --> 00:10:07,040
Speaker 1: machines are hosted on real hardware, but they're running virtual

161
00:10:07,200 --> 00:10:11,560
Speaker 1: implementations of various operating systems. So these machines provide the

162
00:10:11,600 --> 00:10:14,760
Speaker 1: processing power and they house the systems that are necessary

163
00:10:14,800 --> 00:10:17,959
Speaker 1: to parse language and respond appropriately, So you can think

164
00:10:17,960 --> 00:10:21,320
Speaker 1: of it as the brains of natural language processing. They

165
00:10:21,320 --> 00:10:24,439
Speaker 1: all exist on these very powerful computers that are in

166
00:10:24,559 --> 00:10:28,840
Speaker 1: data centers. The widespread availability of the Internet and the

167
00:10:28,880 --> 00:10:31,679
Speaker 1: fact that it's pretty easy to stay connected in many

168
00:10:31,720 --> 00:10:35,360
Speaker 1: parts of the world make this possible. So the end

169
00:10:35,480 --> 00:10:39,640
Speaker 1: user feels like the capabilities are actually housed on whatever

170
00:10:39,679 --> 00:10:41,719
Speaker 1: device he or she is using, like if it's a

171
00:10:41,760 --> 00:10:44,480
Speaker 1: smartphone or a computer, But in reality, all the work

172
00:10:44,559 --> 00:10:48,400
Speaker 1: is actually taking place potentially thousands of miles away in

173
00:10:48,440 --> 00:10:51,160
Speaker 1: a data center, and it's just being sent to you.

174
00:10:51,360 --> 00:10:54,520
Speaker 1: The the queries are being sent to the center and

175
00:10:54,559 --> 00:10:58,240
Speaker 1: the responses are being sent back to your device. Another

176
00:10:58,280 --> 00:11:00,880
Speaker 1: big development that has helped signific piquant LEE is the

177
00:11:00,920 --> 00:11:04,199
Speaker 1: pairing of artificial neural networks and as well as a

178
00:11:04,480 --> 00:11:07,679
Speaker 1: deep learning the process of deep learning, so a neural

179
00:11:07,720 --> 00:11:10,920
Speaker 1: network processes information in a way similar to how our

180
00:11:10,960 --> 00:11:13,960
Speaker 1: brains do it. Every node in a neural network represents

181
00:11:13,960 --> 00:11:18,360
Speaker 1: a neuron and it executes UH an operation upon data

182
00:11:18,559 --> 00:11:21,920
Speaker 1: and then hands off this data, which has now been

183
00:11:21,960 --> 00:11:25,960
Speaker 1: altered it's been transformed by this operation, to another layer

184
00:11:26,080 --> 00:11:29,560
Speaker 1: of neurons with a network which do further processing, and

185
00:11:29,600 --> 00:11:31,920
Speaker 1: so on and so forth. The system as a whole

186
00:11:32,040 --> 00:11:36,520
Speaker 1: can evaluate calculations and assign confidence levels to them. Deep

187
00:11:36,600 --> 00:11:40,600
Speaker 1: learning passes information through numerous layers to transform data and,

188
00:11:40,679 --> 00:11:44,920
Speaker 1: in the context of natural language processing, extract meaning from

189
00:11:44,960 --> 00:11:47,720
Speaker 1: that information. Now I've got a bit more to say

190
00:11:47,720 --> 00:11:50,560
Speaker 1: about natural language processing in general, and then after that

191
00:11:50,640 --> 00:11:55,920
Speaker 1: I'm going to transition to talk about recent implementations like Sirie, Alexa,

192
00:11:55,960 --> 00:11:59,800
Speaker 1: Google Assistant, and Cortana. But first let's take a quick

193
00:12:00,080 --> 00:12:10,000
Speaker 1: rake and thank our sponsor. In two thousand and sixteen,

194
00:12:10,040 --> 00:12:14,280
Speaker 1: Google announced a system that could analyze syntax and recognize

195
00:12:14,320 --> 00:12:19,160
Speaker 1: the various elements of a sentence, including verbs, nouns, adjectives,

196
00:12:19,160 --> 00:12:22,800
Speaker 1: and other components. The system's name is sort of a

197
00:12:22,840 --> 00:12:27,760
Speaker 1: snapshot of the zeitgeist of It was called and I'm

198
00:12:27,840 --> 00:12:32,720
Speaker 1: not making this up Parsi mcpart's face. It really was.

199
00:12:33,360 --> 00:12:37,760
Speaker 1: This is a parser, a a software that is meant

200
00:12:37,800 --> 00:12:42,880
Speaker 1: to analyze inputs and determine what the relationships are between

201
00:12:43,000 --> 00:12:46,840
Speaker 1: various components within the input. So it's parsing out the

202
00:12:46,960 --> 00:12:50,120
Speaker 1: meaning of a phrase by looking at the relationship between

203
00:12:50,160 --> 00:12:53,760
Speaker 1: all the different components. It was designed specifically for English

204
00:12:53,920 --> 00:12:57,920
Speaker 1: language inputs. In that same announcement, Google unveiled and open

205
00:12:57,960 --> 00:13:03,280
Speaker 1: source neural network framework called syntax net syntax Net tags

206
00:13:03,360 --> 00:13:07,319
Speaker 1: every word in an input with a part of speech tag,

207
00:13:07,679 --> 00:13:10,800
Speaker 1: and the tag describes the purpose of that word, what

208
00:13:10,880 --> 00:13:15,520
Speaker 1: purpose does it serve within the sentence, within the context

209
00:13:15,640 --> 00:13:18,600
Speaker 1: of that input. So, for example, it might be the

210
00:13:18,679 --> 00:13:21,920
Speaker 1: subject of the sentence, or it could be an object

211
00:13:22,200 --> 00:13:25,040
Speaker 1: of the sentence, or it might be the action the

212
00:13:25,200 --> 00:13:28,720
Speaker 1: root the user wishes to perform upon the object. So

213
00:13:29,520 --> 00:13:31,720
Speaker 1: if it identifies a verb that tends to be the

214
00:13:31,840 --> 00:13:36,960
Speaker 1: root of the command. The system also determines the syntactic

215
00:13:37,040 --> 00:13:40,320
Speaker 1: relationship between all the words, so not just what each

216
00:13:40,360 --> 00:13:43,560
Speaker 1: word's purpose is, but how that word relates to all

217
00:13:43,679 --> 00:13:46,960
Speaker 1: the other words within the input, and then it creates

218
00:13:47,000 --> 00:13:52,080
Speaker 1: a dependency tree which illustrates which words depend upon others.

219
00:13:52,640 --> 00:13:56,080
Speaker 1: Syntax Net also makes use of beam search. That's the

220
00:13:56,120 --> 00:13:58,959
Speaker 1: strategy I talked about in the Speech Recognition podcast a

221
00:13:59,040 --> 00:14:05,200
Speaker 1: couple of podcasts go so that is to help eliminate ambiguity.

222
00:14:05,320 --> 00:14:10,320
Speaker 1: As sentence length increases, the number of possible interpretations of

223
00:14:10,360 --> 00:14:14,839
Speaker 1: that sentence also increases dramatically. Right, the more complicated a

224
00:14:14,920 --> 00:14:18,840
Speaker 1: sentence is, the easier it is to misinterpret what that

225
00:14:18,960 --> 00:14:21,760
Speaker 1: sentence means, especially if you're looking at it from the

226
00:14:21,760 --> 00:14:24,480
Speaker 1: perspective of a machine, So how does the computer know

227
00:14:25,000 --> 00:14:29,320
Speaker 1: which interpretation is the right one? Syntax net takes a

228
00:14:29,480 --> 00:14:33,040
Speaker 1: sentence and starts to parse it, beginning with a left

229
00:14:33,040 --> 00:14:35,520
Speaker 1: to right approach for English, so it starts at the

230
00:14:35,560 --> 00:14:38,880
Speaker 1: beginning of the sentence and works its way through. Essentially,

231
00:14:38,920 --> 00:14:42,360
Speaker 1: it creates a hypothesis as to how the words relate

232
00:14:42,400 --> 00:14:45,080
Speaker 1: to each other. But as it goes along, it detects

233
00:14:45,120 --> 00:14:49,800
Speaker 1: possible alternate interpretations, so it starts to assign a probability

234
00:14:49,840 --> 00:14:54,040
Speaker 1: score to each interpretation, Essentially how sure it is that

235
00:14:54,200 --> 00:14:56,800
Speaker 1: this is on the right track. And it will keep

236
00:14:56,920 --> 00:15:00,680
Speaker 1: multiple possible answers as it parses, so it doesn't toss

237
00:15:00,760 --> 00:15:04,120
Speaker 1: them aside immediately. It says, all right, I'm right now,

238
00:15:04,440 --> 00:15:07,280
Speaker 1: I'm pretty sure answer A is correct, but I'm going

239
00:15:07,320 --> 00:15:10,320
Speaker 1: to hold on to B and C just in case. Now,

240
00:15:10,360 --> 00:15:13,920
Speaker 1: if one interpretation has a particularly low score and there

241
00:15:13,960 --> 00:15:17,720
Speaker 1: are several other potential interpretations that have higher scores, the

242
00:15:17,760 --> 00:15:20,760
Speaker 1: system will discard the low score with the assumption that

243
00:15:20,840 --> 00:15:22,960
Speaker 1: it just can't be the right answer just doesn't make

244
00:15:23,000 --> 00:15:27,720
Speaker 1: sense in well formed text, that is informal text, something

245
00:15:27,760 --> 00:15:30,840
Speaker 1: that has been written in a very formal approach, PARSI

246
00:15:31,000 --> 00:15:33,680
Speaker 1: mcpars face does a pretty good job. In fact, a

247
00:15:33,760 --> 00:15:38,480
Speaker 1: really good job has an accuracy rating that's approaching the

248
00:15:38,560 --> 00:15:43,040
Speaker 1: level of a human linguist that is trained in parsing sentences.

249
00:15:43,920 --> 00:15:46,360
Speaker 1: Humans who have that kind of training average at around

250
00:15:46,880 --> 00:15:52,360
Speaker 1: scent accuracy, so PARSI mcpars faces right right behind them.

251
00:15:52,440 --> 00:15:57,120
Speaker 1: But the key phrase there is well formed text. If

252
00:15:57,160 --> 00:16:00,920
Speaker 1: you present parsi mcpar's face with more lucy goosey language,

253
00:16:01,200 --> 00:16:04,560
Speaker 1: such as what you might find on your average Internet website,

254
00:16:05,560 --> 00:16:08,360
Speaker 1: which I know was redundant, parsing mcpars face has a

255
00:16:08,400 --> 00:16:12,520
Speaker 1: much more modest nine success rating. It's still impressive, but

256
00:16:12,560 --> 00:16:15,920
Speaker 1: it's a significant drop in accuracy. Now, these sort of

257
00:16:15,920 --> 00:16:18,960
Speaker 1: tools have been used in various Google products for a while,

258
00:16:19,160 --> 00:16:22,600
Speaker 1: not just Google Assistant, which is the one that people

259
00:16:22,640 --> 00:16:24,520
Speaker 1: tend to think about because it's the one we interact

260
00:16:24,560 --> 00:16:28,040
Speaker 1: with when we are speaking to Google, but also in

261
00:16:28,120 --> 00:16:30,920
Speaker 1: stuff like Gmail. If you've used Gmail and you've noticed

262
00:16:30,960 --> 00:16:34,320
Speaker 1: that sometimes you get automated responses popping up that you

263
00:16:34,400 --> 00:16:37,280
Speaker 1: can choose as an option, So instead of writing an email,

264
00:16:37,280 --> 00:16:40,360
Speaker 1: you just select sounds good or I'll see you then,

265
00:16:40,520 --> 00:16:43,120
Speaker 1: or whatever it may be. Then you have seen this

266
00:16:43,160 --> 00:16:46,000
Speaker 1: technology at work, or at least you've seen the product

267
00:16:46,080 --> 00:16:49,280
Speaker 1: of its work. Those automated responses are the result of

268
00:16:49,320 --> 00:16:54,080
Speaker 1: a natural language understanding system that's parsing that email, identifying

269
00:16:54,120 --> 00:16:57,200
Speaker 1: whatever the salient points are in the message, and then

270
00:16:57,240 --> 00:17:00,520
Speaker 1: generating what are hopefully logical responses to it, so you

271
00:17:00,560 --> 00:17:02,520
Speaker 1: can just choose that instead of taking the time to

272
00:17:02,520 --> 00:17:05,679
Speaker 1: actually type something in. One of the key elements in

273
00:17:05,800 --> 00:17:09,520
Speaker 1: natural language understanding is creating machines that can communicate with

274
00:17:09,640 --> 00:17:13,600
Speaker 1: us and explain how they arrived at a certain result. Now,

275
00:17:13,640 --> 00:17:16,880
Speaker 1: this falls into the concept of transparency, which is really

276
00:17:16,960 --> 00:17:19,919
Speaker 1: important when we were talking about artificial intelligence. There's a

277
00:17:20,000 --> 00:17:24,119
Speaker 1: real fear that AI and neural networks are creaning toward

278
00:17:24,240 --> 00:17:28,320
Speaker 1: a black box scenario, and a black box describes any

279
00:17:28,400 --> 00:17:31,240
Speaker 1: system where the workings of the system are hidden from

280
00:17:31,240 --> 00:17:35,719
Speaker 1: our view. We cannot see how something works, and so

281
00:17:35,760 --> 00:17:38,159
Speaker 1: we can only make guesses as to what's going on.

282
00:17:38,760 --> 00:17:40,760
Speaker 1: I know a lot of gear heads who are exasperated

283
00:17:40,760 --> 00:17:44,719
Speaker 1: with the way vehicle manufacturers are creating more of their cars, trucks,

284
00:17:44,760 --> 00:17:48,879
Speaker 1: and other vehicles with systems that aren't easily accessible or modifiable.

285
00:17:49,320 --> 00:17:53,160
Speaker 1: They consider those cars to be black boxes. It makes

286
00:17:53,160 --> 00:17:55,480
Speaker 1: it much harder to work on a vehicle if you

287
00:17:55,520 --> 00:17:59,720
Speaker 1: don't have the proprietary tools and knowledge that are specifically

288
00:17:59,760 --> 00:18:02,840
Speaker 1: for that system. Now take that concept and apply it

289
00:18:02,840 --> 00:18:06,000
Speaker 1: to AI, and it gets pretty scary pretty fast, particularly

290
00:18:06,280 --> 00:18:09,000
Speaker 1: since we're relying on AI to do some important stuff

291
00:18:09,040 --> 00:18:13,239
Speaker 1: like drive cars, make stock option deals, or help with

292
00:18:13,320 --> 00:18:17,399
Speaker 1: healthcare issues, and so one area of work focuses on

293
00:18:17,440 --> 00:18:21,159
Speaker 1: giving machines the capability to explain themselves, not just to

294
00:18:21,200 --> 00:18:24,440
Speaker 1: provide an answer, but explain why they came up with

295
00:18:24,480 --> 00:18:28,120
Speaker 1: that answer. So imagine a chess playing computer. It's playing

296
00:18:28,119 --> 00:18:30,200
Speaker 1: a game of chess and it makes a move. Then

297
00:18:30,240 --> 00:18:33,040
Speaker 1: imagine being able to ask the computer, why did you

298
00:18:33,119 --> 00:18:36,200
Speaker 1: make that move, and then the computer could actually answer

299
00:18:36,280 --> 00:18:39,680
Speaker 1: the question, explaining the logic behind the move it made.

300
00:18:40,119 --> 00:18:43,920
Speaker 1: Now extend that concept to all sorts of different AI applications.

301
00:18:44,240 --> 00:18:46,880
Speaker 1: If an AI stock trader suddenly buys up a ton

302
00:18:46,880 --> 00:18:50,080
Speaker 1: of stocks, you might want to know exactly what prompted

303
00:18:50,160 --> 00:18:53,840
Speaker 1: that decision, why did it make that purchase? And you

304
00:18:53,880 --> 00:18:56,479
Speaker 1: can easily imagine situations in which you'd want to know

305
00:18:56,560 --> 00:18:59,480
Speaker 1: why a machine behaved the way it did. Why did

306
00:18:59,720 --> 00:19:03,399
Speaker 1: an autonomous car choose a particular route. Why did a

307
00:19:03,440 --> 00:19:07,920
Speaker 1: healthcare program suggest a particular diagnosis Without getting those answers,

308
00:19:07,920 --> 00:19:11,040
Speaker 1: we're just putting our faith into machines blindly, and giving

309
00:19:11,040 --> 00:19:15,120
Speaker 1: a computer the ability to generate meaningful and equally important

310
00:19:15,240 --> 00:19:20,080
Speaker 1: relevant explanations would be extremely helpful. So what are some

311
00:19:20,119 --> 00:19:24,440
Speaker 1: of the uses of natural language processing technology. Well, one

312
00:19:24,520 --> 00:19:28,160
Speaker 1: fairly simple application is in spelling and grammar checking software.

313
00:19:28,200 --> 00:19:30,520
Speaker 1: If you've used a word processing program over the last

314
00:19:30,560 --> 00:19:33,480
Speaker 1: few years the last couple of decades, chances are you're

315
00:19:33,480 --> 00:19:37,960
Speaker 1: familiar with automatic real time spell check and grammar check features.

316
00:19:38,680 --> 00:19:40,760
Speaker 1: This is possible because of the work that has been

317
00:19:40,800 --> 00:19:44,120
Speaker 1: done in natural language processing. Spell check needs to take

318
00:19:44,160 --> 00:19:47,560
Speaker 1: into consideration not only if a word is spelled correctly,

319
00:19:47,600 --> 00:19:51,760
Speaker 1: if a word matches a word that's in the computer's lexicon,

320
00:19:52,320 --> 00:19:55,639
Speaker 1: but also if it's the right word for that instance.

321
00:19:56,000 --> 00:19:58,320
Speaker 1: In English, we have a lot of hominems. Those are

322
00:19:58,320 --> 00:20:01,760
Speaker 1: words that sound the same aim, but I have different meanings.

323
00:20:02,080 --> 00:20:05,040
Speaker 1: Now you can have hominem's that are spelled exactly the

324
00:20:05,080 --> 00:20:07,960
Speaker 1: same way, and those really aren't a problem because the

325
00:20:07,960 --> 00:20:12,480
Speaker 1: reader can pick up on what meaning you intended through context. Though,

326
00:20:12,520 --> 00:20:15,960
Speaker 1: if you're using natural language processing to do a translation,

327
00:20:16,400 --> 00:20:18,879
Speaker 1: then the NLP system needs to be able to determine

328
00:20:18,960 --> 00:20:22,480
Speaker 1: which meaning the original author intended. In my earlier example

329
00:20:22,520 --> 00:20:26,640
Speaker 1: about making a withdrawal at the bank, there's a hominem

330
00:20:26,680 --> 00:20:29,160
Speaker 1: you know, to two versions of bank, but they mean

331
00:20:29,200 --> 00:20:32,040
Speaker 1: two different things. I could also talk about bank as

332
00:20:32,080 --> 00:20:34,400
Speaker 1: in the sense of a verb, as in banking off

333
00:20:34,520 --> 00:20:39,040
Speaker 1: of something, but you get the point. There are also

334
00:20:39,119 --> 00:20:42,760
Speaker 1: hominem's that sound the same but are spelled differently, and

335
00:20:42,800 --> 00:20:45,560
Speaker 1: they have different meanings as well. So for example, they

336
00:20:45,680 --> 00:20:49,480
Speaker 1: dreaded too as in t O two as in t

337
00:20:49,760 --> 00:20:54,400
Speaker 1: O O, and two as in two combo. Those are

338
00:20:54,440 --> 00:20:58,159
Speaker 1: three words with three different applications, three different spellings. A

339
00:20:58,320 --> 00:21:01,359
Speaker 1: good spell check algorithm will be able to determine if

340
00:21:01,359 --> 00:21:04,960
Speaker 1: you've used the correct one in any instance. So if

341
00:21:04,960 --> 00:21:10,040
Speaker 1: you say that's two sweet, that's too sweet, but you're

342
00:21:10,240 --> 00:21:13,920
Speaker 1: using the number too just in word form, the spell

343
00:21:14,040 --> 00:21:16,280
Speaker 1: check will give you the old heads up and say

344
00:21:16,600 --> 00:21:19,399
Speaker 1: I think you meant t O O not t w O.

345
00:21:20,160 --> 00:21:22,760
Speaker 1: Fun fact, I typed that sentence into Google Docs and

346
00:21:22,800 --> 00:21:26,000
Speaker 1: it said you're totes fine. BRA didn't notice it at all.

347
00:21:26,560 --> 00:21:30,040
Speaker 1: Grammar checkers have to be able to analyze sentence structure

348
00:21:30,160 --> 00:21:32,840
Speaker 1: and word choice and compared against the grammar program for

349
00:21:32,880 --> 00:21:35,920
Speaker 1: the system. This might also help determine if the word

350
00:21:35,960 --> 00:21:39,240
Speaker 1: you use was the correct one. So, for example, affect

351
00:21:39,640 --> 00:21:45,080
Speaker 1: versus effect, Affect is a verb you affect something. Effect

352
00:21:45,240 --> 00:21:49,159
Speaker 1: is usually a noun. It's typically the result of some action.

353
00:21:49,280 --> 00:21:52,520
Speaker 1: So I could affect a drum, which is a dumb

354
00:21:52,560 --> 00:21:55,320
Speaker 1: thing to say, and the effect might be that the

355
00:21:55,359 --> 00:21:58,680
Speaker 1: sound I played hurt your ears. Now, if you spell

356
00:21:58,760 --> 00:22:02,160
Speaker 1: the word correctly and the spell checker is only comparing

357
00:22:02,200 --> 00:22:04,800
Speaker 1: the words you type against a lexicon to see if

358
00:22:04,840 --> 00:22:07,159
Speaker 1: there's a match, you might not get an indication that

359
00:22:07,200 --> 00:22:10,000
Speaker 1: anything is wrong because the computer system is saying, well,

360
00:22:10,000 --> 00:22:12,959
Speaker 1: that word is spelled correctly. It doesn't realize it's the

361
00:22:12,960 --> 00:22:15,760
Speaker 1: wrong word. But if it has a way of checking grammar,

362
00:22:15,760 --> 00:22:17,760
Speaker 1: it can also make sure you're using the right word

363
00:22:17,840 --> 00:22:21,520
Speaker 1: in the right context. Search engines such as Google use

364
00:22:21,600 --> 00:22:24,440
Speaker 1: natural language processing to determine what it is you're looking

365
00:22:24,480 --> 00:22:26,800
Speaker 1: for right, So when you're typing in a search and

366
00:22:26,840 --> 00:22:30,280
Speaker 1: you hit the search button, you might get a little

367
00:22:31,240 --> 00:22:34,920
Speaker 1: uh notification that says, maybe you meant this other thing,

368
00:22:35,040 --> 00:22:38,160
Speaker 1: or maybe you need to search for this terminology. That's

369
00:22:38,160 --> 00:22:41,040
Speaker 1: a useful feature since not everyone thinks of search the

370
00:22:41,119 --> 00:22:43,760
Speaker 1: same way. I could tell a dozen people to go

371
00:22:43,800 --> 00:22:48,159
Speaker 1: on Google and pull up information about Benjamin Franklin and

372
00:22:48,200 --> 00:22:51,479
Speaker 1: the story about the kite, and those folks might go

373
00:22:51,640 --> 00:22:54,639
Speaker 1: and perform their searches in twelve different ways. But the

374
00:22:54,680 --> 00:22:57,560
Speaker 1: search engine's job is to return the best results based

375
00:22:57,560 --> 00:23:00,480
Speaker 1: on the query, which means it needs to suss out

376
00:23:00,520 --> 00:23:04,160
Speaker 1: what the searcher is actually looking for. So even if

377
00:23:04,480 --> 00:23:08,160
Speaker 1: the twelve people all type twelve different ways of looking

378
00:23:08,240 --> 00:23:11,280
Speaker 1: up this information about Benjamin Franklin and the kite story,

379
00:23:11,880 --> 00:23:16,320
Speaker 1: it should respond with the most relevant results. And maybe

380
00:23:16,320 --> 00:23:19,919
Speaker 1: people get slightly different search results based upon the query,

381
00:23:20,000 --> 00:23:22,760
Speaker 1: but they should be more or less the same. And

382
00:23:22,800 --> 00:23:24,359
Speaker 1: it can also look out for you. It could give

383
00:23:24,359 --> 00:23:26,880
Speaker 1: you suggestions for search terms, should you use an incorrect

384
00:23:26,880 --> 00:23:31,920
Speaker 1: spelling or you approximate a spelling, or something like that.

385
00:23:32,720 --> 00:23:36,320
Speaker 1: One of the areas of opportunity for natural language processing

386
00:23:36,320 --> 00:23:39,160
Speaker 1: applications in the near future is handling the massive amounts

387
00:23:39,200 --> 00:23:42,960
Speaker 1: of information in big data applications. So, for example, a

388
00:23:43,040 --> 00:23:47,159
Speaker 1: lawyer might want to search historical legal results using natural

389
00:23:47,240 --> 00:23:50,040
Speaker 1: language to look for precedents that might help his or

390
00:23:50,040 --> 00:23:53,639
Speaker 1: her case in the courtroom. A pharmaceuticals company might need

391
00:23:53,680 --> 00:23:58,080
Speaker 1: to search information about clinical trials, doctors, notes, patient testimonials,

392
00:23:58,119 --> 00:24:01,560
Speaker 1: and related information. And the amount of information represented by

393
00:24:01,560 --> 00:24:04,800
Speaker 1: big data is truly astounding. It's enormous. It's way too

394
00:24:04,880 --> 00:24:07,960
Speaker 1: much for any human to sort through. So developing a

395
00:24:07,960 --> 00:24:10,800
Speaker 1: method for computers to parse a query and return relevant

396
00:24:10,840 --> 00:24:14,520
Speaker 1: results is highly desirable. For a computer to understand that

397
00:24:14,680 --> 00:24:19,720
Speaker 1: context understanding and air quotes and being able to give

398
00:24:19,760 --> 00:24:23,840
Speaker 1: you results based upon your questions, that would be incredibly

399
00:24:23,920 --> 00:24:27,080
Speaker 1: valuable for lots of different industries. And we started off

400
00:24:27,119 --> 00:24:30,880
Speaker 1: talking about machine translation at the early stages of natural

401
00:24:30,960 --> 00:24:34,640
Speaker 1: language processing. That's still a big area of research. Now

402
00:24:35,119 --> 00:24:38,280
Speaker 1: you can get real time translation tools. You can use

403
00:24:38,320 --> 00:24:41,440
Speaker 1: devices to translate from one language to another in real settings,

404
00:24:41,440 --> 00:24:44,439
Speaker 1: including written languages like signs. You can just hold a

405
00:24:44,480 --> 00:24:48,280
Speaker 1: camera up and get an an English translation of a

406
00:24:48,359 --> 00:24:51,320
Speaker 1: sign that's written another language, and of course vice versa.

407
00:24:51,760 --> 00:24:54,200
Speaker 1: That tends to be marketed as a tool for travelers,

408
00:24:54,200 --> 00:24:56,919
Speaker 1: but it really shows the amazing progress we've made in

409
00:24:57,040 --> 00:25:00,119
Speaker 1: natural language processing from the old days of word for

410
00:25:00,200 --> 00:25:03,400
Speaker 1: word models for machine translation that we're made back during

411
00:25:03,440 --> 00:25:05,880
Speaker 1: the Cold War now we've still got a far away

412
00:25:05,920 --> 00:25:09,040
Speaker 1: to go with natural language processing. We've seen some incredible

413
00:25:09,080 --> 00:25:12,080
Speaker 1: improvements over the past few years, but machines still don't

414
00:25:12,119 --> 00:25:15,639
Speaker 1: actually understand what we're saying or what we're writing, not

415
00:25:15,720 --> 00:25:18,520
Speaker 1: on a conscious level anyway. Instead, they are able to

416
00:25:18,560 --> 00:25:22,000
Speaker 1: refer back to rules, either explicitly stated as in the

417
00:25:22,040 --> 00:25:25,760
Speaker 1: older NLP models, or those arrived at through deep learning.

418
00:25:26,320 --> 00:25:28,240
Speaker 1: Now I'm going to take a quick break, but when

419
00:25:28,240 --> 00:25:31,000
Speaker 1: we come back, I'll talk a bit about the history

420
00:25:31,080 --> 00:25:34,080
Speaker 1: of the voice assistance we all know and love. But first,

421
00:25:34,160 --> 00:25:44,400
Speaker 1: here's another word from our sponsors. All right, So now

422
00:25:44,400 --> 00:25:47,720
Speaker 1: we understand a bit about the technologies that make voice

423
00:25:47,720 --> 00:25:52,800
Speaker 1: assistance possible, specifically speech recognition and natural language processing. There's

424
00:25:52,840 --> 00:25:56,840
Speaker 1: obviously a lot more than that, uh the system. The

425
00:25:56,880 --> 00:26:00,920
Speaker 1: system can obviously process our requests or commands and return

426
00:26:00,960 --> 00:26:06,840
Speaker 1: a result using more traditional computational processes. So while the

427
00:26:06,880 --> 00:26:12,040
Speaker 1: interpretation side is on speech recognition and natural language processing,

428
00:26:12,359 --> 00:26:16,120
Speaker 1: there's still a lot of regular computation work that has

429
00:26:16,119 --> 00:26:21,320
Speaker 1: to happen for a a personal assistant, digital assistant, a

430
00:26:21,400 --> 00:26:23,520
Speaker 1: voice assistant, whatever you want to call them, to be

431
00:26:23,600 --> 00:26:26,080
Speaker 1: able to respond to you. So let's take a quick

432
00:26:26,080 --> 00:26:29,880
Speaker 1: stroll through the history of the major voice assistants out there,

433
00:26:30,440 --> 00:26:32,320
Speaker 1: and I'm going to cover these in the order they

434
00:26:32,320 --> 00:26:35,000
Speaker 1: were introduced to the public more or less, which means

435
00:26:35,000 --> 00:26:37,879
Speaker 1: our very first voice assistant that will be covering in

436
00:26:37,920 --> 00:26:41,280
Speaker 1: this because I'm only focusing on the really big ones. Uh.

437
00:26:41,320 --> 00:26:43,840
Speaker 1: There are lots of small ones out there, but I'm

438
00:26:43,880 --> 00:26:46,680
Speaker 1: looking at the ones everyone's heard about, So that means

439
00:26:46,720 --> 00:26:49,359
Speaker 1: the first one we get to talk about is Apple's Sirie.

440
00:26:49,960 --> 00:26:53,159
Speaker 1: Apple unveiled Sirie on April fourteen, two thousand eleven, and

441
00:26:53,200 --> 00:26:56,399
Speaker 1: to be fair, Sirie existed before this. It was not

442
00:26:56,480 --> 00:26:59,520
Speaker 1: an Apple creation. Syria was actually an app produced by

443
00:26:59,560 --> 00:27:03,719
Speaker 1: an into and developer company called Sirie Incorporated, but Apple

444
00:27:03,800 --> 00:27:07,399
Speaker 1: gobbled up that company in and brought them in house.

445
00:27:07,840 --> 00:27:11,800
Speaker 1: And Apple had previously relied upon another speech recognition program

446
00:27:11,840 --> 00:27:14,480
Speaker 1: called voice over, which had been used in Mac products

447
00:27:14,480 --> 00:27:19,000
Speaker 1: and all iPhones since the iPhone three GS. Siri would

448
00:27:19,000 --> 00:27:22,959
Speaker 1: become available starting with the iPhone for s In this announcement,

449
00:27:23,480 --> 00:27:27,120
Speaker 1: Apple pointed out that earlier implementations of voice commands required

450
00:27:27,240 --> 00:27:30,120
Speaker 1: users to learn the syntax of the system. You had

451
00:27:30,160 --> 00:27:33,000
Speaker 1: to follow a very specific set of rules in order

452
00:27:33,040 --> 00:27:36,000
Speaker 1: to get anything to work. So you give a command

453
00:27:36,240 --> 00:27:38,640
Speaker 1: defined by the system. So for example, you might say

454
00:27:38,960 --> 00:27:43,440
Speaker 1: call mom or play once in a lifetime. You had

455
00:27:43,480 --> 00:27:47,720
Speaker 1: to do this very structured approach to whatever it was

456
00:27:47,760 --> 00:27:50,840
Speaker 1: he wanted to do. But that requires the user to

457
00:27:50,840 --> 00:27:54,560
Speaker 1: actually adhere to rules created by the architects of the system. Right,

458
00:27:54,600 --> 00:27:56,800
Speaker 1: So Sirie was meant to be different. It was meant

459
00:27:56,880 --> 00:28:00,000
Speaker 1: to be able to understand what you wanted on your terms,

460
00:28:00,119 --> 00:28:03,240
Speaker 1: not based off a strict set of rules. Apple said

461
00:28:03,280 --> 00:28:05,560
Speaker 1: that Siri would be able to interpret what you meant

462
00:28:05,960 --> 00:28:08,560
Speaker 1: and would return relevant information to you in response. In

463
00:28:08,600 --> 00:28:11,919
Speaker 1: the unveiling, they said that Siri is quote, your intelligent

464
00:28:12,000 --> 00:28:15,240
Speaker 1: assistant that helps you get things done just by asking

465
00:28:15,680 --> 00:28:18,840
Speaker 1: end quote. During that demonstration, they showed off how Siri

466
00:28:18,960 --> 00:28:22,080
Speaker 1: could parse different phrases that had the same underlying meaning

467
00:28:22,720 --> 00:28:26,840
Speaker 1: that the example they gave originally was was the weather today,

468
00:28:26,880 --> 00:28:30,000
Speaker 1: and then they asked that same question five or six

469
00:28:30,080 --> 00:28:33,600
Speaker 1: different times. Scott Forstall, vice president over at Apple, showed

470
00:28:33,600 --> 00:28:36,119
Speaker 1: off how you could get the same weather information by

471
00:28:36,160 --> 00:28:38,800
Speaker 1: asking it in these different ways. Then they showed off

472
00:28:38,800 --> 00:28:42,520
Speaker 1: how Siri could interoperate with other apps, such as Apple's

473
00:28:42,600 --> 00:28:45,520
Speaker 1: maps feature or through a partnership they had with Yelp.

474
00:28:45,920 --> 00:28:49,120
Speaker 1: Siri could take a request, it could parse it, interpret it,

475
00:28:49,280 --> 00:28:53,640
Speaker 1: send the appropriate UH request to the appropriate destination, and

476
00:28:53,680 --> 00:28:56,960
Speaker 1: then serve up the response. The destination could be a

477
00:28:57,000 --> 00:29:00,360
Speaker 1: web search, it could be an action within a compatible app.

478
00:29:00,400 --> 00:29:04,840
Speaker 1: You get the idea, so that serie next. On July nine,

479
00:29:04,880 --> 00:29:08,440
Speaker 1: two thousand twelve, Google released Android jelly Bean a k

480
00:29:08,600 --> 00:29:11,440
Speaker 1: a Android four point one, and one of the features

481
00:29:11,440 --> 00:29:14,200
Speaker 1: included in that operating system update, at least for certain

482
00:29:14,240 --> 00:29:18,160
Speaker 1: hardware upon release, was an offshoot of Google Search called

483
00:29:18,240 --> 00:29:22,880
Speaker 1: Google Now. This feature would serve up predictive cards containing

484
00:29:22,880 --> 00:29:26,080
Speaker 1: information that the system had flagged as potentially being useful

485
00:29:26,120 --> 00:29:29,120
Speaker 1: to you based off your activity. So let's say you

486
00:29:29,120 --> 00:29:31,560
Speaker 1: spend a lot of times searching for stuff like baseball scores,

487
00:29:32,080 --> 00:29:35,200
Speaker 1: Google Now would start serving you up cards that would

488
00:29:35,240 --> 00:29:38,280
Speaker 1: give you scores from previous games before you could even

489
00:29:38,320 --> 00:29:40,400
Speaker 1: search for them. You would just look at Google Now

490
00:29:40,440 --> 00:29:43,280
Speaker 1: and you could scroll through and you see what the

491
00:29:43,360 --> 00:29:46,120
Speaker 1: latest results were. Then you could actually scroll through the

492
00:29:46,160 --> 00:29:48,800
Speaker 1: different cards, all of which were slowly dialing you in

493
00:29:48,880 --> 00:29:51,600
Speaker 1: as a person, which was kind of creepy. And it

494
00:29:51,640 --> 00:29:55,960
Speaker 1: relied a lot on natural language processing and your activities. Now,

495
00:29:56,000 --> 00:29:58,760
Speaker 1: Google Now was not a voice assistant. This was sort

496
00:29:58,760 --> 00:30:02,720
Speaker 1: of a one way relationship. Google was analyzing information based

497
00:30:02,760 --> 00:30:05,160
Speaker 1: on your activity and then serving up information to you

498
00:30:05,200 --> 00:30:08,000
Speaker 1: that might be useful. But over time the company would

499
00:30:08,040 --> 00:30:12,360
Speaker 1: phase out Google Now and it gradually evolved into Google Assistant.

500
00:30:12,680 --> 00:30:14,600
Speaker 1: There was also Google Voice that allows you to do

501
00:30:14,640 --> 00:30:18,320
Speaker 1: things like voice search, so that also became incorporated into this.

502
00:30:18,480 --> 00:30:22,120
Speaker 1: Google Assistant is a lot like Syrie. It responds to voice,

503
00:30:22,280 --> 00:30:25,000
Speaker 1: It can respond to anaphores, meaning it can keep track

504
00:30:25,040 --> 00:30:27,760
Speaker 1: of subject matter and respond to follow up questions that

505
00:30:27,880 --> 00:30:31,800
Speaker 1: don't contain an explicit reference to the subject. So, for example,

506
00:30:31,840 --> 00:30:34,280
Speaker 1: you could ask Google Assistant, what is the weather going

507
00:30:34,320 --> 00:30:37,080
Speaker 1: to be like in Atlanta? And then after you get

508
00:30:37,080 --> 00:30:39,640
Speaker 1: a response, you might say, what about in Seattle. Now,

509
00:30:39,640 --> 00:30:43,960
Speaker 1: you have not explicitly said what is the weather in Seattle?

510
00:30:44,400 --> 00:30:47,600
Speaker 1: You just said what about in Seattle. However, Google Assistant

511
00:30:47,640 --> 00:30:50,080
Speaker 1: can infer that you are still talking about the weather,

512
00:30:50,200 --> 00:30:53,640
Speaker 1: only now within the context of a different location. Google

513
00:30:53,680 --> 00:30:57,640
Speaker 1: Assistant debuted in May, so in a way, this particular

514
00:30:57,800 --> 00:31:01,680
Speaker 1: entry in our timeline spans two other debuts, because he

515
00:31:01,720 --> 00:31:05,280
Speaker 1: had Google Now on one side and then Google Assistant later,

516
00:31:05,480 --> 00:31:07,480
Speaker 1: but I figured it was important to acknowledge how Google

517
00:31:07,480 --> 00:31:10,600
Speaker 1: Assistant grew out of the older Google Now feature. On

518
00:31:10,680 --> 00:31:15,240
Speaker 1: April two, two thirteen, Microsoft introduced its own voice assistant

519
00:31:15,360 --> 00:31:19,680
Speaker 1: at the Build Developer Conference. Microsoft's entry is named Cortana,

520
00:31:19,840 --> 00:31:23,479
Speaker 1: after the AI character from the Halo series of video games.

521
00:31:23,880 --> 00:31:28,200
Speaker 1: Microsoft integrated Cortana to work with Windows ten, Xbox One,

522
00:31:28,400 --> 00:31:32,120
Speaker 1: Windows Mobile, and a few other platforms as well, including

523
00:31:32,160 --> 00:31:35,040
Speaker 1: apps that were meant for other operating systems like iOS

524
00:31:35,080 --> 00:31:39,520
Speaker 1: and Android. Cortana's US voices that of Jen Taylor. She

525
00:31:39,600 --> 00:31:42,440
Speaker 1: actually is the voice actress who provided the voice for

526
00:31:42,480 --> 00:31:45,200
Speaker 1: the character of Cortona in the Halo games. That's kind

527
00:31:45,200 --> 00:31:48,360
Speaker 1: of fun, and like Siri, Cortona can interface with apps

528
00:31:48,360 --> 00:31:53,080
Speaker 1: as well as performed web searches. In November, Amazon got

529
00:31:53,120 --> 00:31:56,440
Speaker 1: into the game with Alexa and the Amazon Echo. Through

530
00:31:56,560 --> 00:31:59,560
Speaker 1: Amazon Echo, Alexa can serve not just as a voice

531
00:31:59,560 --> 00:32:03,000
Speaker 1: assistant that can retrieve information and play streaming media and

532
00:32:03,040 --> 00:32:05,360
Speaker 1: that kind of thing, but also as an interface in

533
00:32:05,440 --> 00:32:08,760
Speaker 1: home automation applications, and to be fair, so can Google

534
00:32:08,800 --> 00:32:11,960
Speaker 1: Assistant through devices like Google Home. So you can use

535
00:32:11,960 --> 00:32:14,920
Speaker 1: Alexa to interface directly with systems in your home. If

536
00:32:14,920 --> 00:32:18,480
Speaker 1: they are compatible, and not surprisingly, Alexa can interface with

537
00:32:18,520 --> 00:32:22,320
Speaker 1: Amazon's ordering system, allowing users to order products from Amazon

538
00:32:22,360 --> 00:32:26,720
Speaker 1: directly by speaking to Alexa. No shock there. According to Amazon,

539
00:32:27,040 --> 00:32:30,040
Speaker 1: developers were inspired by the Star Trek series of shows,

540
00:32:30,240 --> 00:32:32,600
Speaker 1: which characters would speak out loud to computer systems and

541
00:32:32,640 --> 00:32:35,760
Speaker 1: call for information or send commands to make various stuff happen.

542
00:32:36,320 --> 00:32:39,840
Speaker 1: Amazon also released a developer kit to allow independent developers

543
00:32:39,880 --> 00:32:42,520
Speaker 1: to create what are called Alexa skills. There's an old

544
00:32:42,560 --> 00:32:45,080
Speaker 1: episode of tech Stuff where I interviewed some folks from

545
00:32:45,160 --> 00:32:49,000
Speaker 1: Amazon to talk about this process. But essentially, developers will

546
00:32:49,040 --> 00:32:52,400
Speaker 1: submit skills to Amazon, which can then publish those skills

547
00:32:52,400 --> 00:32:55,320
Speaker 1: and allow anyone who has an Alexa enabled device to

548
00:32:55,600 --> 00:32:58,760
Speaker 1: activate those skills and make use of them. Individuals can

549
00:32:58,760 --> 00:33:01,480
Speaker 1: even build up their own person lies skills using a

550
00:33:01,520 --> 00:33:06,280
Speaker 1: tool called Blueprints, which Amazon introduced in April. Now there

551
00:33:06,280 --> 00:33:09,480
Speaker 1: are other examples I could point to. There's Samsung's Bixby

552
00:33:09,560 --> 00:33:14,120
Speaker 1: which it introduced in March. There's sound Hounds virtual assistant

553
00:33:14,120 --> 00:33:18,440
Speaker 1: called Hound that launched in March of But these were

554
00:33:18,440 --> 00:33:20,760
Speaker 1: the ones that I really hear about the most frequently,

555
00:33:20,760 --> 00:33:22,360
Speaker 1: so were the ones I wanted to kind of cover

556
00:33:22,840 --> 00:33:25,520
Speaker 1: and they all work on on a similar principle. The

557
00:33:25,560 --> 00:33:30,440
Speaker 1: implementations are all particular to their specific brands, but they

558
00:33:30,480 --> 00:33:36,160
Speaker 1: work on under similar foundational principles of natural language processing,

559
00:33:36,840 --> 00:33:41,080
Speaker 1: speech recognition, et cetera. And it's all about converging technologies

560
00:33:41,120 --> 00:33:44,440
Speaker 1: that took decades of hard work to make possible. Now

561
00:33:44,480 --> 00:33:47,239
Speaker 1: I want to thank listener Nate, who was the one

562
00:33:47,280 --> 00:33:50,240
Speaker 1: who set me on this trail to ask about speech

563
00:33:50,240 --> 00:33:55,120
Speaker 1: recognition and natural language processing and these voice assistants. Was

564
00:33:55,200 --> 00:34:00,000
Speaker 1: really interesting to dive into, very very cool, fascinating stuff.

565
00:34:00,040 --> 00:34:02,400
Speaker 1: Thanks a lot, Nate. If any of you out there

566
00:34:02,440 --> 00:34:05,040
Speaker 1: have suggestions for future episodes of tech Stuff, maybe it's

567
00:34:05,040 --> 00:34:07,640
Speaker 1: a technology or a company or a person in tech.

568
00:34:07,920 --> 00:34:10,160
Speaker 1: Maybe there's someone I should interview or have on as

569
00:34:10,160 --> 00:34:13,080
Speaker 1: a guest host, send me a message. The email address

570
00:34:13,280 --> 00:34:16,400
Speaker 1: is tech Stuff at how stuff works dot com. Or

571
00:34:16,480 --> 00:34:18,720
Speaker 1: drop me a line on Facebook or Twitter. The handle

572
00:34:18,719 --> 00:34:20,840
Speaker 1: at both of those is tech Stuff H s W.

573
00:34:21,440 --> 00:34:24,719
Speaker 1: Don't forget to follow us on Instagram and I'll talk

574
00:34:24,719 --> 00:34:33,799
Speaker 1: to you again really soon for more on this and

575
00:34:33,880 --> 00:34:36,399
Speaker 1: thousands of other topics. Is it how stuff Works dot

576
00:34:36,440 --> 00:34:46,560
Speaker 1: Com