1 00:00:04,120 --> 00:00:07,160 Speaker 1: Get in touch with technology with tech Stuff from how 2 00:00:07,200 --> 00:00:13,880 Speaker 1: stuff works dot com. Hey there, and welcome to tech Stuff. 3 00:00:13,920 --> 00:00:17,520 Speaker 1: I'm your host, Jonathan Strickland. I'm an executive producer with 4 00:00:17,560 --> 00:00:21,319 Speaker 1: how Stuff Works in a love all things tech, and 5 00:00:21,440 --> 00:00:26,720 Speaker 1: this is the second episode about natural language processing an LP, 6 00:00:27,040 --> 00:00:31,040 Speaker 1: also natural language understanding and LU. The two are related. 7 00:00:31,800 --> 00:00:35,080 Speaker 1: With that describes the technologies and processes we use to 8 00:00:35,120 --> 00:00:39,080 Speaker 1: give machines the ability to interpret and respond to language 9 00:00:39,120 --> 00:00:43,479 Speaker 1: the way we use it, so not just understanding our input, 10 00:00:43,520 --> 00:00:47,839 Speaker 1: but also generating output that still follows the rules of 11 00:00:47,960 --> 00:00:51,479 Speaker 1: various languages. So it's all about getting machines to conform 12 00:00:51,600 --> 00:00:54,520 Speaker 1: to us rather than the other way around. If you 13 00:00:54,640 --> 00:00:58,200 Speaker 1: have not listened to the episode immediately before this one, 14 00:00:58,720 --> 00:01:00,760 Speaker 1: you should do that. But as I'm about to pick 15 00:01:00,800 --> 00:01:02,920 Speaker 1: up where I left off, which was just after our 16 00:01:03,000 --> 00:01:07,280 Speaker 1: PA pulled the plug on its Speech Understanding research project, 17 00:01:07,920 --> 00:01:10,920 Speaker 1: and the research under the r PA project had shown 18 00:01:11,000 --> 00:01:15,280 Speaker 1: that NLP was an even more challenging problem than had 19 00:01:15,319 --> 00:01:20,360 Speaker 1: previously been anticipated. Even the simplest approaches were creating enormous 20 00:01:20,400 --> 00:01:22,959 Speaker 1: demands on both the work programmers had to do to 21 00:01:23,000 --> 00:01:26,160 Speaker 1: build a system out and the processing the system would 22 00:01:26,200 --> 00:01:29,640 Speaker 1: have to rely upon in order to interpret language. Work 23 00:01:29,680 --> 00:01:35,280 Speaker 1: in the late nineties seventies ranged into psychology. NLP researchers 24 00:01:35,440 --> 00:01:37,720 Speaker 1: felt a system needed to be able to identify a 25 00:01:37,840 --> 00:01:42,400 Speaker 1: user's needs and goals in order to function properly, had 26 00:01:42,440 --> 00:01:46,240 Speaker 1: to understand not just the surface level meaning of a phrase, 27 00:01:46,760 --> 00:01:50,920 Speaker 1: but the underlying meaning of linguistic expressions as well. Only 28 00:01:50,960 --> 00:01:53,880 Speaker 1: then could you have a computer system that could collaborate 29 00:01:53,920 --> 00:01:56,560 Speaker 1: with a human being in a seamless way. So, in 30 00:01:56,560 --> 00:01:59,640 Speaker 1: other words, what they're saying is that you could translate 31 00:01:59,680 --> 00:02:03,080 Speaker 1: stuff for interpret stuff word by word, but unless you 32 00:02:03,080 --> 00:02:05,800 Speaker 1: have an understanding of what the person is trying to 33 00:02:05,840 --> 00:02:09,799 Speaker 1: actually accomplish, chances are the results you're going to get 34 00:02:09,800 --> 00:02:12,160 Speaker 1: back are not going to be as relevant as they 35 00:02:12,160 --> 00:02:15,440 Speaker 1: could be. And so that was where the psychology was 36 00:02:15,480 --> 00:02:19,480 Speaker 1: starting to take form. By the early nineteen eighties, which 37 00:02:19,720 --> 00:02:22,640 Speaker 1: marks the third phase of n LP development. According to 38 00:02:22,680 --> 00:02:26,160 Speaker 1: the researcher Karen spark Jones, who I talked about in 39 00:02:26,160 --> 00:02:29,800 Speaker 1: the last episode, researchers were coming to terms with the 40 00:02:29,840 --> 00:02:34,000 Speaker 1: idea that a scalable NLP system that relied upon the 41 00:02:34,040 --> 00:02:38,160 Speaker 1: old methods of building lexicons and syntax rules just was 42 00:02:38,200 --> 00:02:41,040 Speaker 1: not practical It required far too much work on the 43 00:02:41,080 --> 00:02:43,880 Speaker 1: front end when designing a system to make a general 44 00:02:43,919 --> 00:02:48,040 Speaker 1: purpose in LP application. The problem was just way too 45 00:02:48,040 --> 00:02:52,880 Speaker 1: big to take that approach. Even with relatively narrow implementations 46 00:02:52,919 --> 00:02:57,080 Speaker 1: like designing a system that would parse technical documents, you think, 47 00:02:57,360 --> 00:02:59,799 Speaker 1: all right, well, the language used in technical documents is 48 00:02:59,840 --> 00:03:02,799 Speaker 1: a subset of the language you would encounter in the 49 00:03:02,880 --> 00:03:07,640 Speaker 1: quote unquote real world. Even with those use cases, the 50 00:03:07,720 --> 00:03:10,600 Speaker 1: old methods were proving to require far too much investment 51 00:03:10,639 --> 00:03:14,799 Speaker 1: in time, money, and effort on the design front. Spark 52 00:03:14,919 --> 00:03:18,919 Speaker 1: Jones identifies the key focus during this phase as being 53 00:03:19,000 --> 00:03:24,160 Speaker 1: on grammar and logic. During this phase, researchers developed several 54 00:03:24,240 --> 00:03:27,680 Speaker 1: different grammar types. Now, grammars are sets of rules for 55 00:03:27,720 --> 00:03:31,440 Speaker 1: analyzing and formalizing language. I would love to go into 56 00:03:31,480 --> 00:03:34,679 Speaker 1: more detail about the different grammars that were developed during 57 00:03:34,680 --> 00:03:39,120 Speaker 1: this phase or adopted for computational models, but honestly, it 58 00:03:39,160 --> 00:03:44,040 Speaker 1: gets really, really heavy, really quickly. It gets extremely technical, 59 00:03:44,280 --> 00:03:46,680 Speaker 1: though not on a technological side, but more on the 60 00:03:46,760 --> 00:03:50,320 Speaker 1: linguistic side. And suffice it to say that a lot 61 00:03:50,360 --> 00:03:53,080 Speaker 1: of research and debate centered around what is the best 62 00:03:53,120 --> 00:03:56,440 Speaker 1: way to arrive at the meaning of language? How do 63 00:03:56,520 --> 00:04:00,080 Speaker 1: we get to that? How how can you ascertain it 64 00:04:00,200 --> 00:04:03,400 Speaker 1: is meant by what was spoken or what was written. 65 00:04:03,760 --> 00:04:07,240 Speaker 1: The grammars were meant to direct NLP models to analyze 66 00:04:07,320 --> 00:04:11,680 Speaker 1: language in different ways that were computationally viable and that 67 00:04:11,720 --> 00:04:15,320 Speaker 1: wouldn't require the laborious process of programming everything in a 68 00:04:15,360 --> 00:04:19,280 Speaker 1: word for word style. Another big area of focus at 69 00:04:19,279 --> 00:04:23,320 Speaker 1: this time was on generation, meaning creating models that would 70 00:04:23,320 --> 00:04:28,040 Speaker 1: allow machines to generate natural language responses to users, including 71 00:04:28,080 --> 00:04:32,240 Speaker 1: responses that were extended, long examples of discourse, not just 72 00:04:32,920 --> 00:04:36,760 Speaker 1: a quick message. While machines wouldn't be able to think, 73 00:04:37,480 --> 00:04:39,880 Speaker 1: they would be able to put together a more sophisticated 74 00:04:39,960 --> 00:04:43,320 Speaker 1: response than chatbots like Eliza that I mentioned in the 75 00:04:43,400 --> 00:04:46,800 Speaker 1: last episode could manage. So the idea being, how can 76 00:04:46,839 --> 00:04:51,120 Speaker 1: we make a machine that can communicate results to a 77 00:04:51,160 --> 00:04:54,880 Speaker 1: person in a way that just makes sense. It's almost 78 00:04:54,880 --> 00:04:57,440 Speaker 1: as if a normal human being is chatting with you. 79 00:04:58,200 --> 00:05:01,360 Speaker 1: But as we understand it, it's very difficult to do 80 00:05:01,440 --> 00:05:04,960 Speaker 1: this on an extended basis. You can do it for 81 00:05:05,360 --> 00:05:09,280 Speaker 1: responses to individual queries, but when you start trying to 82 00:05:09,320 --> 00:05:12,680 Speaker 1: create something that can carry on an actual conversation, that's 83 00:05:12,680 --> 00:05:16,120 Speaker 1: where things start. To break down. In the nineties, work 84 00:05:16,200 --> 00:05:20,600 Speaker 1: in n LP focused on representing words as as mathematical vectors. 85 00:05:21,279 --> 00:05:25,480 Speaker 1: Many words are related to one another, so for example, 86 00:05:25,720 --> 00:05:29,719 Speaker 1: hotel and motel are related. They don't mean exactly the 87 00:05:29,760 --> 00:05:33,640 Speaker 1: same thing, but they mean very similar things. Then you 88 00:05:33,720 --> 00:05:37,080 Speaker 1: have a term like bet and breakfast. A bet and 89 00:05:37,080 --> 00:05:40,120 Speaker 1: breakfast is similar again to a hotel or a motel. 90 00:05:40,200 --> 00:05:43,200 Speaker 1: It's a different thing, but it's related. So these words 91 00:05:43,240 --> 00:05:46,640 Speaker 1: have similarities. They also have differences between them, but they're 92 00:05:46,680 --> 00:05:49,560 Speaker 1: all more similar to each other than if I used 93 00:05:49,560 --> 00:05:52,520 Speaker 1: a different word like hospital. A bet and breakfast is 94 00:05:52,600 --> 00:05:54,880 Speaker 1: more like a hotel or a motel than it is 95 00:05:54,920 --> 00:05:57,880 Speaker 1: a hospital. So in other words, we can group words 96 00:05:57,920 --> 00:06:02,880 Speaker 1: together into vector spaces and calculate the quote unquote distances 97 00:06:02,920 --> 00:06:07,240 Speaker 1: between vectors, and that determines degrees of similarity, and this 98 00:06:07,320 --> 00:06:11,560 Speaker 1: is very helpful for both translation and natural language processing. 99 00:06:12,040 --> 00:06:15,360 Speaker 1: There are ways to do this that even take context 100 00:06:15,440 --> 00:06:18,799 Speaker 1: into account. And this relates back to what was being 101 00:06:19,760 --> 00:06:26,240 Speaker 1: uh suggested by Warren Weaver when I talked about that memorandum. 102 00:06:26,279 --> 00:06:28,960 Speaker 1: There's a model called skip Graham, which is essentially what 103 00:06:29,040 --> 00:06:33,200 Speaker 1: he was talking about. This model takes a window of 104 00:06:33,240 --> 00:06:36,800 Speaker 1: words surrounding each word in a sentence to determine context, 105 00:06:36,920 --> 00:06:38,800 Speaker 1: so it's not looking at it just from a word 106 00:06:38,960 --> 00:06:42,440 Speaker 1: toward basis. Let's say that I write a phrase and 107 00:06:42,440 --> 00:06:46,520 Speaker 1: it says, I'm going to the bank to make a withdrawal. Now, 108 00:06:46,560 --> 00:06:48,560 Speaker 1: the word bank can actually refer to a couple of 109 00:06:48,560 --> 00:06:52,240 Speaker 1: different things. Right, it could be a financial institution, which 110 00:06:52,279 --> 00:06:55,000 Speaker 1: is obviously what I do mean when I say that sentence. 111 00:06:55,320 --> 00:06:58,440 Speaker 1: That it could also mean the area right next to 112 00:06:58,480 --> 00:07:01,279 Speaker 1: a river, right the bank of a river. The Skip 113 00:07:01,320 --> 00:07:04,520 Speaker 1: Graham model would take each word in that sentence and 114 00:07:04,560 --> 00:07:07,440 Speaker 1: then part with a few other words that are close 115 00:07:07,520 --> 00:07:10,880 Speaker 1: by to determine the meaning of the phrase. So it's 116 00:07:10,880 --> 00:07:13,160 Speaker 1: looking at I'm going to the bank to make a 117 00:07:13,200 --> 00:07:17,800 Speaker 1: withdrawal for bank, it might say to bank, the bank, 118 00:07:18,000 --> 00:07:22,640 Speaker 1: to bank, make bank a bank withdrawal bank. By looking 119 00:07:22,680 --> 00:07:26,440 Speaker 1: at these pairings, the system can figure out from context 120 00:07:26,880 --> 00:07:30,240 Speaker 1: that the bank I'm talking about is probably a financial institution. 121 00:07:30,520 --> 00:07:33,239 Speaker 1: I'm probably not making a withdrawal from a river bank. 122 00:07:33,960 --> 00:07:38,600 Speaker 1: So it's a way of machine systems figuring out the 123 00:07:38,640 --> 00:07:42,120 Speaker 1: meaning of a phrase through contextual cues by using this 124 00:07:42,160 --> 00:07:45,800 Speaker 1: windowed approach. And again, Warren weaver Back had proposed such 125 00:07:45,800 --> 00:07:48,800 Speaker 1: a thing. The vector approach would become more important as 126 00:07:48,800 --> 00:07:53,240 Speaker 1: computer scientists made advances in neural networks. That approach also 127 00:07:53,360 --> 00:07:56,920 Speaker 1: made machine translation much more effective because it no longer 128 00:07:57,000 --> 00:08:00,560 Speaker 1: looked for word for word matches, but rather matches meaning 129 00:08:00,880 --> 00:08:05,880 Speaker 1: based on vectors and probabilities. That's really important because once 130 00:08:05,920 --> 00:08:08,640 Speaker 1: you determine the meaning of a phrase in one language, 131 00:08:09,040 --> 00:08:13,320 Speaker 1: then you can look for a phrase in another language 132 00:08:13,360 --> 00:08:18,720 Speaker 1: that most closely resembles the meaning of the original. Uh. 133 00:08:18,760 --> 00:08:22,679 Speaker 1: This is the art of translation. A real translator, someone 134 00:08:22,680 --> 00:08:26,040 Speaker 1: who's translated from one language to another, is probably not 135 00:08:26,120 --> 00:08:28,880 Speaker 1: doing so word for word. Rather, they're doing meaning for 136 00:08:29,080 --> 00:08:33,120 Speaker 1: meaning to make certain that the intent of what is 137 00:08:33,160 --> 00:08:38,480 Speaker 1: being communicated gets through, not just the vocabulary. The ninety nineties, 138 00:08:38,520 --> 00:08:42,480 Speaker 1: which sparked Jones identifies as the fourth phase of NLP 139 00:08:42,600 --> 00:08:45,400 Speaker 1: development that would be the final phase in her report, 140 00:08:46,040 --> 00:08:50,960 Speaker 1: saw a more concentrated focus on lexicons over syntax, and 141 00:08:50,960 --> 00:08:55,000 Speaker 1: it also saw more practical applications of natural language processing, 142 00:08:55,320 --> 00:08:57,880 Speaker 1: as well as leveraging the Worldwide Web to help train 143 00:08:58,000 --> 00:09:01,840 Speaker 1: natural language processing models. There was an a rich source 144 00:09:02,440 --> 00:09:06,120 Speaker 1: of natural language on the Worldwide Web. Pretty much every 145 00:09:06,120 --> 00:09:09,800 Speaker 1: permutation you could imagine from people who are very careful 146 00:09:10,160 --> 00:09:13,560 Speaker 1: and the way they construct sentences and paragraphs to people 147 00:09:13,559 --> 00:09:17,040 Speaker 1: who are much more cavalier in the way they use language, 148 00:09:17,040 --> 00:09:21,680 Speaker 1: whether purposefully or otherwise. And also that report from spark 149 00:09:21,800 --> 00:09:25,480 Speaker 1: Jones again is dated October two thousand one, so that's 150 00:09:25,520 --> 00:09:30,160 Speaker 1: where her work stops for that particular report. But nearly 151 00:09:30,160 --> 00:09:34,240 Speaker 1: two decades have passed since that time, So in that time, 152 00:09:34,280 --> 00:09:36,719 Speaker 1: what has changed. Well, I would argue we are now 153 00:09:36,720 --> 00:09:40,520 Speaker 1: in a new phase of NLP development, one marked largely 154 00:09:40,600 --> 00:09:43,680 Speaker 1: by the rise and a few key technologies. One of 155 00:09:43,679 --> 00:09:47,640 Speaker 1: those is cloud computing. Cloud computing has removed the necessity 156 00:09:47,760 --> 00:09:51,840 Speaker 1: to build in complex capabilities in end machines like a 157 00:09:51,880 --> 00:09:55,640 Speaker 1: smartphone or a computer terminal, So an organization can create 158 00:09:55,679 --> 00:10:00,480 Speaker 1: a cloud infrastructure which consists of powerful machines and data basis. 159 00:10:00,679 --> 00:10:03,680 Speaker 1: Those machines could be real, they could be virtual. Virtual 160 00:10:03,760 --> 00:10:07,040 Speaker 1: machines are hosted on real hardware, but they're running virtual 161 00:10:07,200 --> 00:10:11,560 Speaker 1: implementations of various operating systems. So these machines provide the 162 00:10:11,600 --> 00:10:14,760 Speaker 1: processing power and they house the systems that are necessary 163 00:10:14,800 --> 00:10:17,959 Speaker 1: to parse language and respond appropriately, So you can think 164 00:10:17,960 --> 00:10:21,320 Speaker 1: of it as the brains of natural language processing. They 165 00:10:21,320 --> 00:10:24,439 Speaker 1: all exist on these very powerful computers that are in 166 00:10:24,559 --> 00:10:28,840 Speaker 1: data centers. The widespread availability of the Internet and the 167 00:10:28,880 --> 00:10:31,679 Speaker 1: fact that it's pretty easy to stay connected in many 168 00:10:31,720 --> 00:10:35,360 Speaker 1: parts of the world make this possible. So the end 169 00:10:35,480 --> 00:10:39,640 Speaker 1: user feels like the capabilities are actually housed on whatever 170 00:10:39,679 --> 00:10:41,719 Speaker 1: device he or she is using, like if it's a 171 00:10:41,760 --> 00:10:44,480 Speaker 1: smartphone or a computer, But in reality, all the work 172 00:10:44,559 --> 00:10:48,400 Speaker 1: is actually taking place potentially thousands of miles away in 173 00:10:48,440 --> 00:10:51,160 Speaker 1: a data center, and it's just being sent to you. 174 00:10:51,360 --> 00:10:54,520 Speaker 1: The the queries are being sent to the center and 175 00:10:54,559 --> 00:10:58,240 Speaker 1: the responses are being sent back to your device. Another 176 00:10:58,280 --> 00:11:00,880 Speaker 1: big development that has helped signific piquant LEE is the 177 00:11:00,920 --> 00:11:04,199 Speaker 1: pairing of artificial neural networks and as well as a 178 00:11:04,480 --> 00:11:07,679 Speaker 1: deep learning the process of deep learning, so a neural 179 00:11:07,720 --> 00:11:10,920 Speaker 1: network processes information in a way similar to how our 180 00:11:10,960 --> 00:11:13,960 Speaker 1: brains do it. Every node in a neural network represents 181 00:11:13,960 --> 00:11:18,360 Speaker 1: a neuron and it executes UH an operation upon data 182 00:11:18,559 --> 00:11:21,920 Speaker 1: and then hands off this data, which has now been 183 00:11:21,960 --> 00:11:25,960 Speaker 1: altered it's been transformed by this operation, to another layer 184 00:11:26,080 --> 00:11:29,560 Speaker 1: of neurons with a network which do further processing, and 185 00:11:29,600 --> 00:11:31,920 Speaker 1: so on and so forth. The system as a whole 186 00:11:32,040 --> 00:11:36,520 Speaker 1: can evaluate calculations and assign confidence levels to them. Deep 187 00:11:36,600 --> 00:11:40,600 Speaker 1: learning passes information through numerous layers to transform data and, 188 00:11:40,679 --> 00:11:44,920 Speaker 1: in the context of natural language processing, extract meaning from 189 00:11:44,960 --> 00:11:47,720 Speaker 1: that information. Now I've got a bit more to say 190 00:11:47,720 --> 00:11:50,560 Speaker 1: about natural language processing in general, and then after that 191 00:11:50,640 --> 00:11:55,920 Speaker 1: I'm going to transition to talk about recent implementations like Sirie, Alexa, 192 00:11:55,960 --> 00:11:59,800 Speaker 1: Google Assistant, and Cortana. But first let's take a quick 193 00:12:00,080 --> 00:12:10,000 Speaker 1: rake and thank our sponsor. In two thousand and sixteen, 194 00:12:10,040 --> 00:12:14,280 Speaker 1: Google announced a system that could analyze syntax and recognize 195 00:12:14,320 --> 00:12:19,160 Speaker 1: the various elements of a sentence, including verbs, nouns, adjectives, 196 00:12:19,160 --> 00:12:22,800 Speaker 1: and other components. The system's name is sort of a 197 00:12:22,840 --> 00:12:27,760 Speaker 1: snapshot of the zeitgeist of It was called and I'm 198 00:12:27,840 --> 00:12:32,720 Speaker 1: not making this up Parsi mcpart's face. It really was. 199 00:12:33,360 --> 00:12:37,760 Speaker 1: This is a parser, a a software that is meant 200 00:12:37,800 --> 00:12:42,880 Speaker 1: to analyze inputs and determine what the relationships are between 201 00:12:43,000 --> 00:12:46,840 Speaker 1: various components within the input. So it's parsing out the 202 00:12:46,960 --> 00:12:50,120 Speaker 1: meaning of a phrase by looking at the relationship between 203 00:12:50,160 --> 00:12:53,760 Speaker 1: all the different components. It was designed specifically for English 204 00:12:53,920 --> 00:12:57,920 Speaker 1: language inputs. In that same announcement, Google unveiled and open 205 00:12:57,960 --> 00:13:03,280 Speaker 1: source neural network framework called syntax net syntax Net tags 206 00:13:03,360 --> 00:13:07,319 Speaker 1: every word in an input with a part of speech tag, 207 00:13:07,679 --> 00:13:10,800 Speaker 1: and the tag describes the purpose of that word, what 208 00:13:10,880 --> 00:13:15,520 Speaker 1: purpose does it serve within the sentence, within the context 209 00:13:15,640 --> 00:13:18,600 Speaker 1: of that input. So, for example, it might be the 210 00:13:18,679 --> 00:13:21,920 Speaker 1: subject of the sentence, or it could be an object 211 00:13:22,200 --> 00:13:25,040 Speaker 1: of the sentence, or it might be the action the 212 00:13:25,200 --> 00:13:28,720 Speaker 1: root the user wishes to perform upon the object. So 213 00:13:29,520 --> 00:13:31,720 Speaker 1: if it identifies a verb that tends to be the 214 00:13:31,840 --> 00:13:36,960 Speaker 1: root of the command. The system also determines the syntactic 215 00:13:37,040 --> 00:13:40,320 Speaker 1: relationship between all the words, so not just what each 216 00:13:40,360 --> 00:13:43,560 Speaker 1: word's purpose is, but how that word relates to all 217 00:13:43,679 --> 00:13:46,960 Speaker 1: the other words within the input, and then it creates 218 00:13:47,000 --> 00:13:52,080 Speaker 1: a dependency tree which illustrates which words depend upon others. 219 00:13:52,640 --> 00:13:56,080 Speaker 1: Syntax Net also makes use of beam search. That's the 220 00:13:56,120 --> 00:13:58,959 Speaker 1: strategy I talked about in the Speech Recognition podcast a 221 00:13:59,040 --> 00:14:05,200 Speaker 1: couple of podcasts go so that is to help eliminate ambiguity. 222 00:14:05,320 --> 00:14:10,320 Speaker 1: As sentence length increases, the number of possible interpretations of 223 00:14:10,360 --> 00:14:14,839 Speaker 1: that sentence also increases dramatically. Right, the more complicated a 224 00:14:14,920 --> 00:14:18,840 Speaker 1: sentence is, the easier it is to misinterpret what that 225 00:14:18,960 --> 00:14:21,760 Speaker 1: sentence means, especially if you're looking at it from the 226 00:14:21,760 --> 00:14:24,480 Speaker 1: perspective of a machine, So how does the computer know 227 00:14:25,000 --> 00:14:29,320 Speaker 1: which interpretation is the right one? Syntax net takes a 228 00:14:29,480 --> 00:14:33,040 Speaker 1: sentence and starts to parse it, beginning with a left 229 00:14:33,040 --> 00:14:35,520 Speaker 1: to right approach for English, so it starts at the 230 00:14:35,560 --> 00:14:38,880 Speaker 1: beginning of the sentence and works its way through. Essentially, 231 00:14:38,920 --> 00:14:42,360 Speaker 1: it creates a hypothesis as to how the words relate 232 00:14:42,400 --> 00:14:45,080 Speaker 1: to each other. But as it goes along, it detects 233 00:14:45,120 --> 00:14:49,800 Speaker 1: possible alternate interpretations, so it starts to assign a probability 234 00:14:49,840 --> 00:14:54,040 Speaker 1: score to each interpretation, Essentially how sure it is that 235 00:14:54,200 --> 00:14:56,800 Speaker 1: this is on the right track. And it will keep 236 00:14:56,920 --> 00:15:00,680 Speaker 1: multiple possible answers as it parses, so it doesn't toss 237 00:15:00,760 --> 00:15:04,120 Speaker 1: them aside immediately. It says, all right, I'm right now, 238 00:15:04,440 --> 00:15:07,280 Speaker 1: I'm pretty sure answer A is correct, but I'm going 239 00:15:07,320 --> 00:15:10,320 Speaker 1: to hold on to B and C just in case. Now, 240 00:15:10,360 --> 00:15:13,920 Speaker 1: if one interpretation has a particularly low score and there 241 00:15:13,960 --> 00:15:17,720 Speaker 1: are several other potential interpretations that have higher scores, the 242 00:15:17,760 --> 00:15:20,760 Speaker 1: system will discard the low score with the assumption that 243 00:15:20,840 --> 00:15:22,960 Speaker 1: it just can't be the right answer just doesn't make 244 00:15:23,000 --> 00:15:27,720 Speaker 1: sense in well formed text, that is informal text, something 245 00:15:27,760 --> 00:15:30,840 Speaker 1: that has been written in a very formal approach, PARSI 246 00:15:31,000 --> 00:15:33,680 Speaker 1: mcpars face does a pretty good job. In fact, a 247 00:15:33,760 --> 00:15:38,480 Speaker 1: really good job has an accuracy rating that's approaching the 248 00:15:38,560 --> 00:15:43,040 Speaker 1: level of a human linguist that is trained in parsing sentences. 249 00:15:43,920 --> 00:15:46,360 Speaker 1: Humans who have that kind of training average at around 250 00:15:46,880 --> 00:15:52,360 Speaker 1: scent accuracy, so PARSI mcpars faces right right behind them. 251 00:15:52,440 --> 00:15:57,120 Speaker 1: But the key phrase there is well formed text. If 252 00:15:57,160 --> 00:16:00,920 Speaker 1: you present parsi mcpar's face with more lucy goosey language, 253 00:16:01,200 --> 00:16:04,560 Speaker 1: such as what you might find on your average Internet website, 254 00:16:05,560 --> 00:16:08,360 Speaker 1: which I know was redundant, parsing mcpars face has a 255 00:16:08,400 --> 00:16:12,520 Speaker 1: much more modest nine success rating. It's still impressive, but 256 00:16:12,560 --> 00:16:15,920 Speaker 1: it's a significant drop in accuracy. Now, these sort of 257 00:16:15,920 --> 00:16:18,960 Speaker 1: tools have been used in various Google products for a while, 258 00:16:19,160 --> 00:16:22,600 Speaker 1: not just Google Assistant, which is the one that people 259 00:16:22,640 --> 00:16:24,520 Speaker 1: tend to think about because it's the one we interact 260 00:16:24,560 --> 00:16:28,040 Speaker 1: with when we are speaking to Google, but also in 261 00:16:28,120 --> 00:16:30,920 Speaker 1: stuff like Gmail. If you've used Gmail and you've noticed 262 00:16:30,960 --> 00:16:34,320 Speaker 1: that sometimes you get automated responses popping up that you 263 00:16:34,400 --> 00:16:37,280 Speaker 1: can choose as an option, So instead of writing an email, 264 00:16:37,280 --> 00:16:40,360 Speaker 1: you just select sounds good or I'll see you then, 265 00:16:40,520 --> 00:16:43,120 Speaker 1: or whatever it may be. Then you have seen this 266 00:16:43,160 --> 00:16:46,000 Speaker 1: technology at work, or at least you've seen the product 267 00:16:46,080 --> 00:16:49,280 Speaker 1: of its work. Those automated responses are the result of 268 00:16:49,320 --> 00:16:54,080 Speaker 1: a natural language understanding system that's parsing that email, identifying 269 00:16:54,120 --> 00:16:57,200 Speaker 1: whatever the salient points are in the message, and then 270 00:16:57,240 --> 00:17:00,520 Speaker 1: generating what are hopefully logical responses to it, so you 271 00:17:00,560 --> 00:17:02,520 Speaker 1: can just choose that instead of taking the time to 272 00:17:02,520 --> 00:17:05,679 Speaker 1: actually type something in. One of the key elements in 273 00:17:05,800 --> 00:17:09,520 Speaker 1: natural language understanding is creating machines that can communicate with 274 00:17:09,640 --> 00:17:13,600 Speaker 1: us and explain how they arrived at a certain result. Now, 275 00:17:13,640 --> 00:17:16,880 Speaker 1: this falls into the concept of transparency, which is really 276 00:17:16,960 --> 00:17:19,919 Speaker 1: important when we were talking about artificial intelligence. There's a 277 00:17:20,000 --> 00:17:24,119 Speaker 1: real fear that AI and neural networks are creaning toward 278 00:17:24,240 --> 00:17:28,320 Speaker 1: a black box scenario, and a black box describes any 279 00:17:28,400 --> 00:17:31,240 Speaker 1: system where the workings of the system are hidden from 280 00:17:31,240 --> 00:17:35,719 Speaker 1: our view. We cannot see how something works, and so 281 00:17:35,760 --> 00:17:38,159 Speaker 1: we can only make guesses as to what's going on. 282 00:17:38,760 --> 00:17:40,760 Speaker 1: I know a lot of gear heads who are exasperated 283 00:17:40,760 --> 00:17:44,719 Speaker 1: with the way vehicle manufacturers are creating more of their cars, trucks, 284 00:17:44,760 --> 00:17:48,879 Speaker 1: and other vehicles with systems that aren't easily accessible or modifiable. 285 00:17:49,320 --> 00:17:53,160 Speaker 1: They consider those cars to be black boxes. It makes 286 00:17:53,160 --> 00:17:55,480 Speaker 1: it much harder to work on a vehicle if you 287 00:17:55,520 --> 00:17:59,720 Speaker 1: don't have the proprietary tools and knowledge that are specifically 288 00:17:59,760 --> 00:18:02,840 Speaker 1: for that system. Now take that concept and apply it 289 00:18:02,840 --> 00:18:06,000 Speaker 1: to AI, and it gets pretty scary pretty fast, particularly 290 00:18:06,280 --> 00:18:09,000 Speaker 1: since we're relying on AI to do some important stuff 291 00:18:09,040 --> 00:18:13,239 Speaker 1: like drive cars, make stock option deals, or help with 292 00:18:13,320 --> 00:18:17,399 Speaker 1: healthcare issues, and so one area of work focuses on 293 00:18:17,440 --> 00:18:21,159 Speaker 1: giving machines the capability to explain themselves, not just to 294 00:18:21,200 --> 00:18:24,440 Speaker 1: provide an answer, but explain why they came up with 295 00:18:24,480 --> 00:18:28,120 Speaker 1: that answer. So imagine a chess playing computer. It's playing 296 00:18:28,119 --> 00:18:30,200 Speaker 1: a game of chess and it makes a move. Then 297 00:18:30,240 --> 00:18:33,040 Speaker 1: imagine being able to ask the computer, why did you 298 00:18:33,119 --> 00:18:36,200 Speaker 1: make that move, and then the computer could actually answer 299 00:18:36,280 --> 00:18:39,680 Speaker 1: the question, explaining the logic behind the move it made. 300 00:18:40,119 --> 00:18:43,920 Speaker 1: Now extend that concept to all sorts of different AI applications. 301 00:18:44,240 --> 00:18:46,880 Speaker 1: If an AI stock trader suddenly buys up a ton 302 00:18:46,880 --> 00:18:50,080 Speaker 1: of stocks, you might want to know exactly what prompted 303 00:18:50,160 --> 00:18:53,840 Speaker 1: that decision, why did it make that purchase? And you 304 00:18:53,880 --> 00:18:56,479 Speaker 1: can easily imagine situations in which you'd want to know 305 00:18:56,560 --> 00:18:59,480 Speaker 1: why a machine behaved the way it did. Why did 306 00:18:59,720 --> 00:19:03,399 Speaker 1: an autonomous car choose a particular route. Why did a 307 00:19:03,440 --> 00:19:07,920 Speaker 1: healthcare program suggest a particular diagnosis Without getting those answers, 308 00:19:07,920 --> 00:19:11,040 Speaker 1: we're just putting our faith into machines blindly, and giving 309 00:19:11,040 --> 00:19:15,120 Speaker 1: a computer the ability to generate meaningful and equally important 310 00:19:15,240 --> 00:19:20,080 Speaker 1: relevant explanations would be extremely helpful. So what are some 311 00:19:20,119 --> 00:19:24,440 Speaker 1: of the uses of natural language processing technology. Well, one 312 00:19:24,520 --> 00:19:28,160 Speaker 1: fairly simple application is in spelling and grammar checking software. 313 00:19:28,200 --> 00:19:30,520 Speaker 1: If you've used a word processing program over the last 314 00:19:30,560 --> 00:19:33,480 Speaker 1: few years the last couple of decades, chances are you're 315 00:19:33,480 --> 00:19:37,960 Speaker 1: familiar with automatic real time spell check and grammar check features. 316 00:19:38,680 --> 00:19:40,760 Speaker 1: This is possible because of the work that has been 317 00:19:40,800 --> 00:19:44,120 Speaker 1: done in natural language processing. Spell check needs to take 318 00:19:44,160 --> 00:19:47,560 Speaker 1: into consideration not only if a word is spelled correctly, 319 00:19:47,600 --> 00:19:51,760 Speaker 1: if a word matches a word that's in the computer's lexicon, 320 00:19:52,320 --> 00:19:55,639 Speaker 1: but also if it's the right word for that instance. 321 00:19:56,000 --> 00:19:58,320 Speaker 1: In English, we have a lot of hominems. Those are 322 00:19:58,320 --> 00:20:01,760 Speaker 1: words that sound the same aim, but I have different meanings. 323 00:20:02,080 --> 00:20:05,040 Speaker 1: Now you can have hominem's that are spelled exactly the 324 00:20:05,080 --> 00:20:07,960 Speaker 1: same way, and those really aren't a problem because the 325 00:20:07,960 --> 00:20:12,480 Speaker 1: reader can pick up on what meaning you intended through context. Though, 326 00:20:12,520 --> 00:20:15,960 Speaker 1: if you're using natural language processing to do a translation, 327 00:20:16,400 --> 00:20:18,879 Speaker 1: then the NLP system needs to be able to determine 328 00:20:18,960 --> 00:20:22,480 Speaker 1: which meaning the original author intended. In my earlier example 329 00:20:22,520 --> 00:20:26,640 Speaker 1: about making a withdrawal at the bank, there's a hominem 330 00:20:26,680 --> 00:20:29,160 Speaker 1: you know, to two versions of bank, but they mean 331 00:20:29,200 --> 00:20:32,040 Speaker 1: two different things. I could also talk about bank as 332 00:20:32,080 --> 00:20:34,400 Speaker 1: in the sense of a verb, as in banking off 333 00:20:34,520 --> 00:20:39,040 Speaker 1: of something, but you get the point. There are also 334 00:20:39,119 --> 00:20:42,760 Speaker 1: hominem's that sound the same but are spelled differently, and 335 00:20:42,800 --> 00:20:45,560 Speaker 1: they have different meanings as well. So for example, they 336 00:20:45,680 --> 00:20:49,480 Speaker 1: dreaded too as in t O two as in t 337 00:20:49,760 --> 00:20:54,400 Speaker 1: O O, and two as in two combo. Those are 338 00:20:54,440 --> 00:20:58,159 Speaker 1: three words with three different applications, three different spellings. A 339 00:20:58,320 --> 00:21:01,359 Speaker 1: good spell check algorithm will be able to determine if 340 00:21:01,359 --> 00:21:04,960 Speaker 1: you've used the correct one in any instance. So if 341 00:21:04,960 --> 00:21:10,040 Speaker 1: you say that's two sweet, that's too sweet, but you're 342 00:21:10,240 --> 00:21:13,920 Speaker 1: using the number too just in word form, the spell 343 00:21:14,040 --> 00:21:16,280 Speaker 1: check will give you the old heads up and say 344 00:21:16,600 --> 00:21:19,399 Speaker 1: I think you meant t O O not t w O. 345 00:21:20,160 --> 00:21:22,760 Speaker 1: Fun fact, I typed that sentence into Google Docs and 346 00:21:22,800 --> 00:21:26,000 Speaker 1: it said you're totes fine. BRA didn't notice it at all. 347 00:21:26,560 --> 00:21:30,040 Speaker 1: Grammar checkers have to be able to analyze sentence structure 348 00:21:30,160 --> 00:21:32,840 Speaker 1: and word choice and compared against the grammar program for 349 00:21:32,880 --> 00:21:35,920 Speaker 1: the system. This might also help determine if the word 350 00:21:35,960 --> 00:21:39,240 Speaker 1: you use was the correct one. So, for example, affect 351 00:21:39,640 --> 00:21:45,080 Speaker 1: versus effect, Affect is a verb you affect something. Effect 352 00:21:45,240 --> 00:21:49,159 Speaker 1: is usually a noun. It's typically the result of some action. 353 00:21:49,280 --> 00:21:52,520 Speaker 1: So I could affect a drum, which is a dumb 354 00:21:52,560 --> 00:21:55,320 Speaker 1: thing to say, and the effect might be that the 355 00:21:55,359 --> 00:21:58,680 Speaker 1: sound I played hurt your ears. Now, if you spell 356 00:21:58,760 --> 00:22:02,160 Speaker 1: the word correctly and the spell checker is only comparing 357 00:22:02,200 --> 00:22:04,800 Speaker 1: the words you type against a lexicon to see if 358 00:22:04,840 --> 00:22:07,159 Speaker 1: there's a match, you might not get an indication that 359 00:22:07,200 --> 00:22:10,000 Speaker 1: anything is wrong because the computer system is saying, well, 360 00:22:10,000 --> 00:22:12,959 Speaker 1: that word is spelled correctly. It doesn't realize it's the 361 00:22:12,960 --> 00:22:15,760 Speaker 1: wrong word. But if it has a way of checking grammar, 362 00:22:15,760 --> 00:22:17,760 Speaker 1: it can also make sure you're using the right word 363 00:22:17,840 --> 00:22:21,520 Speaker 1: in the right context. Search engines such as Google use 364 00:22:21,600 --> 00:22:24,440 Speaker 1: natural language processing to determine what it is you're looking 365 00:22:24,480 --> 00:22:26,800 Speaker 1: for right, So when you're typing in a search and 366 00:22:26,840 --> 00:22:30,280 Speaker 1: you hit the search button, you might get a little 367 00:22:31,240 --> 00:22:34,920 Speaker 1: uh notification that says, maybe you meant this other thing, 368 00:22:35,040 --> 00:22:38,160 Speaker 1: or maybe you need to search for this terminology. That's 369 00:22:38,160 --> 00:22:41,040 Speaker 1: a useful feature since not everyone thinks of search the 370 00:22:41,119 --> 00:22:43,760 Speaker 1: same way. I could tell a dozen people to go 371 00:22:43,800 --> 00:22:48,159 Speaker 1: on Google and pull up information about Benjamin Franklin and 372 00:22:48,200 --> 00:22:51,479 Speaker 1: the story about the kite, and those folks might go 373 00:22:51,640 --> 00:22:54,639 Speaker 1: and perform their searches in twelve different ways. But the 374 00:22:54,680 --> 00:22:57,560 Speaker 1: search engine's job is to return the best results based 375 00:22:57,560 --> 00:23:00,480 Speaker 1: on the query, which means it needs to suss out 376 00:23:00,520 --> 00:23:04,160 Speaker 1: what the searcher is actually looking for. So even if 377 00:23:04,480 --> 00:23:08,160 Speaker 1: the twelve people all type twelve different ways of looking 378 00:23:08,240 --> 00:23:11,280 Speaker 1: up this information about Benjamin Franklin and the kite story, 379 00:23:11,880 --> 00:23:16,320 Speaker 1: it should respond with the most relevant results. And maybe 380 00:23:16,320 --> 00:23:19,919 Speaker 1: people get slightly different search results based upon the query, 381 00:23:20,000 --> 00:23:22,760 Speaker 1: but they should be more or less the same. And 382 00:23:22,800 --> 00:23:24,359 Speaker 1: it can also look out for you. It could give 383 00:23:24,359 --> 00:23:26,880 Speaker 1: you suggestions for search terms, should you use an incorrect 384 00:23:26,880 --> 00:23:31,920 Speaker 1: spelling or you approximate a spelling, or something like that. 385 00:23:32,720 --> 00:23:36,320 Speaker 1: One of the areas of opportunity for natural language processing 386 00:23:36,320 --> 00:23:39,160 Speaker 1: applications in the near future is handling the massive amounts 387 00:23:39,200 --> 00:23:42,960 Speaker 1: of information in big data applications. So, for example, a 388 00:23:43,040 --> 00:23:47,159 Speaker 1: lawyer might want to search historical legal results using natural 389 00:23:47,240 --> 00:23:50,040 Speaker 1: language to look for precedents that might help his or 390 00:23:50,040 --> 00:23:53,639 Speaker 1: her case in the courtroom. A pharmaceuticals company might need 391 00:23:53,680 --> 00:23:58,080 Speaker 1: to search information about clinical trials, doctors, notes, patient testimonials, 392 00:23:58,119 --> 00:24:01,560 Speaker 1: and related information. And the amount of information represented by 393 00:24:01,560 --> 00:24:04,800 Speaker 1: big data is truly astounding. It's enormous. It's way too 394 00:24:04,880 --> 00:24:07,960 Speaker 1: much for any human to sort through. So developing a 395 00:24:07,960 --> 00:24:10,800 Speaker 1: method for computers to parse a query and return relevant 396 00:24:10,840 --> 00:24:14,520 Speaker 1: results is highly desirable. For a computer to understand that 397 00:24:14,680 --> 00:24:19,720 Speaker 1: context understanding and air quotes and being able to give 398 00:24:19,760 --> 00:24:23,840 Speaker 1: you results based upon your questions, that would be incredibly 399 00:24:23,920 --> 00:24:27,080 Speaker 1: valuable for lots of different industries. And we started off 400 00:24:27,119 --> 00:24:30,880 Speaker 1: talking about machine translation at the early stages of natural 401 00:24:30,960 --> 00:24:34,640 Speaker 1: language processing. That's still a big area of research. Now 402 00:24:35,119 --> 00:24:38,280 Speaker 1: you can get real time translation tools. You can use 403 00:24:38,320 --> 00:24:41,440 Speaker 1: devices to translate from one language to another in real settings, 404 00:24:41,440 --> 00:24:44,439 Speaker 1: including written languages like signs. You can just hold a 405 00:24:44,480 --> 00:24:48,280 Speaker 1: camera up and get an an English translation of a 406 00:24:48,359 --> 00:24:51,320 Speaker 1: sign that's written another language, and of course vice versa. 407 00:24:51,760 --> 00:24:54,200 Speaker 1: That tends to be marketed as a tool for travelers, 408 00:24:54,200 --> 00:24:56,919 Speaker 1: but it really shows the amazing progress we've made in 409 00:24:57,040 --> 00:25:00,119 Speaker 1: natural language processing from the old days of word for 410 00:25:00,200 --> 00:25:03,400 Speaker 1: word models for machine translation that we're made back during 411 00:25:03,440 --> 00:25:05,880 Speaker 1: the Cold War now we've still got a far away 412 00:25:05,920 --> 00:25:09,040 Speaker 1: to go with natural language processing. We've seen some incredible 413 00:25:09,080 --> 00:25:12,080 Speaker 1: improvements over the past few years, but machines still don't 414 00:25:12,119 --> 00:25:15,639 Speaker 1: actually understand what we're saying or what we're writing, not 415 00:25:15,720 --> 00:25:18,520 Speaker 1: on a conscious level anyway. Instead, they are able to 416 00:25:18,560 --> 00:25:22,000 Speaker 1: refer back to rules, either explicitly stated as in the 417 00:25:22,040 --> 00:25:25,760 Speaker 1: older NLP models, or those arrived at through deep learning. 418 00:25:26,320 --> 00:25:28,240 Speaker 1: Now I'm going to take a quick break, but when 419 00:25:28,240 --> 00:25:31,000 Speaker 1: we come back, I'll talk a bit about the history 420 00:25:31,080 --> 00:25:34,080 Speaker 1: of the voice assistance we all know and love. But first, 421 00:25:34,160 --> 00:25:44,400 Speaker 1: here's another word from our sponsors. All right, So now 422 00:25:44,400 --> 00:25:47,720 Speaker 1: we understand a bit about the technologies that make voice 423 00:25:47,720 --> 00:25:52,800 Speaker 1: assistance possible, specifically speech recognition and natural language processing. There's 424 00:25:52,840 --> 00:25:56,840 Speaker 1: obviously a lot more than that, uh the system. The 425 00:25:56,880 --> 00:26:00,920 Speaker 1: system can obviously process our requests or commands and return 426 00:26:00,960 --> 00:26:06,840 Speaker 1: a result using more traditional computational processes. So while the 427 00:26:06,880 --> 00:26:12,040 Speaker 1: interpretation side is on speech recognition and natural language processing, 428 00:26:12,359 --> 00:26:16,120 Speaker 1: there's still a lot of regular computation work that has 429 00:26:16,119 --> 00:26:21,320 Speaker 1: to happen for a a personal assistant, digital assistant, a 430 00:26:21,400 --> 00:26:23,520 Speaker 1: voice assistant, whatever you want to call them, to be 431 00:26:23,600 --> 00:26:26,080 Speaker 1: able to respond to you. So let's take a quick 432 00:26:26,080 --> 00:26:29,880 Speaker 1: stroll through the history of the major voice assistants out there, 433 00:26:30,440 --> 00:26:32,320 Speaker 1: and I'm going to cover these in the order they 434 00:26:32,320 --> 00:26:35,000 Speaker 1: were introduced to the public more or less, which means 435 00:26:35,000 --> 00:26:37,879 Speaker 1: our very first voice assistant that will be covering in 436 00:26:37,920 --> 00:26:41,280 Speaker 1: this because I'm only focusing on the really big ones. Uh. 437 00:26:41,320 --> 00:26:43,840 Speaker 1: There are lots of small ones out there, but I'm 438 00:26:43,880 --> 00:26:46,680 Speaker 1: looking at the ones everyone's heard about, So that means 439 00:26:46,720 --> 00:26:49,359 Speaker 1: the first one we get to talk about is Apple's Sirie. 440 00:26:49,960 --> 00:26:53,159 Speaker 1: Apple unveiled Sirie on April fourteen, two thousand eleven, and 441 00:26:53,200 --> 00:26:56,399 Speaker 1: to be fair, Sirie existed before this. It was not 442 00:26:56,480 --> 00:26:59,520 Speaker 1: an Apple creation. Syria was actually an app produced by 443 00:26:59,560 --> 00:27:03,719 Speaker 1: an into and developer company called Sirie Incorporated, but Apple 444 00:27:03,800 --> 00:27:07,399 Speaker 1: gobbled up that company in and brought them in house. 445 00:27:07,840 --> 00:27:11,800 Speaker 1: And Apple had previously relied upon another speech recognition program 446 00:27:11,840 --> 00:27:14,480 Speaker 1: called voice over, which had been used in Mac products 447 00:27:14,480 --> 00:27:19,000 Speaker 1: and all iPhones since the iPhone three GS. Siri would 448 00:27:19,000 --> 00:27:22,959 Speaker 1: become available starting with the iPhone for s In this announcement, 449 00:27:23,480 --> 00:27:27,120 Speaker 1: Apple pointed out that earlier implementations of voice commands required 450 00:27:27,240 --> 00:27:30,120 Speaker 1: users to learn the syntax of the system. You had 451 00:27:30,160 --> 00:27:33,000 Speaker 1: to follow a very specific set of rules in order 452 00:27:33,040 --> 00:27:36,000 Speaker 1: to get anything to work. So you give a command 453 00:27:36,240 --> 00:27:38,640 Speaker 1: defined by the system. So for example, you might say 454 00:27:38,960 --> 00:27:43,440 Speaker 1: call mom or play once in a lifetime. You had 455 00:27:43,480 --> 00:27:47,720 Speaker 1: to do this very structured approach to whatever it was 456 00:27:47,760 --> 00:27:50,840 Speaker 1: he wanted to do. But that requires the user to 457 00:27:50,840 --> 00:27:54,560 Speaker 1: actually adhere to rules created by the architects of the system. Right, 458 00:27:54,600 --> 00:27:56,800 Speaker 1: So Sirie was meant to be different. It was meant 459 00:27:56,880 --> 00:28:00,000 Speaker 1: to be able to understand what you wanted on your terms, 460 00:28:00,119 --> 00:28:03,240 Speaker 1: not based off a strict set of rules. Apple said 461 00:28:03,280 --> 00:28:05,560 Speaker 1: that Siri would be able to interpret what you meant 462 00:28:05,960 --> 00:28:08,560 Speaker 1: and would return relevant information to you in response. In 463 00:28:08,600 --> 00:28:11,919 Speaker 1: the unveiling, they said that Siri is quote, your intelligent 464 00:28:12,000 --> 00:28:15,240 Speaker 1: assistant that helps you get things done just by asking 465 00:28:15,680 --> 00:28:18,840 Speaker 1: end quote. During that demonstration, they showed off how Siri 466 00:28:18,960 --> 00:28:22,080 Speaker 1: could parse different phrases that had the same underlying meaning 467 00:28:22,720 --> 00:28:26,840 Speaker 1: that the example they gave originally was was the weather today, 468 00:28:26,880 --> 00:28:30,000 Speaker 1: and then they asked that same question five or six 469 00:28:30,080 --> 00:28:33,600 Speaker 1: different times. Scott Forstall, vice president over at Apple, showed 470 00:28:33,600 --> 00:28:36,119 Speaker 1: off how you could get the same weather information by 471 00:28:36,160 --> 00:28:38,800 Speaker 1: asking it in these different ways. Then they showed off 472 00:28:38,800 --> 00:28:42,520 Speaker 1: how Siri could interoperate with other apps, such as Apple's 473 00:28:42,600 --> 00:28:45,520 Speaker 1: maps feature or through a partnership they had with Yelp. 474 00:28:45,920 --> 00:28:49,120 Speaker 1: Siri could take a request, it could parse it, interpret it, 475 00:28:49,280 --> 00:28:53,640 Speaker 1: send the appropriate UH request to the appropriate destination, and 476 00:28:53,680 --> 00:28:56,960 Speaker 1: then serve up the response. The destination could be a 477 00:28:57,000 --> 00:29:00,360 Speaker 1: web search, it could be an action within a compatible app. 478 00:29:00,400 --> 00:29:04,840 Speaker 1: You get the idea, so that serie next. On July nine, 479 00:29:04,880 --> 00:29:08,440 Speaker 1: two thousand twelve, Google released Android jelly Bean a k 480 00:29:08,600 --> 00:29:11,440 Speaker 1: a Android four point one, and one of the features 481 00:29:11,440 --> 00:29:14,200 Speaker 1: included in that operating system update, at least for certain 482 00:29:14,240 --> 00:29:18,160 Speaker 1: hardware upon release, was an offshoot of Google Search called 483 00:29:18,240 --> 00:29:22,880 Speaker 1: Google Now. This feature would serve up predictive cards containing 484 00:29:22,880 --> 00:29:26,080 Speaker 1: information that the system had flagged as potentially being useful 485 00:29:26,120 --> 00:29:29,120 Speaker 1: to you based off your activity. So let's say you 486 00:29:29,120 --> 00:29:31,560 Speaker 1: spend a lot of times searching for stuff like baseball scores, 487 00:29:32,080 --> 00:29:35,200 Speaker 1: Google Now would start serving you up cards that would 488 00:29:35,240 --> 00:29:38,280 Speaker 1: give you scores from previous games before you could even 489 00:29:38,320 --> 00:29:40,400 Speaker 1: search for them. You would just look at Google Now 490 00:29:40,440 --> 00:29:43,280 Speaker 1: and you could scroll through and you see what the 491 00:29:43,360 --> 00:29:46,120 Speaker 1: latest results were. Then you could actually scroll through the 492 00:29:46,160 --> 00:29:48,800 Speaker 1: different cards, all of which were slowly dialing you in 493 00:29:48,880 --> 00:29:51,600 Speaker 1: as a person, which was kind of creepy. And it 494 00:29:51,640 --> 00:29:55,960 Speaker 1: relied a lot on natural language processing and your activities. Now, 495 00:29:56,000 --> 00:29:58,760 Speaker 1: Google Now was not a voice assistant. This was sort 496 00:29:58,760 --> 00:30:02,720 Speaker 1: of a one way relationship. Google was analyzing information based 497 00:30:02,760 --> 00:30:05,160 Speaker 1: on your activity and then serving up information to you 498 00:30:05,200 --> 00:30:08,000 Speaker 1: that might be useful. But over time the company would 499 00:30:08,040 --> 00:30:12,360 Speaker 1: phase out Google Now and it gradually evolved into Google Assistant. 500 00:30:12,680 --> 00:30:14,600 Speaker 1: There was also Google Voice that allows you to do 501 00:30:14,640 --> 00:30:18,320 Speaker 1: things like voice search, so that also became incorporated into this. 502 00:30:18,480 --> 00:30:22,120 Speaker 1: Google Assistant is a lot like Syrie. It responds to voice, 503 00:30:22,280 --> 00:30:25,000 Speaker 1: It can respond to anaphores, meaning it can keep track 504 00:30:25,040 --> 00:30:27,760 Speaker 1: of subject matter and respond to follow up questions that 505 00:30:27,880 --> 00:30:31,800 Speaker 1: don't contain an explicit reference to the subject. So, for example, 506 00:30:31,840 --> 00:30:34,280 Speaker 1: you could ask Google Assistant, what is the weather going 507 00:30:34,320 --> 00:30:37,080 Speaker 1: to be like in Atlanta? And then after you get 508 00:30:37,080 --> 00:30:39,640 Speaker 1: a response, you might say, what about in Seattle. Now, 509 00:30:39,640 --> 00:30:43,960 Speaker 1: you have not explicitly said what is the weather in Seattle? 510 00:30:44,400 --> 00:30:47,600 Speaker 1: You just said what about in Seattle. However, Google Assistant 511 00:30:47,640 --> 00:30:50,080 Speaker 1: can infer that you are still talking about the weather, 512 00:30:50,200 --> 00:30:53,640 Speaker 1: only now within the context of a different location. Google 513 00:30:53,680 --> 00:30:57,640 Speaker 1: Assistant debuted in May, so in a way, this particular 514 00:30:57,800 --> 00:31:01,680 Speaker 1: entry in our timeline spans two other debuts, because he 515 00:31:01,720 --> 00:31:05,280 Speaker 1: had Google Now on one side and then Google Assistant later, 516 00:31:05,480 --> 00:31:07,480 Speaker 1: but I figured it was important to acknowledge how Google 517 00:31:07,480 --> 00:31:10,600 Speaker 1: Assistant grew out of the older Google Now feature. On 518 00:31:10,680 --> 00:31:15,240 Speaker 1: April two, two thirteen, Microsoft introduced its own voice assistant 519 00:31:15,360 --> 00:31:19,680 Speaker 1: at the Build Developer Conference. Microsoft's entry is named Cortana, 520 00:31:19,840 --> 00:31:23,479 Speaker 1: after the AI character from the Halo series of video games. 521 00:31:23,880 --> 00:31:28,200 Speaker 1: Microsoft integrated Cortana to work with Windows ten, Xbox One, 522 00:31:28,400 --> 00:31:32,120 Speaker 1: Windows Mobile, and a few other platforms as well, including 523 00:31:32,160 --> 00:31:35,040 Speaker 1: apps that were meant for other operating systems like iOS 524 00:31:35,080 --> 00:31:39,520 Speaker 1: and Android. Cortana's US voices that of Jen Taylor. She 525 00:31:39,600 --> 00:31:42,440 Speaker 1: actually is the voice actress who provided the voice for 526 00:31:42,480 --> 00:31:45,200 Speaker 1: the character of Cortona in the Halo games. That's kind 527 00:31:45,200 --> 00:31:48,360 Speaker 1: of fun, and like Siri, Cortona can interface with apps 528 00:31:48,360 --> 00:31:53,080 Speaker 1: as well as performed web searches. In November, Amazon got 529 00:31:53,120 --> 00:31:56,440 Speaker 1: into the game with Alexa and the Amazon Echo. Through 530 00:31:56,560 --> 00:31:59,560 Speaker 1: Amazon Echo, Alexa can serve not just as a voice 531 00:31:59,560 --> 00:32:03,000 Speaker 1: assistant that can retrieve information and play streaming media and 532 00:32:03,040 --> 00:32:05,360 Speaker 1: that kind of thing, but also as an interface in 533 00:32:05,440 --> 00:32:08,760 Speaker 1: home automation applications, and to be fair, so can Google 534 00:32:08,800 --> 00:32:11,960 Speaker 1: Assistant through devices like Google Home. So you can use 535 00:32:11,960 --> 00:32:14,920 Speaker 1: Alexa to interface directly with systems in your home. If 536 00:32:14,920 --> 00:32:18,480 Speaker 1: they are compatible, and not surprisingly, Alexa can interface with 537 00:32:18,520 --> 00:32:22,320 Speaker 1: Amazon's ordering system, allowing users to order products from Amazon 538 00:32:22,360 --> 00:32:26,720 Speaker 1: directly by speaking to Alexa. No shock there. According to Amazon, 539 00:32:27,040 --> 00:32:30,040 Speaker 1: developers were inspired by the Star Trek series of shows, 540 00:32:30,240 --> 00:32:32,600 Speaker 1: which characters would speak out loud to computer systems and 541 00:32:32,640 --> 00:32:35,760 Speaker 1: call for information or send commands to make various stuff happen. 542 00:32:36,320 --> 00:32:39,840 Speaker 1: Amazon also released a developer kit to allow independent developers 543 00:32:39,880 --> 00:32:42,520 Speaker 1: to create what are called Alexa skills. There's an old 544 00:32:42,560 --> 00:32:45,080 Speaker 1: episode of tech Stuff where I interviewed some folks from 545 00:32:45,160 --> 00:32:49,000 Speaker 1: Amazon to talk about this process. But essentially, developers will 546 00:32:49,040 --> 00:32:52,400 Speaker 1: submit skills to Amazon, which can then publish those skills 547 00:32:52,400 --> 00:32:55,320 Speaker 1: and allow anyone who has an Alexa enabled device to 548 00:32:55,600 --> 00:32:58,760 Speaker 1: activate those skills and make use of them. Individuals can 549 00:32:58,760 --> 00:33:01,480 Speaker 1: even build up their own person lies skills using a 550 00:33:01,520 --> 00:33:06,280 Speaker 1: tool called Blueprints, which Amazon introduced in April. Now there 551 00:33:06,280 --> 00:33:09,480 Speaker 1: are other examples I could point to. There's Samsung's Bixby 552 00:33:09,560 --> 00:33:14,120 Speaker 1: which it introduced in March. There's sound Hounds virtual assistant 553 00:33:14,120 --> 00:33:18,440 Speaker 1: called Hound that launched in March of But these were 554 00:33:18,440 --> 00:33:20,760 Speaker 1: the ones that I really hear about the most frequently, 555 00:33:20,760 --> 00:33:22,360 Speaker 1: so were the ones I wanted to kind of cover 556 00:33:22,840 --> 00:33:25,520 Speaker 1: and they all work on on a similar principle. The 557 00:33:25,560 --> 00:33:30,440 Speaker 1: implementations are all particular to their specific brands, but they 558 00:33:30,480 --> 00:33:36,160 Speaker 1: work on under similar foundational principles of natural language processing, 559 00:33:36,840 --> 00:33:41,080 Speaker 1: speech recognition, et cetera. And it's all about converging technologies 560 00:33:41,120 --> 00:33:44,440 Speaker 1: that took decades of hard work to make possible. Now 561 00:33:44,480 --> 00:33:47,239 Speaker 1: I want to thank listener Nate, who was the one 562 00:33:47,280 --> 00:33:50,240 Speaker 1: who set me on this trail to ask about speech 563 00:33:50,240 --> 00:33:55,120 Speaker 1: recognition and natural language processing and these voice assistants. Was 564 00:33:55,200 --> 00:34:00,000 Speaker 1: really interesting to dive into, very very cool, fascinating stuff. 565 00:34:00,040 --> 00:34:02,400 Speaker 1: Thanks a lot, Nate. If any of you out there 566 00:34:02,440 --> 00:34:05,040 Speaker 1: have suggestions for future episodes of tech Stuff, maybe it's 567 00:34:05,040 --> 00:34:07,640 Speaker 1: a technology or a company or a person in tech. 568 00:34:07,920 --> 00:34:10,160 Speaker 1: Maybe there's someone I should interview or have on as 569 00:34:10,160 --> 00:34:13,080 Speaker 1: a guest host, send me a message. The email address 570 00:34:13,280 --> 00:34:16,400 Speaker 1: is tech Stuff at how stuff works dot com. Or 571 00:34:16,480 --> 00:34:18,720 Speaker 1: drop me a line on Facebook or Twitter. The handle 572 00:34:18,719 --> 00:34:20,840 Speaker 1: at both of those is tech Stuff H s W. 573 00:34:21,440 --> 00:34:24,719 Speaker 1: Don't forget to follow us on Instagram and I'll talk 574 00:34:24,719 --> 00:34:33,799 Speaker 1: to you again really soon for more on this and 575 00:34:33,880 --> 00:34:36,399 Speaker 1: thousands of other topics. Is it how stuff Works dot 576 00:34:36,440 --> 00:34:46,560 Speaker 1: Com