WEBVTT - More on NLP and where voice assistants come from

0:00:04.120 --> 0:00:07.160
<v Speaker 1>Get in touch with technology with tech Stuff from how

0:00:07.200 --> 0:00:13.880
<v Speaker 1>stuff works dot com. Hey there, and welcome to tech Stuff.

0:00:13.920 --> 0:00:17.520
<v Speaker 1>I'm your host, Jonathan Strickland. I'm an executive producer with

0:00:17.560 --> 0:00:21.319
<v Speaker 1>how Stuff Works in a love all things tech, and

0:00:21.440 --> 0:00:26.720
<v Speaker 1>this is the second episode about natural language processing an LP,

0:00:27.040 --> 0:00:31.040
<v Speaker 1>also natural language understanding and LU. The two are related.

0:00:31.800 --> 0:00:35.080
<v Speaker 1>With that describes the technologies and processes we use to

0:00:35.120 --> 0:00:39.080
<v Speaker 1>give machines the ability to interpret and respond to language

0:00:39.120 --> 0:00:43.479
<v Speaker 1>the way we use it, so not just understanding our input,

0:00:43.520 --> 0:00:47.839
<v Speaker 1>but also generating output that still follows the rules of

0:00:47.960 --> 0:00:51.479
<v Speaker 1>various languages. So it's all about getting machines to conform

0:00:51.600 --> 0:00:54.520
<v Speaker 1>to us rather than the other way around. If you

0:00:54.640 --> 0:00:58.200
<v Speaker 1>have not listened to the episode immediately before this one,

0:00:58.720 --> 0:01:00.760
<v Speaker 1>you should do that. But as I'm about to pick

0:01:00.800 --> 0:01:02.920
<v Speaker 1>up where I left off, which was just after our

0:01:03.000 --> 0:01:07.280
<v Speaker 1>PA pulled the plug on its Speech Understanding research project,

0:01:07.920 --> 0:01:10.920
<v Speaker 1>and the research under the r PA project had shown

0:01:11.000 --> 0:01:15.280
<v Speaker 1>that NLP was an even more challenging problem than had

0:01:15.319 --> 0:01:20.360
<v Speaker 1>previously been anticipated. Even the simplest approaches were creating enormous

0:01:20.400 --> 0:01:22.959
<v Speaker 1>demands on both the work programmers had to do to

0:01:23.000 --> 0:01:26.160
<v Speaker 1>build a system out and the processing the system would

0:01:26.200 --> 0:01:29.640
<v Speaker 1>have to rely upon in order to interpret language. Work

0:01:29.680 --> 0:01:35.280
<v Speaker 1>in the late nineties seventies ranged into psychology. NLP researchers

0:01:35.440 --> 0:01:37.720
<v Speaker 1>felt a system needed to be able to identify a

0:01:37.840 --> 0:01:42.400
<v Speaker 1>user's needs and goals in order to function properly, had

0:01:42.440 --> 0:01:46.240
<v Speaker 1>to understand not just the surface level meaning of a phrase,

0:01:46.760 --> 0:01:50.920
<v Speaker 1>but the underlying meaning of linguistic expressions as well. Only

0:01:50.960 --> 0:01:53.880
<v Speaker 1>then could you have a computer system that could collaborate

0:01:53.920 --> 0:01:56.560
<v Speaker 1>with a human being in a seamless way. So, in

0:01:56.560 --> 0:01:59.640
<v Speaker 1>other words, what they're saying is that you could translate

0:01:59.680 --> 0:02:03.080
<v Speaker 1>stuff for interpret stuff word by word, but unless you

0:02:03.080 --> 0:02:05.800
<v Speaker 1>have an understanding of what the person is trying to

0:02:05.840 --> 0:02:09.799
<v Speaker 1>actually accomplish, chances are the results you're going to get

0:02:09.800 --> 0:02:12.160
<v Speaker 1>back are not going to be as relevant as they

0:02:12.160 --> 0:02:15.440
<v Speaker 1>could be. And so that was where the psychology was

0:02:15.480 --> 0:02:19.480
<v Speaker 1>starting to take form. By the early nineteen eighties, which

0:02:19.720 --> 0:02:22.640
<v Speaker 1>marks the third phase of n LP development. According to

0:02:22.680 --> 0:02:26.160
<v Speaker 1>the researcher Karen spark Jones, who I talked about in

0:02:26.160 --> 0:02:29.800
<v Speaker 1>the last episode, researchers were coming to terms with the

0:02:29.840 --> 0:02:34.000
<v Speaker 1>idea that a scalable NLP system that relied upon the

0:02:34.040 --> 0:02:38.160
<v Speaker 1>old methods of building lexicons and syntax rules just was

0:02:38.200 --> 0:02:41.040
<v Speaker 1>not practical It required far too much work on the

0:02:41.080 --> 0:02:43.880
<v Speaker 1>front end when designing a system to make a general

0:02:43.919 --> 0:02:48.040
<v Speaker 1>purpose in LP application. The problem was just way too

0:02:48.040 --> 0:02:52.880
<v Speaker 1>big to take that approach. Even with relatively narrow implementations

0:02:52.919 --> 0:02:57.080
<v Speaker 1>like designing a system that would parse technical documents, you think,

0:02:57.360 --> 0:02:59.799
<v Speaker 1>all right, well, the language used in technical documents is

0:02:59.840 --> 0:03:02.799
<v Speaker 1>a subset of the language you would encounter in the

0:03:02.880 --> 0:03:07.640
<v Speaker 1>quote unquote real world. Even with those use cases, the

0:03:07.720 --> 0:03:10.600
<v Speaker 1>old methods were proving to require far too much investment

0:03:10.639 --> 0:03:14.799
<v Speaker 1>in time, money, and effort on the design front. Spark

0:03:14.919 --> 0:03:18.919
<v Speaker 1>Jones identifies the key focus during this phase as being

0:03:19.000 --> 0:03:24.160
<v Speaker 1>on grammar and logic. During this phase, researchers developed several

0:03:24.240 --> 0:03:27.680
<v Speaker 1>different grammar types. Now, grammars are sets of rules for

0:03:27.720 --> 0:03:31.440
<v Speaker 1>analyzing and formalizing language. I would love to go into

0:03:31.480 --> 0:03:34.679
<v Speaker 1>more detail about the different grammars that were developed during

0:03:34.680 --> 0:03:39.120
<v Speaker 1>this phase or adopted for computational models, but honestly, it

0:03:39.160 --> 0:03:44.040
<v Speaker 1>gets really, really heavy, really quickly. It gets extremely technical,

0:03:44.280 --> 0:03:46.680
<v Speaker 1>though not on a technological side, but more on the

0:03:46.760 --> 0:03:50.320
<v Speaker 1>linguistic side. And suffice it to say that a lot

0:03:50.360 --> 0:03:53.080
<v Speaker 1>of research and debate centered around what is the best

0:03:53.120 --> 0:03:56.440
<v Speaker 1>way to arrive at the meaning of language? How do

0:03:56.520 --> 0:04:00.080
<v Speaker 1>we get to that? How how can you ascertain it

0:04:00.200 --> 0:04:03.400
<v Speaker 1>is meant by what was spoken or what was written.

0:04:03.760 --> 0:04:07.240
<v Speaker 1>The grammars were meant to direct NLP models to analyze

0:04:07.320 --> 0:04:11.680
<v Speaker 1>language in different ways that were computationally viable and that

0:04:11.720 --> 0:04:15.320
<v Speaker 1>wouldn't require the laborious process of programming everything in a

0:04:15.360 --> 0:04:19.280
<v Speaker 1>word for word style. Another big area of focus at

0:04:19.279 --> 0:04:23.320
<v Speaker 1>this time was on generation, meaning creating models that would

0:04:23.320 --> 0:04:28.040
<v Speaker 1>allow machines to generate natural language responses to users, including

0:04:28.080 --> 0:04:32.240
<v Speaker 1>responses that were extended, long examples of discourse, not just

0:04:32.920 --> 0:04:36.760
<v Speaker 1>a quick message. While machines wouldn't be able to think,

0:04:37.480 --> 0:04:39.880
<v Speaker 1>they would be able to put together a more sophisticated

0:04:39.960 --> 0:04:43.320
<v Speaker 1>response than chatbots like Eliza that I mentioned in the

0:04:43.400 --> 0:04:46.800
<v Speaker 1>last episode could manage. So the idea being, how can

0:04:46.839 --> 0:04:51.120
<v Speaker 1>we make a machine that can communicate results to a

0:04:51.160 --> 0:04:54.880
<v Speaker 1>person in a way that just makes sense. It's almost

0:04:54.880 --> 0:04:57.440
<v Speaker 1>as if a normal human being is chatting with you.

0:04:58.200 --> 0:05:01.360
<v Speaker 1>But as we understand it, it's very difficult to do

0:05:01.440 --> 0:05:04.960
<v Speaker 1>this on an extended basis. You can do it for

0:05:05.360 --> 0:05:09.280
<v Speaker 1>responses to individual queries, but when you start trying to

0:05:09.320 --> 0:05:12.680
<v Speaker 1>create something that can carry on an actual conversation, that's

0:05:12.680 --> 0:05:16.120
<v Speaker 1>where things start. To break down. In the nineties, work

0:05:16.200 --> 0:05:20.600
<v Speaker 1>in n LP focused on representing words as as mathematical vectors.

0:05:21.279 --> 0:05:25.480
<v Speaker 1>Many words are related to one another, so for example,

0:05:25.720 --> 0:05:29.719
<v Speaker 1>hotel and motel are related. They don't mean exactly the

0:05:29.760 --> 0:05:33.640
<v Speaker 1>same thing, but they mean very similar things. Then you

0:05:33.720 --> 0:05:37.080
<v Speaker 1>have a term like bet and breakfast. A bet and

0:05:37.080 --> 0:05:40.120
<v Speaker 1>breakfast is similar again to a hotel or a motel.

0:05:40.200 --> 0:05:43.200
<v Speaker 1>It's a different thing, but it's related. So these words

0:05:43.240 --> 0:05:46.640
<v Speaker 1>have similarities. They also have differences between them, but they're

0:05:46.680 --> 0:05:49.560
<v Speaker 1>all more similar to each other than if I used

0:05:49.560 --> 0:05:52.520
<v Speaker 1>a different word like hospital. A bet and breakfast is

0:05:52.600 --> 0:05:54.880
<v Speaker 1>more like a hotel or a motel than it is

0:05:54.920 --> 0:05:57.880
<v Speaker 1>a hospital. So in other words, we can group words

0:05:57.920 --> 0:06:02.880
<v Speaker 1>together into vector spaces and calculate the quote unquote distances

0:06:02.920 --> 0:06:07.240
<v Speaker 1>between vectors, and that determines degrees of similarity, and this

0:06:07.320 --> 0:06:11.560
<v Speaker 1>is very helpful for both translation and natural language processing.

0:06:12.040 --> 0:06:15.360
<v Speaker 1>There are ways to do this that even take context

0:06:15.440 --> 0:06:18.799
<v Speaker 1>into account. And this relates back to what was being

0:06:19.760 --> 0:06:26.240
<v Speaker 1>uh suggested by Warren Weaver when I talked about that memorandum.

0:06:26.279 --> 0:06:28.960
<v Speaker 1>There's a model called skip Graham, which is essentially what

0:06:29.040 --> 0:06:33.200
<v Speaker 1>he was talking about. This model takes a window of

0:06:33.240 --> 0:06:36.800
<v Speaker 1>words surrounding each word in a sentence to determine context,

0:06:36.920 --> 0:06:38.800
<v Speaker 1>so it's not looking at it just from a word

0:06:38.960 --> 0:06:42.440
<v Speaker 1>toward basis. Let's say that I write a phrase and

0:06:42.440 --> 0:06:46.520
<v Speaker 1>it says, I'm going to the bank to make a withdrawal. Now,

0:06:46.560 --> 0:06:48.560
<v Speaker 1>the word bank can actually refer to a couple of

0:06:48.560 --> 0:06:52.240
<v Speaker 1>different things. Right, it could be a financial institution, which

0:06:52.279 --> 0:06:55.000
<v Speaker 1>is obviously what I do mean when I say that sentence.

0:06:55.320 --> 0:06:58.440
<v Speaker 1>That it could also mean the area right next to

0:06:58.480 --> 0:07:01.279
<v Speaker 1>a river, right the bank of a river. The Skip

0:07:01.320 --> 0:07:04.520
<v Speaker 1>Graham model would take each word in that sentence and

0:07:04.560 --> 0:07:07.440
<v Speaker 1>then part with a few other words that are close

0:07:07.520 --> 0:07:10.880
<v Speaker 1>by to determine the meaning of the phrase. So it's

0:07:10.880 --> 0:07:13.160
<v Speaker 1>looking at I'm going to the bank to make a

0:07:13.200 --> 0:07:17.800
<v Speaker 1>withdrawal for bank, it might say to bank, the bank,

0:07:18.000 --> 0:07:22.640
<v Speaker 1>to bank, make bank a bank withdrawal bank. By looking

0:07:22.680 --> 0:07:26.440
<v Speaker 1>at these pairings, the system can figure out from context

0:07:26.880 --> 0:07:30.240
<v Speaker 1>that the bank I'm talking about is probably a financial institution.

0:07:30.520 --> 0:07:33.239
<v Speaker 1>I'm probably not making a withdrawal from a river bank.

0:07:33.960 --> 0:07:38.600
<v Speaker 1>So it's a way of machine systems figuring out the

0:07:38.640 --> 0:07:42.120
<v Speaker 1>meaning of a phrase through contextual cues by using this

0:07:42.160 --> 0:07:45.800
<v Speaker 1>windowed approach. And again, Warren weaver Back had proposed such

0:07:45.800 --> 0:07:48.800
<v Speaker 1>a thing. The vector approach would become more important as

0:07:48.800 --> 0:07:53.240
<v Speaker 1>computer scientists made advances in neural networks. That approach also

0:07:53.360 --> 0:07:56.920
<v Speaker 1>made machine translation much more effective because it no longer

0:07:57.000 --> 0:08:00.560
<v Speaker 1>looked for word for word matches, but rather matches meaning

0:08:00.880 --> 0:08:05.880
<v Speaker 1>based on vectors and probabilities. That's really important because once

0:08:05.920 --> 0:08:08.640
<v Speaker 1>you determine the meaning of a phrase in one language,

0:08:09.040 --> 0:08:13.320
<v Speaker 1>then you can look for a phrase in another language

0:08:13.360 --> 0:08:18.720
<v Speaker 1>that most closely resembles the meaning of the original. Uh.

0:08:18.760 --> 0:08:22.679
<v Speaker 1>This is the art of translation. A real translator, someone

0:08:22.680 --> 0:08:26.040
<v Speaker 1>who's translated from one language to another, is probably not

0:08:26.120 --> 0:08:28.880
<v Speaker 1>doing so word for word. Rather, they're doing meaning for

0:08:29.080 --> 0:08:33.120
<v Speaker 1>meaning to make certain that the intent of what is

0:08:33.160 --> 0:08:38.480
<v Speaker 1>being communicated gets through, not just the vocabulary. The ninety nineties,

0:08:38.520 --> 0:08:42.480
<v Speaker 1>which sparked Jones identifies as the fourth phase of NLP

0:08:42.600 --> 0:08:45.400
<v Speaker 1>development that would be the final phase in her report,

0:08:46.040 --> 0:08:50.960
<v Speaker 1>saw a more concentrated focus on lexicons over syntax, and

0:08:50.960 --> 0:08:55.000
<v Speaker 1>it also saw more practical applications of natural language processing,

0:08:55.320 --> 0:08:57.880
<v Speaker 1>as well as leveraging the Worldwide Web to help train

0:08:58.000 --> 0:09:01.840
<v Speaker 1>natural language processing models. There was an a rich source

0:09:02.440 --> 0:09:06.120
<v Speaker 1>of natural language on the Worldwide Web. Pretty much every

0:09:06.120 --> 0:09:09.800
<v Speaker 1>permutation you could imagine from people who are very careful

0:09:10.160 --> 0:09:13.560
<v Speaker 1>and the way they construct sentences and paragraphs to people

0:09:13.559 --> 0:09:17.040
<v Speaker 1>who are much more cavalier in the way they use language,

0:09:17.040 --> 0:09:21.680
<v Speaker 1>whether purposefully or otherwise. And also that report from spark

0:09:21.800 --> 0:09:25.480
<v Speaker 1>Jones again is dated October two thousand one, so that's

0:09:25.520 --> 0:09:30.160
<v Speaker 1>where her work stops for that particular report. But nearly

0:09:30.160 --> 0:09:34.240
<v Speaker 1>two decades have passed since that time, So in that time,

0:09:34.280 --> 0:09:36.719
<v Speaker 1>what has changed. Well, I would argue we are now

0:09:36.720 --> 0:09:40.520
<v Speaker 1>in a new phase of NLP development, one marked largely

0:09:40.600 --> 0:09:43.680
<v Speaker 1>by the rise and a few key technologies. One of

0:09:43.679 --> 0:09:47.640
<v Speaker 1>those is cloud computing. Cloud computing has removed the necessity

0:09:47.760 --> 0:09:51.840
<v Speaker 1>to build in complex capabilities in end machines like a

0:09:51.880 --> 0:09:55.640
<v Speaker 1>smartphone or a computer terminal, So an organization can create

0:09:55.679 --> 0:10:00.480
<v Speaker 1>a cloud infrastructure which consists of powerful machines and data basis.

0:10:00.679 --> 0:10:03.680
<v Speaker 1>Those machines could be real, they could be virtual. Virtual

0:10:03.760 --> 0:10:07.040
<v Speaker 1>machines are hosted on real hardware, but they're running virtual

0:10:07.200 --> 0:10:11.560
<v Speaker 1>implementations of various operating systems. So these machines provide the

0:10:11.600 --> 0:10:14.760
<v Speaker 1>processing power and they house the systems that are necessary

0:10:14.800 --> 0:10:17.959
<v Speaker 1>to parse language and respond appropriately, So you can think

0:10:17.960 --> 0:10:21.320
<v Speaker 1>of it as the brains of natural language processing. They

0:10:21.320 --> 0:10:24.439
<v Speaker 1>all exist on these very powerful computers that are in

0:10:24.559 --> 0:10:28.840
<v Speaker 1>data centers. The widespread availability of the Internet and the

0:10:28.880 --> 0:10:31.679
<v Speaker 1>fact that it's pretty easy to stay connected in many

0:10:31.720 --> 0:10:35.360
<v Speaker 1>parts of the world make this possible. So the end

0:10:35.480 --> 0:10:39.640
<v Speaker 1>user feels like the capabilities are actually housed on whatever

0:10:39.679 --> 0:10:41.719
<v Speaker 1>device he or she is using, like if it's a

0:10:41.760 --> 0:10:44.480
<v Speaker 1>smartphone or a computer, But in reality, all the work

0:10:44.559 --> 0:10:48.400
<v Speaker 1>is actually taking place potentially thousands of miles away in

0:10:48.440 --> 0:10:51.160
<v Speaker 1>a data center, and it's just being sent to you.

0:10:51.360 --> 0:10:54.520
<v Speaker 1>The the queries are being sent to the center and

0:10:54.559 --> 0:10:58.240
<v Speaker 1>the responses are being sent back to your device. Another

0:10:58.280 --> 0:11:00.880
<v Speaker 1>big development that has helped signific piquant LEE is the

0:11:00.920 --> 0:11:04.199
<v Speaker 1>pairing of artificial neural networks and as well as a

0:11:04.480 --> 0:11:07.679
<v Speaker 1>deep learning the process of deep learning, so a neural

0:11:07.720 --> 0:11:10.920
<v Speaker 1>network processes information in a way similar to how our

0:11:10.960 --> 0:11:13.960
<v Speaker 1>brains do it. Every node in a neural network represents

0:11:13.960 --> 0:11:18.360
<v Speaker 1>a neuron and it executes UH an operation upon data

0:11:18.559 --> 0:11:21.920
<v Speaker 1>and then hands off this data, which has now been

0:11:21.960 --> 0:11:25.960
<v Speaker 1>altered it's been transformed by this operation, to another layer

0:11:26.080 --> 0:11:29.560
<v Speaker 1>of neurons with a network which do further processing, and

0:11:29.600 --> 0:11:31.920
<v Speaker 1>so on and so forth. The system as a whole

0:11:32.040 --> 0:11:36.520
<v Speaker 1>can evaluate calculations and assign confidence levels to them. Deep

0:11:36.600 --> 0:11:40.600
<v Speaker 1>learning passes information through numerous layers to transform data and,

0:11:40.679 --> 0:11:44.920
<v Speaker 1>in the context of natural language processing, extract meaning from

0:11:44.960 --> 0:11:47.720
<v Speaker 1>that information. Now I've got a bit more to say

0:11:47.720 --> 0:11:50.560
<v Speaker 1>about natural language processing in general, and then after that

0:11:50.640 --> 0:11:55.920
<v Speaker 1>I'm going to transition to talk about recent implementations like Sirie, Alexa,

0:11:55.960 --> 0:11:59.800
<v Speaker 1>Google Assistant, and Cortana. But first let's take a quick

0:12:00.080 --> 0:12:10.000
<v Speaker 1>rake and thank our sponsor. In two thousand and sixteen,

0:12:10.040 --> 0:12:14.280
<v Speaker 1>Google announced a system that could analyze syntax and recognize

0:12:14.320 --> 0:12:19.160
<v Speaker 1>the various elements of a sentence, including verbs, nouns, adjectives,

0:12:19.160 --> 0:12:22.800
<v Speaker 1>and other components. The system's name is sort of a

0:12:22.840 --> 0:12:27.760
<v Speaker 1>snapshot of the zeitgeist of It was called and I'm

0:12:27.840 --> 0:12:32.720
<v Speaker 1>not making this up Parsi mcpart's face. It really was.

0:12:33.360 --> 0:12:37.760
<v Speaker 1>This is a parser, a a software that is meant

0:12:37.800 --> 0:12:42.880
<v Speaker 1>to analyze inputs and determine what the relationships are between

0:12:43.000 --> 0:12:46.840
<v Speaker 1>various components within the input. So it's parsing out the

0:12:46.960 --> 0:12:50.120
<v Speaker 1>meaning of a phrase by looking at the relationship between

0:12:50.160 --> 0:12:53.760
<v Speaker 1>all the different components. It was designed specifically for English

0:12:53.920 --> 0:12:57.920
<v Speaker 1>language inputs. In that same announcement, Google unveiled and open

0:12:57.960 --> 0:13:03.280
<v Speaker 1>source neural network framework called syntax net syntax Net tags

0:13:03.360 --> 0:13:07.319
<v Speaker 1>every word in an input with a part of speech tag,

0:13:07.679 --> 0:13:10.800
<v Speaker 1>and the tag describes the purpose of that word, what

0:13:10.880 --> 0:13:15.520
<v Speaker 1>purpose does it serve within the sentence, within the context

0:13:15.640 --> 0:13:18.600
<v Speaker 1>of that input. So, for example, it might be the

0:13:18.679 --> 0:13:21.920
<v Speaker 1>subject of the sentence, or it could be an object

0:13:22.200 --> 0:13:25.040
<v Speaker 1>of the sentence, or it might be the action the

0:13:25.200 --> 0:13:28.720
<v Speaker 1>root the user wishes to perform upon the object. So

0:13:29.520 --> 0:13:31.720
<v Speaker 1>if it identifies a verb that tends to be the

0:13:31.840 --> 0:13:36.960
<v Speaker 1>root of the command. The system also determines the syntactic

0:13:37.040 --> 0:13:40.320
<v Speaker 1>relationship between all the words, so not just what each

0:13:40.360 --> 0:13:43.560
<v Speaker 1>word's purpose is, but how that word relates to all

0:13:43.679 --> 0:13:46.960
<v Speaker 1>the other words within the input, and then it creates

0:13:47.000 --> 0:13:52.080
<v Speaker 1>a dependency tree which illustrates which words depend upon others.

0:13:52.640 --> 0:13:56.080
<v Speaker 1>Syntax Net also makes use of beam search. That's the

0:13:56.120 --> 0:13:58.959
<v Speaker 1>strategy I talked about in the Speech Recognition podcast a

0:13:59.040 --> 0:14:05.200
<v Speaker 1>couple of podcasts go so that is to help eliminate ambiguity.

0:14:05.320 --> 0:14:10.320
<v Speaker 1>As sentence length increases, the number of possible interpretations of

0:14:10.360 --> 0:14:14.839
<v Speaker 1>that sentence also increases dramatically. Right, the more complicated a

0:14:14.920 --> 0:14:18.840
<v Speaker 1>sentence is, the easier it is to misinterpret what that

0:14:18.960 --> 0:14:21.760
<v Speaker 1>sentence means, especially if you're looking at it from the

0:14:21.760 --> 0:14:24.480
<v Speaker 1>perspective of a machine, So how does the computer know

0:14:25.000 --> 0:14:29.320
<v Speaker 1>which interpretation is the right one? Syntax net takes a

0:14:29.480 --> 0:14:33.040
<v Speaker 1>sentence and starts to parse it, beginning with a left

0:14:33.040 --> 0:14:35.520
<v Speaker 1>to right approach for English, so it starts at the

0:14:35.560 --> 0:14:38.880
<v Speaker 1>beginning of the sentence and works its way through. Essentially,

0:14:38.920 --> 0:14:42.360
<v Speaker 1>it creates a hypothesis as to how the words relate

0:14:42.400 --> 0:14:45.080
<v Speaker 1>to each other. But as it goes along, it detects

0:14:45.120 --> 0:14:49.800
<v Speaker 1>possible alternate interpretations, so it starts to assign a probability

0:14:49.840 --> 0:14:54.040
<v Speaker 1>score to each interpretation, Essentially how sure it is that

0:14:54.200 --> 0:14:56.800
<v Speaker 1>this is on the right track. And it will keep

0:14:56.920 --> 0:15:00.680
<v Speaker 1>multiple possible answers as it parses, so it doesn't toss

0:15:00.760 --> 0:15:04.120
<v Speaker 1>them aside immediately. It says, all right, I'm right now,

0:15:04.440 --> 0:15:07.280
<v Speaker 1>I'm pretty sure answer A is correct, but I'm going

0:15:07.320 --> 0:15:10.320
<v Speaker 1>to hold on to B and C just in case. Now,

0:15:10.360 --> 0:15:13.920
<v Speaker 1>if one interpretation has a particularly low score and there

0:15:13.960 --> 0:15:17.720
<v Speaker 1>are several other potential interpretations that have higher scores, the

0:15:17.760 --> 0:15:20.760
<v Speaker 1>system will discard the low score with the assumption that

0:15:20.840 --> 0:15:22.960
<v Speaker 1>it just can't be the right answer just doesn't make

0:15:23.000 --> 0:15:27.720
<v Speaker 1>sense in well formed text, that is informal text, something

0:15:27.760 --> 0:15:30.840
<v Speaker 1>that has been written in a very formal approach, PARSI

0:15:31.000 --> 0:15:33.680
<v Speaker 1>mcpars face does a pretty good job. In fact, a

0:15:33.760 --> 0:15:38.480
<v Speaker 1>really good job has an accuracy rating that's approaching the

0:15:38.560 --> 0:15:43.040
<v Speaker 1>level of a human linguist that is trained in parsing sentences.

0:15:43.920 --> 0:15:46.360
<v Speaker 1>Humans who have that kind of training average at around

0:15:46.880 --> 0:15:52.360
<v Speaker 1>scent accuracy, so PARSI mcpars faces right right behind them.

0:15:52.440 --> 0:15:57.120
<v Speaker 1>But the key phrase there is well formed text. If

0:15:57.160 --> 0:16:00.920
<v Speaker 1>you present parsi mcpar's face with more lucy goosey language,

0:16:01.200 --> 0:16:04.560
<v Speaker 1>such as what you might find on your average Internet website,

0:16:05.560 --> 0:16:08.360
<v Speaker 1>which I know was redundant, parsing mcpars face has a

0:16:08.400 --> 0:16:12.520
<v Speaker 1>much more modest nine success rating. It's still impressive, but

0:16:12.560 --> 0:16:15.920
<v Speaker 1>it's a significant drop in accuracy. Now, these sort of

0:16:15.920 --> 0:16:18.960
<v Speaker 1>tools have been used in various Google products for a while,

0:16:19.160 --> 0:16:22.600
<v Speaker 1>not just Google Assistant, which is the one that people

0:16:22.640 --> 0:16:24.520
<v Speaker 1>tend to think about because it's the one we interact

0:16:24.560 --> 0:16:28.040
<v Speaker 1>with when we are speaking to Google, but also in

0:16:28.120 --> 0:16:30.920
<v Speaker 1>stuff like Gmail. If you've used Gmail and you've noticed

0:16:30.960 --> 0:16:34.320
<v Speaker 1>that sometimes you get automated responses popping up that you

0:16:34.400 --> 0:16:37.280
<v Speaker 1>can choose as an option, So instead of writing an email,

0:16:37.280 --> 0:16:40.360
<v Speaker 1>you just select sounds good or I'll see you then,

0:16:40.520 --> 0:16:43.120
<v Speaker 1>or whatever it may be. Then you have seen this

0:16:43.160 --> 0:16:46.000
<v Speaker 1>technology at work, or at least you've seen the product

0:16:46.080 --> 0:16:49.280
<v Speaker 1>of its work. Those automated responses are the result of

0:16:49.320 --> 0:16:54.080
<v Speaker 1>a natural language understanding system that's parsing that email, identifying

0:16:54.120 --> 0:16:57.200
<v Speaker 1>whatever the salient points are in the message, and then

0:16:57.240 --> 0:17:00.520
<v Speaker 1>generating what are hopefully logical responses to it, so you

0:17:00.560 --> 0:17:02.520
<v Speaker 1>can just choose that instead of taking the time to

0:17:02.520 --> 0:17:05.679
<v Speaker 1>actually type something in. One of the key elements in

0:17:05.800 --> 0:17:09.520
<v Speaker 1>natural language understanding is creating machines that can communicate with

0:17:09.640 --> 0:17:13.600
<v Speaker 1>us and explain how they arrived at a certain result. Now,

0:17:13.640 --> 0:17:16.880
<v Speaker 1>this falls into the concept of transparency, which is really

0:17:16.960 --> 0:17:19.919
<v Speaker 1>important when we were talking about artificial intelligence. There's a

0:17:20.000 --> 0:17:24.119
<v Speaker 1>real fear that AI and neural networks are creaning toward

0:17:24.240 --> 0:17:28.320
<v Speaker 1>a black box scenario, and a black box describes any

0:17:28.400 --> 0:17:31.240
<v Speaker 1>system where the workings of the system are hidden from

0:17:31.240 --> 0:17:35.719
<v Speaker 1>our view. We cannot see how something works, and so

0:17:35.760 --> 0:17:38.159
<v Speaker 1>we can only make guesses as to what's going on.

0:17:38.760 --> 0:17:40.760
<v Speaker 1>I know a lot of gear heads who are exasperated

0:17:40.760 --> 0:17:44.719
<v Speaker 1>with the way vehicle manufacturers are creating more of their cars, trucks,

0:17:44.760 --> 0:17:48.879
<v Speaker 1>and other vehicles with systems that aren't easily accessible or modifiable.

0:17:49.320 --> 0:17:53.160
<v Speaker 1>They consider those cars to be black boxes. It makes

0:17:53.160 --> 0:17:55.480
<v Speaker 1>it much harder to work on a vehicle if you

0:17:55.520 --> 0:17:59.720
<v Speaker 1>don't have the proprietary tools and knowledge that are specifically

0:17:59.760 --> 0:18:02.840
<v Speaker 1>for that system. Now take that concept and apply it

0:18:02.840 --> 0:18:06.000
<v Speaker 1>to AI, and it gets pretty scary pretty fast, particularly

0:18:06.280 --> 0:18:09.000
<v Speaker 1>since we're relying on AI to do some important stuff

0:18:09.040 --> 0:18:13.239
<v Speaker 1>like drive cars, make stock option deals, or help with

0:18:13.320 --> 0:18:17.399
<v Speaker 1>healthcare issues, and so one area of work focuses on

0:18:17.440 --> 0:18:21.159
<v Speaker 1>giving machines the capability to explain themselves, not just to

0:18:21.200 --> 0:18:24.440
<v Speaker 1>provide an answer, but explain why they came up with

0:18:24.480 --> 0:18:28.120
<v Speaker 1>that answer. So imagine a chess playing computer. It's playing

0:18:28.119 --> 0:18:30.200
<v Speaker 1>a game of chess and it makes a move. Then

0:18:30.240 --> 0:18:33.040
<v Speaker 1>imagine being able to ask the computer, why did you

0:18:33.119 --> 0:18:36.200
<v Speaker 1>make that move, and then the computer could actually answer

0:18:36.280 --> 0:18:39.680
<v Speaker 1>the question, explaining the logic behind the move it made.

0:18:40.119 --> 0:18:43.920
<v Speaker 1>Now extend that concept to all sorts of different AI applications.

0:18:44.240 --> 0:18:46.880
<v Speaker 1>If an AI stock trader suddenly buys up a ton

0:18:46.880 --> 0:18:50.080
<v Speaker 1>of stocks, you might want to know exactly what prompted

0:18:50.160 --> 0:18:53.840
<v Speaker 1>that decision, why did it make that purchase? And you

0:18:53.880 --> 0:18:56.479
<v Speaker 1>can easily imagine situations in which you'd want to know

0:18:56.560 --> 0:18:59.480
<v Speaker 1>why a machine behaved the way it did. Why did

0:18:59.720 --> 0:19:03.399
<v Speaker 1>an autonomous car choose a particular route. Why did a

0:19:03.440 --> 0:19:07.920
<v Speaker 1>healthcare program suggest a particular diagnosis Without getting those answers,

0:19:07.920 --> 0:19:11.040
<v Speaker 1>we're just putting our faith into machines blindly, and giving

0:19:11.040 --> 0:19:15.120
<v Speaker 1>a computer the ability to generate meaningful and equally important

0:19:15.240 --> 0:19:20.080
<v Speaker 1>relevant explanations would be extremely helpful. So what are some

0:19:20.119 --> 0:19:24.440
<v Speaker 1>of the uses of natural language processing technology. Well, one

0:19:24.520 --> 0:19:28.160
<v Speaker 1>fairly simple application is in spelling and grammar checking software.

0:19:28.200 --> 0:19:30.520
<v Speaker 1>If you've used a word processing program over the last

0:19:30.560 --> 0:19:33.480
<v Speaker 1>few years the last couple of decades, chances are you're

0:19:33.480 --> 0:19:37.960
<v Speaker 1>familiar with automatic real time spell check and grammar check features.

0:19:38.680 --> 0:19:40.760
<v Speaker 1>This is possible because of the work that has been

0:19:40.800 --> 0:19:44.120
<v Speaker 1>done in natural language processing. Spell check needs to take

0:19:44.160 --> 0:19:47.560
<v Speaker 1>into consideration not only if a word is spelled correctly,

0:19:47.600 --> 0:19:51.760
<v Speaker 1>if a word matches a word that's in the computer's lexicon,

0:19:52.320 --> 0:19:55.639
<v Speaker 1>but also if it's the right word for that instance.

0:19:56.000 --> 0:19:58.320
<v Speaker 1>In English, we have a lot of hominems. Those are

0:19:58.320 --> 0:20:01.760
<v Speaker 1>words that sound the same aim, but I have different meanings.

0:20:02.080 --> 0:20:05.040
<v Speaker 1>Now you can have hominem's that are spelled exactly the

0:20:05.080 --> 0:20:07.960
<v Speaker 1>same way, and those really aren't a problem because the

0:20:07.960 --> 0:20:12.480
<v Speaker 1>reader can pick up on what meaning you intended through context. Though,

0:20:12.520 --> 0:20:15.960
<v Speaker 1>if you're using natural language processing to do a translation,

0:20:16.400 --> 0:20:18.879
<v Speaker 1>then the NLP system needs to be able to determine

0:20:18.960 --> 0:20:22.480
<v Speaker 1>which meaning the original author intended. In my earlier example

0:20:22.520 --> 0:20:26.640
<v Speaker 1>about making a withdrawal at the bank, there's a hominem

0:20:26.680 --> 0:20:29.160
<v Speaker 1>you know, to two versions of bank, but they mean

0:20:29.200 --> 0:20:32.040
<v Speaker 1>two different things. I could also talk about bank as

0:20:32.080 --> 0:20:34.400
<v Speaker 1>in the sense of a verb, as in banking off

0:20:34.520 --> 0:20:39.040
<v Speaker 1>of something, but you get the point. There are also

0:20:39.119 --> 0:20:42.760
<v Speaker 1>hominem's that sound the same but are spelled differently, and

0:20:42.800 --> 0:20:45.560
<v Speaker 1>they have different meanings as well. So for example, they

0:20:45.680 --> 0:20:49.480
<v Speaker 1>dreaded too as in t O two as in t

0:20:49.760 --> 0:20:54.400
<v Speaker 1>O O, and two as in two combo. Those are

0:20:54.440 --> 0:20:58.159
<v Speaker 1>three words with three different applications, three different spellings. A

0:20:58.320 --> 0:21:01.359
<v Speaker 1>good spell check algorithm will be able to determine if

0:21:01.359 --> 0:21:04.960
<v Speaker 1>you've used the correct one in any instance. So if

0:21:04.960 --> 0:21:10.040
<v Speaker 1>you say that's two sweet, that's too sweet, but you're

0:21:10.240 --> 0:21:13.920
<v Speaker 1>using the number too just in word form, the spell

0:21:14.040 --> 0:21:16.280
<v Speaker 1>check will give you the old heads up and say

0:21:16.600 --> 0:21:19.399
<v Speaker 1>I think you meant t O O not t w O.

0:21:20.160 --> 0:21:22.760
<v Speaker 1>Fun fact, I typed that sentence into Google Docs and

0:21:22.800 --> 0:21:26.000
<v Speaker 1>it said you're totes fine. BRA didn't notice it at all.

0:21:26.560 --> 0:21:30.040
<v Speaker 1>Grammar checkers have to be able to analyze sentence structure

0:21:30.160 --> 0:21:32.840
<v Speaker 1>and word choice and compared against the grammar program for

0:21:32.880 --> 0:21:35.920
<v Speaker 1>the system. This might also help determine if the word

0:21:35.960 --> 0:21:39.240
<v Speaker 1>you use was the correct one. So, for example, affect

0:21:39.640 --> 0:21:45.080
<v Speaker 1>versus effect, Affect is a verb you affect something. Effect

0:21:45.240 --> 0:21:49.159
<v Speaker 1>is usually a noun. It's typically the result of some action.

0:21:49.280 --> 0:21:52.520
<v Speaker 1>So I could affect a drum, which is a dumb

0:21:52.560 --> 0:21:55.320
<v Speaker 1>thing to say, and the effect might be that the

0:21:55.359 --> 0:21:58.680
<v Speaker 1>sound I played hurt your ears. Now, if you spell

0:21:58.760 --> 0:22:02.160
<v Speaker 1>the word correctly and the spell checker is only comparing

0:22:02.200 --> 0:22:04.800
<v Speaker 1>the words you type against a lexicon to see if

0:22:04.840 --> 0:22:07.159
<v Speaker 1>there's a match, you might not get an indication that

0:22:07.200 --> 0:22:10.000
<v Speaker 1>anything is wrong because the computer system is saying, well,

0:22:10.000 --> 0:22:12.959
<v Speaker 1>that word is spelled correctly. It doesn't realize it's the

0:22:12.960 --> 0:22:15.760
<v Speaker 1>wrong word. But if it has a way of checking grammar,

0:22:15.760 --> 0:22:17.760
<v Speaker 1>it can also make sure you're using the right word

0:22:17.840 --> 0:22:21.520
<v Speaker 1>in the right context. Search engines such as Google use

0:22:21.600 --> 0:22:24.440
<v Speaker 1>natural language processing to determine what it is you're looking

0:22:24.480 --> 0:22:26.800
<v Speaker 1>for right, So when you're typing in a search and

0:22:26.840 --> 0:22:30.280
<v Speaker 1>you hit the search button, you might get a little

0:22:31.240 --> 0:22:34.920
<v Speaker 1>uh notification that says, maybe you meant this other thing,

0:22:35.040 --> 0:22:38.160
<v Speaker 1>or maybe you need to search for this terminology. That's

0:22:38.160 --> 0:22:41.040
<v Speaker 1>a useful feature since not everyone thinks of search the

0:22:41.119 --> 0:22:43.760
<v Speaker 1>same way. I could tell a dozen people to go

0:22:43.800 --> 0:22:48.159
<v Speaker 1>on Google and pull up information about Benjamin Franklin and

0:22:48.200 --> 0:22:51.479
<v Speaker 1>the story about the kite, and those folks might go

0:22:51.640 --> 0:22:54.639
<v Speaker 1>and perform their searches in twelve different ways. But the

0:22:54.680 --> 0:22:57.560
<v Speaker 1>search engine's job is to return the best results based

0:22:57.560 --> 0:23:00.480
<v Speaker 1>on the query, which means it needs to suss out

0:23:00.520 --> 0:23:04.160
<v Speaker 1>what the searcher is actually looking for. So even if

0:23:04.480 --> 0:23:08.160
<v Speaker 1>the twelve people all type twelve different ways of looking

0:23:08.240 --> 0:23:11.280
<v Speaker 1>up this information about Benjamin Franklin and the kite story,

0:23:11.880 --> 0:23:16.320
<v Speaker 1>it should respond with the most relevant results. And maybe

0:23:16.320 --> 0:23:19.919
<v Speaker 1>people get slightly different search results based upon the query,

0:23:20.000 --> 0:23:22.760
<v Speaker 1>but they should be more or less the same. And

0:23:22.800 --> 0:23:24.359
<v Speaker 1>it can also look out for you. It could give

0:23:24.359 --> 0:23:26.880
<v Speaker 1>you suggestions for search terms, should you use an incorrect

0:23:26.880 --> 0:23:31.920
<v Speaker 1>spelling or you approximate a spelling, or something like that.

0:23:32.720 --> 0:23:36.320
<v Speaker 1>One of the areas of opportunity for natural language processing

0:23:36.320 --> 0:23:39.160
<v Speaker 1>applications in the near future is handling the massive amounts

0:23:39.200 --> 0:23:42.960
<v Speaker 1>of information in big data applications. So, for example, a

0:23:43.040 --> 0:23:47.159
<v Speaker 1>lawyer might want to search historical legal results using natural

0:23:47.240 --> 0:23:50.040
<v Speaker 1>language to look for precedents that might help his or

0:23:50.040 --> 0:23:53.639
<v Speaker 1>her case in the courtroom. A pharmaceuticals company might need

0:23:53.680 --> 0:23:58.080
<v Speaker 1>to search information about clinical trials, doctors, notes, patient testimonials,

0:23:58.119 --> 0:24:01.560
<v Speaker 1>and related information. And the amount of information represented by

0:24:01.560 --> 0:24:04.800
<v Speaker 1>big data is truly astounding. It's enormous. It's way too

0:24:04.880 --> 0:24:07.960
<v Speaker 1>much for any human to sort through. So developing a

0:24:07.960 --> 0:24:10.800
<v Speaker 1>method for computers to parse a query and return relevant

0:24:10.840 --> 0:24:14.520
<v Speaker 1>results is highly desirable. For a computer to understand that

0:24:14.680 --> 0:24:19.720
<v Speaker 1>context understanding and air quotes and being able to give

0:24:19.760 --> 0:24:23.840
<v Speaker 1>you results based upon your questions, that would be incredibly

0:24:23.920 --> 0:24:27.080
<v Speaker 1>valuable for lots of different industries. And we started off

0:24:27.119 --> 0:24:30.880
<v Speaker 1>talking about machine translation at the early stages of natural

0:24:30.960 --> 0:24:34.640
<v Speaker 1>language processing. That's still a big area of research. Now

0:24:35.119 --> 0:24:38.280
<v Speaker 1>you can get real time translation tools. You can use

0:24:38.320 --> 0:24:41.440
<v Speaker 1>devices to translate from one language to another in real settings,

0:24:41.440 --> 0:24:44.439
<v Speaker 1>including written languages like signs. You can just hold a

0:24:44.480 --> 0:24:48.280
<v Speaker 1>camera up and get an an English translation of a

0:24:48.359 --> 0:24:51.320
<v Speaker 1>sign that's written another language, and of course vice versa.

0:24:51.760 --> 0:24:54.200
<v Speaker 1>That tends to be marketed as a tool for travelers,

0:24:54.200 --> 0:24:56.919
<v Speaker 1>but it really shows the amazing progress we've made in

0:24:57.040 --> 0:25:00.119
<v Speaker 1>natural language processing from the old days of word for

0:25:00.200 --> 0:25:03.400
<v Speaker 1>word models for machine translation that we're made back during

0:25:03.440 --> 0:25:05.880
<v Speaker 1>the Cold War now we've still got a far away

0:25:05.920 --> 0:25:09.040
<v Speaker 1>to go with natural language processing. We've seen some incredible

0:25:09.080 --> 0:25:12.080
<v Speaker 1>improvements over the past few years, but machines still don't

0:25:12.119 --> 0:25:15.639
<v Speaker 1>actually understand what we're saying or what we're writing, not

0:25:15.720 --> 0:25:18.520
<v Speaker 1>on a conscious level anyway. Instead, they are able to

0:25:18.560 --> 0:25:22.000
<v Speaker 1>refer back to rules, either explicitly stated as in the

0:25:22.040 --> 0:25:25.760
<v Speaker 1>older NLP models, or those arrived at through deep learning.

0:25:26.320 --> 0:25:28.240
<v Speaker 1>Now I'm going to take a quick break, but when

0:25:28.240 --> 0:25:31.000
<v Speaker 1>we come back, I'll talk a bit about the history

0:25:31.080 --> 0:25:34.080
<v Speaker 1>of the voice assistance we all know and love. But first,

0:25:34.160 --> 0:25:44.400
<v Speaker 1>here's another word from our sponsors. All right, So now

0:25:44.400 --> 0:25:47.720
<v Speaker 1>we understand a bit about the technologies that make voice

0:25:47.720 --> 0:25:52.800
<v Speaker 1>assistance possible, specifically speech recognition and natural language processing. There's

0:25:52.840 --> 0:25:56.840
<v Speaker 1>obviously a lot more than that, uh the system. The

0:25:56.880 --> 0:26:00.920
<v Speaker 1>system can obviously process our requests or commands and return

0:26:00.960 --> 0:26:06.840
<v Speaker 1>a result using more traditional computational processes. So while the

0:26:06.880 --> 0:26:12.040
<v Speaker 1>interpretation side is on speech recognition and natural language processing,

0:26:12.359 --> 0:26:16.120
<v Speaker 1>there's still a lot of regular computation work that has

0:26:16.119 --> 0:26:21.320
<v Speaker 1>to happen for a a personal assistant, digital assistant, a

0:26:21.400 --> 0:26:23.520
<v Speaker 1>voice assistant, whatever you want to call them, to be

0:26:23.600 --> 0:26:26.080
<v Speaker 1>able to respond to you. So let's take a quick

0:26:26.080 --> 0:26:29.880
<v Speaker 1>stroll through the history of the major voice assistants out there,

0:26:30.440 --> 0:26:32.320
<v Speaker 1>and I'm going to cover these in the order they

0:26:32.320 --> 0:26:35.000
<v Speaker 1>were introduced to the public more or less, which means

0:26:35.000 --> 0:26:37.879
<v Speaker 1>our very first voice assistant that will be covering in

0:26:37.920 --> 0:26:41.280
<v Speaker 1>this because I'm only focusing on the really big ones. Uh.

0:26:41.320 --> 0:26:43.840
<v Speaker 1>There are lots of small ones out there, but I'm

0:26:43.880 --> 0:26:46.680
<v Speaker 1>looking at the ones everyone's heard about, So that means

0:26:46.720 --> 0:26:49.359
<v Speaker 1>the first one we get to talk about is Apple's Sirie.

0:26:49.960 --> 0:26:53.159
<v Speaker 1>Apple unveiled Sirie on April fourteen, two thousand eleven, and

0:26:53.200 --> 0:26:56.399
<v Speaker 1>to be fair, Sirie existed before this. It was not

0:26:56.480 --> 0:26:59.520
<v Speaker 1>an Apple creation. Syria was actually an app produced by

0:26:59.560 --> 0:27:03.719
<v Speaker 1>an into and developer company called Sirie Incorporated, but Apple

0:27:03.800 --> 0:27:07.399
<v Speaker 1>gobbled up that company in and brought them in house.

0:27:07.840 --> 0:27:11.800
<v Speaker 1>And Apple had previously relied upon another speech recognition program

0:27:11.840 --> 0:27:14.480
<v Speaker 1>called voice over, which had been used in Mac products

0:27:14.480 --> 0:27:19.000
<v Speaker 1>and all iPhones since the iPhone three GS. Siri would

0:27:19.000 --> 0:27:22.959
<v Speaker 1>become available starting with the iPhone for s In this announcement,

0:27:23.480 --> 0:27:27.120
<v Speaker 1>Apple pointed out that earlier implementations of voice commands required

0:27:27.240 --> 0:27:30.120
<v Speaker 1>users to learn the syntax of the system. You had

0:27:30.160 --> 0:27:33.000
<v Speaker 1>to follow a very specific set of rules in order

0:27:33.040 --> 0:27:36.000
<v Speaker 1>to get anything to work. So you give a command

0:27:36.240 --> 0:27:38.640
<v Speaker 1>defined by the system. So for example, you might say

0:27:38.960 --> 0:27:43.440
<v Speaker 1>call mom or play once in a lifetime. You had

0:27:43.480 --> 0:27:47.720
<v Speaker 1>to do this very structured approach to whatever it was

0:27:47.760 --> 0:27:50.840
<v Speaker 1>he wanted to do. But that requires the user to

0:27:50.840 --> 0:27:54.560
<v Speaker 1>actually adhere to rules created by the architects of the system. Right,

0:27:54.600 --> 0:27:56.800
<v Speaker 1>So Sirie was meant to be different. It was meant

0:27:56.880 --> 0:28:00.000
<v Speaker 1>to be able to understand what you wanted on your terms,

0:28:00.119 --> 0:28:03.240
<v Speaker 1>not based off a strict set of rules. Apple said

0:28:03.280 --> 0:28:05.560
<v Speaker 1>that Siri would be able to interpret what you meant

0:28:05.960 --> 0:28:08.560
<v Speaker 1>and would return relevant information to you in response. In

0:28:08.600 --> 0:28:11.919
<v Speaker 1>the unveiling, they said that Siri is quote, your intelligent

0:28:12.000 --> 0:28:15.240
<v Speaker 1>assistant that helps you get things done just by asking

0:28:15.680 --> 0:28:18.840
<v Speaker 1>end quote. During that demonstration, they showed off how Siri

0:28:18.960 --> 0:28:22.080
<v Speaker 1>could parse different phrases that had the same underlying meaning

0:28:22.720 --> 0:28:26.840
<v Speaker 1>that the example they gave originally was was the weather today,

0:28:26.880 --> 0:28:30.000
<v Speaker 1>and then they asked that same question five or six

0:28:30.080 --> 0:28:33.600
<v Speaker 1>different times. Scott Forstall, vice president over at Apple, showed

0:28:33.600 --> 0:28:36.119
<v Speaker 1>off how you could get the same weather information by

0:28:36.160 --> 0:28:38.800
<v Speaker 1>asking it in these different ways. Then they showed off

0:28:38.800 --> 0:28:42.520
<v Speaker 1>how Siri could interoperate with other apps, such as Apple's

0:28:42.600 --> 0:28:45.520
<v Speaker 1>maps feature or through a partnership they had with Yelp.

0:28:45.920 --> 0:28:49.120
<v Speaker 1>Siri could take a request, it could parse it, interpret it,

0:28:49.280 --> 0:28:53.640
<v Speaker 1>send the appropriate UH request to the appropriate destination, and

0:28:53.680 --> 0:28:56.960
<v Speaker 1>then serve up the response. The destination could be a

0:28:57.000 --> 0:29:00.360
<v Speaker 1>web search, it could be an action within a compatible app.

0:29:00.400 --> 0:29:04.840
<v Speaker 1>You get the idea, so that serie next. On July nine,

0:29:04.880 --> 0:29:08.440
<v Speaker 1>two thousand twelve, Google released Android jelly Bean a k

0:29:08.600 --> 0:29:11.440
<v Speaker 1>a Android four point one, and one of the features

0:29:11.440 --> 0:29:14.200
<v Speaker 1>included in that operating system update, at least for certain

0:29:14.240 --> 0:29:18.160
<v Speaker 1>hardware upon release, was an offshoot of Google Search called

0:29:18.240 --> 0:29:22.880
<v Speaker 1>Google Now. This feature would serve up predictive cards containing

0:29:22.880 --> 0:29:26.080
<v Speaker 1>information that the system had flagged as potentially being useful

0:29:26.120 --> 0:29:29.120
<v Speaker 1>to you based off your activity. So let's say you

0:29:29.120 --> 0:29:31.560
<v Speaker 1>spend a lot of times searching for stuff like baseball scores,

0:29:32.080 --> 0:29:35.200
<v Speaker 1>Google Now would start serving you up cards that would

0:29:35.240 --> 0:29:38.280
<v Speaker 1>give you scores from previous games before you could even

0:29:38.320 --> 0:29:40.400
<v Speaker 1>search for them. You would just look at Google Now

0:29:40.440 --> 0:29:43.280
<v Speaker 1>and you could scroll through and you see what the

0:29:43.360 --> 0:29:46.120
<v Speaker 1>latest results were. Then you could actually scroll through the

0:29:46.160 --> 0:29:48.800
<v Speaker 1>different cards, all of which were slowly dialing you in

0:29:48.880 --> 0:29:51.600
<v Speaker 1>as a person, which was kind of creepy. And it

0:29:51.640 --> 0:29:55.960
<v Speaker 1>relied a lot on natural language processing and your activities. Now,

0:29:56.000 --> 0:29:58.760
<v Speaker 1>Google Now was not a voice assistant. This was sort

0:29:58.760 --> 0:30:02.720
<v Speaker 1>of a one way relationship. Google was analyzing information based

0:30:02.760 --> 0:30:05.160
<v Speaker 1>on your activity and then serving up information to you

0:30:05.200 --> 0:30:08.000
<v Speaker 1>that might be useful. But over time the company would

0:30:08.040 --> 0:30:12.360
<v Speaker 1>phase out Google Now and it gradually evolved into Google Assistant.

0:30:12.680 --> 0:30:14.600
<v Speaker 1>There was also Google Voice that allows you to do

0:30:14.640 --> 0:30:18.320
<v Speaker 1>things like voice search, so that also became incorporated into this.

0:30:18.480 --> 0:30:22.120
<v Speaker 1>Google Assistant is a lot like Syrie. It responds to voice,

0:30:22.280 --> 0:30:25.000
<v Speaker 1>It can respond to anaphores, meaning it can keep track

0:30:25.040 --> 0:30:27.760
<v Speaker 1>of subject matter and respond to follow up questions that

0:30:27.880 --> 0:30:31.800
<v Speaker 1>don't contain an explicit reference to the subject. So, for example,

0:30:31.840 --> 0:30:34.280
<v Speaker 1>you could ask Google Assistant, what is the weather going

0:30:34.320 --> 0:30:37.080
<v Speaker 1>to be like in Atlanta? And then after you get

0:30:37.080 --> 0:30:39.640
<v Speaker 1>a response, you might say, what about in Seattle. Now,

0:30:39.640 --> 0:30:43.960
<v Speaker 1>you have not explicitly said what is the weather in Seattle?

0:30:44.400 --> 0:30:47.600
<v Speaker 1>You just said what about in Seattle. However, Google Assistant

0:30:47.640 --> 0:30:50.080
<v Speaker 1>can infer that you are still talking about the weather,

0:30:50.200 --> 0:30:53.640
<v Speaker 1>only now within the context of a different location. Google

0:30:53.680 --> 0:30:57.640
<v Speaker 1>Assistant debuted in May, so in a way, this particular

0:30:57.800 --> 0:31:01.680
<v Speaker 1>entry in our timeline spans two other debuts, because he

0:31:01.720 --> 0:31:05.280
<v Speaker 1>had Google Now on one side and then Google Assistant later,

0:31:05.480 --> 0:31:07.480
<v Speaker 1>but I figured it was important to acknowledge how Google

0:31:07.480 --> 0:31:10.600
<v Speaker 1>Assistant grew out of the older Google Now feature. On

0:31:10.680 --> 0:31:15.240
<v Speaker 1>April two, two thirteen, Microsoft introduced its own voice assistant

0:31:15.360 --> 0:31:19.680
<v Speaker 1>at the Build Developer Conference. Microsoft's entry is named Cortana,

0:31:19.840 --> 0:31:23.479
<v Speaker 1>after the AI character from the Halo series of video games.

0:31:23.880 --> 0:31:28.200
<v Speaker 1>Microsoft integrated Cortana to work with Windows ten, Xbox One,

0:31:28.400 --> 0:31:32.120
<v Speaker 1>Windows Mobile, and a few other platforms as well, including

0:31:32.160 --> 0:31:35.040
<v Speaker 1>apps that were meant for other operating systems like iOS

0:31:35.080 --> 0:31:39.520
<v Speaker 1>and Android. Cortana's US voices that of Jen Taylor. She

0:31:39.600 --> 0:31:42.440
<v Speaker 1>actually is the voice actress who provided the voice for

0:31:42.480 --> 0:31:45.200
<v Speaker 1>the character of Cortona in the Halo games. That's kind

0:31:45.200 --> 0:31:48.360
<v Speaker 1>of fun, and like Siri, Cortona can interface with apps

0:31:48.360 --> 0:31:53.080
<v Speaker 1>as well as performed web searches. In November, Amazon got

0:31:53.120 --> 0:31:56.440
<v Speaker 1>into the game with Alexa and the Amazon Echo. Through

0:31:56.560 --> 0:31:59.560
<v Speaker 1>Amazon Echo, Alexa can serve not just as a voice

0:31:59.560 --> 0:32:03.000
<v Speaker 1>assistant that can retrieve information and play streaming media and

0:32:03.040 --> 0:32:05.360
<v Speaker 1>that kind of thing, but also as an interface in

0:32:05.440 --> 0:32:08.760
<v Speaker 1>home automation applications, and to be fair, so can Google

0:32:08.800 --> 0:32:11.960
<v Speaker 1>Assistant through devices like Google Home. So you can use

0:32:11.960 --> 0:32:14.920
<v Speaker 1>Alexa to interface directly with systems in your home. If

0:32:14.920 --> 0:32:18.480
<v Speaker 1>they are compatible, and not surprisingly, Alexa can interface with

0:32:18.520 --> 0:32:22.320
<v Speaker 1>Amazon's ordering system, allowing users to order products from Amazon

0:32:22.360 --> 0:32:26.720
<v Speaker 1>directly by speaking to Alexa. No shock there. According to Amazon,

0:32:27.040 --> 0:32:30.040
<v Speaker 1>developers were inspired by the Star Trek series of shows,

0:32:30.240 --> 0:32:32.600
<v Speaker 1>which characters would speak out loud to computer systems and

0:32:32.640 --> 0:32:35.760
<v Speaker 1>call for information or send commands to make various stuff happen.

0:32:36.320 --> 0:32:39.840
<v Speaker 1>Amazon also released a developer kit to allow independent developers

0:32:39.880 --> 0:32:42.520
<v Speaker 1>to create what are called Alexa skills. There's an old

0:32:42.560 --> 0:32:45.080
<v Speaker 1>episode of tech Stuff where I interviewed some folks from

0:32:45.160 --> 0:32:49.000
<v Speaker 1>Amazon to talk about this process. But essentially, developers will

0:32:49.040 --> 0:32:52.400
<v Speaker 1>submit skills to Amazon, which can then publish those skills

0:32:52.400 --> 0:32:55.320
<v Speaker 1>and allow anyone who has an Alexa enabled device to

0:32:55.600 --> 0:32:58.760
<v Speaker 1>activate those skills and make use of them. Individuals can

0:32:58.760 --> 0:33:01.480
<v Speaker 1>even build up their own person lies skills using a

0:33:01.520 --> 0:33:06.280
<v Speaker 1>tool called Blueprints, which Amazon introduced in April. Now there

0:33:06.280 --> 0:33:09.480
<v Speaker 1>are other examples I could point to. There's Samsung's Bixby

0:33:09.560 --> 0:33:14.120
<v Speaker 1>which it introduced in March. There's sound Hounds virtual assistant

0:33:14.120 --> 0:33:18.440
<v Speaker 1>called Hound that launched in March of But these were

0:33:18.440 --> 0:33:20.760
<v Speaker 1>the ones that I really hear about the most frequently,

0:33:20.760 --> 0:33:22.360
<v Speaker 1>so were the ones I wanted to kind of cover

0:33:22.840 --> 0:33:25.520
<v Speaker 1>and they all work on on a similar principle. The

0:33:25.560 --> 0:33:30.440
<v Speaker 1>implementations are all particular to their specific brands, but they

0:33:30.480 --> 0:33:36.160
<v Speaker 1>work on under similar foundational principles of natural language processing,

0:33:36.840 --> 0:33:41.080
<v Speaker 1>speech recognition, et cetera. And it's all about converging technologies

0:33:41.120 --> 0:33:44.440
<v Speaker 1>that took decades of hard work to make possible. Now

0:33:44.480 --> 0:33:47.239
<v Speaker 1>I want to thank listener Nate, who was the one

0:33:47.280 --> 0:33:50.240
<v Speaker 1>who set me on this trail to ask about speech

0:33:50.240 --> 0:33:55.120
<v Speaker 1>recognition and natural language processing and these voice assistants. Was

0:33:55.200 --> 0:34:00.000
<v Speaker 1>really interesting to dive into, very very cool, fascinating stuff.

0:34:00.040 --> 0:34:02.400
<v Speaker 1>Thanks a lot, Nate. If any of you out there

0:34:02.440 --> 0:34:05.040
<v Speaker 1>have suggestions for future episodes of tech Stuff, maybe it's

0:34:05.040 --> 0:34:07.640
<v Speaker 1>a technology or a company or a person in tech.

0:34:07.920 --> 0:34:10.160
<v Speaker 1>Maybe there's someone I should interview or have on as

0:34:10.160 --> 0:34:13.080
<v Speaker 1>a guest host, send me a message. The email address

0:34:13.280 --> 0:34:16.400
<v Speaker 1>is tech Stuff at how stuff works dot com. Or

0:34:16.480 --> 0:34:18.720
<v Speaker 1>drop me a line on Facebook or Twitter. The handle

0:34:18.719 --> 0:34:20.840
<v Speaker 1>at both of those is tech Stuff H s W.

0:34:21.440 --> 0:34:24.719
<v Speaker 1>Don't forget to follow us on Instagram and I'll talk

0:34:24.719 --> 0:34:33.799
<v Speaker 1>to you again really soon for more on this and

0:34:33.880 --> 0:34:36.399
<v Speaker 1>thousands of other topics. Is it how stuff Works dot

0:34:36.440 --> 0:34:46.560
<v Speaker 1>Com