WEBVTT - Happy Anniver-Siri

0:00:04.400 --> 0:00:07.800
<v Speaker 1>Welcome to Tech Stuff, a production from I Heart Radio.

0:00:12.360 --> 0:00:15.600
<v Speaker 1>Hey there, and welcome to tech Stuff. I'm your host,

0:00:15.800 --> 0:00:19.040
<v Speaker 1>Jonathan Strickland. I'm an executive producer with I Heart Radio,

0:00:19.079 --> 0:00:23.560
<v Speaker 1>and I love all things tech, And uh, you know

0:00:23.600 --> 0:00:26.840
<v Speaker 1>what the today's episode I was gonna I was gonna

0:00:26.840 --> 0:00:29.920
<v Speaker 1>make it a one partner, but it turns out there's

0:00:30.000 --> 0:00:33.040
<v Speaker 1>just way too much stuff, not just about the topic

0:00:33.080 --> 0:00:37.239
<v Speaker 1>at hand, but the various components that make up this

0:00:37.440 --> 0:00:40.600
<v Speaker 1>topic that require me to do more than one. So

0:00:40.760 --> 0:00:43.519
<v Speaker 1>this is gonna likely be a two parter. But today

0:00:43.960 --> 0:00:47.239
<v Speaker 1>I thought we could look back at the development and

0:00:47.400 --> 0:00:53.160
<v Speaker 1>evolution of a famous AI personality. This virtual assistant celebrated

0:00:53.200 --> 0:00:57.480
<v Speaker 1>an anniversary recently, and I must apologize for being a

0:00:57.480 --> 0:01:02.480
<v Speaker 1>couple of days late with this, but this particular servant

0:01:03.280 --> 0:01:07.120
<v Speaker 1>debuted on October fourth, two thousand eleven, technically for the

0:01:07.200 --> 0:01:11.680
<v Speaker 1>second time, but the history of the actual technology dates

0:01:11.720 --> 0:01:15.480
<v Speaker 1>back much further. And of course, I'm talking about Sirie,

0:01:15.920 --> 0:01:20.640
<v Speaker 1>Apple's virtual assistant that can interpret voice commands and return

0:01:20.720 --> 0:01:25.160
<v Speaker 1>results based on them. This is not just some dull

0:01:25.360 --> 0:01:30.680
<v Speaker 1>history lesson, however, Sirie really has an incredible backstory, ranging

0:01:30.720 --> 0:01:33.959
<v Speaker 1>from a science fiction vision of the future to a

0:01:34.120 --> 0:01:39.240
<v Speaker 1>secret project intended to augment the decision making capabilities of

0:01:39.280 --> 0:01:45.880
<v Speaker 1>the United States military. Yeah, Siri had a pretty tough background.

0:01:46.640 --> 0:01:50.720
<v Speaker 1>The story of Sirie is complicated, and not just because

0:01:50.880 --> 0:01:55.280
<v Speaker 1>of the internal history of developing the technology, but also

0:01:55.360 --> 0:01:59.440
<v Speaker 1>because the tool relies on a lot of converging technological

0:01:59.480 --> 0:02:04.600
<v Speaker 1>trend There are elements of voice recognition, UH, speech to text,

0:02:05.080 --> 0:02:09.240
<v Speaker 1>natural language interpretation, and other technologies that fall under the

0:02:09.440 --> 0:02:14.600
<v Speaker 1>very broad umbrella of artificial intelligence. So get settled, it's

0:02:14.600 --> 0:02:17.760
<v Speaker 1>time to talk about Siri. Also, if you're listening to

0:02:17.760 --> 0:02:22.320
<v Speaker 1>this near Apple devices, I apologize because there's a good

0:02:22.400 --> 0:02:26.800
<v Speaker 1>chance those devices might start talking back at me. But

0:02:26.960 --> 0:02:29.440
<v Speaker 1>I refuse to do an episode where I just refer

0:02:29.560 --> 0:02:34.240
<v Speaker 1>to the subject as you know who. You could argue

0:02:34.720 --> 0:02:37.799
<v Speaker 1>that the origins of Siri can be found in a

0:02:37.840 --> 0:02:43.960
<v Speaker 1>promotional video that Apple produced back in nineteen seven to

0:02:44.040 --> 0:02:48.440
<v Speaker 1>show off a concept of an artificially intelligent smart assistant.

0:02:49.000 --> 0:02:52.520
<v Speaker 1>Now that alone is interesting, but what really is amazing

0:02:52.800 --> 0:02:57.360
<v Speaker 1>is that the arbitrary date they chose as the setting

0:02:57.360 --> 0:03:01.440
<v Speaker 1>for this video was two thousand evan, probably September. We

0:03:01.560 --> 0:03:05.080
<v Speaker 1>know that because there is a part within the video

0:03:05.200 --> 0:03:08.840
<v Speaker 1>where a character asks for information that had been published

0:03:08.919 --> 0:03:14.080
<v Speaker 1>five years previously, and the published information had a publication

0:03:14.160 --> 0:03:17.480
<v Speaker 1>date of two thousand six. Now this means that the

0:03:17.560 --> 0:03:22.520
<v Speaker 1>actual debut of Syrie as an Apple product was just

0:03:22.760 --> 0:03:27.160
<v Speaker 1>one month after the fictional events in that video from nine.

0:03:28.680 --> 0:03:31.520
<v Speaker 1>That's just a coincidence, but it's a cool one. The

0:03:31.639 --> 0:03:37.000
<v Speaker 1>Knowledge Navigator video shows a man walking into a study,

0:03:37.360 --> 0:03:42.080
<v Speaker 1>really nice one, and unfolding a tablet style computer device.

0:03:42.560 --> 0:03:45.080
<v Speaker 1>Then he walks off away to stare at stuff as

0:03:45.120 --> 0:03:49.640
<v Speaker 1>a virtual assistant reads off his messages and meetings on

0:03:49.680 --> 0:03:54.560
<v Speaker 1>his calendar. The virtual assistant appears as a video and

0:03:54.600 --> 0:03:57.520
<v Speaker 1>a little window on the screen of the tablet, and

0:03:57.560 --> 0:03:59.760
<v Speaker 1>it's you know, like shot from the shoulders up, kind

0:03:59.800 --> 0:04:03.000
<v Speaker 1>of a the bust of a young man, and the

0:04:03.120 --> 0:04:06.440
<v Speaker 1>video takes up that one little corner of the tablet device.

0:04:06.440 --> 0:04:10.560
<v Speaker 1>So in this visualization, the virtual assistant isn't just a

0:04:10.640 --> 0:04:15.560
<v Speaker 1>disembodied voice. It also has a face. Also, everyone in

0:04:15.560 --> 0:04:19.720
<v Speaker 1>this video is extremely white, which I guess is kind

0:04:19.720 --> 0:04:24.080
<v Speaker 1>of a given for the time period and the people involved,

0:04:24.680 --> 0:04:28.680
<v Speaker 1>but it just comes across as so white. I mean,

0:04:29.160 --> 0:04:32.160
<v Speaker 1>we're doing this with the benefit of the glasses of

0:04:32.200 --> 0:04:35.400
<v Speaker 1>twenty I just wanted to throw that out there anyway.

0:04:35.560 --> 0:04:38.960
<v Speaker 1>The video goes on to have the real life man

0:04:39.160 --> 0:04:42.520
<v Speaker 1>who is a professor in this video, ask his virtual

0:04:42.520 --> 0:04:47.120
<v Speaker 1>assistant to pull up lecture notes uh and unread articles

0:04:47.160 --> 0:04:50.039
<v Speaker 1>that relate back to the lecture he's He's asking for

0:04:50.040 --> 0:04:52.440
<v Speaker 1>a lecture notes of a lecture he gave a year ago.

0:04:52.839 --> 0:04:55.440
<v Speaker 1>He's giving essentially the same lecture now, but he wants

0:04:55.440 --> 0:04:58.440
<v Speaker 1>to update it with the latest information, and he even

0:04:58.480 --> 0:05:03.159
<v Speaker 1>asks the virtual assistant to summarize those unread articles that

0:05:03.200 --> 0:05:06.520
<v Speaker 1>had been published in the year since his last lecture.

0:05:06.760 --> 0:05:12.040
<v Speaker 1>The virtual assistant is thus aggregating information, analyzing that information

0:05:12.080 --> 0:05:15.880
<v Speaker 1>for context, and then delivering summaries, which is that's a

0:05:15.880 --> 0:05:21.279
<v Speaker 1>pretty sophisticated set of artificially intelligent tasks. He also, the

0:05:21.360 --> 0:05:25.680
<v Speaker 1>professor uses the device and virtual assistant to call and

0:05:25.760 --> 0:05:29.760
<v Speaker 1>collaborate with a peer in real time. Now, this was

0:05:29.839 --> 0:05:33.840
<v Speaker 1>not the only video that Apple would produce to showcase

0:05:33.920 --> 0:05:37.560
<v Speaker 1>this kind of general idea, however, arguably it is the

0:05:37.600 --> 0:05:43.000
<v Speaker 1>most famous of those videos. Now, as I said, Knowledge

0:05:43.080 --> 0:05:47.039
<v Speaker 1>Navigator came out of Apple, and Steve Jobs would later

0:05:47.080 --> 0:05:51.880
<v Speaker 1>play a pivotal role in how the company would introduce Sirie,

0:05:52.560 --> 0:05:56.120
<v Speaker 1>but This was not a Steve Jobs project because Jobs

0:05:56.120 --> 0:05:59.840
<v Speaker 1>had been ousted from the company Apple, or he had

0:06:00.080 --> 0:06:03.240
<v Speaker 1>quit in disgust, depending upon which version of the story

0:06:03.600 --> 0:06:06.159
<v Speaker 1>you're listening to. Anyway, he had left a couple of

0:06:06.240 --> 0:06:10.279
<v Speaker 1>years before this video was produced. The Knowledge Navigator was

0:06:10.360 --> 0:06:14.200
<v Speaker 1>something that Apple CEO John Scully had described in a

0:06:14.279 --> 0:06:18.640
<v Speaker 1>book titled Odyssey. Now, of course, in science fiction stories

0:06:19.400 --> 0:06:22.240
<v Speaker 1>we have no shortage of instances where a human is

0:06:22.279 --> 0:06:26.800
<v Speaker 1>interacting with a computer or otherwise artificially intelligent device like

0:06:26.839 --> 0:06:30.520
<v Speaker 1>a robot, but the Knowledge Navigator seemed to lay down

0:06:30.560 --> 0:06:35.160
<v Speaker 1>the foundations toward future products like Siri and the iPad,

0:06:35.440 --> 0:06:39.040
<v Speaker 1>not to mention the potential uses of the Internet, which

0:06:39.040 --> 0:06:44.080
<v Speaker 1>inn was definitely a thing. It existed, but most of

0:06:44.120 --> 0:06:48.440
<v Speaker 1>the mainstream public remained unaware of it because the Worldwide

0:06:48.440 --> 0:06:51.919
<v Speaker 1>Web wouldn't even come along for another few years. However,

0:06:52.360 --> 0:06:54.760
<v Speaker 1>while you can look at this video and say, ah,

0:06:55.480 --> 0:06:59.520
<v Speaker 1>this must be where Apple got that idea, they probably

0:06:59.560 --> 0:07:02.400
<v Speaker 1>got to work right away on Siri, well you'd be

0:07:02.480 --> 0:07:06.960
<v Speaker 1>wrong because the early work, in fact, the vast bulk

0:07:07.440 --> 0:07:09.880
<v Speaker 1>of the work on Syrie to bring it to life,

0:07:10.440 --> 0:07:14.560
<v Speaker 1>didn't start at Apple at all. It didn't involve the company.

0:07:14.600 --> 0:07:19.200
<v Speaker 1>So our story now turns to a very different organization,

0:07:19.600 --> 0:07:25.640
<v Speaker 1>the Defense Advanced Research Projects Agency, better known as DARPA.

0:07:25.760 --> 0:07:29.600
<v Speaker 1>Now this is part of the United States Department of Defense.

0:07:30.080 --> 0:07:33.120
<v Speaker 1>Back in nineteen fifty eight, the then President of the

0:07:33.200 --> 0:07:39.080
<v Speaker 1>United States, Dwight D. Eisenhower, authorized the foundation of this agency,

0:07:39.280 --> 0:07:42.240
<v Speaker 1>though at the time it was called the Advanced Research

0:07:42.320 --> 0:07:46.880
<v Speaker 1>Project Agency or ARPA. Defense would be added later. This

0:07:46.960 --> 0:07:49.400
<v Speaker 1>agency would play a critical role in the evolution of

0:07:49.440 --> 0:07:53.960
<v Speaker 1>technologies in the United States, and the mission of DARPA

0:07:54.040 --> 0:07:58.520
<v Speaker 1>and ARPA before it is quote to make pivotal investments

0:07:58.600 --> 0:08:03.000
<v Speaker 1>and breakthrough technology is for national security end quote, and

0:08:03.040 --> 0:08:07.240
<v Speaker 1>that wording is really precise. It's easy to imagine DARPA

0:08:07.320 --> 0:08:11.600
<v Speaker 1>as being housed in some enormous underground bunker filled with

0:08:11.640 --> 0:08:16.520
<v Speaker 1>scientists who are building out crazy devices like robo scorpions

0:08:16.640 --> 0:08:19.680
<v Speaker 1>or a blender that can also teleport or something. But

0:08:19.800 --> 0:08:26.080
<v Speaker 1>in reality, DARPA is more about funding research than conducting research. Now,

0:08:26.080 --> 0:08:29.520
<v Speaker 1>don't get me wrong, the agency relies heavily on experts

0:08:29.560 --> 0:08:33.240
<v Speaker 1>to evaluate proposals and consider to whom the agency should

0:08:33.280 --> 0:08:36.959
<v Speaker 1>send money. But the purpose of DARPA is to enable

0:08:37.120 --> 0:08:41.680
<v Speaker 1>others to do important work. DARPA has played a huge

0:08:41.840 --> 0:08:46.640
<v Speaker 1>role in countless technological breakthroughs. This way. Much of the

0:08:46.679 --> 0:08:49.960
<v Speaker 1>technologies that would go on to power the Internet started

0:08:50.000 --> 0:08:53.400
<v Speaker 1>with ARPA net, a kind of precursor network to the

0:08:53.400 --> 0:08:57.400
<v Speaker 1>Internet and one that was funded by ARPA. Thus the

0:08:57.520 --> 0:09:01.600
<v Speaker 1>name the DARPA Grand Challenge just helped get self driving

0:09:01.640 --> 0:09:05.880
<v Speaker 1>cars into gear. You know, pun intended. They also created

0:09:05.960 --> 0:09:09.720
<v Speaker 1>difficult scenarios for humanoid robots to go through. That was

0:09:09.760 --> 0:09:13.120
<v Speaker 1>a few years ago and was really cool. The competitions

0:09:13.200 --> 0:09:17.640
<v Speaker 1>DARPA hosts have specific goals and metrics, and that guides

0:09:17.720 --> 0:09:20.840
<v Speaker 1>the designers and engineers who are working on them as

0:09:20.840 --> 0:09:24.720
<v Speaker 1>they build out technologies. It's good to define your goal.

0:09:24.840 --> 0:09:28.080
<v Speaker 1>It really gives you focus when you're trying to develop

0:09:28.160 --> 0:09:31.360
<v Speaker 1>the technology to meet that goal. Winning a challenge is

0:09:31.400 --> 0:09:34.320
<v Speaker 1>a big deal, though the cash prize may not even

0:09:34.360 --> 0:09:37.880
<v Speaker 1>cover the amount of money participants have spent through the

0:09:37.880 --> 0:09:42.400
<v Speaker 1>development of those technologies, and there are entire businesses, or

0:09:42.559 --> 0:09:46.680
<v Speaker 1>at least divisions within businesses that can be borne out

0:09:46.679 --> 0:09:50.400
<v Speaker 1>of these challenges. The Grand Challenges are just one way

0:09:50.520 --> 0:09:55.200
<v Speaker 1>DARPA encourages technological development. Often, the agency will create a

0:09:55.240 --> 0:09:59.480
<v Speaker 1>specific goal such as the design of a robotic exoskeleton

0:09:59.559 --> 0:10:03.000
<v Speaker 1>that can help you know, US soldiers carry heavy loads

0:10:03.160 --> 0:10:06.800
<v Speaker 1>while they are on foot for longer distances, and then

0:10:06.840 --> 0:10:10.439
<v Speaker 1>they'll send out an RFP, which is a request for proposal.

0:10:11.120 --> 0:10:14.680
<v Speaker 1>The agency considers the proposals that it receives from this

0:10:14.840 --> 0:10:19.040
<v Speaker 1>RFP and then decides which, if any, they will accept

0:10:19.160 --> 0:10:22.320
<v Speaker 1>and then fund. Then after a given amount of time.

0:10:22.400 --> 0:10:25.840
<v Speaker 1>You know, it's dependent upon the specific project, we find

0:10:25.880 --> 0:10:28.960
<v Speaker 1>out if anything comes out of it. Sometimes nothing does,

0:10:29.360 --> 0:10:33.360
<v Speaker 1>as some technological problems may prove more challenging than others

0:10:33.400 --> 0:10:37.680
<v Speaker 1>and may require more time to evolve the various technologies

0:10:37.720 --> 0:10:40.400
<v Speaker 1>to make it possible. So it might push the field,

0:10:40.640 --> 0:10:42.760
<v Speaker 1>but you might not have a finished product at the

0:10:42.800 --> 0:10:45.120
<v Speaker 1>end of it. Other times you do get a finished

0:10:45.120 --> 0:10:49.240
<v Speaker 1>product anyway. In two thousand three, a decade and a

0:10:49.280 --> 0:10:52.840
<v Speaker 1>half after the Knowledge Navigator videos came out of Apple,

0:10:53.480 --> 0:10:57.040
<v Speaker 1>DARPA identified a new opportunity, and this was one that

0:10:57.120 --> 0:11:00.960
<v Speaker 1>was borne out of necessity. The challenge was that we

0:11:01.040 --> 0:11:04.360
<v Speaker 1>have access to way more information today than we did

0:11:04.360 --> 0:11:08.440
<v Speaker 1>in the past. So decades ago, military commanders had to

0:11:08.480 --> 0:11:12.960
<v Speaker 1>make decisions based on limited information. They'd rely a great

0:11:13.040 --> 0:11:17.280
<v Speaker 1>deal on their own expertise and experience in order to

0:11:17.400 --> 0:11:19.360
<v Speaker 1>make up for the fact that they only had part

0:11:19.400 --> 0:11:22.160
<v Speaker 1>of the picture. And while a great commander has a

0:11:22.200 --> 0:11:26.199
<v Speaker 1>better chance of making the right call than an inexperienced

0:11:26.200 --> 0:11:30.119
<v Speaker 1>commander would, the limited amount of information could still contribute

0:11:30.160 --> 0:11:33.840
<v Speaker 1>to disaster. You might be the greatest commander of all time,

0:11:34.400 --> 0:11:37.319
<v Speaker 1>but if you're lacking a key part of information, you

0:11:37.400 --> 0:11:41.160
<v Speaker 1>might make a decision that is terrible. So flash forward

0:11:41.200 --> 0:11:44.120
<v Speaker 1>to two thousand three, and now the story had kind

0:11:44.200 --> 0:11:48.800
<v Speaker 1>of flip flopped. Now military commanders would receive more information

0:11:48.840 --> 0:11:52.920
<v Speaker 1>than they could reasonably handle. The challenge now wasn't to

0:11:53.120 --> 0:11:56.120
<v Speaker 1>use intuition to make up for blind spots, but rather,

0:11:56.559 --> 0:11:59.600
<v Speaker 1>how do you synthesize all this information so that you

0:11:59.640 --> 0:12:03.960
<v Speaker 1>can make the right decision. Too much information was proving

0:12:04.000 --> 0:12:06.640
<v Speaker 1>to be kind of as big a problem as too

0:12:06.720 --> 0:12:11.240
<v Speaker 1>little information, at least in some cases, and so DARPA

0:12:11.240 --> 0:12:14.240
<v Speaker 1>wished to fund the development of a smart system that

0:12:14.320 --> 0:12:17.560
<v Speaker 1>could help commanders make sense of all the data coming

0:12:17.600 --> 0:12:21.840
<v Speaker 1>in from day to day. Now, DARPA projects tend to

0:12:21.880 --> 0:12:26.360
<v Speaker 1>be labyrinthian, with lots of bits and pieces and a

0:12:26.360 --> 0:12:30.160
<v Speaker 1>lot of different companies and research labs and more organizations

0:12:30.240 --> 0:12:33.800
<v Speaker 1>might tackle all or part of one of these projects.

0:12:34.400 --> 0:12:38.199
<v Speaker 1>The cognitive computing section of DARPA had a program called

0:12:38.360 --> 0:12:44.640
<v Speaker 1>Perceptive Assistance that Learn or PAL, which seems nice. It

0:12:44.760 --> 0:12:47.520
<v Speaker 1>was this part of the program that would fund the

0:12:47.559 --> 0:12:52.200
<v Speaker 1>development of a virtual cognitive assistant. The amount of funding

0:12:52.640 --> 0:12:57.520
<v Speaker 1>was twenty two million dollars. What a great PAL. The

0:12:57.640 --> 0:13:02.880
<v Speaker 1>organization that landed this deal was s r I International,

0:13:03.240 --> 0:13:11.160
<v Speaker 1>itself an incredibly influential organization. It's a nonprofit scientific research institution.

0:13:11.520 --> 0:13:16.319
<v Speaker 1>Originally it was called the Stanford Research Institute because it

0:13:16.360 --> 0:13:20.000
<v Speaker 1>was established by the trustees of Stanford University back in

0:13:20.120 --> 0:13:24.120
<v Speaker 1>nineteen forty six, though the organization would separate from the

0:13:24.200 --> 0:13:28.160
<v Speaker 1>university formally in the nineteen seventies and become a standalone,

0:13:28.240 --> 0:13:33.480
<v Speaker 1>nonprofit scientific research lab. The organization has played a role

0:13:33.520 --> 0:13:38.120
<v Speaker 1>in advancing materials science, developing liquid crystal displays or l

0:13:38.160 --> 0:13:43.280
<v Speaker 1>c d s, creating telesurgery implementations, and more. And now

0:13:43.360 --> 0:13:46.720
<v Speaker 1>it was going to tackle DARPA's request for a cognitive

0:13:46.760 --> 0:13:52.360
<v Speaker 1>computer assistant. S r I International created a project called

0:13:52.400 --> 0:13:58.200
<v Speaker 1>the Cognitive Assistant that Learns and Organizes or KALO or

0:13:58.400 --> 0:14:01.320
<v Speaker 1>CALO if you prefer. And this appears to be another

0:14:01.360 --> 0:14:05.440
<v Speaker 1>case where they landed upon that acronym first and then

0:14:05.559 --> 0:14:09.480
<v Speaker 1>worked backward, as klo seems to come from the Latin

0:14:09.520 --> 0:14:15.840
<v Speaker 1>word colognists, which means soldiers servant, and I probably mispronounced

0:14:15.840 --> 0:14:19.240
<v Speaker 1>that because even though I was a medievalist, it's almost

0:14:19.280 --> 0:14:23.720
<v Speaker 1>criminal I never took Latin. The concept, however, hearkens back

0:14:23.720 --> 0:14:26.280
<v Speaker 1>to some of what we would see in that Knowledge

0:14:26.400 --> 0:14:30.560
<v Speaker 1>Navigator video from a system that would be able to

0:14:30.640 --> 0:14:36.400
<v Speaker 1>receive and interpret information, presumably from multiple sources, and provide

0:14:36.400 --> 0:14:41.040
<v Speaker 1>a meaningful presentation or even interpretation of that data to humans,

0:14:41.760 --> 0:14:44.880
<v Speaker 1>which is a pretty tall order, and let's break down

0:14:45.120 --> 0:14:47.400
<v Speaker 1>a bit of what an assistant would need to do

0:14:47.520 --> 0:14:50.920
<v Speaker 1>in order to accomplish this. We'll leave help the voice

0:14:50.920 --> 0:14:54.040
<v Speaker 1>activation parts for now, as that would not be absolutely

0:14:54.040 --> 0:14:56.080
<v Speaker 1>critical to make this work. You know, you might have

0:14:56.120 --> 0:14:59.680
<v Speaker 1>a system that gives daily briefings on its own, or

0:15:00.040 --> 0:15:02.680
<v Speaker 1>you might have one that you activate through text commands

0:15:02.760 --> 0:15:05.840
<v Speaker 1>or some other user interface. It wouldn't necessarily have to

0:15:05.880 --> 0:15:08.840
<v Speaker 1>be voice activated. But on the back end, what has

0:15:08.880 --> 0:15:11.680
<v Speaker 1>to happen for this to work well? Presumably such a

0:15:11.680 --> 0:15:14.480
<v Speaker 1>system would need to pull in data from a number

0:15:14.560 --> 0:15:18.680
<v Speaker 1>of disparate sources, so the assistant wouldn't just be reciting

0:15:18.680 --> 0:15:23.600
<v Speaker 1>facts and figures that we're coming from a centralized data server. Instead,

0:15:23.600 --> 0:15:27.040
<v Speaker 1>it might be assimilating data from numerous sources into a

0:15:27.120 --> 0:15:31.000
<v Speaker 1>cohesive presentation. On top of that, the data might be

0:15:31.000 --> 0:15:33.680
<v Speaker 1>in different formats, meaning the system would need to be

0:15:33.680 --> 0:15:37.800
<v Speaker 1>able to analyze the information inside different types of files.

0:15:38.880 --> 0:15:42.120
<v Speaker 1>This isn't an easy thing to do. There's a reason

0:15:42.280 --> 0:15:45.000
<v Speaker 1>we have a lot of specialized programs for working with

0:15:45.040 --> 0:15:49.120
<v Speaker 1>specific types of files. When I put together these podcasts,

0:15:49.680 --> 0:15:53.000
<v Speaker 1>I use a word processor for my notes, and I

0:15:53.160 --> 0:15:56.840
<v Speaker 1>use an audio editing piece of software to record and

0:15:57.080 --> 0:16:00.479
<v Speaker 1>edit the podcasts. Now I need both of those programs

0:16:00.680 --> 0:16:04.000
<v Speaker 1>because neither of them can do the job that the

0:16:04.040 --> 0:16:06.720
<v Speaker 1>other one does. I don't have like a all purpose

0:16:06.760 --> 0:16:11.440
<v Speaker 1>program that does everything. Accessing different file formats, even in

0:16:11.480 --> 0:16:15.760
<v Speaker 1>the same general family of applications is tricky. Beyond that,

0:16:16.320 --> 0:16:20.360
<v Speaker 1>the way information can be presented within each file could

0:16:20.360 --> 0:16:23.880
<v Speaker 1>be very different. It's very possible for us to open

0:16:23.960 --> 0:16:28.800
<v Speaker 1>up multiple spreadsheets and even using the same basic spreadsheet

0:16:28.800 --> 0:16:31.160
<v Speaker 1>program let's just say Excel, It's possible for us to

0:16:31.200 --> 0:16:35.240
<v Speaker 1>open up half a dozen Excel spreadsheets that are all

0:16:35.280 --> 0:16:38.680
<v Speaker 1>presenting the same information but doing so in different ways,

0:16:38.880 --> 0:16:41.760
<v Speaker 1>and that might not be obvious at casual glance. You

0:16:41.840 --> 0:16:44.960
<v Speaker 1>might look at one and the other and not immediately realize, oh,

0:16:45.160 --> 0:16:48.200
<v Speaker 1>these are both saying the same thing. Just think about

0:16:48.200 --> 0:16:51.000
<v Speaker 1>how information could be presented as a table or a

0:16:51.000 --> 0:16:55.560
<v Speaker 1>graph or a chart. The AI assistant would ideally be

0:16:55.640 --> 0:16:59.040
<v Speaker 1>able to access information no matter what format it was in.

0:16:59.560 --> 0:17:02.880
<v Speaker 1>Nomatter are what a version of that format it was in,

0:17:02.960 --> 0:17:05.199
<v Speaker 1>be able to interpret it and then be able to

0:17:05.240 --> 0:17:09.280
<v Speaker 1>deliver a meaningful analysis to the user. Now, as data

0:17:09.320 --> 0:17:13.560
<v Speaker 1>sets grow, this becomes increasingly difficult, which I should point

0:17:13.600 --> 0:17:16.600
<v Speaker 1>out is the whole reason DARPA wanted to fund research

0:17:16.640 --> 0:17:19.800
<v Speaker 1>into this in the first place. Military commanders were faced

0:17:19.840 --> 0:17:23.360
<v Speaker 1>with a growing mountain of information that was increasingly difficult

0:17:23.400 --> 0:17:28.600
<v Speaker 1>to parse. The analysis might also need to incorporate natural

0:17:28.720 --> 0:17:32.479
<v Speaker 1>language recognition features. And I've talked about natural language a

0:17:32.480 --> 0:17:35.480
<v Speaker 1>lot in previous episodes, but if we boil it down,

0:17:35.720 --> 0:17:38.679
<v Speaker 1>it's the language that we humans use to communicate with

0:17:38.720 --> 0:17:43.399
<v Speaker 1>one another. It's our natural way of expressing our thoughts.

0:17:43.440 --> 0:17:47.119
<v Speaker 1>But the way we humans process and communicate information is

0:17:47.240 --> 0:17:51.080
<v Speaker 1>different from how machines do it. We can be subtle.

0:17:51.400 --> 0:17:54.919
<v Speaker 1>We can use stuff like metaphors and allegories and just

0:17:55.080 --> 0:17:59.960
<v Speaker 1>different phrasing. Computers are, you know, a lot more literal. Hey,

0:18:00.119 --> 0:18:02.960
<v Speaker 1>if you break it down to the most basic unit

0:18:03.240 --> 0:18:06.600
<v Speaker 1>of machine information, you know, the bit. You see how

0:18:06.680 --> 0:18:10.560
<v Speaker 1>literal computers are. A bit is either a zero or

0:18:10.600 --> 0:18:13.600
<v Speaker 1>a one, or if you prefer, it's either off and

0:18:13.840 --> 0:18:18.159
<v Speaker 1>on or no and yes. But using lots of bits,

0:18:18.359 --> 0:18:21.359
<v Speaker 1>we can describe information in a way that provides more

0:18:21.400 --> 0:18:24.320
<v Speaker 1>subtlety than just nowhere. Yes. But my point is that

0:18:24.359 --> 0:18:28.520
<v Speaker 1>computers don't naturally process information the way we do, and

0:18:28.600 --> 0:18:33.400
<v Speaker 1>so an entire branch of artificial intelligence called natural language

0:18:33.400 --> 0:18:37.880
<v Speaker 1>processing evolved to create ways for computers to interpret what

0:18:37.960 --> 0:18:42.680
<v Speaker 1>we mean when we express things within natural language. Making

0:18:42.720 --> 0:18:46.080
<v Speaker 1>this more complicated is that, of course, there's no one

0:18:46.240 --> 0:18:49.439
<v Speaker 1>way to say any given thing. We've got lots of

0:18:49.480 --> 0:18:53.040
<v Speaker 1>ways to express the same general thought. And added to that,

0:18:53.680 --> 0:18:58.400
<v Speaker 1>we have lots of different languages. There are around seven

0:18:58.440 --> 0:19:02.320
<v Speaker 1>thousand different langue whig is spoken in the world today,

0:19:02.640 --> 0:19:04.919
<v Speaker 1>though you could probably get away with a couple of

0:19:05.040 --> 0:19:08.399
<v Speaker 1>dozen and cover the vast majority of the world's population

0:19:08.520 --> 0:19:11.840
<v Speaker 1>that way. But these languages have their own vocabularies, their

0:19:11.840 --> 0:19:16.119
<v Speaker 1>own syntaxes, their own expressions. So not only do we

0:19:16.200 --> 0:19:19.320
<v Speaker 1>have multiple ways of saying things within one language, we

0:19:19.400 --> 0:19:22.960
<v Speaker 1>have all these different languages to worry about. If you

0:19:23.000 --> 0:19:26.320
<v Speaker 1>were to send ten people into a room with an

0:19:26.320 --> 0:19:29.600
<v Speaker 1>AI assistant, and those ten people have a task they're

0:19:29.640 --> 0:19:33.000
<v Speaker 1>supposed to perform with the help of this AI assistant,

0:19:33.680 --> 0:19:36.240
<v Speaker 1>odds are no two people are going to go about

0:19:36.280 --> 0:19:40.240
<v Speaker 1>it exactly the same way. And yet a working virtual

0:19:40.280 --> 0:19:43.359
<v Speaker 1>assistant needs to be able to interpret and respond to

0:19:43.560 --> 0:19:47.120
<v Speaker 1>every case and do so reliably on the back end,

0:19:47.440 --> 0:19:50.080
<v Speaker 1>and AI system needs to be able to interpret data

0:19:50.119 --> 0:19:53.480
<v Speaker 1>coming from different sources that may have very different ways

0:19:53.520 --> 0:19:58.720
<v Speaker 1>of expressing similar ideas. This is an enormous task. Now,

0:19:58.720 --> 0:20:01.560
<v Speaker 1>when we come back, I'll talk more about what s

0:20:01.680 --> 0:20:04.520
<v Speaker 1>R I was doing and how the military project would

0:20:04.520 --> 0:20:08.560
<v Speaker 1>evolve ultimately into Apple's Personal Assistant. But first let's take

0:20:08.880 --> 0:20:19.359
<v Speaker 1>a quick break. Now I've only scratched the surface of

0:20:19.440 --> 0:20:22.840
<v Speaker 1>what makes the creation of an AI assistant capable of

0:20:22.880 --> 0:20:27.280
<v Speaker 1>accessing information from numerous sources and making that information useful

0:20:27.800 --> 0:20:32.040
<v Speaker 1>really required. Let's talk a bit about the parameters of

0:20:32.080 --> 0:20:35.399
<v Speaker 1>this project itself. So if you remember I said that

0:20:35.480 --> 0:20:38.919
<v Speaker 1>the deal was initially for twenty two million dollars, and

0:20:39.000 --> 0:20:42.200
<v Speaker 1>that would end up funding the creation of a five

0:20:42.400 --> 0:20:47.720
<v Speaker 1>hundred person project, and the project spanned five years initially

0:20:47.880 --> 0:20:51.680
<v Speaker 1>to investigate the possibility of building out such an AI system.

0:20:51.720 --> 0:20:55.159
<v Speaker 1>Over time, more money would end up going into the

0:20:55.240 --> 0:20:58.760
<v Speaker 1>research system, and it totaled around a hundred fifty million

0:20:58.800 --> 0:21:01.399
<v Speaker 1>dollars by the end of the produc inject. The lab

0:21:01.560 --> 0:21:04.920
<v Speaker 1>where it all went down would receive the charming nickname

0:21:05.200 --> 0:21:08.760
<v Speaker 1>nerd City. A large part of the project focused on

0:21:08.840 --> 0:21:13.159
<v Speaker 1>creating a program that could learn a user's behaviors. So

0:21:13.200 --> 0:21:17.359
<v Speaker 1>not only could this personal assistant respond to what you

0:21:17.400 --> 0:21:22.760
<v Speaker 1>were asking, it would gradually learn the way you behaved

0:21:22.840 --> 0:21:26.240
<v Speaker 1>and it would adapt to you to work more effectively.

0:21:26.800 --> 0:21:31.040
<v Speaker 1>Now this comes into the arena of pattern recognition. We

0:21:31.280 --> 0:21:34.840
<v Speaker 1>humans are pretty darn good at recognizing patterns. In fact,

0:21:35.400 --> 0:21:39.480
<v Speaker 1>we're so good that sometimes we will quote unquote recognize

0:21:39.560 --> 0:21:43.919
<v Speaker 1>a pattern even when there isn't a pattern there. In

0:21:43.960 --> 0:21:47.880
<v Speaker 1>some cases, this can come across as charming, such as

0:21:48.280 --> 0:21:52.040
<v Speaker 1>when we see a face in a cloud, right, that's

0:21:52.560 --> 0:21:55.880
<v Speaker 1>not really a pattern there. We're recognizing a pattern where

0:21:55.880 --> 0:21:58.639
<v Speaker 1>none really exists. It's all based on our perspective in

0:21:58.640 --> 0:22:02.560
<v Speaker 1>our imaginations. Now, in other cases, it's not so charming.

0:22:02.600 --> 0:22:05.159
<v Speaker 1>It can actually lead to faulty reasoning. So I'm going

0:22:05.200 --> 0:22:08.120
<v Speaker 1>to give you a very basic example that I hear

0:22:08.200 --> 0:22:11.880
<v Speaker 1>all the time, particularly now that we're in October and

0:22:11.960 --> 0:22:16.439
<v Speaker 1>there's some full moon weirdness going on. So there's a

0:22:16.480 --> 0:22:21.320
<v Speaker 1>fairly widespread belief that there's a connection between full moons

0:22:21.359 --> 0:22:25.280
<v Speaker 1>and an increase in the number of medical emergencies that happened.

0:22:25.359 --> 0:22:29.520
<v Speaker 1>Generally speaking, that people act irresponsibly during a full moon,

0:22:29.640 --> 0:22:33.760
<v Speaker 1>and that often results in injury, which means greater activity

0:22:33.800 --> 0:22:38.480
<v Speaker 1>at hospitals. Now, this belief is most likely due to

0:22:38.640 --> 0:22:43.680
<v Speaker 1>confirmation bias. That is, we already have a belief in place,

0:22:44.040 --> 0:22:46.880
<v Speaker 1>and the belief is that full moons lead to more

0:22:46.920 --> 0:22:51.000
<v Speaker 1>accidents because of people acting irresponsibly. That is what we believe.

0:22:51.720 --> 0:22:55.760
<v Speaker 1>It doesn't have evidence yet, and then when things do

0:22:55.920 --> 0:22:58.960
<v Speaker 1>get busy at a hospital and there happens to be

0:22:59.000 --> 0:23:03.159
<v Speaker 1>a full moon, we register that as evidence for our belief. Aha,

0:23:03.920 --> 0:23:07.840
<v Speaker 1>says the mistaken person. The full moon explains it. However,

0:23:08.200 --> 0:23:11.080
<v Speaker 1>on nights when it is busy but there is no

0:23:11.160 --> 0:23:14.160
<v Speaker 1>full moon, there's no hit, no one, no one takes

0:23:14.200 --> 0:23:17.280
<v Speaker 1>notice of how odd you know, it's crazy busy, but

0:23:17.359 --> 0:23:20.959
<v Speaker 1>there's no full moon tonight. We don't do that. Likewise,

0:23:21.520 --> 0:23:25.000
<v Speaker 1>if it happens to not be busy but there's a

0:23:25.040 --> 0:23:27.800
<v Speaker 1>full moon, you're also not likely to notice. You're not

0:23:27.880 --> 0:23:30.159
<v Speaker 1>likely to say, like hunt, it's not very busy tonight,

0:23:30.200 --> 0:23:33.560
<v Speaker 1>but there's a full moon out. So it's only when

0:23:33.800 --> 0:23:37.120
<v Speaker 1>you have the full moon and the busy hospital where

0:23:37.119 --> 0:23:41.360
<v Speaker 1>the evidence appears to support your belief and confirm your bias.

0:23:42.040 --> 0:23:44.480
<v Speaker 1>But in truth, when you take a step back and

0:23:44.560 --> 0:23:47.520
<v Speaker 1>you do an objective study and you look at the

0:23:47.640 --> 0:23:50.440
<v Speaker 1>times when a hospital is busy, and you look at

0:23:50.520 --> 0:23:52.439
<v Speaker 1>when there was a full moon, and you look to

0:23:52.440 --> 0:23:56.280
<v Speaker 1>see if there's any correlation, it falls apart. Now I

0:23:56.320 --> 0:23:58.959
<v Speaker 1>got a little off track there, But the point I

0:23:58.960 --> 0:24:03.040
<v Speaker 1>wanted to make is that we humans are biologically attuned

0:24:03.240 --> 0:24:08.080
<v Speaker 1>to recognizing patterns. It's very likely that pattern recognition is

0:24:08.080 --> 0:24:11.240
<v Speaker 1>one of the traits that really helped us survive thousands

0:24:11.240 --> 0:24:14.359
<v Speaker 1>of years ago, which is why it's so intrinsic in

0:24:14.400 --> 0:24:19.359
<v Speaker 1>the human experience. But building programs, computer systems that are

0:24:19.359 --> 0:24:23.880
<v Speaker 1>capable of identifying patterns and separating out what is signal

0:24:24.119 --> 0:24:28.000
<v Speaker 1>versus what is noise is its own really big challenge.

0:24:28.800 --> 0:24:31.280
<v Speaker 1>S r I was hoping to create a program that

0:24:31.320 --> 0:24:34.520
<v Speaker 1>could look for patterns and user behavior in order to

0:24:34.640 --> 0:24:38.879
<v Speaker 1>respond with greater precision and accuracy to user requests and

0:24:39.040 --> 0:24:43.680
<v Speaker 1>ultimately to anticipate future requests. Now we see the sort

0:24:43.720 --> 0:24:47.960
<v Speaker 1>of pattern recognition and response in lots of technology today.

0:24:48.000 --> 0:24:51.240
<v Speaker 1>There are several smart thermostats on the market right now,

0:24:51.440 --> 0:24:55.200
<v Speaker 1>for example, that can track when you tend to raise

0:24:55.480 --> 0:24:58.399
<v Speaker 1>or lower the temperature in your home, and after a while,

0:24:58.640 --> 0:25:01.480
<v Speaker 1>the thermostat learns that, hey, maybe you like it nice

0:25:01.480 --> 0:25:03.840
<v Speaker 1>and chilly at night, but you want it to be

0:25:03.960 --> 0:25:07.320
<v Speaker 1>warm and toasty in the morning, and so the thermostat

0:25:07.400 --> 0:25:10.840
<v Speaker 1>begins to adjust itself in preparation for that based on

0:25:10.920 --> 0:25:14.800
<v Speaker 1>your previous behaviors. Now that is a very simple example.

0:25:15.359 --> 0:25:18.960
<v Speaker 1>Extrapolate that out and you begin to imagine a technology

0:25:19.000 --> 0:25:22.639
<v Speaker 1>that is anticipating what you need or want, perhaps before

0:25:22.680 --> 0:25:26.320
<v Speaker 1>you're even aware of it yourself, which can get kind

0:25:26.359 --> 0:25:29.480
<v Speaker 1>of creepy but also sort of magical. But in truth,

0:25:29.520 --> 0:25:34.639
<v Speaker 1>it's because this system is detecting patterns that we aren't

0:25:34.680 --> 0:25:38.679
<v Speaker 1>even able to recognize ourselves. The danger there, of course,

0:25:39.200 --> 0:25:43.159
<v Speaker 1>is that the systems can sometimes mistakenly identify a pattern

0:25:43.520 --> 0:25:46.120
<v Speaker 1>when in fact there's not really a pattern there. Very

0:25:46.160 --> 0:25:48.720
<v Speaker 1>similar to the case I was explaining about with the

0:25:48.840 --> 0:25:52.800
<v Speaker 1>full moon and the busy hospital. Even computer systems can

0:25:52.800 --> 0:25:56.640
<v Speaker 1>make those sort of mistakes, and depending upon the implementation,

0:25:56.920 --> 0:25:59.960
<v Speaker 1>that can be a real problem. But that's a that's

0:26:00.000 --> 0:26:02.960
<v Speaker 1>an issue for a different podcast. Now. When it comes

0:26:02.960 --> 0:26:06.919
<v Speaker 1>to humans, pattern recognition is so ingrained in most of

0:26:07.000 --> 0:26:09.760
<v Speaker 1>us that it can actually be kind of hard to explain.

0:26:10.000 --> 0:26:13.280
<v Speaker 1>You notice, when something happens, and if that same thing

0:26:13.359 --> 0:26:17.080
<v Speaker 1>happens later with the same general results as the first time,

0:26:17.560 --> 0:26:22.120
<v Speaker 1>it reinforces your first perception of that thing, and if

0:26:22.119 --> 0:26:24.760
<v Speaker 1>it happens over and over, their brain essentially comes to

0:26:24.840 --> 0:26:29.280
<v Speaker 1>understand that when I see X happen, I can expect

0:26:29.400 --> 0:26:33.280
<v Speaker 1>why to follow, and from that you might eventually realize

0:26:33.320 --> 0:26:36.240
<v Speaker 1>that there are other correlating factors that may or may

0:26:36.240 --> 0:26:39.919
<v Speaker 1>not be present. When this goes on. With computers, the

0:26:39.960 --> 0:26:43.399
<v Speaker 1>goal is to create systems that can analyze input, whether

0:26:43.480 --> 0:26:46.679
<v Speaker 1>that input is an image file or typed text or

0:26:46.760 --> 0:26:50.439
<v Speaker 1>spoken words or whatever, and it first has to interpret

0:26:50.560 --> 0:26:54.320
<v Speaker 1>that input, has to identify it and figure out the

0:26:54.400 --> 0:26:58.480
<v Speaker 1>defining features and attributes of that input, then compare that

0:26:58.880 --> 0:27:02.199
<v Speaker 1>against known patterns to see if the input matches or

0:27:02.359 --> 0:27:05.880
<v Speaker 1>doesn't match those patterns. And in a way, you can

0:27:05.920 --> 0:27:08.439
<v Speaker 1>think of this as a computer system receiving input and

0:27:08.480 --> 0:27:12.639
<v Speaker 1>asking the question have I seen this before? And if so,

0:27:13.200 --> 0:27:17.640
<v Speaker 1>what is the correct response? If the input matches no pattern,

0:27:18.040 --> 0:27:21.000
<v Speaker 1>the system then has to have the correct response for that.

0:27:21.520 --> 0:27:24.919
<v Speaker 1>So a very simple example might just be a failed state,

0:27:25.000 --> 0:27:28.040
<v Speaker 1>in which case the virtual assistant might reply with something

0:27:28.080 --> 0:27:30.920
<v Speaker 1>like I'm sorry, I don't know how to do that yet,

0:27:31.320 --> 0:27:35.320
<v Speaker 1>or something along those lines. Now, remember earlier I mentioned

0:27:35.520 --> 0:27:37.760
<v Speaker 1>that we humans have a lot of different ways to

0:27:37.840 --> 0:27:42.520
<v Speaker 1>say the same general thing. For example, with my smart speaker,

0:27:42.840 --> 0:27:45.480
<v Speaker 1>I might ask it to turn the lights on full,

0:27:45.720 --> 0:27:47.760
<v Speaker 1>meaning I want them to be all the way up.

0:27:48.359 --> 0:27:52.080
<v Speaker 1>I might say make the lights. I might just say

0:27:52.240 --> 0:27:55.240
<v Speaker 1>make it brighter. And the system has to take this input,

0:27:55.680 --> 0:27:59.160
<v Speaker 1>analyze it, and make a statistical determination to guess at

0:27:59.280 --> 0:28:03.119
<v Speaker 1>what is that I actually want to have happen. I

0:28:03.200 --> 0:28:06.880
<v Speaker 1>say guess because in each case we're really looking at

0:28:06.880 --> 0:28:09.760
<v Speaker 1>a system that has multiple options when it comes to

0:28:09.880 --> 0:28:13.919
<v Speaker 1>a response, and each option gets a probability assigned to

0:28:13.960 --> 0:28:18.600
<v Speaker 1>it based on how closely that option matches with the input,

0:28:19.119 --> 0:28:22.400
<v Speaker 1>So I might say make it brighter, and the underlying

0:28:22.440 --> 0:28:26.560
<v Speaker 1>system recognizes that there's a n chance I mean, increase

0:28:26.640 --> 0:28:29.160
<v Speaker 1>the brightness of the lights of the room, my men,

0:28:29.760 --> 0:28:35.640
<v Speaker 1>and the system has determined that that's the most probable answer. Right,

0:28:35.640 --> 0:28:39.120
<v Speaker 1>it's probably correct, so it goes with that, but still

0:28:39.200 --> 0:28:41.400
<v Speaker 1>kind of a guess. Now, there are a lot of

0:28:41.400 --> 0:28:44.440
<v Speaker 1>different ways to go about doing this, but the one

0:28:44.520 --> 0:28:48.160
<v Speaker 1>you hear about a lot would be artificial neural networks.

0:28:48.560 --> 0:28:51.600
<v Speaker 1>I've talked a lot about these in recent episodes, so

0:28:51.760 --> 0:28:54.680
<v Speaker 1>we'll just give kind of the quick overview. So you've

0:28:54.720 --> 0:28:59.480
<v Speaker 1>got a computer system has artificial neurons. These are called nodes,

0:29:00.040 --> 0:29:03.160
<v Speaker 1>and the job of a node is to accept incoming

0:29:03.240 --> 0:29:07.240
<v Speaker 1>input from two or more sources. The node is then

0:29:07.280 --> 0:29:10.760
<v Speaker 1>to perform an operation on those inputs, and then it

0:29:10.800 --> 0:29:13.760
<v Speaker 1>generates an output, which it then passes on to other

0:29:13.920 --> 0:29:17.000
<v Speaker 1>nodes further in the system. You can think of the

0:29:17.080 --> 0:29:20.560
<v Speaker 1>nodes as existing in a series of levels, with the

0:29:20.560 --> 0:29:23.120
<v Speaker 1>top level being where input comes in and the bottom

0:29:23.200 --> 0:29:27.280
<v Speaker 1>level being where the ultimate output comes out. So the

0:29:27.400 --> 0:29:31.760
<v Speaker 1>nodes are level down except incoming inputs then perform other

0:29:31.800 --> 0:29:34.880
<v Speaker 1>operations on them and pass it further down the chain

0:29:34.960 --> 0:29:38.640
<v Speaker 1>and so on until ultimately you get an output or response.

0:29:38.680 --> 0:29:42.160
<v Speaker 1>Now that's a gross oversimplification of what's going on, but

0:29:42.320 --> 0:29:45.920
<v Speaker 1>generally you get the idea of the process. Now, let's

0:29:45.920 --> 0:29:48.560
<v Speaker 1>complicate things a little bit to get these sort of

0:29:48.600 --> 0:29:52.240
<v Speaker 1>neural networks to generate the results you want. One thing

0:29:52.280 --> 0:29:56.240
<v Speaker 1>you can do is mess with how each node values

0:29:56.440 --> 0:30:00.320
<v Speaker 1>or ways each of the inputs coming into that node.

0:30:01.040 --> 0:30:04.480
<v Speaker 1>So I'm going to use some names human names for

0:30:04.600 --> 0:30:08.280
<v Speaker 1>nodes here just to make things easier to understand. Let's

0:30:08.280 --> 0:30:12.960
<v Speaker 1>say we've got a node named Billy. Billy is on

0:30:13.000 --> 0:30:15.840
<v Speaker 1>the second layer of nodes, so it's one layer down

0:30:15.880 --> 0:30:19.680
<v Speaker 1>from where direct input comes into the system. So there

0:30:19.680 --> 0:30:24.240
<v Speaker 1>are nodes above Billy that are sending information to Billy.

0:30:24.480 --> 0:30:27.360
<v Speaker 1>We'll say that the two nodes that give Billy information

0:30:27.360 --> 0:30:31.920
<v Speaker 1>are named Sue and Jim Bob. Sue and Jim Bob

0:30:32.320 --> 0:30:35.800
<v Speaker 1>send Billy information, and it's Billy's job to determine what

0:30:36.040 --> 0:30:39.160
<v Speaker 1>further information to send down the pipeline. Like I need

0:30:39.200 --> 0:30:42.280
<v Speaker 1>to do an operation based on this bits of these

0:30:42.320 --> 0:30:44.680
<v Speaker 1>bits of information that are coming to me, and then

0:30:44.720 --> 0:30:47.680
<v Speaker 1>I have to come up with a result. Only Billy

0:30:47.760 --> 0:30:51.200
<v Speaker 1>has been told that Sue's information tends to be a

0:30:51.240 --> 0:30:56.000
<v Speaker 1>little more important than Jimbob's information is, and so if

0:30:56.040 --> 0:30:58.600
<v Speaker 1>there's a question as to what to do, it's better

0:30:58.640 --> 0:31:03.360
<v Speaker 1>to lean more on sue use information than on Jimbob's information.

0:31:03.880 --> 0:31:07.600
<v Speaker 1>We would call this waiting as n W E I

0:31:07.720 --> 0:31:11.840
<v Speaker 1>G H T I n G. Computer scientists wait the

0:31:11.960 --> 0:31:15.440
<v Speaker 1>inputs going into nodes in order to train a system

0:31:15.520 --> 0:31:19.360
<v Speaker 1>to generate the results to the scientists want. One way

0:31:19.400 --> 0:31:22.600
<v Speaker 1>to do this is through a process called back propagation.

0:31:23.320 --> 0:31:27.440
<v Speaker 1>Back propagation is when you know what result you want

0:31:27.640 --> 0:31:30.400
<v Speaker 1>the system to arrive at. So let's use the classic

0:31:30.440 --> 0:31:34.360
<v Speaker 1>example of identifying pictures that have cats in them. As

0:31:34.400 --> 0:31:37.760
<v Speaker 1>a human, you can quickly determine if a photo has

0:31:37.760 --> 0:31:40.560
<v Speaker 1>a cat in it or not. You'll spot it right away.

0:31:40.680 --> 0:31:44.680
<v Speaker 1>So you feed a picture through this system and you

0:31:44.720 --> 0:31:47.280
<v Speaker 1>wait for the system to tell you if yes, there's

0:31:47.280 --> 0:31:50.600
<v Speaker 1>a kitty cat in the picture or no. The images

0:31:50.720 --> 0:31:53.840
<v Speaker 1>cat free. And let's say that the picture you fed

0:31:53.880 --> 0:31:56.479
<v Speaker 1>to the system in fact does have a cat in it.

0:31:56.680 --> 0:31:58.640
<v Speaker 1>You can see it, but when you feed it through

0:31:58.640 --> 0:32:01.400
<v Speaker 1>the system, the system fail is to find the cat

0:32:01.520 --> 0:32:04.600
<v Speaker 1>and says nope, there's no cat here. Well, you know

0:32:05.040 --> 0:32:08.040
<v Speaker 1>that the system got it wrong. So what you might

0:32:08.080 --> 0:32:10.920
<v Speaker 1>do as a computer scientist is you look at that

0:32:11.080 --> 0:32:14.400
<v Speaker 1>final level of nodes right at the output level to

0:32:14.480 --> 0:32:17.840
<v Speaker 1>see which factors led those nodes to come to the

0:32:17.880 --> 0:32:21.600
<v Speaker 1>conclusion that there was no cat in the photo. You

0:32:21.680 --> 0:32:24.200
<v Speaker 1>then look at the inputs that are coming into those

0:32:24.240 --> 0:32:26.959
<v Speaker 1>nodes and you see how they are weighted, and you

0:32:27.080 --> 0:32:31.000
<v Speaker 1>change the weights of those inputs in order to force

0:32:31.120 --> 0:32:34.440
<v Speaker 1>that last level of nodes to say, oh, no, there

0:32:34.480 --> 0:32:37.040
<v Speaker 1>definitely is a cat here. And so on. You move

0:32:37.320 --> 0:32:40.640
<v Speaker 1>up from the output level and you go up level

0:32:40.720 --> 0:32:45.000
<v Speaker 1>by level, tweaking the waitings of incoming data so that

0:32:45.080 --> 0:32:48.720
<v Speaker 1>the system is tweaked to more accurately determined if a

0:32:48.760 --> 0:32:51.719
<v Speaker 1>photo has a cat in it. Now, this takes a

0:32:51.760 --> 0:32:55.760
<v Speaker 1>lot of work, and it also means using huge data sets.

0:32:55.840 --> 0:32:59.520
<v Speaker 1>You know, you're feeding hundreds of thousands or millions of images,

0:32:59.760 --> 0:33:02.800
<v Speaker 1>so of them with cats, some of them without, and

0:33:02.920 --> 0:33:05.280
<v Speaker 1>training the system over and over again to train it

0:33:05.360 --> 0:33:08.560
<v Speaker 1>before you start feeding it brand new images to see

0:33:08.560 --> 0:33:11.240
<v Speaker 1>if it still works. And this can be a laborious

0:33:11.280 --> 0:33:14.240
<v Speaker 1>process to train a machine learning system, but the result

0:33:14.320 --> 0:33:16.840
<v Speaker 1>is that you end up with a system that hopefully

0:33:17.080 --> 0:33:19.640
<v Speaker 1>is pretty accurate a doing whatever it was you were

0:33:19.680 --> 0:33:22.840
<v Speaker 1>training it to do, you know, like recognized cats. But

0:33:22.920 --> 0:33:26.960
<v Speaker 1>that's just one approach to machine learning. There are others.

0:33:27.600 --> 0:33:30.600
<v Speaker 1>Some like the version I just described, fall into a

0:33:30.640 --> 0:33:37.040
<v Speaker 1>broad category called supervised learning. Others are in unsupervised learning.

0:33:37.320 --> 0:33:42.520
<v Speaker 1>In fact, Kalo was largely built through unsupervised learning, meaning

0:33:42.880 --> 0:33:46.880
<v Speaker 1>the machine had to train itself as it performed tasks

0:33:47.320 --> 0:33:51.240
<v Speaker 1>using inputs that hadn't been curated specifically for training purposes.

0:33:51.280 --> 0:33:54.200
<v Speaker 1>It's just an enormous amount of information coming in that

0:33:54.320 --> 0:33:57.400
<v Speaker 1>the system has to process. So, in other words, for Kalo,

0:33:57.480 --> 0:34:00.160
<v Speaker 1>the system wasn't dealing with like a stack of a

0:34:00.200 --> 0:34:04.480
<v Speaker 1>million photos, seventy of which had cats and which didn't.

0:34:04.920 --> 0:34:08.200
<v Speaker 1>Kayla was working with real world information and attempting to

0:34:08.239 --> 0:34:12.000
<v Speaker 1>suss out what to do with it in real time. Now,

0:34:12.040 --> 0:34:16.080
<v Speaker 1>to go into how unsupervised machine learning works would require

0:34:16.080 --> 0:34:19.080
<v Speaker 1>a full episode on its own, but it is a

0:34:19.120 --> 0:34:23.279
<v Speaker 1>fascinating and complicated subject, so I probably will tackle it

0:34:23.320 --> 0:34:25.600
<v Speaker 1>at some point. I'm just gonna spare you guys for

0:34:25.680 --> 0:34:28.520
<v Speaker 1>right now. The real point I'm making is that s

0:34:28.640 --> 0:34:32.040
<v Speaker 1>RI I International spent years building out systems that could

0:34:32.080 --> 0:34:35.920
<v Speaker 1>do a wide range of tasks based on inputs. Pattern

0:34:35.960 --> 0:34:39.600
<v Speaker 1>recognition was actually just one relatively small piece of that.

0:34:40.200 --> 0:34:43.040
<v Speaker 1>Creating an ability to pull data from different sources in

0:34:43.040 --> 0:34:46.759
<v Speaker 1>a meaningful way is its own incredibly challenging problem, as

0:34:46.800 --> 0:34:50.680
<v Speaker 1>I alluded to earlier, particularly as the number of sources

0:34:50.680 --> 0:34:53.920
<v Speaker 1>you're pulling from and the variety of formats the data

0:34:54.000 --> 0:34:57.120
<v Speaker 1>is in begins to increase, it becomes easier for the

0:34:57.120 --> 0:35:00.960
<v Speaker 1>system to make mistakes as you throw more variety at it,

0:35:01.080 --> 0:35:04.800
<v Speaker 1>and it requires a lot of refinement. Frankly, it's actually

0:35:04.960 --> 0:35:08.480
<v Speaker 1>a task that's so big I have trouble grasping it.

0:35:09.120 --> 0:35:13.719
<v Speaker 1>The Kalo project became the largest AI program in history

0:35:13.800 --> 0:35:17.480
<v Speaker 1>up to that point. It was an incredible achievement. It

0:35:17.520 --> 0:35:22.040
<v Speaker 1>brought together different disciplines of artificial intelligence into a cohesive

0:35:22.120 --> 0:35:26.080
<v Speaker 1>project with a solid goal. By the two thousand's, artificial

0:35:26.080 --> 0:35:31.120
<v Speaker 1>intelligence was a sprawling collection of computer science disciplines, each

0:35:31.160 --> 0:35:34.480
<v Speaker 1>with incredible depth to them. So you might find an

0:35:34.480 --> 0:35:37.520
<v Speaker 1>expert in one field of AI who would have little

0:35:37.560 --> 0:35:41.400
<v Speaker 1>to no experience with another branch under the same general

0:35:41.440 --> 0:35:45.440
<v Speaker 1>discipline of artificial intelligence. There was a prevailing feeling that

0:35:45.520 --> 0:35:48.680
<v Speaker 1>the various branches of AI had each become so complex

0:35:49.000 --> 0:35:52.960
<v Speaker 1>they would never work together. The Kalo project proved that wrong.

0:35:53.680 --> 0:35:57.000
<v Speaker 1>When we come back, i'll explain how part of this

0:35:57.120 --> 0:36:00.600
<v Speaker 1>military project would break away to become the virtual assistant,

0:36:01.120 --> 0:36:05.160
<v Speaker 1>ultimately finding its way onto iOS devices. But first let's

0:36:05.160 --> 0:36:19.000
<v Speaker 1>take another quick break. Adam Chair, whose name I'm likely mispronouncing,

0:36:19.000 --> 0:36:21.880
<v Speaker 1>and I apologize, but he was an engineer at s

0:36:22.000 --> 0:36:24.480
<v Speaker 1>r I working on Kalo, and he worked with a

0:36:24.560 --> 0:36:27.839
<v Speaker 1>team that had the daunting task of assimilating the work

0:36:28.040 --> 0:36:31.720
<v Speaker 1>that was being done by twenties seven different engineering teams

0:36:32.440 --> 0:36:36.839
<v Speaker 1>into a cohesive virtual assistant. So, as I mentioned just

0:36:36.960 --> 0:36:40.000
<v Speaker 1>before the break, the disciplines of AI had each gotten

0:36:40.160 --> 0:36:45.000
<v Speaker 1>very deep, very broad, and required a lot of specialization.

0:36:45.320 --> 0:36:48.759
<v Speaker 1>So you have these different engineering teams working within various disciplines,

0:36:49.280 --> 0:36:52.399
<v Speaker 1>and it was chairs team that needed to bring all

0:36:52.400 --> 0:36:56.040
<v Speaker 1>these together and make it into a working, coherent hole.

0:36:56.560 --> 0:36:59.880
<v Speaker 1>The results were really phenomenal. Now I'll give you a

0:37:00.040 --> 0:37:04.799
<v Speaker 1>hypothetical use for Kalo. Let's say that you've got a

0:37:04.800 --> 0:37:08.640
<v Speaker 1>project team and there are ten people on your team,

0:37:08.760 --> 0:37:12.520
<v Speaker 1>including you, and let's say there's a meeting that's on

0:37:12.560 --> 0:37:16.879
<v Speaker 1>the books for tomorrow morning at a particular conference room,

0:37:16.920 --> 0:37:19.400
<v Speaker 1>and it's supposed to be a status update meeting for

0:37:19.440 --> 0:37:22.840
<v Speaker 1>the project. It turns out that two out of the

0:37:22.920 --> 0:37:25.360
<v Speaker 1>ten people on your team are no longer able to

0:37:25.440 --> 0:37:29.960
<v Speaker 1>make the meeting due to last minute high priority conflicts,

0:37:30.040 --> 0:37:33.359
<v Speaker 1>so they've had to cancel out of the meeting. KALO

0:37:33.440 --> 0:37:36.319
<v Speaker 1>would be able to detect the change in status of

0:37:36.360 --> 0:37:38.799
<v Speaker 1>those two people and say, all right, these two are

0:37:38.880 --> 0:37:42.640
<v Speaker 1>no longer going to the meeting. Then KALO could determine

0:37:42.719 --> 0:37:46.200
<v Speaker 1>how important those two people were to the overall team,

0:37:46.320 --> 0:37:49.719
<v Speaker 1>essentially saying what are their roles? What what role are

0:37:49.719 --> 0:37:53.080
<v Speaker 1>they performing within the context of this team, and is

0:37:53.120 --> 0:37:56.200
<v Speaker 1>it a critical role for this meeting. It can also

0:37:56.200 --> 0:37:58.680
<v Speaker 1>look at the importance of the meeting itself, like, oh, well,

0:37:58.719 --> 0:38:01.440
<v Speaker 1>this is a status update, so it's really just to

0:38:01.520 --> 0:38:04.600
<v Speaker 1>keep the team, you know, informed of what's going on.

0:38:05.440 --> 0:38:08.120
<v Speaker 1>It's not a mission critical type of meeting. It could

0:38:08.120 --> 0:38:11.160
<v Speaker 1>take all that into account. Then KALO can make a

0:38:11.160 --> 0:38:14.359
<v Speaker 1>determination on its own whether or not it should keep

0:38:14.400 --> 0:38:17.239
<v Speaker 1>the meeting in place and go ahead just without those

0:38:17.280 --> 0:38:20.720
<v Speaker 1>two people and maybe just send updates to those two people,

0:38:21.320 --> 0:38:24.960
<v Speaker 1>or to cancel the meeting entirely notifying all the participants

0:38:25.000 --> 0:38:28.600
<v Speaker 1>about it. Then look at the different calendars of those participants,

0:38:28.920 --> 0:38:33.040
<v Speaker 1>book a new meeting, including securing a space for that

0:38:33.160 --> 0:38:36.799
<v Speaker 1>meeting and sending out new invites. It would even be

0:38:36.880 --> 0:38:39.400
<v Speaker 1>able to look at the purpose of the meeting and

0:38:39.480 --> 0:38:43.279
<v Speaker 1>flag information that's relevant to that meeting, essentially creating a

0:38:43.320 --> 0:38:47.640
<v Speaker 1>sort of meeting dossier on demand. So it's really, you know,

0:38:47.760 --> 0:38:53.000
<v Speaker 1>incredible sophisticated stuff. Now, that was the fully fledged Kalo,

0:38:53.800 --> 0:38:58.000
<v Speaker 1>but an offshoot of this project, or maybe it's it's

0:38:58.040 --> 0:39:00.480
<v Speaker 1>better to say it was a smaller sister project that

0:39:00.520 --> 0:39:02.960
<v Speaker 1>existed at the same time it launched in two thousand three.

0:39:03.000 --> 0:39:07.440
<v Speaker 1>Along with Kalo. This other one was called Vanguard, at

0:39:07.480 --> 0:39:10.000
<v Speaker 1>least within s r I, and it was taking a

0:39:10.000 --> 0:39:15.000
<v Speaker 1>more scaled down approach of building out an assistant and

0:39:15.040 --> 0:39:19.280
<v Speaker 1>looking at how it could be useful on mobile devices. Now, again,

0:39:19.280 --> 0:39:22.319
<v Speaker 1>this was in two thousand three, before smartphones would really

0:39:22.360 --> 0:39:26.120
<v Speaker 1>become a mainstream product because Apple wouldn't even introduce the

0:39:26.160 --> 0:39:29.440
<v Speaker 1>iPhone until two thousand seven. But s r I was

0:39:29.480 --> 0:39:32.920
<v Speaker 1>working on implementations of a more limited virtual assistant and

0:39:32.960 --> 0:39:36.880
<v Speaker 1>then showing it off to companies like Motorola. One person

0:39:37.160 --> 0:39:40.840
<v Speaker 1>at Motorola who was really impressed with this work was

0:39:40.880 --> 0:39:45.319
<v Speaker 1>a guy named Dog Kittlaus. Kittlaus attempted to convince his

0:39:45.360 --> 0:39:49.239
<v Speaker 1>superiors that Motorola that Vanguard was a really important piece

0:39:49.280 --> 0:39:53.200
<v Speaker 1>of work, but he didn't find any real interest over

0:39:53.320 --> 0:39:57.279
<v Speaker 1>at Motorola, so he did something fairly brazen. In two

0:39:57.280 --> 0:40:00.600
<v Speaker 1>thousand seven, he quit his job at Motorole and he

0:40:00.719 --> 0:40:04.800
<v Speaker 1>joined SRI International with the intent of exploring ways to

0:40:04.960 --> 0:40:09.080
<v Speaker 1>spin off a new business that would develop an implementation

0:40:09.280 --> 0:40:14.480
<v Speaker 1>of the Kalo Vanguard virtual assistant, but for the consumer market.

0:40:15.040 --> 0:40:19.080
<v Speaker 1>The result would be a new company called Sirie s

0:40:19.120 --> 0:40:21.799
<v Speaker 1>I r I, which is kind of the way you

0:40:21.800 --> 0:40:24.840
<v Speaker 1>would say s r I if you were trying to

0:40:24.880 --> 0:40:27.480
<v Speaker 1>pronounce it as if it were an acronym as opposed

0:40:27.560 --> 0:40:32.280
<v Speaker 1>to an initialism. Adam Chair, after some convincing from Kittlaus,

0:40:32.840 --> 0:40:36.480
<v Speaker 1>joined the venture as the vice president of Engineering. Kit

0:40:36.600 --> 0:40:40.320
<v Speaker 1>Loss would be the CEO. Tom Gruber, who had studied

0:40:40.320 --> 0:40:44.120
<v Speaker 1>computer science at Stanford and then pioneered work in various

0:40:44.160 --> 0:40:48.160
<v Speaker 1>fields of artificial intelligence, would become the chief technology officer

0:40:48.360 --> 0:40:53.640
<v Speaker 1>for the company. Interestingly, the Serie team didn't initially call

0:40:53.920 --> 0:41:00.000
<v Speaker 1>their own virtual assistant project SIRIE. Instead, the new spinoff company,

0:41:00.520 --> 0:41:04.960
<v Speaker 1>SIRI would call their virtual Assistant how H a l

0:41:05.440 --> 0:41:08.719
<v Speaker 1>after the AI system in the book and film two

0:41:08.800 --> 0:41:11.960
<v Speaker 1>thousand one. They did take an extra step to reassure

0:41:12.000 --> 0:41:15.719
<v Speaker 1>people that this time HOW would behave itself. So, if

0:41:15.719 --> 0:41:18.480
<v Speaker 1>you're not familiar with the story of two thousand one,

0:41:19.040 --> 0:41:24.239
<v Speaker 1>the artificially Intelligent computer system HOW begins to malfunction and

0:41:24.280 --> 0:41:26.880
<v Speaker 1>begins to interpret its mission in such a way that

0:41:27.000 --> 0:41:29.920
<v Speaker 1>it compels it to start killing off the crew inside

0:41:29.920 --> 0:41:33.560
<v Speaker 1>a spacecraft, kind of a worst case scenario with AI.

0:41:34.200 --> 0:41:37.480
<v Speaker 1>While SIRIE began to get off the ground, it was

0:41:37.560 --> 0:41:41.759
<v Speaker 1>licensing technologies from s r I to power the virtual assistant,

0:41:42.120 --> 0:41:44.839
<v Speaker 1>and it also began to hire the talent needed to

0:41:44.960 --> 0:41:48.799
<v Speaker 1>bring this idea to life. At the same time, Apple

0:41:49.160 --> 0:41:52.319
<v Speaker 1>was pushing the smartphone industry into the limelight with the

0:41:52.360 --> 0:41:54.880
<v Speaker 1>introduction of the first iPhone. This was all happening at

0:41:54.920 --> 0:41:58.200
<v Speaker 1>two thousand seven. It was clear that the push for

0:41:58.280 --> 0:42:01.480
<v Speaker 1>a virtual assistant was coming at just the right time,

0:42:01.600 --> 0:42:06.880
<v Speaker 1>as Apple's implementation of smartphone technology was a grand slam

0:42:06.920 --> 0:42:11.040
<v Speaker 1>home run. To use a sports analogy, it soon became

0:42:11.080 --> 0:42:14.239
<v Speaker 1>obvious that the future of computing was going to be,

0:42:14.320 --> 0:42:18.480
<v Speaker 1>at least in large part mobile That in turn opened

0:42:18.520 --> 0:42:21.640
<v Speaker 1>up opportunities to create new ways to interact with mobile

0:42:21.640 --> 0:42:24.560
<v Speaker 1>devices in order to do the stuff we needed to

0:42:24.640 --> 0:42:28.280
<v Speaker 1>do now. It's obvious to say this, but mobile devices

0:42:28.320 --> 0:42:32.200
<v Speaker 1>have a very different user interface from your typical computer.

0:42:32.600 --> 0:42:35.760
<v Speaker 1>Interacting with a handheld computer by tapping on a screen

0:42:35.960 --> 0:42:40.880
<v Speaker 1>or talking to it creates different opportunities for crafting experiences

0:42:41.280 --> 0:42:44.520
<v Speaker 1>than someone sitting down to a computer with a keyboard

0:42:44.520 --> 0:42:48.799
<v Speaker 1>and mouse. There's a potential need for a voice activated

0:42:48.880 --> 0:42:51.760
<v Speaker 1>personal assistant that could help you carry out your tasks,

0:42:51.800 --> 0:42:56.720
<v Speaker 1>particularly ones that might need multiple steps. Sirie the Company

0:42:57.000 --> 0:43:00.160
<v Speaker 1>came along just as the need for Sirie the App

0:43:00.360 --> 0:43:03.040
<v Speaker 1>was beginning to take shape, so it was the right

0:43:03.080 --> 0:43:07.280
<v Speaker 1>place at the right time. In two thousand seven, Apple

0:43:07.360 --> 0:43:10.960
<v Speaker 1>had not yet opened up the opportunity for independent app

0:43:11.000 --> 0:43:15.200
<v Speaker 1>developers to submit apps for the iPhone. That wouldn't actually

0:43:15.200 --> 0:43:18.160
<v Speaker 1>happen until July tenth, two thou eight, essentially a year

0:43:18.200 --> 0:43:21.960
<v Speaker 1>after the iPhone had debuted. The Serie team was still

0:43:22.360 --> 0:43:25.600
<v Speaker 1>hard at work building out the virtual assistant app they

0:43:25.600 --> 0:43:28.719
<v Speaker 1>had in mind in two thousand and eight, while they

0:43:28.760 --> 0:43:32.440
<v Speaker 1>were licensing technology from s r I International, you know,

0:43:32.480 --> 0:43:35.839
<v Speaker 1>from the Vanguard and the the Kalo projects, they still

0:43:35.880 --> 0:43:38.120
<v Speaker 1>had to build out the systems that would actually power

0:43:38.200 --> 0:43:42.640
<v Speaker 1>Syria on the back end. Generally speaking, their approach was

0:43:42.719 --> 0:43:45.560
<v Speaker 1>to create an app where a person could ask Syria

0:43:45.680 --> 0:43:49.319
<v Speaker 1>question and the app would record that request as a

0:43:49.360 --> 0:43:53.000
<v Speaker 1>little audio file, send that audio file to a server

0:43:53.160 --> 0:43:55.879
<v Speaker 1>and a data center, and the first step then would

0:43:55.920 --> 0:44:00.200
<v Speaker 1>be to transcribe the audio file into text, so we're

0:44:00.200 --> 0:44:03.479
<v Speaker 1>talking about speech to text here. Then the system would

0:44:03.480 --> 0:44:07.400
<v Speaker 1>need to parse the request. What is actually being asked here?

0:44:07.480 --> 0:44:11.719
<v Speaker 1>What is the command or request saying. Now, in some systems,

0:44:12.080 --> 0:44:15.440
<v Speaker 1>a computer will break down a sentence into its various components,

0:44:15.480 --> 0:44:19.000
<v Speaker 1>you know, a subject, verb, and object, and then try

0:44:19.080 --> 0:44:22.560
<v Speaker 1>to figure out what is actually being set. Adam Chair

0:44:22.680 --> 0:44:26.759
<v Speaker 1>took a different approach with his team. They taught their

0:44:26.800 --> 0:44:31.399
<v Speaker 1>system the meaning of real world objects. So, rather than

0:44:31.480 --> 0:44:34.760
<v Speaker 1>trying to parse out what a sentence meant by first

0:44:34.880 --> 0:44:38.760
<v Speaker 1>figuring out what's the subject, what's the verb, and what's

0:44:38.800 --> 0:44:42.560
<v Speaker 1>the object that the subject is acting upon, Siri started

0:44:42.560 --> 0:44:46.040
<v Speaker 1>off by looking at real world concepts within the request.

0:44:46.719 --> 0:44:50.319
<v Speaker 1>Siri would then map the request against a list of

0:44:50.400 --> 0:44:55.480
<v Speaker 1>possible responses and then employ that statistical probability model that

0:44:55.560 --> 0:44:59.120
<v Speaker 1>I mentioned earlier. What are the odds that someone was

0:44:59.160 --> 0:45:02.960
<v Speaker 1>asking for dire actions to an Italian restaurant versus asking

0:45:03.040 --> 0:45:06.640
<v Speaker 1>Siri to provide a recipe for an Italian dish, for example.

0:45:07.120 --> 0:45:10.439
<v Speaker 1>So if I activate my virtual assistant and say I

0:45:10.520 --> 0:45:15.279
<v Speaker 1>want linguini, that's a pretty broad thing to say, right.

0:45:15.440 --> 0:45:17.799
<v Speaker 1>The app has to guess at whether I mean I

0:45:17.880 --> 0:45:21.719
<v Speaker 1>want to go someplace that serves linguini or I want

0:45:21.719 --> 0:45:25.080
<v Speaker 1>to make it myself. Now, my personal app would have

0:45:25.200 --> 0:45:29.000
<v Speaker 1>learned by my behaviors that I am very lazy and

0:45:29.000 --> 0:45:31.960
<v Speaker 1>would realize that I am actually asking for someone to

0:45:32.000 --> 0:45:35.880
<v Speaker 1>bring me linguini. So there's no doubt Siri would return

0:45:35.920 --> 0:45:39.160
<v Speaker 1>results of Italian restaurants that deliver as a result from

0:45:39.160 --> 0:45:42.359
<v Speaker 1>my request. And keep in mind, Sirie was intended to

0:45:42.440 --> 0:45:45.319
<v Speaker 1>learn from user behaviors and a tune itself to those

0:45:45.360 --> 0:45:50.520
<v Speaker 1>behaviors over time. Beyond that, Siri would pull information from

0:45:50.600 --> 0:45:54.320
<v Speaker 1>multiple sources to provide results. So if I asked about

0:45:54.320 --> 0:45:57.960
<v Speaker 1>a restaurant, Siri would provide all sorts of data about

0:45:58.040 --> 0:46:01.440
<v Speaker 1>the restaurant, from user reviews, to directions to the restaurant,

0:46:01.520 --> 0:46:04.640
<v Speaker 1>to menu items to what price range I might expect

0:46:05.160 --> 0:46:08.440
<v Speaker 1>at that place. Syria could also tap into other stuff

0:46:08.480 --> 0:46:12.680
<v Speaker 1>like the phone's location, and thus give relevant answers based

0:46:12.719 --> 0:46:15.640
<v Speaker 1>on my location, so I wouldn't have to worry about

0:46:15.680 --> 0:46:19.000
<v Speaker 1>getting irrelevant search results if I happened to be far

0:46:19.120 --> 0:46:23.359
<v Speaker 1>from home, right Siri wouldn't suggest that I go and

0:46:23.440 --> 0:46:25.480
<v Speaker 1>get food from a place that's right down the street

0:46:25.480 --> 0:46:28.320
<v Speaker 1>from my house in Atlanta while I happen to be

0:46:28.360 --> 0:46:31.719
<v Speaker 1>in New York City, for example. The team also gave

0:46:31.800 --> 0:46:35.680
<v Speaker 1>Sirie a bit of an attitude. Siri could be sassy

0:46:35.840 --> 0:46:38.279
<v Speaker 1>and had a bit of a potty mouth. In fact,

0:46:38.320 --> 0:46:41.600
<v Speaker 1>Siri would occasionally drop an F bomb here or there now.

0:46:41.600 --> 0:46:45.920
<v Speaker 1>According to Kittlaus, the goal was eventually to offer extensions

0:46:45.960 --> 0:46:48.719
<v Speaker 1>to Siri so that end users could kind of pick

0:46:48.800 --> 0:46:53.600
<v Speaker 1>the apps personality. Maybe you want a no nonsense virtual

0:46:53.600 --> 0:46:56.920
<v Speaker 1>assistant that just provides the information you need and that's it.

0:46:57.760 --> 0:47:01.600
<v Speaker 1>Maybe you wanted more of a good fee sidekick, or

0:47:01.640 --> 0:47:04.960
<v Speaker 1>maybe you wanted a virtual assistant who could give you

0:47:05.000 --> 0:47:08.520
<v Speaker 1>some serious attitude on occasion. The goal down the line

0:47:08.600 --> 0:47:10.880
<v Speaker 1>was to create options for people to kind of shape

0:47:10.960 --> 0:47:14.040
<v Speaker 1>their experience, but that would end up on the cutting

0:47:14.120 --> 0:47:18.600
<v Speaker 1>room floor due to a very big reason. The serie

0:47:18.719 --> 0:47:24.239
<v Speaker 1>app made its debut in the iPhone app store. In January,

0:47:24.280 --> 0:47:28.120
<v Speaker 1>three weeks after it debuted, Kit Loss received a phone

0:47:28.120 --> 0:47:32.080
<v Speaker 1>call from an unlisted number, a call that he almost

0:47:32.320 --> 0:47:35.720
<v Speaker 1>didn't even answer, but when he did answer, the person

0:47:35.800 --> 0:47:37.600
<v Speaker 1>on the other end of the call happened to be

0:47:37.719 --> 0:47:42.120
<v Speaker 1>Steve Jobs, the CEO of Apple. Jobs was over the

0:47:42.160 --> 0:47:45.040
<v Speaker 1>moon about Sirie and wanted to meet with kit Lost

0:47:45.080 --> 0:47:48.919
<v Speaker 1>to discover some pretty enormous options, the biggest one being

0:47:48.960 --> 0:47:53.240
<v Speaker 1>that Apple itself would acquire Sirie. Now. At the time Sirie,

0:47:53.239 --> 0:47:56.200
<v Speaker 1>the company was working on developing a version of the

0:47:56.239 --> 0:47:59.920
<v Speaker 1>app for Android phones, having reached a deal with varies

0:48:00.080 --> 0:48:02.920
<v Speaker 1>in to create a version of Sirie that could be

0:48:03.000 --> 0:48:06.520
<v Speaker 1>the default app on all Verizon Android phones moving forward.

0:48:07.200 --> 0:48:11.680
<v Speaker 1>The Apple deal would ultimately derail that agreement, as Jobs

0:48:11.760 --> 0:48:16.080
<v Speaker 1>was insistent that Sirie be an Apple exclusive. In fact,

0:48:16.400 --> 0:48:22.480
<v Speaker 1>when Apple would introduce Sirie on October fourth, two thousand eleven,

0:48:23.440 --> 0:48:26.760
<v Speaker 1>it seemed like it was being presented as a purely

0:48:26.960 --> 0:48:32.600
<v Speaker 1>Apple product, that it didn't have a life outside of

0:48:32.680 --> 0:48:35.120
<v Speaker 1>Apple at all. It came across as it just being

0:48:35.400 --> 0:48:40.360
<v Speaker 1>Apple all along. And of course, the day after Apple

0:48:40.600 --> 0:48:45.400
<v Speaker 1>would introduce SyRI to the public, Steve Jobs himself passed away.

0:48:45.680 --> 0:48:49.319
<v Speaker 1>October five, two thousand eleven. But that part of the

0:48:49.320 --> 0:48:52.399
<v Speaker 1>story will have to wait for part two because, as

0:48:52.400 --> 0:48:56.480
<v Speaker 1>I said, this is going longer than I anticipated. So

0:48:56.520 --> 0:48:59.719
<v Speaker 1>in our next episode we'll pick up probably actually a

0:48:59.719 --> 0:49:02.759
<v Speaker 1>little earlier than where I'm leaving off here, actually, because

0:49:02.760 --> 0:49:06.359
<v Speaker 1>there's still some other details we should talk about as

0:49:06.360 --> 0:49:10.640
<v Speaker 1>far as how Siri works and the actual arrangement of

0:49:10.719 --> 0:49:14.320
<v Speaker 1>Apple's acquisition, and then we'll talk about how the app

0:49:14.520 --> 0:49:18.800
<v Speaker 1>has evolved and changed under Apple's ownership, and will also explore,

0:49:18.840 --> 0:49:22.120
<v Speaker 1>you know, a little bit about series distant cousins like

0:49:22.320 --> 0:49:26.799
<v Speaker 1>Alexa and Google Assistant and others, because all of these

0:49:26.840 --> 0:49:31.440
<v Speaker 1>work in similar ways, though they have their own specific

0:49:32.120 --> 0:49:36.680
<v Speaker 1>processes to handle requests, and so if you do an

0:49:36.680 --> 0:49:40.359
<v Speaker 1>Apples to Apples comparison, it does break down ultimately once

0:49:40.400 --> 0:49:43.600
<v Speaker 1>you start getting down to how things are working in

0:49:43.760 --> 0:49:46.520
<v Speaker 1>detail on the back end. So I won't go into

0:49:47.040 --> 0:49:50.640
<v Speaker 1>full mode on those because it would require multiple episodes

0:49:50.640 --> 0:49:53.920
<v Speaker 1>on that. But we will talk more about Siri and

0:49:54.320 --> 0:49:57.120
<v Speaker 1>what has happened in the years since its acquisition in

0:49:57.120 --> 0:50:00.120
<v Speaker 1>our next episode. If you guys have suggestions for future

0:50:00.120 --> 0:50:02.960
<v Speaker 1>topics I should tackle on tech stuff, let me know

0:50:03.320 --> 0:50:05.399
<v Speaker 1>the best way to do that is to reach out

0:50:05.480 --> 0:50:08.919
<v Speaker 1>on Twitter. The handle we use is text stuff H

0:50:09.120 --> 0:50:13.120
<v Speaker 1>s W and I'll talk to you again really soon.

0:50:18.239 --> 0:50:21.279
<v Speaker 1>Text Stuff is an I Heart Radio production. For more

0:50:21.360 --> 0:50:24.720
<v Speaker 1>podcasts from my heart Radio, visit the i heart Radio app,

0:50:24.880 --> 0:50:28.040
<v Speaker 1>Apple Podcasts, or wherever you listen to your favorite shows.