WEBVTT - Amazon Part 2: The Obsessive Secrecy Around Alexa

0:00:00.160 --> 0:00:03.760
<v Speaker 1>Jeff Adams as a computer scientist. In two thousand and eleven,

0:00:03.880 --> 0:00:07.120
<v Speaker 1>he was at a really tiny startup, just thirteen employees,

0:00:07.440 --> 0:00:10.360
<v Speaker 1>called the App. They were working on technology that could

0:00:10.400 --> 0:00:15.080
<v Speaker 1>transcribe voicemail into text messages. And then one day Jeff's

0:00:15.080 --> 0:00:19.239
<v Speaker 1>fairy godmother showed up in the form of executives from Amazon.

0:00:19.720 --> 0:00:22.000
<v Speaker 1>We went to a show and some people from Amazon

0:00:22.160 --> 0:00:24.640
<v Speaker 1>came and talked to us, and they might be interested

0:00:24.720 --> 0:00:28.360
<v Speaker 1>in possibly even acquiring us. And I thought, that's absurd.

0:00:28.480 --> 0:00:32.120
<v Speaker 1>What could Amazon possibly be interested in speech for? Maybe

0:00:32.440 --> 0:00:35.600
<v Speaker 1>they want you to be able to, you know, call

0:00:35.680 --> 0:00:38.600
<v Speaker 1>up on the phone and order a book. Jeff figured

0:00:38.600 --> 0:00:41.040
<v Speaker 1>the talks wouldn't amount too much, but over the course

0:00:41.080 --> 0:00:45.000
<v Speaker 1>of the next few months they became serious. Naturally, he

0:00:45.080 --> 0:00:48.280
<v Speaker 1>wanted to know why Amazon was interested. They said, you know,

0:00:48.360 --> 0:00:51.000
<v Speaker 1>don't ask, it's our business. And I thought, well, it's

0:00:51.000 --> 0:00:53.000
<v Speaker 1>gonna be my business too, and you might want to

0:00:53.040 --> 0:00:58.280
<v Speaker 1>know whether whether we can do it um, But they wouldn't.

0:00:58.280 --> 0:01:00.160
<v Speaker 1>They wouldn't say anything. They said, I'm sorry, were not

0:01:00.200 --> 0:01:02.639
<v Speaker 1>at liberty to discuss any of this. There was one

0:01:02.720 --> 0:01:05.120
<v Speaker 1>hint that Amazon would give, so one of the things

0:01:05.200 --> 0:01:07.600
<v Speaker 1>they did let us know was that the guy who

0:01:07.600 --> 0:01:10.440
<v Speaker 1>was kind of running this was Greg Hart, who we

0:01:10.560 --> 0:01:14.080
<v Speaker 1>knew was the right hand person of Jeff Bezos. So

0:01:14.160 --> 0:01:16.280
<v Speaker 1>they go to all the meetings with them. They are

0:01:16.319 --> 0:01:20.360
<v Speaker 1>like the confidante, the consilior or whatever. Uh. And so

0:01:20.400 --> 0:01:23.800
<v Speaker 1>we knew, okay, something this does have visibility at the

0:01:23.840 --> 0:01:27.520
<v Speaker 1>highest levels, and Bezos must be behind it in in

0:01:27.680 --> 0:01:31.000
<v Speaker 1>some form. As the deal was being finalized, Jeff's team

0:01:31.040 --> 0:01:34.760
<v Speaker 1>traveled to Florence, Italy, to a computer speech conference along

0:01:34.800 --> 0:01:38.479
<v Speaker 1>with several Amazon managers. They even stayed in a villa together,

0:01:38.959 --> 0:01:41.399
<v Speaker 1>but the Amazon people didn't want to be seen with

0:01:41.440 --> 0:01:45.320
<v Speaker 1>the voice transcription people during the conference. From day to

0:01:45.400 --> 0:01:49.120
<v Speaker 1>day in the conference, we had to let we we

0:01:49.200 --> 0:01:51.800
<v Speaker 1>I rented a van and we drove down into town

0:01:51.880 --> 0:01:54.240
<v Speaker 1>and to the conference together. I had to let them

0:01:54.240 --> 0:01:57.160
<v Speaker 1>out around the block, around the corner. It's like being

0:01:57.200 --> 0:01:59.920
<v Speaker 1>a teenager and trying to hide the person you're dating

0:02:00.160 --> 0:02:02.280
<v Speaker 1>from your parents. First of all, they didn't want anyone

0:02:02.280 --> 0:02:04.160
<v Speaker 1>to know that they were from Amazon. They didn't want

0:02:04.160 --> 0:02:06.320
<v Speaker 1>anyone to see them with us, so we had to

0:02:06.440 --> 0:02:08.480
<v Speaker 1>avoid them at the conference. We would like, you know,

0:02:08.560 --> 0:02:12.280
<v Speaker 1>look at each other and kind of smile from across

0:02:12.360 --> 0:02:15.000
<v Speaker 1>the room or across the courtyard or whatever. But we

0:02:15.000 --> 0:02:18.160
<v Speaker 1>couldn't sit together, we couldn't talk together, or whatever. That

0:02:18.280 --> 0:02:22.440
<v Speaker 1>fall the deal was done. Amazon buys the speech company,

0:02:22.520 --> 0:02:25.480
<v Speaker 1>and Jeff is still living in mystery. We had been

0:02:25.520 --> 0:02:28.200
<v Speaker 1>employees of Amazon for like a week, but they had

0:02:28.240 --> 0:02:31.040
<v Speaker 1>still not told us any anything. They said, no, we

0:02:31.080 --> 0:02:34.000
<v Speaker 1>have to tell you in a closed room. So we

0:02:34.040 --> 0:02:36.360
<v Speaker 1>all came out to Seattle. We all got in the

0:02:36.400 --> 0:02:41.360
<v Speaker 1>room together. They closed the door, locked the door, put

0:02:41.680 --> 0:02:46.200
<v Speaker 1>paper over the window in the door, closed the exterior windows. Uh.

0:02:46.440 --> 0:02:51.320
<v Speaker 1>It was very secret, very mysterious, and hush, hush. Jeff

0:02:51.360 --> 0:02:55.520
<v Speaker 1>and his team leaned forward. This was the moment they said,

0:02:55.919 --> 0:02:59.519
<v Speaker 1>imagine something the size of a coke can that's sitting

0:02:59.560 --> 0:03:02.000
<v Speaker 1>on your able and we're going to sell this for

0:03:04.720 --> 0:03:06.760
<v Speaker 1>uh and people will be able to talk to it.

0:03:06.840 --> 0:03:10.200
<v Speaker 1>And we thought, you know, we can't do this. The

0:03:10.280 --> 0:03:13.360
<v Speaker 1>technology isn't there yet, it doesn't exist yet. And of

0:03:13.400 --> 0:03:17.359
<v Speaker 1>course they did do it by inventing the technology, because

0:03:17.360 --> 0:03:22.440
<v Speaker 1>the product they're talking about becomes Alexa, Amazon's AI virtual assistant,

0:03:22.720 --> 0:03:26.000
<v Speaker 1>and that twenty co can is the Echo smart speaker.

0:03:26.360 --> 0:03:29.640
<v Speaker 1>Though it ends up selling for more than twenty Alexa

0:03:29.800 --> 0:03:33.480
<v Speaker 1>is one of Amazon's first forays into artificial intelligence, and

0:03:33.520 --> 0:03:37.080
<v Speaker 1>it would bring Amazon into people's living rooms, further weaving

0:03:37.120 --> 0:03:43.760
<v Speaker 1>it into their lives. You're listening to Foundering. I'm your host,

0:03:43.960 --> 0:03:47.320
<v Speaker 1>Brad Stone. In this episode, we're going to tell the

0:03:47.400 --> 0:03:51.080
<v Speaker 1>story of the creation of Alexa. How Jeff Bezos almost

0:03:51.160 --> 0:03:55.440
<v Speaker 1>singlehandedly conceived the idea for a voice activated computer and

0:03:55.640 --> 0:04:00.400
<v Speaker 1>drove the device's creation. Alexa helped to solidify Amazon's image

0:04:00.400 --> 0:04:04.560
<v Speaker 1>as an innovator and in the process challenge conventional notions

0:04:04.560 --> 0:04:07.840
<v Speaker 1>of privacy. Will tell you more after a quick break.

0:04:19.080 --> 0:04:22.000
<v Speaker 1>In the last episode, we talked about how Amazon survived

0:04:22.000 --> 0:04:25.000
<v Speaker 1>the dot com bust, how Jeff Bezos said to colleagues,

0:04:25.240 --> 0:04:27.320
<v Speaker 1>the only way out of this is to invent her

0:04:27.320 --> 0:04:31.080
<v Speaker 1>way out. During those next few years, Bezos launched Amazon

0:04:31.160 --> 0:04:35.159
<v Speaker 1>Prime and the Kindle. But there's another innovation from this era,

0:04:35.560 --> 0:04:38.880
<v Speaker 1>one that generates huge profits for Amazon and sets the

0:04:38.960 --> 0:04:42.680
<v Speaker 1>stage for the creation of its pioneering voice assistant. It's

0:04:42.720 --> 0:04:47.560
<v Speaker 1>Amazon's cloud service called AWS, or Amazon Web Services. It's

0:04:47.600 --> 0:04:51.440
<v Speaker 1>the most difficult of Bezos's inventions for lay people to understand,

0:04:52.080 --> 0:04:54.799
<v Speaker 1>because as much as it may already seem like Amazon

0:04:54.920 --> 0:04:58.279
<v Speaker 1>is everywhere a massive store, a network of warehouses, and

0:04:58.320 --> 0:05:01.039
<v Speaker 1>the fleet of delivery people. There's one more way in

0:05:01.040 --> 0:05:05.960
<v Speaker 1>which Amazon is ubiquitous indispensable. Amazon is literally holding up

0:05:06.040 --> 0:05:09.520
<v Speaker 1>much of the Internet today. AWS is a highly profitable,

0:05:09.640 --> 0:05:14.320
<v Speaker 1>fifty billion dollar annual business for Amazon. When you watch Netflix,

0:05:14.480 --> 0:05:18.440
<v Speaker 1>you're often streaming video from Amazon. When you send a Snapchat,

0:05:18.640 --> 0:05:23.120
<v Speaker 1>you're using Amazon servers. By two thousand and ten, Bezos

0:05:23.240 --> 0:05:25.760
<v Speaker 1>was asking everyone in the company, what are you doing

0:05:25.800 --> 0:05:30.039
<v Speaker 1>for AWS? You wanted Amazon to deepen and exploit the

0:05:30.080 --> 0:05:33.000
<v Speaker 1>early lead it already had in running massive, cutting edge

0:05:33.040 --> 0:05:36.320
<v Speaker 1>data centers. Around this time, he was at lunch with

0:05:36.360 --> 0:05:39.800
<v Speaker 1>Greg Hart, the guy described as his consiliary, and the

0:05:39.880 --> 0:05:43.200
<v Speaker 1>conversation turned how Google was starting to let people search

0:05:43.640 --> 0:05:47.640
<v Speaker 1>just by talking into their smartphones. Here's Greg and and

0:05:47.680 --> 0:05:49.680
<v Speaker 1>I said, look at how convenient is And I just

0:05:49.800 --> 0:05:53.640
<v Speaker 1>did a Google voice search on pizza and your me um.

0:05:53.760 --> 0:05:57.000
<v Speaker 1>And you know, it's just so much faster than actually typing,

0:05:57.360 --> 0:05:59.080
<v Speaker 1>you know, with your fingers or your thumbs pizza and

0:05:59.120 --> 0:06:01.920
<v Speaker 1>your me on your phone waiting for results to come back. Now,

0:06:02.120 --> 0:06:04.960
<v Speaker 1>Bezos has long been a believer in voice. Here he

0:06:05.080 --> 0:06:07.160
<v Speaker 1>is all the way back in the year two thousand

0:06:07.560 --> 0:06:12.040
<v Speaker 1>talking to Charlie Rose, I believe that for mobile commerce,

0:06:12.440 --> 0:06:14.640
<v Speaker 1>the thing that's going to be the biggest part of

0:06:14.640 --> 0:06:17.919
<v Speaker 1>that is voice, and I think there's a lot that

0:06:18.000 --> 0:06:20.640
<v Speaker 1>can happen in the In the short term, it'll be

0:06:20.720 --> 0:06:23.520
<v Speaker 1>kind of a stilted special purpose language for talking to

0:06:23.560 --> 0:06:25.839
<v Speaker 1>Amazon dot Com. But in the long term it could

0:06:25.839 --> 0:06:29.400
<v Speaker 1>even be natural language processing. I think that is a

0:06:29.480 --> 0:06:33.240
<v Speaker 1>real mind bender. Now. A decade later, it seemed like

0:06:33.279 --> 0:06:36.360
<v Speaker 1>this early prediction might come true. A few weeks after

0:06:36.400 --> 0:06:39.680
<v Speaker 1>their fateful lunch, Bazo sent an email to Greg that read,

0:06:40.240 --> 0:06:43.240
<v Speaker 1>we should build a twenty device with its brains in

0:06:43.240 --> 0:06:47.200
<v Speaker 1>the cloud that's completely controlled by your voice. The central

0:06:47.240 --> 0:06:50.279
<v Speaker 1>insight here was that the product could be inexpensive because

0:06:50.279 --> 0:06:53.400
<v Speaker 1>it's brains resided in Amazon's data centers and could be

0:06:53.480 --> 0:06:57.000
<v Speaker 1>constantly improved, So if you buy it, the product would

0:06:57.000 --> 0:07:00.159
<v Speaker 1>be upgrading itself and you wouldn't even know it. A

0:07:00.200 --> 0:07:03.440
<v Speaker 1>few weeks later, Greg started recruiting people inside the company

0:07:03.480 --> 0:07:07.200
<v Speaker 1>for the project. He proceeded with total secrecy. The only

0:07:07.240 --> 0:07:09.760
<v Speaker 1>thing he told colleagues was that it had the potential

0:07:09.800 --> 0:07:12.200
<v Speaker 1>to be bigger than the Kindle, and at the time

0:07:12.240 --> 0:07:15.440
<v Speaker 1>that seemed like laughable that that we could create something

0:07:15.480 --> 0:07:17.560
<v Speaker 1>that could be bigger than Kindle, because Kindle, you know,

0:07:17.560 --> 0:07:22.480
<v Speaker 1>at that time was now for years into its life,

0:07:23.760 --> 0:07:27.480
<v Speaker 1>dominating the e reader and e ink market and was

0:07:27.560 --> 0:07:30.400
<v Speaker 1>changing the way that people thought about meeting. Next, Greg

0:07:30.400 --> 0:07:33.960
<v Speaker 1>acquired YAP, the speech recognition company where Jeff Adams worked.

0:07:34.480 --> 0:07:36.480
<v Speaker 1>That team didn't think he knew what he was doing either.

0:07:37.280 --> 0:07:39.520
<v Speaker 1>Amazon wanted people to be able to talk to a

0:07:39.600 --> 0:07:43.160
<v Speaker 1>device across a room and haven't understand them. That's called

0:07:43.200 --> 0:07:47.320
<v Speaker 1>far field speech recognition, and the technology for it didn't exist.

0:07:47.720 --> 0:07:52.960
<v Speaker 1>Here's Jeff Adams again. The big problem is speech recognition

0:07:53.000 --> 0:07:56.840
<v Speaker 1>at the time really relied on a close talking microphone.

0:07:56.880 --> 0:07:59.400
<v Speaker 1>You needed to capture the speech close to the person's mouth.

0:07:59.800 --> 0:08:01.880
<v Speaker 1>And they were talking about, Oh, I'm going to be

0:08:01.880 --> 0:08:04.080
<v Speaker 1>in the garage and this thing is gonna coach me

0:08:04.160 --> 0:08:07.320
<v Speaker 1>through changing my car's oil, and it's gonna be over

0:08:07.360 --> 0:08:09.000
<v Speaker 1>on the other side of the garage, and I'm gonna

0:08:09.040 --> 0:08:11.480
<v Speaker 1>be shouting things to it, and it's gonna be shouting

0:08:11.480 --> 0:08:15.520
<v Speaker 1>instructions back or whatever. And we thought, you can't do that.

0:08:15.560 --> 0:08:19.200
<v Speaker 1>There's there's too many reflective surfaces in the room. They're

0:08:19.200 --> 0:08:22.480
<v Speaker 1>gonna mess up the audio. Basically, if you're shouting across

0:08:22.520 --> 0:08:25.600
<v Speaker 1>the room, that introduces lots of echo and the computer

0:08:25.680 --> 0:08:28.920
<v Speaker 1>gets confused. So I realized that this far field issue

0:08:28.960 --> 0:08:30.520
<v Speaker 1>was going to be a problem. I didn't want to

0:08:30.560 --> 0:08:33.240
<v Speaker 1>say anything in in front of the group. I didn't wanna,

0:08:33.320 --> 0:08:38.480
<v Speaker 1>you know, um, appear unsupportive. But afterwards I tracked down

0:08:38.480 --> 0:08:41.400
<v Speaker 1>Greg Hart in the hallway and I said, Greg, we're

0:08:41.400 --> 0:08:43.560
<v Speaker 1>excited about this, but I think you should know that

0:08:43.920 --> 0:08:47.400
<v Speaker 1>what you want to do, the technology isn't there yet.

0:08:47.440 --> 0:08:50.720
<v Speaker 1>It doesn't exist yet. We don't we we don't have

0:08:50.960 --> 0:08:54.720
<v Speaker 1>that technology to solve this far field speech problem. And

0:08:55.000 --> 0:08:58.960
<v Speaker 1>he was unflapped. He said, uh, I appreciate that, thank

0:08:59.000 --> 0:09:01.679
<v Speaker 1>you for telling me, But solve it. We are an Amazon,

0:09:02.000 --> 0:09:04.839
<v Speaker 1>We've got resources, hire as many people as you need,

0:09:04.880 --> 0:09:07.679
<v Speaker 1>to take as long as you want, but you know,

0:09:07.720 --> 0:09:11.360
<v Speaker 1>solve the problem. The team gave itself the code name Doppler,

0:09:11.600 --> 0:09:14.319
<v Speaker 1>as in the Doppler effect, which describes the way a

0:09:14.400 --> 0:09:18.240
<v Speaker 1>sound wave moves with respect to a listener. They hotly

0:09:18.280 --> 0:09:21.079
<v Speaker 1>debated everything from what to name the device, to what

0:09:21.120 --> 0:09:23.679
<v Speaker 1>it should do, to how to market that to the public.

0:09:24.520 --> 0:09:27.920
<v Speaker 1>At first, they met with Bezos once a month. He

0:09:27.920 --> 0:09:31.960
<v Speaker 1>would get deeply involved in the technology. UM, not just

0:09:32.160 --> 0:09:35.280
<v Speaker 1>the business or the product, but deeply involved in the technology.

0:09:35.360 --> 0:09:37.400
<v Speaker 1>He I would say, he stayed very close to the

0:09:37.440 --> 0:09:41.200
<v Speaker 1>project throughout. As a project progressed, those meetings would increase

0:09:41.440 --> 0:09:44.880
<v Speaker 1>in frequency UM and UH. And by the end we

0:09:44.880 --> 0:09:47.400
<v Speaker 1>were meeting with him, you know, leading up to launch,

0:09:47.440 --> 0:09:49.440
<v Speaker 1>I would say, there probably wasn't a day that went

0:09:49.480 --> 0:09:51.960
<v Speaker 1>by that we didn't have at least one meeting with

0:09:52.040 --> 0:09:55.000
<v Speaker 1>Jeff UM. And you know, sometimes there were multiple meetings

0:09:55.000 --> 0:09:59.880
<v Speaker 1>with Jeff in a single day. UM. That is both

0:10:00.280 --> 0:10:03.320
<v Speaker 1>a blessing and a curse. The blessing came in the

0:10:03.360 --> 0:10:06.360
<v Speaker 1>form of money. The team could spend whatever they needed

0:10:06.400 --> 0:10:09.760
<v Speaker 1>to break through the technical obstacles. This is a big

0:10:09.800 --> 0:10:13.120
<v Speaker 1>project that Jeff is personally invested in. It was his

0:10:13.240 --> 0:10:17.120
<v Speaker 1>idea originally. He can be very demanding. UM, he can

0:10:17.160 --> 0:10:20.680
<v Speaker 1>push teams UM. But at the same time, like being

0:10:20.720 --> 0:10:24.240
<v Speaker 1>able to go through that experience and have his brain

0:10:24.400 --> 0:10:31.360
<v Speaker 1>on your side is an immensely powerful opportunity and experience

0:10:31.480 --> 0:10:34.520
<v Speaker 1>if you let it, like if you get really frustrated

0:10:34.520 --> 0:10:36.200
<v Speaker 1>by it. And at times some of the people on

0:10:36.280 --> 0:10:39.600
<v Speaker 1>my product team would say, it feels like Jeff is

0:10:39.600 --> 0:10:43.600
<v Speaker 1>the product manager for Alexa. Bezo set the vision for Alexa.

0:10:44.160 --> 0:10:46.760
<v Speaker 1>He wanted it to be the Star Trek computer. He

0:10:46.840 --> 0:10:51.600
<v Speaker 1>pictured a versatile, conversational machine that could respond to any question.

0:10:52.160 --> 0:10:54.000
<v Speaker 1>You should be able to sit in your living room

0:10:54.240 --> 0:10:58.280
<v Speaker 1>and ask Alexa anything. It sounds simple, but it's not

0:10:58.800 --> 0:11:00.840
<v Speaker 1>because they needed a lot of data to train the

0:11:00.880 --> 0:11:04.760
<v Speaker 1>AI algorithms. For example, for Alexa to tell you the weather,

0:11:05.120 --> 0:11:07.559
<v Speaker 1>they would have to understand the phrasing of your question

0:11:07.800 --> 0:11:10.880
<v Speaker 1>and your dialect from across a noisy room, and sort

0:11:10.960 --> 0:11:14.240
<v Speaker 1>through databases for the right answer. This would take Alexa

0:11:14.320 --> 0:11:17.679
<v Speaker 1>a step further than Syrior Google Voice, which only worked

0:11:17.679 --> 0:11:20.880
<v Speaker 1>when you spoke directly into your phone. But the Alexa

0:11:20.960 --> 0:11:23.520
<v Speaker 1>team just didn't have the data to get the AI smart.

0:11:24.000 --> 0:11:27.959
<v Speaker 1>The early prototypes worked so badly that even Amazon employees

0:11:28.000 --> 0:11:31.280
<v Speaker 1>didn't really want to test them. Bassos was getting impatient.

0:11:31.720 --> 0:11:34.480
<v Speaker 1>He actually walked out of a number of internal meetings

0:11:34.480 --> 0:11:37.800
<v Speaker 1>and frustration. Then Greg and his colleague struck upon an

0:11:37.840 --> 0:11:43.240
<v Speaker 1>idea they called Project Amped. Basically, rents houses or apartments

0:11:43.280 --> 0:11:46.600
<v Speaker 1>in cities all over the US, and we would put

0:11:46.640 --> 0:11:51.120
<v Speaker 1>devices in those apartments. They were all camouflaged, and they

0:11:51.160 --> 0:11:54.080
<v Speaker 1>were not just Amazon devices. There were other companies devices

0:11:54.160 --> 0:11:57.120
<v Speaker 1>there as well, UM, some of which were visible, some

0:11:57.200 --> 0:12:01.080
<v Speaker 1>of which were not visible. And characteristic of all Amazon projects,

0:12:01.120 --> 0:12:05.720
<v Speaker 1>secrecy was the priority. Employees were careful to conceal Amazon's

0:12:05.720 --> 0:12:08.079
<v Speaker 1>identity when they set up the room, and that was

0:12:08.120 --> 0:12:10.400
<v Speaker 1>all about obfuscation, you know, we didn't want people to

0:12:10.480 --> 0:12:13.600
<v Speaker 1>understand what company we were working for. UM, and so

0:12:13.760 --> 0:12:15.800
<v Speaker 1>you could see an xbox, you could see this, you

0:12:15.840 --> 0:12:18.199
<v Speaker 1>could see that UM. And then we would have all

0:12:18.200 --> 0:12:21.440
<v Speaker 1>these Alexa devices hidden throughout the room or echo devices,

0:12:21.960 --> 0:12:25.680
<v Speaker 1>prototype devices, and then they brought in testers, thousands of

0:12:25.720 --> 0:12:28.640
<v Speaker 1>people paid an hourly wage, coming in at all hours

0:12:28.640 --> 0:12:30.640
<v Speaker 1>of the day and days of the week to train

0:12:30.679 --> 0:12:34.920
<v Speaker 1>the machines. Then we would bring in we would recruit participants,

0:12:35.320 --> 0:12:38.600
<v Speaker 1>have people with different accents, you know, male, female, different

0:12:38.600 --> 0:12:41.520
<v Speaker 1>ages who would come in and we would ask them

0:12:41.600 --> 0:12:45.640
<v Speaker 1>to do a mixture of reading scripted things and then

0:12:45.720 --> 0:12:48.319
<v Speaker 1>also talking in a much more off the cuff all

0:12:48.320 --> 0:12:51.600
<v Speaker 1>the cart fashion to ask for things that people would

0:12:51.640 --> 0:12:55.920
<v Speaker 1>be we hoped asking Alexa when we launched. Amazon conducted

0:12:55.920 --> 0:12:58.800
<v Speaker 1>the data gathering effort with such stealth that neighbors started

0:12:58.840 --> 0:13:02.280
<v Speaker 1>to get suspicious. House that we rented in Boston that

0:13:03.440 --> 0:13:05.400
<v Speaker 1>the neighbors thought that because we had a lot of

0:13:05.400 --> 0:13:08.000
<v Speaker 1>cars showing up and individuals getting out on their own

0:13:08.040 --> 0:13:11.240
<v Speaker 1>and coming in and spending I think maybe an hour,

0:13:11.920 --> 0:13:13.680
<v Speaker 1>and so there's a lot of sort of you know,

0:13:13.720 --> 0:13:16.679
<v Speaker 1>transient in and out traffic, and neighbors thought that maybe

0:13:16.679 --> 0:13:19.760
<v Speaker 1>there was you know, a drug running ring or something

0:13:19.800 --> 0:13:22.240
<v Speaker 1>else going on, and so the police actually showed up

0:13:22.480 --> 0:13:26.680
<v Speaker 1>despite the attention from police. Bezos loved the inventiveness of

0:13:26.679 --> 0:13:30.960
<v Speaker 1>the data gathering program. When we first took AMPED to him,

0:13:32.240 --> 0:13:36.959
<v Speaker 1>his response was effectively like, now you're talking, like let's

0:13:37.000 --> 0:13:40.120
<v Speaker 1>do this, um, Like tell me if you want more money,

0:13:40.440 --> 0:13:43.720
<v Speaker 1>Greg says. The project AMPED ran in thirteen cities and

0:13:43.760 --> 0:13:47.080
<v Speaker 1>included over ten people. The devices were placed all over

0:13:47.120 --> 0:13:49.400
<v Speaker 1>the room, and so we were trying to capture a

0:13:49.440 --> 0:13:53.960
<v Speaker 1>massive amount of acoustic data about, you know, how noise

0:13:54.040 --> 0:13:56.760
<v Speaker 1>performs in a room in all kinds of different rooms,

0:13:56.760 --> 0:14:00.160
<v Speaker 1>in bedrooms and living rooms, in kitchens, in bathrooms. The

0:14:00.200 --> 0:14:04.920
<v Speaker 1>result Amazon basically solved the far field voice problem. It

0:14:05.000 --> 0:14:07.959
<v Speaker 1>took six months for this company to solve a problem

0:14:08.000 --> 0:14:11.960
<v Speaker 1>that had stumped speech scientists for decades. But since the

0:14:12.000 --> 0:14:15.120
<v Speaker 1>project was top secret, they weren't able to tell anyone,

0:14:15.440 --> 0:14:19.640
<v Speaker 1>not even the speech science community, about their big technological breakthrough.

0:14:20.560 --> 0:14:24.119
<v Speaker 1>Here's Ahmed Boozid. He had been working in voice technology

0:14:24.200 --> 0:14:27.320
<v Speaker 1>for almost twenty years when Amazon tried to recruit him

0:14:27.400 --> 0:14:30.720
<v Speaker 1>in early two fifteen. I turned him down. I said,

0:14:30.840 --> 0:14:33.720
<v Speaker 1>you're not showing me anything. You have never done anything

0:14:33.760 --> 0:14:36.880
<v Speaker 1>in voice, and you're telling me that you are going

0:14:36.920 --> 0:14:39.640
<v Speaker 1>to any uly to to go from DC and go

0:14:39.720 --> 0:14:42.720
<v Speaker 1>to Seattle and they'll let you. Just trust us. I know, no,

0:14:42.920 --> 0:14:44.800
<v Speaker 1>I don't think so. I mean, I've seen the Kindle

0:14:44.880 --> 0:14:48.160
<v Speaker 1>and it's fine, it's okay, but Kindall is like, you know,

0:14:49.040 --> 0:14:50.800
<v Speaker 1>it's a toy compared to what you guys are trying

0:14:50.800 --> 0:14:54.000
<v Speaker 1>to do. So anyway, I said no twice. Eventually they

0:14:54.040 --> 0:14:57.480
<v Speaker 1>flew Ahmed out to Amazon headquarters to see Alexa in person.

0:14:57.880 --> 0:15:00.400
<v Speaker 1>So it was, you know, on the table, and so

0:15:00.440 --> 0:15:02.880
<v Speaker 1>the first thing I did is I asked it for

0:15:02.880 --> 0:15:04.960
<v Speaker 1>for music, and he plays some Miles Davis and it

0:15:05.080 --> 0:15:09.400
<v Speaker 1>did I'm like, okay, cool, And then I suppos I said,

0:15:09.440 --> 0:15:12.320
<v Speaker 1>you know what time is it? Man stopping and told

0:15:12.320 --> 0:15:14.280
<v Speaker 1>me the time, and then it continued playing Miles datas,

0:15:14.320 --> 0:15:17.640
<v Speaker 1>which was great. Interacting with Alexa was almost a moving

0:15:17.720 --> 0:15:21.360
<v Speaker 1>moment for Ahmed. He felt like decades of scientific research

0:15:21.400 --> 0:15:24.880
<v Speaker 1>had been realized in this little device, and I, you know,

0:15:24.960 --> 0:15:27.600
<v Speaker 1>I was like, Okay, this is amazing, This is amazing.

0:15:28.480 --> 0:15:31.080
<v Speaker 1>This is clearly an important moment in technology. And this

0:15:31.120 --> 0:15:32.720
<v Speaker 1>is what I've been doing all my life, right, solving

0:15:32.720 --> 0:15:35.360
<v Speaker 1>the problem problems. See, you know, we people who are

0:15:35.400 --> 0:15:38.680
<v Speaker 1>in the speech were are always saying this expression, you know,

0:15:38.720 --> 0:15:41.120
<v Speaker 1>speeches around the corner. We've been saying it's since the

0:15:41.120 --> 0:15:44.120
<v Speaker 1>mid nineties, species around the corner, because we do believe

0:15:44.160 --> 0:15:47.000
<v Speaker 1>that if you do speech well, a lot of stuff

0:15:47.920 --> 0:15:52.120
<v Speaker 1>becomes easy to do for people. Ahmed was particularly impressed

0:15:52.120 --> 0:15:56.240
<v Speaker 1>by two innovations that Amazon made. One that they solved

0:15:56.240 --> 0:16:00.120
<v Speaker 1>the far field speech issue and secondly a let so

0:16:00.200 --> 0:16:03.760
<v Speaker 1>it was fast to speed right. The fact that it's

0:16:03.800 --> 0:16:05.720
<v Speaker 1>just I was, I was just amazed, Like, how the

0:16:05.720 --> 0:16:09.680
<v Speaker 1>hell does come back within within like two seconds? I mean,

0:16:09.880 --> 0:16:12.200
<v Speaker 1>if I if I type, you know, if I want

0:16:12.200 --> 0:16:14.640
<v Speaker 1>to launch a page on my browser, or it takes

0:16:14.680 --> 0:16:17.840
<v Speaker 1>like sometimes like three four or five seconds. Right, How

0:16:17.920 --> 0:16:19.520
<v Speaker 1>is how the hell is this thing doing all of

0:16:19.520 --> 0:16:23.520
<v Speaker 1>these things, getting my voice going to the cloud, processing it,

0:16:24.040 --> 0:16:27.640
<v Speaker 1>coming back talking back within two seconds. It's like a

0:16:27.680 --> 0:16:30.360
<v Speaker 1>magic even for someone like me who was been like

0:16:30.440 --> 0:16:33.640
<v Speaker 1>you should be jaded by by by then, Right, okay,

0:16:33.680 --> 0:16:36.200
<v Speaker 1>I know exactly how it's happened. I was astonished. Right,

0:16:36.440 --> 0:16:39.080
<v Speaker 1>how the hell did they do that? We'll be right back.

0:16:50.080 --> 0:16:53.400
<v Speaker 1>I want to take a moment to address Amazon's obsessive secrecy.

0:16:54.040 --> 0:16:57.360
<v Speaker 1>Many companies are secretive, but tech companies today have the

0:16:57.400 --> 0:16:59.800
<v Speaker 1>eyes of the world on them, so they are taking

0:17:00.000 --> 0:17:03.600
<v Speaker 1>corporate secrecy to a whole other level. Like that story.

0:17:03.680 --> 0:17:07.679
<v Speaker 1>Jeff Adams, the speech scientist, told that the Amazon executives

0:17:07.840 --> 0:17:10.600
<v Speaker 1>were so secretive when the acquired a company that they

0:17:10.600 --> 0:17:14.160
<v Speaker 1>wouldn't even be seen together at a conference. That's typical

0:17:14.160 --> 0:17:18.040
<v Speaker 1>of Amazon. This intense drive for secrecy comes straight from

0:17:18.119 --> 0:17:21.480
<v Speaker 1>Jeff Bezos. He wants to tightly control the messaging around

0:17:21.480 --> 0:17:25.639
<v Speaker 1>Amazon's new products. The idea is that complete secrecy pays

0:17:25.720 --> 0:17:29.600
<v Speaker 1>off with a surprising, almost magical reveal once the product

0:17:29.680 --> 0:17:34.119
<v Speaker 1>is launched. That man keeping Alexa under wraps until launch,

0:17:34.560 --> 0:17:38.480
<v Speaker 1>preserving the details of how it actually worked, and choosing

0:17:38.680 --> 0:17:42.520
<v Speaker 1>the perfect voice here's Greg Hart, how do you define

0:17:42.560 --> 0:17:46.000
<v Speaker 1>the characteristics that you want the voice to have? And

0:17:46.040 --> 0:17:48.120
<v Speaker 1>so there was sort of a brief that we wrote

0:17:48.160 --> 0:17:51.639
<v Speaker 1>up about the qualities you wanted the person to be knowledgeable.

0:17:51.680 --> 0:17:54.560
<v Speaker 1>We very quickly early on, decided the first voice would

0:17:54.560 --> 0:17:57.199
<v Speaker 1>be female. We knew there would be additional voices, but

0:17:57.280 --> 0:17:59.800
<v Speaker 1>we felt that the first voice should be female in

0:18:00.160 --> 0:18:04.400
<v Speaker 1>because yeah, so that's the logical question why In part

0:18:04.480 --> 0:18:07.439
<v Speaker 1>because the we knew that the device would be in

0:18:07.560 --> 0:18:12.040
<v Speaker 1>the kitchen, and we felt that a female voice would

0:18:12.080 --> 0:18:15.680
<v Speaker 1>be more open and inviting and warm than a male

0:18:15.800 --> 0:18:18.719
<v Speaker 1>voice would be in that environment, and more appropriate in

0:18:18.760 --> 0:18:22.720
<v Speaker 1>that environment. Not because of any sexist things, but just

0:18:22.760 --> 0:18:25.480
<v Speaker 1>because of the fact that we knew, um that we

0:18:25.600 --> 0:18:28.920
<v Speaker 1>knew the way that, um, that's the right way to

0:18:28.960 --> 0:18:34.959
<v Speaker 1>say this. We had seen evidence that people respond differently

0:18:35.080 --> 0:18:39.040
<v Speaker 1>to male computer voices than to female computer voices, and

0:18:39.040 --> 0:18:42.760
<v Speaker 1>they respond more positively to female computer voices, and we

0:18:42.880 --> 0:18:45.520
<v Speaker 1>wanted because the device was in the home, we wanted

0:18:45.560 --> 0:18:48.919
<v Speaker 1>it to be a device that everybody would respond positively to.

0:18:49.400 --> 0:18:52.280
<v Speaker 1>Putting a female voice in the kitchen, so to speak,

0:18:52.440 --> 0:18:55.879
<v Speaker 1>would turn out to be a somewhat controversial choice, and

0:18:55.960 --> 0:18:59.520
<v Speaker 1>Amazon isn't the only company to do this. Serie. Google

0:18:59.600 --> 0:19:04.439
<v Speaker 1>Voice and Microsoft Cortana are all women by default, So

0:19:04.600 --> 0:19:07.640
<v Speaker 1>Amazon moved forward in their search for a voice. They

0:19:07.680 --> 0:19:10.640
<v Speaker 1>contracted with the same studio that had developed the voice

0:19:10.640 --> 0:19:13.520
<v Speaker 1>of Sirie for Apple. Okay, I set up your meeting

0:19:13.520 --> 0:19:16.480
<v Speaker 1>with David tomorrow. Shall I schedule it? Who is voiced

0:19:16.480 --> 0:19:19.960
<v Speaker 1>by Susan Bennett, a career voice actress. Hi, my name

0:19:20.000 --> 0:19:22.800
<v Speaker 1>is Susan Bennett, and I have a voice actor and

0:19:23.440 --> 0:19:27.240
<v Speaker 1>the original voice of Sirie. So to find their Alexa,

0:19:27.520 --> 0:19:30.760
<v Speaker 1>the studio had half a dozen female voice actresses read

0:19:30.800 --> 0:19:35.320
<v Speaker 1>for hours. They read entire books and random articles, and

0:19:35.400 --> 0:19:39.600
<v Speaker 1>finally Greg Hart and Jeff Bezos picked one woman. Her

0:19:39.680 --> 0:19:44.520
<v Speaker 1>identity hasn't been revealed in all this time. It's amazing

0:19:44.560 --> 0:19:47.600
<v Speaker 1>to me as an Amazon reporter that Amazon has still

0:19:47.640 --> 0:19:49.800
<v Speaker 1>been able to keep the voice of Alexa a secret.

0:19:50.200 --> 0:19:54.520
<v Speaker 1>In I started canvassing voice over actors, asking if they

0:19:54.600 --> 0:19:58.639
<v Speaker 1>knew the identity of Alexa. No one knew. Finally, I

0:19:58.680 --> 0:20:00.919
<v Speaker 1>got a tip from someone who had worked with the

0:20:00.960 --> 0:20:04.800
<v Speaker 1>studio that Amazon contracted with, and they said that Alexa

0:20:04.880 --> 0:20:07.600
<v Speaker 1>was voiced by an actress and singer from Boulder, Colorado

0:20:07.880 --> 0:20:12.080
<v Speaker 1>named Nina Raleigh reached out to Raleigh numerous times, she

0:20:12.119 --> 0:20:14.960
<v Speaker 1>wouldn't confirm that she's the voice of Alexa, but she

0:20:15.040 --> 0:20:18.879
<v Speaker 1>didn't deny it either. Eventually I confirmed with enough people

0:20:18.880 --> 0:20:23.000
<v Speaker 1>that I feel confident she's the one. She's Alexa. Here's

0:20:23.000 --> 0:20:24.960
<v Speaker 1>a clip from her website and add that she did

0:20:25.000 --> 0:20:29.000
<v Speaker 1>for Time Warner Cable. Thanks for choosing Time Warner. Now

0:20:29.000 --> 0:20:31.760
<v Speaker 1>that you've ordered your installation, let's set up a time

0:20:31.800 --> 0:20:35.760
<v Speaker 1>to get a crew over to your place. And here's Alexa.

0:20:36.240 --> 0:20:39.800
<v Speaker 1>Time Warner Cable, also simply known as Time Warner, was

0:20:39.800 --> 0:20:42.960
<v Speaker 1>an American cable television company. It was ranked the second

0:20:43.040 --> 0:20:45.720
<v Speaker 1>largest cable company in the United States. But imagine what

0:20:45.800 --> 0:20:48.720
<v Speaker 1>it might be like to be Nina Raleigh. Your voice

0:20:48.760 --> 0:20:52.080
<v Speaker 1>is piped into millions of homes every day, each new

0:20:52.119 --> 0:20:55.439
<v Speaker 1>iteration of Alexa, each update for Amazon, you have to

0:20:55.480 --> 0:20:59.040
<v Speaker 1>record something new, Like when Amazon Fresh released a product

0:20:59.040 --> 0:21:02.879
<v Speaker 1>called the single Burger. She had to be available single

0:21:02.920 --> 0:21:05.719
<v Speaker 1>Cow Burger, a beef burger made with meat from just

0:21:05.800 --> 0:21:09.560
<v Speaker 1>a single cow. She's chained to Alexa. It sounds like

0:21:09.600 --> 0:21:14.080
<v Speaker 1>a life of obscurity and loneliness. Compare that to Siri

0:21:14.240 --> 0:21:18.159
<v Speaker 1>and Susan Bennett. She's spoken on CNN and countless talk shows.

0:21:18.520 --> 0:21:20.240
<v Speaker 1>She's even been able to use her fame as the

0:21:20.320 --> 0:21:25.040
<v Speaker 1>voice of Sirie as leverage for many more opportunities. Greg

0:21:25.080 --> 0:21:28.240
<v Speaker 1>saw the burden that Amazon secrecy put on the human

0:21:28.359 --> 0:21:33.520
<v Speaker 1>behind the voice of Alexa. I never met her um

0:21:34.560 --> 0:21:37.160
<v Speaker 1>and I don't even know that I might have. I'm

0:21:37.160 --> 0:21:39.480
<v Speaker 1>not have spoken with her once, but I never met her.

0:21:40.040 --> 0:21:43.600
<v Speaker 1>And it would be interesting to be in her shoes now,

0:21:43.680 --> 0:21:46.840
<v Speaker 1>because it's on the one hand, it's it's incredible that

0:21:46.920 --> 0:21:50.040
<v Speaker 1>this thing that you contributed to is now so ubiquitous.

0:21:50.119 --> 0:21:52.480
<v Speaker 1>On the other hand, I would think it would be,

0:21:52.640 --> 0:21:56.960
<v Speaker 1>you know, maybe a little bit unsettling. Amazon introduced the

0:21:57.000 --> 0:22:01.160
<v Speaker 1>Echo in November two, fourteen were re strived from Amazon.

0:22:01.359 --> 0:22:04.080
<v Speaker 1>I didn't know what it was. With a YouTube video,

0:22:04.520 --> 0:22:12.720
<v Speaker 1>Alexa play rock music, rock music, alexis stop. We want

0:22:12.720 --> 0:22:16.000
<v Speaker 1>to try Alexa, what time is it? The time is three.

0:22:17.119 --> 0:22:19.159
<v Speaker 1>You actually don't have to yell at it. Okay. It

0:22:19.240 --> 0:22:21.760
<v Speaker 1>uses far Fueld technology so it can hear you from

0:22:21.840 --> 0:22:24.880
<v Speaker 1>anywhere in the room, So I can just hear you anywhere. Yes,

0:22:25.600 --> 0:22:29.280
<v Speaker 1>that promise was enough. A responsive computer that can tell

0:22:29.320 --> 0:22:32.840
<v Speaker 1>the time or the weather, play music, and answer questions

0:22:32.840 --> 0:22:36.000
<v Speaker 1>from across the room. Tens of thousands of people joined

0:22:36.000 --> 0:22:39.159
<v Speaker 1>a waiting list to receive a device. Here's Bezos at

0:22:39.200 --> 0:22:42.800
<v Speaker 1>the Recode Technology conference in two thousand sixteen saying that

0:22:42.840 --> 0:22:47.000
<v Speaker 1>the future is Alexa. He's speaking to Walt Mossburg. But

0:22:47.080 --> 0:22:49.879
<v Speaker 1>it has been a dream ever since, you know, people started,

0:22:49.920 --> 0:22:52.080
<v Speaker 1>you know, in the early days of science fiction to

0:22:52.160 --> 0:22:55.280
<v Speaker 1>have a computer that you can talk to. So are

0:22:55.320 --> 0:22:58.520
<v Speaker 1>you deeply committed to this becoming a huge part of

0:22:58.560 --> 0:23:02.119
<v Speaker 1>your business and what you Absolutely We've been working on it.

0:23:02.200 --> 0:23:04.320
<v Speaker 1>You know, we worked on it. We have more than

0:23:04.359 --> 0:23:09.160
<v Speaker 1>a thousand people dedicated us to Alexa in the Echo ecosystem,

0:23:09.240 --> 0:23:12.320
<v Speaker 1>and it's a and there's so much more to come.

0:23:12.760 --> 0:23:15.960
<v Speaker 1>Baz also is deploying his playbook. For experiments that produced

0:23:16.000 --> 0:23:20.399
<v Speaker 1>promising sparks, he poured gasoline on them. Amazon ramped up

0:23:20.480 --> 0:23:24.600
<v Speaker 1>hiring the Alexa team, balloon to ten thou employees, and

0:23:24.680 --> 0:23:28.480
<v Speaker 1>Bezos paider on ten million dollars for the company's first

0:23:28.600 --> 0:23:32.280
<v Speaker 1>ever Super Bowl at starring Alec Baldwin and Missy Elliott.

0:23:33.920 --> 0:23:38.560
<v Speaker 1>Alex to stop. How you do that? It's my Amazon

0:23:38.680 --> 0:23:44.119
<v Speaker 1>Echo like extreme music, order things and watch this. I

0:23:44.240 --> 0:23:50.719
<v Speaker 1>like to turn on the lights. Wow. Inside the company,

0:23:50.880 --> 0:23:53.560
<v Speaker 1>employee has noticed that he seemed to take real joy

0:23:53.680 --> 0:23:57.280
<v Speaker 1>in the work here's Ahmed Jeff, you know, the CEO

0:23:57.600 --> 0:24:01.920
<v Speaker 1>and the guy in charge, uh was I would say

0:24:01.960 --> 0:24:04.480
<v Speaker 1>he was obsessed with Alexa. How many he had this

0:24:04.600 --> 0:24:07.320
<v Speaker 1>saying that I am happiest. He used to say this,

0:24:07.440 --> 0:24:10.720
<v Speaker 1>I'm happy is when I'm working on Alexa UM and

0:24:10.760 --> 0:24:13.159
<v Speaker 1>so he you know, it was his favorite time of

0:24:13.200 --> 0:24:15.399
<v Speaker 1>the day. Used to meet with ALEX a team and

0:24:15.440 --> 0:24:18.720
<v Speaker 1>work and work on ALEX. So he was directly involved

0:24:18.960 --> 0:24:23.359
<v Speaker 1>in the early days. Basis was frequently asked about privacy.

0:24:23.520 --> 0:24:26.600
<v Speaker 1>At the two thousand sixteen Code Conference, he promised to

0:24:26.680 --> 0:24:30.399
<v Speaker 1>be a good steward of the sensitive personal data that

0:24:30.440 --> 0:24:33.080
<v Speaker 1>Alexa was sure to pick up. This is going to

0:24:33.160 --> 0:24:36.399
<v Speaker 1>get much deeper into our lives. So so what are

0:24:36.440 --> 0:24:38.560
<v Speaker 1>what are the privacy I think that if you take

0:24:38.600 --> 0:24:44.000
<v Speaker 1>the totality of you know, privacy UM and our ability

0:24:44.040 --> 0:24:46.879
<v Speaker 1>to store large amounts of information to use it in

0:24:46.920 --> 0:24:49.960
<v Speaker 1>ways that customers actually do want us to use it.

0:24:50.080 --> 0:24:52.560
<v Speaker 1>So there are benefits. And I think one of the

0:24:52.560 --> 0:24:54.719
<v Speaker 1>things that you have to do is when you collect

0:24:54.720 --> 0:24:57.240
<v Speaker 1>and store data, you have to be clear about what

0:24:57.280 --> 0:25:00.359
<v Speaker 1>you're doing. You have to and not just you know,

0:25:00.920 --> 0:25:05.000
<v Speaker 1>subsection seventeen, paragraph three clearly as you can see at

0:25:05.000 --> 0:25:07.720
<v Speaker 1>our privacy policy we were allowed to do that. Would

0:25:07.720 --> 0:25:10.480
<v Speaker 1>you have to figure out ways to be kind of

0:25:10.520 --> 0:25:14.560
<v Speaker 1>obviously clear? He stood on stage and promised that the

0:25:14.600 --> 0:25:19.040
<v Speaker 1>privacy policy would be obviously clear. What he wasn't saying

0:25:19.520 --> 0:25:22.120
<v Speaker 1>was that Alexa was so smart in part because there

0:25:22.119 --> 0:25:25.960
<v Speaker 1>were real life humans listening through the machines, helping to

0:25:26.080 --> 0:25:29.399
<v Speaker 1>craft many of Alexa's answers and to fix its errors

0:25:29.440 --> 0:25:32.040
<v Speaker 1>by listening to what a subset of users said. The

0:25:32.160 --> 0:25:35.440
<v Speaker 1>company kept Alexa owners in the dark about how this

0:25:35.600 --> 0:25:39.520
<v Speaker 1>aspect of their devices worked. Ruthie Hope Slattess was one

0:25:39.560 --> 0:25:42.760
<v Speaker 1>of those people being paid to listen. Several years ago,

0:25:43.080 --> 0:25:45.840
<v Speaker 1>she saw an ad from a temp agency seeking someone

0:25:45.880 --> 0:25:49.280
<v Speaker 1>with an English or journalism degree. They offered twelve dollars

0:25:49.280 --> 0:25:52.920
<v Speaker 1>an hour to transcribe audio recordings. There was a kind

0:25:52.920 --> 0:25:55.760
<v Speaker 1>of a generic add on prex list. I don't think

0:25:55.840 --> 0:25:59.960
<v Speaker 1>it mentioned Amazon at all. She applied, She passed US

0:26:00.080 --> 0:26:03.480
<v Speaker 1>curity check and a grammar test, and then she was

0:26:03.560 --> 0:26:06.720
<v Speaker 1>let in on her task. She would listen to conversations

0:26:06.800 --> 0:26:09.600
<v Speaker 1>picked up by the echoes microphones and type them up

0:26:09.800 --> 0:26:13.320
<v Speaker 1>and then feed the information back in the Amazon system.

0:26:13.359 --> 0:26:15.359
<v Speaker 1>She said it seemed like a good job. At first,

0:26:15.680 --> 0:26:19.280
<v Speaker 1>we heard the customer using it in you know, ordering

0:26:19.320 --> 0:26:22.080
<v Speaker 1>flour and asking what time it was and asking to

0:26:22.119 --> 0:26:24.879
<v Speaker 1>be told to joke and so forth, and it seemed

0:26:25.040 --> 0:26:28.080
<v Speaker 1>pretty cool. I mean a little invasive, for sure, but

0:26:28.080 --> 0:26:32.080
<v Speaker 1>but pretty cool overall. Ruthie was there to make Alexa's

0:26:32.119 --> 0:26:37.040
<v Speaker 1>responses seem more nuanced, more intuitive, more human. This was

0:26:37.119 --> 0:26:39.600
<v Speaker 1>right around the time Amazon started selling the Echo to

0:26:39.720 --> 0:26:43.960
<v Speaker 1>customers on a limited basis. She assumed at first that

0:26:44.040 --> 0:26:46.639
<v Speaker 1>the folks she was listening to had signed up to

0:26:46.680 --> 0:26:51.200
<v Speaker 1>help improve Amazon speech recognition software, and knew that someone

0:26:51.400 --> 0:26:55.080
<v Speaker 1>might be listening. Who were all of these all of

0:26:55.119 --> 0:26:58.560
<v Speaker 1>these many many voices and people were they? Various people

0:26:58.640 --> 0:27:02.920
<v Speaker 1>work for Amazon who agreed to bring this home where

0:27:02.960 --> 0:27:07.080
<v Speaker 1>they you know? But then if they were, would they

0:27:07.160 --> 0:27:11.440
<v Speaker 1>really be ordering sex toys, you know, um and talking

0:27:11.480 --> 0:27:14.159
<v Speaker 1>dirty to it and all of the grotesque things that

0:27:14.200 --> 0:27:18.240
<v Speaker 1>we occasionally heard. Almost immediately, it was clear to Ruthie

0:27:18.240 --> 0:27:21.000
<v Speaker 1>that people liked a toy with Alexa in ways that

0:27:21.080 --> 0:27:24.520
<v Speaker 1>Amazon maybe didn't intend. What can we ask her? How

0:27:24.560 --> 0:27:27.080
<v Speaker 1>how will she respond if we abuse her, if we

0:27:27.160 --> 0:27:30.480
<v Speaker 1>talked dirty to her, if we asked her to marry us?

0:27:30.560 --> 0:27:33.399
<v Speaker 1>If we you know, that sort of thing. Most of

0:27:33.440 --> 0:27:40.080
<v Speaker 1>the sexual stuff seemed innocuous enough. Percent of it wasn't disturbing.

0:27:40.280 --> 0:27:43.600
<v Speaker 1>You know, it was sometimes humorous and sometimes you know,

0:27:43.720 --> 0:27:48.159
<v Speaker 1>kind of silly or weird or entertaining even you know,

0:27:48.240 --> 0:27:50.439
<v Speaker 1>like somebody trying to figure out what kind of dildo

0:27:50.520 --> 0:27:53.520
<v Speaker 1>toward her or something like that, you know, but sometimes

0:27:53.720 --> 0:27:57.600
<v Speaker 1>it did get disturbing if it was a man who

0:27:57.920 --> 0:28:03.320
<v Speaker 1>was talking to her, like he would talk in an

0:28:03.359 --> 0:28:06.320
<v Speaker 1>abusive way to a woman, Like you could hear it

0:28:06.359 --> 0:28:09.120
<v Speaker 1>in his voice, And she wondered if men were being

0:28:09.160 --> 0:28:13.040
<v Speaker 1>so rude in part because Alexa was female, and being

0:28:13.080 --> 0:28:16.440
<v Speaker 1>a woman, even an ai woman, meant that Alexa would

0:28:16.440 --> 0:28:20.320
<v Speaker 1>be on the receiving end of misogyny. Even children did this.

0:28:20.720 --> 0:28:24.760
<v Speaker 1>There were a lot of kids who would talk abusively

0:28:24.840 --> 0:28:28.760
<v Speaker 1>to her, and it felt as though they were exercising

0:28:29.240 --> 0:28:32.240
<v Speaker 1>some sort of anger that they had towards their parents

0:28:32.320 --> 0:28:35.679
<v Speaker 1>or their teacher or something like that. There was a

0:28:35.800 --> 0:28:40.080
<v Speaker 1>lot of psychology that I thought was fascinating. Ruthie felt

0:28:40.120 --> 0:28:42.640
<v Speaker 1>like the way people spoke to Alexa was so private,

0:28:42.920 --> 0:28:46.200
<v Speaker 1>so intimate. They would never ever speak like this if

0:28:46.200 --> 0:28:49.160
<v Speaker 1>they knew someone was listening. There in their private home.

0:28:49.360 --> 0:28:51.760
<v Speaker 1>They think that no one will ever hear the words

0:28:51.800 --> 0:28:54.400
<v Speaker 1>that are coming out of their mouth. Ever, they do

0:28:54.720 --> 0:28:57.640
<v Speaker 1>know that they're talking to a robot, but at the

0:28:57.680 --> 0:29:01.880
<v Speaker 1>same time they're speaking to a row that as those

0:29:02.000 --> 0:29:06.080
<v Speaker 1>essentient being. Ruthie's job was a secret. It wouldn't become

0:29:06.120 --> 0:29:09.600
<v Speaker 1>public until years later that people like her were helping

0:29:09.640 --> 0:29:14.120
<v Speaker 1>to improve the software by analyzing real Alexa recordings. My

0:29:14.280 --> 0:29:18.160
<v Speaker 1>colleagues first interviewed Ruthie in two thousand nineteen, and after

0:29:18.200 --> 0:29:21.120
<v Speaker 1>we published a story describing what she and her colleagues did,

0:29:21.520 --> 0:29:26.720
<v Speaker 1>Amazon acknowledged the listening program. Technically, Alexa's terms of use

0:29:26.800 --> 0:29:30.200
<v Speaker 1>gave Amazon wide latitude to duke basically whatever it wanted

0:29:30.200 --> 0:29:33.040
<v Speaker 1>with the recordings, as long as the company was using

0:29:33.080 --> 0:29:37.240
<v Speaker 1>that audio to improve the software. But many customers fell duped.

0:29:37.520 --> 0:29:39.720
<v Speaker 1>They thought that when they spoke to their Alexa device,

0:29:39.960 --> 0:29:43.680
<v Speaker 1>they were only dealing with clever software, and Bezos had

0:29:43.720 --> 0:29:46.920
<v Speaker 1>gone back on his word his precise promise to be

0:29:47.040 --> 0:29:50.440
<v Speaker 1>transparent with how he used as customers data. It seemed

0:29:50.440 --> 0:29:54.720
<v Speaker 1>that while he highly prioritized Amazon's corporate privacy it's secrecy,

0:29:55.040 --> 0:29:57.959
<v Speaker 1>he was being careless with the privacy of his customers,

0:29:58.760 --> 0:30:01.680
<v Speaker 1>and in two thousand eight teen, Alexa had a huge

0:30:01.720 --> 0:30:05.960
<v Speaker 1>privacy mishap. Here's my colleague Prea and Nod describing the

0:30:06.000 --> 0:30:10.640
<v Speaker 1>infamous incident. What happened there is a family in Portland's

0:30:10.640 --> 0:30:14.880
<v Speaker 1>said that their echo randomly sent recorded conversations to a

0:30:15.000 --> 0:30:18.120
<v Speaker 1>contact of their's, a contact of the man in the houses.

0:30:18.680 --> 0:30:21.200
<v Speaker 1>And what happened at the time was Amazon said this

0:30:21.240 --> 0:30:25.400
<v Speaker 1>is how it worked. They said that Alexa interpreted something

0:30:26.000 --> 0:30:29.240
<v Speaker 1>those people were saying in their background conversation as Alexa,

0:30:29.360 --> 0:30:32.440
<v Speaker 1>so it woke up. Then it misunderstood a different phrase

0:30:32.480 --> 0:30:36.320
<v Speaker 1>in their background conversation as send that message. It understood

0:30:36.320 --> 0:30:39.800
<v Speaker 1>a different phrase in their background conversation. After that, the

0:30:39.840 --> 0:30:43.520
<v Speaker 1>device asked to whom and thought that they responded with

0:30:43.560 --> 0:30:47.040
<v Speaker 1>a name, So it ended up sending these conversations to

0:30:47.360 --> 0:30:49.920
<v Speaker 1>one of their contacts through this crazy chain of events.

0:30:49.960 --> 0:30:53.120
<v Speaker 1>It's like a horror movie. After the couple's friend reached

0:30:53.160 --> 0:30:55.479
<v Speaker 1>out to them and said, hey, I think you're Alexa

0:30:55.520 --> 0:30:58.320
<v Speaker 1>has been hacked, it turned out they weren't hacked. This

0:30:58.400 --> 0:31:01.800
<v Speaker 1>was Alexa mishearing. But at the end of the day,

0:31:01.840 --> 0:31:05.000
<v Speaker 1>this kind of scenario where Alexa could misinterpret a series

0:31:05.000 --> 0:31:08.400
<v Speaker 1>of things people were saying in their own home, interpret

0:31:08.440 --> 0:31:12.440
<v Speaker 1>them as requests, and send these people's recordings to someone else.

0:31:12.960 --> 0:31:17.120
<v Speaker 1>If this kind of unlikely scenario can happen, and did happen,

0:31:17.680 --> 0:31:20.800
<v Speaker 1>then who's to say it can't happen again? What else

0:31:20.800 --> 0:31:25.400
<v Speaker 1>can Alexa mix up? Even so, people were undeterred. Over

0:31:25.440 --> 0:31:28.560
<v Speaker 1>the next few years, Amazon would sell tens of millions

0:31:28.560 --> 0:31:32.560
<v Speaker 1>of Alexa devices and inspire imitators from Google and Apple.

0:31:33.080 --> 0:31:37.080
<v Speaker 1>It's popularity cemented bezos notion of himself as an inventor

0:31:37.480 --> 0:31:40.760
<v Speaker 1>and the general public's perception of Amazon as an innovator.

0:31:41.480 --> 0:31:44.320
<v Speaker 1>But Alexa hasn't quite met the goals its creators had

0:31:44.360 --> 0:31:47.560
<v Speaker 1>for it. Most people only use it as a kitchen timer,

0:31:47.880 --> 0:31:50.719
<v Speaker 1>a music player, a source for the occasional weather report.

0:31:51.280 --> 0:31:55.400
<v Speaker 1>It's never been conversational in the way Bezos imagined no

0:31:55.440 --> 0:31:58.120
<v Speaker 1>one would call at the Star Trek computer, but that

0:31:58.240 --> 0:32:02.720
<v Speaker 1>hasn't stopped consumers and estimated one in three US households

0:32:02.880 --> 0:32:06.480
<v Speaker 1>currently have a smart speaker. In the years since Alexa launched,

0:32:06.680 --> 0:32:10.040
<v Speaker 1>people have invited more items that integrate AI assistance into

0:32:10.040 --> 0:32:14.600
<v Speaker 1>their homes. That includes stuff like nest thermostats, ring cameras,

0:32:14.640 --> 0:32:18.760
<v Speaker 1>and Sono speakers. Whenever a story about an Alexa privacy

0:32:18.800 --> 0:32:22.320
<v Speaker 1>breach gets published, it makes a big splash, but it

0:32:22.400 --> 0:32:26.320
<v Speaker 1>hardly ever changes anything. Because Alexa marked a turning point,

0:32:26.840 --> 0:32:30.520
<v Speaker 1>It was another doorway into a more surveiled world and

0:32:30.600 --> 0:32:34.520
<v Speaker 1>placed Amazon alongside other big tech companies like Google and

0:32:34.560 --> 0:32:40.440
<v Speaker 1>Facebook at the hot controversial center of the battle over privacy. Finally,

0:32:40.760 --> 0:32:43.960
<v Speaker 1>Alexa turned to Amazon into a company its customers interacted

0:32:44.000 --> 0:32:47.400
<v Speaker 1>with almost every single day, not just once a week

0:32:47.640 --> 0:32:49.440
<v Speaker 1>or a few times a month when they wanted to

0:32:49.480 --> 0:32:53.000
<v Speaker 1>buy something online. And there was another way Bezos would

0:32:53.000 --> 0:32:56.600
<v Speaker 1>accomplish the same goal by producing his own TV shows

0:32:56.640 --> 0:33:00.840
<v Speaker 1>and movies. Bezos is risky expansion into hollyw would that's

0:33:00.920 --> 0:33:14.040
<v Speaker 1>next time on Foundering The Amazon story. Foundering is hosted

0:33:14.080 --> 0:33:17.880
<v Speaker 1>by me brad Stone. Sean wen Is. Our executive producer.

0:33:18.360 --> 0:33:21.920
<v Speaker 1>Pria Nad and Matt Day contributed reporting to this episode.

0:33:22.360 --> 0:33:26.680
<v Speaker 1>Raymondo is our audio engineer, Molly Nugent as our associate producer,

0:33:27.280 --> 0:33:30.680
<v Speaker 1>Mark Million and Manner May Robin a Jello and Molly

0:33:30.760 --> 0:33:34.320
<v Speaker 1>shoots our story editors. Francesca Levi is the head of

0:33:34.320 --> 0:33:37.760
<v Speaker 1>Bloomberg Podcast. Be sure to subscribe and if you like

0:33:37.840 --> 0:33:41.680
<v Speaker 1>our show, leave a review. Most importantly, tell your friends

0:33:42.000 --> 0:33:42.880
<v Speaker 1>see you next time.