WEBVTT - Teaching Computers to See

0:00:15.356 --> 0:00:15.796
<v Speaker 1>Pushkin.

0:00:21.316 --> 0:00:24.036
<v Speaker 2>Over the past year, we've heard a lot about artificial

0:00:24.076 --> 0:00:28.676
<v Speaker 2>intelligence models that are really good at manipulating language. We've

0:00:28.676 --> 0:00:33.116
<v Speaker 2>heard somewhat less about AI that deals with images. It's

0:00:33.156 --> 0:00:36.956
<v Speaker 2>called computer vision, and it's a huge deal, which you

0:00:36.996 --> 0:00:40.876
<v Speaker 2>know obviously. Like language, vision is this core part of

0:00:40.916 --> 0:00:44.916
<v Speaker 2>the experience of being human, And on a more practical level,

0:00:45.316 --> 0:00:48.516
<v Speaker 2>computer vision is key for self driving cars, and for

0:00:48.636 --> 0:00:52.476
<v Speaker 2>drones and for all kinds of industrial robots. As it

0:00:52.516 --> 0:00:56.236
<v Speaker 2>turns out, there was this one key moment in the

0:00:56.276 --> 0:01:00.516
<v Speaker 2>development of modern AI, for both vision and language, and

0:01:00.556 --> 0:01:03.916
<v Speaker 2>if you understand this moment, you understand a lot about

0:01:03.916 --> 0:01:13.036
<v Speaker 2>how AI works today. I'm Jacob Goldstein. This is What's

0:01:13.036 --> 0:01:15.556
<v Speaker 2>Your Problem, the show where I talk to people who

0:01:15.636 --> 0:01:19.996
<v Speaker 2>are trying to make technological progress. My guest today played

0:01:20.036 --> 0:01:23.436
<v Speaker 2>a central role in that key moment in AI history.

0:01:23.956 --> 0:01:25.196
<v Speaker 2>Her name is faith A.

0:01:25.316 --> 0:01:25.676
<v Speaker 1>Lee.

0:01:26.156 --> 0:01:29.716
<v Speaker 2>She's a Stanford computer scientist, the author of a memoir

0:01:29.756 --> 0:01:33.556
<v Speaker 2>called The Worlds I See, the former chief scientist of

0:01:33.676 --> 0:01:37.516
<v Speaker 2>AI and machine learning at Google, and just generally one

0:01:37.556 --> 0:01:40.676
<v Speaker 2>of the most important innovators in the history of computer vision.

0:01:41.476 --> 0:01:44.476
<v Speaker 2>I started our conversation with really a pretty general question.

0:01:45.116 --> 0:01:48.236
<v Speaker 2>I asked Fay fe just to explain what computer vision

0:01:48.356 --> 0:01:49.996
<v Speaker 2>is and why it's so important.

0:01:53.076 --> 0:01:59.236
<v Speaker 3>So computer vision is about enabling computers and machines to

0:01:59.596 --> 0:02:05.316
<v Speaker 3>have visual intelligence. What is visual intelligence? Well, the best

0:02:06.036 --> 0:02:12.716
<v Speaker 3>example comes from humans who are extremely visually intelligent animals,

0:02:13.156 --> 0:02:18.116
<v Speaker 3>So that we can make an omelet by knowing what

0:02:18.316 --> 0:02:20.996
<v Speaker 3>is in our fridge. How do we go and take

0:02:21.076 --> 0:02:23.916
<v Speaker 3>the egg out, how do we take the tomato out?

0:02:24.036 --> 0:02:29.036
<v Speaker 3>How do we plan the cooking of the omelet? How

0:02:29.076 --> 0:02:32.996
<v Speaker 3>do we interact with every ingredients, and how do we

0:02:33.116 --> 0:02:38.636
<v Speaker 3>understand all the changes of the objects and all this

0:02:39.036 --> 0:02:40.596
<v Speaker 3>is part of visual intelligence.

0:02:41.076 --> 0:02:43.276
<v Speaker 2>Yeah, I mean you write in your book that vision

0:02:43.316 --> 0:02:47.676
<v Speaker 2>isn't just an application of our intelligence, it is synonymous

0:02:47.716 --> 0:02:50.116
<v Speaker 2>with our intelligence, which is something I want to talk

0:02:50.156 --> 0:02:53.756
<v Speaker 2>more about. But before we get into human vision and

0:02:53.796 --> 0:02:57.036
<v Speaker 2>how that led you into computer vision, just give me

0:02:57.116 --> 0:03:00.356
<v Speaker 2>a sense of some of the applications, both the current

0:03:00.436 --> 0:03:05.916
<v Speaker 2>applications of computer vision and potential future applications of computer vision.

0:03:06.396 --> 0:03:09.556
<v Speaker 3>In fact, we're already using computer vision to do a

0:03:09.636 --> 0:03:14.116
<v Speaker 3>lot of things. The most obvious example is all kinds

0:03:14.116 --> 0:03:18.516
<v Speaker 3>of driver's assistant programs. Right, we're not having even about

0:03:18.516 --> 0:03:22.236
<v Speaker 3>self driving cars. We're talking about lane detection or talking

0:03:22.276 --> 0:03:28.076
<v Speaker 3>about avoiding curb sized pedestrian alert. You know, we are

0:03:28.196 --> 0:03:33.516
<v Speaker 3>using computer vision in our healthcare system, in radiology, in pathology,

0:03:34.156 --> 0:03:39.476
<v Speaker 3>or you know, in protecting of species. A lot of

0:03:40.556 --> 0:03:44.956
<v Speaker 3>the camera traps in the in the deep forests are

0:03:45.076 --> 0:03:50.476
<v Speaker 3>using computer vision to track to track different animals. So

0:03:50.756 --> 0:03:53.796
<v Speaker 3>we're using computer vision already on a daily basis.

0:03:54.036 --> 0:03:57.916
<v Speaker 2>And then when you dream of some applications that are

0:03:57.956 --> 0:04:00.476
<v Speaker 2>not here yet but that might be here in whatever

0:04:00.516 --> 0:04:02.236
<v Speaker 2>five or ten years, what do you think of what's

0:04:02.236 --> 0:04:03.236
<v Speaker 2>at the top of the list.

0:04:04.156 --> 0:04:07.156
<v Speaker 3>So when I dream of computer vision, I dream of

0:04:07.316 --> 0:04:10.676
<v Speaker 3>all kinds of robotic application, so from self driving car

0:04:10.836 --> 0:04:14.916
<v Speaker 3>to personal robots using computer vision. I dream of our

0:04:14.956 --> 0:04:20.276
<v Speaker 3>biodiversity being mapped using computer vision. I dream of exploration

0:04:20.596 --> 0:04:21.716
<v Speaker 3>using computer vision.

0:04:21.956 --> 0:04:25.596
<v Speaker 2>Wonderful. So I want to talk about your work in

0:04:25.716 --> 0:04:31.796
<v Speaker 2>computer vision, which goes back well decades now, and I

0:04:31.876 --> 0:04:37.436
<v Speaker 2>want to start with work not on computers actually, but

0:04:37.556 --> 0:04:45.356
<v Speaker 2>on human beings right, on understanding of how humans process

0:04:45.436 --> 0:04:48.116
<v Speaker 2>visual information, right, how we make sense of what we're seeing.

0:04:48.516 --> 0:04:51.636
<v Speaker 2>And in the book, you write in particular about this

0:04:52.716 --> 0:04:57.076
<v Speaker 2>nineteen ninety six paper with a boring name that was

0:04:57.116 --> 0:05:00.196
<v Speaker 2>a huge deal. It was called speed of processing in

0:05:00.236 --> 0:05:04.156
<v Speaker 2>the human visual system. Tell me about that paper and

0:05:04.156 --> 0:05:05.156
<v Speaker 2>what it meant.

0:05:05.356 --> 0:05:11.156
<v Speaker 3>It's a paper of using EG, which is recording electrical

0:05:11.196 --> 0:05:17.236
<v Speaker 3>brain waves to make a link between how fast can

0:05:17.356 --> 0:05:22.276
<v Speaker 3>humans make a very complex visual decision when they sees something,

0:05:22.356 --> 0:05:27.236
<v Speaker 3>And the particular decision humans were to make is to

0:05:28.276 --> 0:05:34.356
<v Speaker 3>separate images from images containing animals and images not containing animals.

0:05:34.676 --> 0:05:38.876
<v Speaker 3>And if you think about the pool of possibilities is

0:05:38.996 --> 0:05:45.316
<v Speaker 3>extremely complex. It's actually mathematically just an infinite possibility because

0:05:45.756 --> 0:05:48.236
<v Speaker 3>there are so many different types of animals, so many

0:05:48.276 --> 0:05:53.196
<v Speaker 3>different different types of non animals. That's infinite for practical purposes.

0:05:53.716 --> 0:05:56.396
<v Speaker 3>And then you put them in photos, you can get

0:05:56.636 --> 0:06:01.956
<v Speaker 3>infinite possibilities of photos. Yet you show them one by

0:06:02.116 --> 0:06:06.116
<v Speaker 3>one two humans they make decisions really quickly, and they

0:06:06.116 --> 0:06:08.076
<v Speaker 3>make correct decisions really quickly.

0:06:09.196 --> 0:06:14.436
<v Speaker 2>That really quickly, but like mind bogglingly quickly at the time, right,

0:06:14.476 --> 0:06:18.676
<v Speaker 2>it was shocking just how fast it was, right milliseconds.

0:06:18.916 --> 0:06:22.556
<v Speaker 3>Yeah. So the thing is we kind of sort of

0:06:22.756 --> 0:06:26.436
<v Speaker 3>know we're good at see, right as a species. We

0:06:26.556 --> 0:06:29.196
<v Speaker 3>know we open our eyes we see the world, but

0:06:29.476 --> 0:06:32.476
<v Speaker 3>we don't really know how good and how fast.

0:06:33.036 --> 0:06:35.596
<v Speaker 2>And this is we underestimate. It's it's a rare case

0:06:35.596 --> 0:06:38.436
<v Speaker 2>where human beings underestimate ourselves exactly.

0:06:38.796 --> 0:06:44.436
<v Speaker 3>This is a rigorous scientific study put a time, actual

0:06:44.596 --> 0:06:48.916
<v Speaker 3>time to that speed of visual intelligence, and it's using

0:06:48.996 --> 0:06:53.316
<v Speaker 3>modern technique. It's very smart and very very exciting.

0:06:54.756 --> 0:06:58.596
<v Speaker 2>What did it mean to you when you saw that result?

0:06:58.676 --> 0:06:59.676
<v Speaker 2>When you read that paper?

0:07:00.436 --> 0:07:03.556
<v Speaker 3>When I read that paper, it means north star. Let

0:07:03.556 --> 0:07:07.116
<v Speaker 3>me explain what does north star mean? As north star?

0:07:07.236 --> 0:07:07.516
<v Speaker 2>Okay?

0:07:07.596 --> 0:07:12.756
<v Speaker 3>Yeah, As a scientist, I'm driven by finding answers to

0:07:12.796 --> 0:07:18.076
<v Speaker 3>the most audacious question. But as high Einstein has said,

0:07:18.676 --> 0:07:24.796
<v Speaker 3>in science scientific inquiry, the hardest job is not finding

0:07:24.876 --> 0:07:30.236
<v Speaker 3>solutions asking the right question because you you know, when

0:07:30.276 --> 0:07:32.996
<v Speaker 3>you like we talk about visual intelligence, it's such a

0:07:33.116 --> 0:07:37.756
<v Speaker 3>vast topic. What is the topic to pursue? What is

0:07:37.796 --> 0:07:41.996
<v Speaker 3>the question to ask that is fundamental to visual intelligence?

0:07:42.436 --> 0:07:45.836
<v Speaker 3>And how do we unlock it? When when we read

0:07:45.876 --> 0:07:54.156
<v Speaker 3>that Simon Thorp paper, it convinced me that complex object categorization,

0:07:54.876 --> 0:08:00.556
<v Speaker 3>the ability to classify you know, animal versus no animal,

0:08:01.116 --> 0:08:07.076
<v Speaker 3>chair versus you know, table, hot dog, hot dog versus hamburger.

0:08:07.316 --> 0:08:10.956
<v Speaker 3>You know, this is fundamental told to humans. It's a

0:08:10.956 --> 0:08:15.276
<v Speaker 3>building block of visual intelligence. Is it has a neural

0:08:15.356 --> 0:08:22.996
<v Speaker 3>correlate in human brain that shows how evolutionally evolutionarily optimized

0:08:23.196 --> 0:08:27.476
<v Speaker 3>it is. So with all that evidence, it comments me

0:08:28.036 --> 0:08:32.036
<v Speaker 3>object categorization is a north star to pursue.

0:08:32.676 --> 0:08:35.076
<v Speaker 2>And you were a grad student at the time, right,

0:08:35.156 --> 0:08:38.516
<v Speaker 2>this is sort of the thing and any ambitious grad

0:08:38.556 --> 0:08:40.476
<v Speaker 2>student that's going to be doing it's like, I know,

0:08:40.476 --> 0:08:42.836
<v Speaker 2>I'm interested in this field, but I need my question,

0:08:42.956 --> 0:08:45.316
<v Speaker 2>I need my thing, right, And so now you've got

0:08:45.356 --> 0:08:49.596
<v Speaker 2>your thing, yes, and it's categorization in particular. And you

0:08:49.676 --> 0:08:55.916
<v Speaker 2>describe how earlier theories of how humans process visual input

0:08:56.076 --> 0:08:59.836
<v Speaker 2>was not so categorization focused, right. It was kind of

0:08:59.876 --> 0:09:02.316
<v Speaker 2>like if you just sort of thought from first principles,

0:09:02.556 --> 0:09:04.756
<v Speaker 2>you would think, well, we see color and we see

0:09:04.796 --> 0:09:06.516
<v Speaker 2>shapes and then we kind of make sense of it.

0:09:06.796 --> 0:09:09.556
<v Speaker 2>But with this paper and related work, it show is

0:09:09.596 --> 0:09:12.876
<v Speaker 2>like that's actually not it, right, and in fact our brains.

0:09:13.276 --> 0:09:15.356
<v Speaker 2>You write about how there are specific regions of the

0:09:15.356 --> 0:09:19.596
<v Speaker 2>brain like this region is just face, the face categorization region,

0:09:19.596 --> 0:09:21.796
<v Speaker 2>and this region is the like place as we go

0:09:21.876 --> 0:09:24.636
<v Speaker 2>all the time region And so it's a really different

0:09:24.716 --> 0:09:29.196
<v Speaker 2>and interesting way of thinking about seeing, and it's fundamentally

0:09:29.276 --> 0:09:35.036
<v Speaker 2>about just incredibly quickly putting things into categories. And so

0:09:35.356 --> 0:09:38.956
<v Speaker 2>you decide to take this idea of vision and categorization

0:09:39.076 --> 0:09:43.196
<v Speaker 2>and try and figure out how to how to get

0:09:43.236 --> 0:09:45.636
<v Speaker 2>computers to do this, right, how to get computers to

0:09:45.716 --> 0:09:52.036
<v Speaker 2>be able to categorize objects from the world. And you

0:09:53.076 --> 0:09:58.636
<v Speaker 2>start building these data sets essentially of labeled images, right,

0:09:58.676 --> 0:10:01.596
<v Speaker 2>And you build what seems in retrospect like a relatively

0:10:01.596 --> 0:10:04.836
<v Speaker 2>small one at Caltech, and then you decide to build

0:10:04.996 --> 0:10:07.276
<v Speaker 2>a really big one. Right. It comes to be called

0:10:07.316 --> 0:10:10.236
<v Speaker 2>the image net. It's a thing you're famous for, NERD

0:10:10.276 --> 0:10:15.196
<v Speaker 2>famous for, And I want to talk about building image net, right,

0:10:15.236 --> 0:10:18.276
<v Speaker 2>So tell me about deciding to build what becomes image net.

0:10:19.556 --> 0:10:24.756
<v Speaker 3>So the image net is the north Star. To me,

0:10:25.356 --> 0:10:28.356
<v Speaker 3>I was in the field long enough because I finished

0:10:28.356 --> 0:10:31.876
<v Speaker 3>my PhD, I started my own lab. I had this

0:10:32.236 --> 0:10:38.996
<v Speaker 3>unwavering faith and believe that, you know, unlocking object recognition

0:10:39.356 --> 0:10:42.316
<v Speaker 3>is part of the north is a north star, is

0:10:42.356 --> 0:10:45.596
<v Speaker 3>a critical north star. And I became impatient because I

0:10:45.676 --> 0:10:49.836
<v Speaker 3>realized we were not making enough progress. I realized that,

0:10:51.276 --> 0:10:55.716
<v Speaker 3>especially algorithmically, we were like running in circles a little

0:10:55.756 --> 0:11:01.356
<v Speaker 3>bit of optimizing very small algorithms that are not really

0:11:01.396 --> 0:11:05.356
<v Speaker 3>getting to the essence of the problem, and part of

0:11:05.396 --> 0:11:09.116
<v Speaker 3>the essence, which a lot of people overlooked, is actually

0:11:09.316 --> 0:11:12.956
<v Speaker 3>the scale of the problem. What was really bothering me

0:11:13.636 --> 0:11:17.116
<v Speaker 3>is that we were not seeing the problem. We're not

0:11:17.196 --> 0:11:21.396
<v Speaker 3>seeing the mathematical problem with the scale thinking, because it's

0:11:21.436 --> 0:11:27.076
<v Speaker 3>not just about being big. It's about the mathematical reason

0:11:27.636 --> 0:11:31.076
<v Speaker 3>of why we should go big, and it's it's a

0:11:31.156 --> 0:11:35.156
<v Speaker 3>very deep reason in general, it's a reason for what

0:11:35.196 --> 0:11:39.276
<v Speaker 3>we call generalization. You have to learn enough to be

0:11:39.436 --> 0:11:44.036
<v Speaker 3>able to see everything and that minds You've got.

0:11:43.836 --> 0:11:45.716
<v Speaker 2>To see a lot of pictures of things that are

0:11:45.756 --> 0:11:47.836
<v Speaker 2>cats and not cats to understand what.

0:11:47.836 --> 0:11:52.956
<v Speaker 3>Is right that that mindset was just not you know,

0:11:53.036 --> 0:11:57.836
<v Speaker 3>that's a big data mindset. It was not in the

0:11:57.876 --> 0:11:59.316
<v Speaker 3>world at all at that time.

0:11:59.556 --> 0:12:02.476
<v Speaker 2>So how did you get there? How did you Because

0:12:02.516 --> 0:12:05.556
<v Speaker 2>what you end up doing is building this just gargantuan

0:12:06.316 --> 0:12:10.356
<v Speaker 2>uh thing full of labeled images, bigger than anybody'd ever

0:12:10.356 --> 0:12:12.116
<v Speaker 2>built before. Like, how did you arrive at that?

0:12:12.116 --> 0:12:15.396
<v Speaker 3>That's a great question. I think that's actually the most

0:12:15.436 --> 0:12:17.956
<v Speaker 3>fun but difficult part of the book to write, is

0:12:18.236 --> 0:12:24.156
<v Speaker 3>you know, like dig in to my own brain. In hindsight,

0:12:24.236 --> 0:12:29.756
<v Speaker 3>it's just little by little the insight and the realization epiphany,

0:12:30.196 --> 0:12:32.956
<v Speaker 3>But honestly, I don't know how to analyze my own brain.

0:12:33.236 --> 0:12:40.236
<v Speaker 3>I had the mathematical intuition that scale makes a difference,

0:12:40.276 --> 0:12:44.476
<v Speaker 3>bigger difference than most people give credit to. I also

0:12:44.636 --> 0:12:51.276
<v Speaker 3>had the neurocognitive science inspiration that early human development was

0:12:51.356 --> 0:12:55.476
<v Speaker 3>exposure to the world in continuous ways. We don't like

0:12:55.676 --> 0:12:59.676
<v Speaker 3>lock the baby in a dark room and show them,

0:12:59.836 --> 0:13:04.316
<v Speaker 3>you know, one hundred cats. They just go out and experience.

0:13:04.676 --> 0:13:08.116
<v Speaker 3>You know that experience is actually driven by big data.

0:13:09.276 --> 0:13:13.156
<v Speaker 3>Maybe I was also inspired by this Internet age coming

0:13:13.236 --> 0:13:16.316
<v Speaker 3>our way right like that part, I do think it's

0:13:16.356 --> 0:13:21.316
<v Speaker 3>a little bit moment of just being alone, and somehow

0:13:22.196 --> 0:13:26.996
<v Speaker 3>all the stars aligned in my head. I decided I'm

0:13:27.036 --> 0:13:32.996
<v Speaker 3>going to try the craziest thing, and I did have

0:13:33.676 --> 0:13:36.796
<v Speaker 3>a faith and believe that it was the right thing

0:13:36.876 --> 0:13:37.156
<v Speaker 3>to do.

0:13:37.316 --> 0:13:39.796
<v Speaker 2>And specifically, like, what was this thing that you were

0:13:39.796 --> 0:13:40.316
<v Speaker 2>going to build?

0:13:41.036 --> 0:13:46.876
<v Speaker 3>I'm gonna get the entire Internet of images, consisted of

0:13:47.396 --> 0:13:50.316
<v Speaker 3>all the objects I can get my hands on that

0:13:50.396 --> 0:13:55.156
<v Speaker 3>humans have ever taken pictures of and catalog them in

0:13:55.236 --> 0:14:00.796
<v Speaker 3>a gigantic, big database. And I will use that to

0:14:00.876 --> 0:14:05.756
<v Speaker 3>do two things. To train machines to recognize the entire

0:14:05.876 --> 0:14:11.476
<v Speaker 3>world of objects, and also to benchmark everybody's progress. You

0:14:11.516 --> 0:14:15.876
<v Speaker 3>know everybody, I mean the international community of computer vision scientists.

0:14:15.876 --> 0:14:18.556
<v Speaker 2>So you will have this database and then everyone can

0:14:18.636 --> 0:14:21.676
<v Speaker 2>train their computer vision models on your database and see

0:14:21.676 --> 0:14:22.996
<v Speaker 2>how they do on new images.

0:14:23.316 --> 0:14:23.756
<v Speaker 3>Yes, so.

0:14:25.876 --> 0:14:28.036
<v Speaker 2>You have to decide. There's this interesting part of the

0:14:28.076 --> 0:14:29.836
<v Speaker 2>book where you're like, Okay, I want to build a

0:14:29.876 --> 0:14:34.476
<v Speaker 2>database with everything in it. How many categories of everything

0:14:34.556 --> 0:14:37.116
<v Speaker 2>are there? Right, somebody's actually done that research. If you

0:14:37.196 --> 0:14:41.076
<v Speaker 2>take all the things, how many kinds of things are there?

0:14:41.276 --> 0:14:42.076
<v Speaker 2>What's the number?

0:14:43.916 --> 0:14:48.476
<v Speaker 3>The number is the Beaderman number, and the Beierman number

0:14:48.916 --> 0:14:54.796
<v Speaker 3>is a I'm proud of really giving professor or Piederman

0:14:54.916 --> 0:14:59.076
<v Speaker 3>that credit. Yeah, nobody noticed that number. He wrote. He's

0:14:59.116 --> 0:15:03.156
<v Speaker 3>a cognitive scientist who wrote it very very good, But

0:15:03.276 --> 0:15:06.236
<v Speaker 3>I don't think it's a famous paper in the nineteen

0:15:06.276 --> 0:15:12.196
<v Speaker 3>eighties guestimating or estimating with the back of the envelope

0:15:12.236 --> 0:15:17.036
<v Speaker 3>computation of how many visual concepts humans see? And that

0:15:17.196 --> 0:15:20.756
<v Speaker 3>is a very hard number. How do you interrogate a

0:15:20.836 --> 0:15:25.116
<v Speaker 3>person and say, list me all the visual concepts. It's impossible.

0:15:25.556 --> 0:15:30.196
<v Speaker 3>But he had a way of using dictionary and using

0:15:30.316 --> 0:15:33.956
<v Speaker 3>visual structure to estimate, and he put a number of

0:15:33.996 --> 0:15:36.436
<v Speaker 3>thirty thousand visual concepts.

0:15:36.876 --> 0:15:40.396
<v Speaker 2>They're thirty thousand different sort of kinds of things, right,

0:15:40.476 --> 0:15:44.276
<v Speaker 2>people can identify differentiating. Yeah, it's a lot.

0:15:44.476 --> 0:15:45.236
<v Speaker 3>That's a lot.

0:15:46.036 --> 0:15:51.636
<v Speaker 2>Yeah, And every concept you're setting out, if you're setting out, sorry,

0:15:51.676 --> 0:15:53.356
<v Speaker 2>is that your number? Is your number?

0:15:53.556 --> 0:15:57.356
<v Speaker 3>That was my number. I was obsessed with that number,

0:15:57.956 --> 0:16:00.996
<v Speaker 3>and I was obsessed in a way that I feel

0:16:00.996 --> 0:16:03.476
<v Speaker 3>I was kind of crazy because nobody was obsessed with

0:16:03.516 --> 0:16:06.636
<v Speaker 3>that number. Nobody even knew. I think my book is

0:16:06.676 --> 0:16:10.076
<v Speaker 3>the book that gave the number A which is be

0:16:10.236 --> 0:16:12.716
<v Speaker 3>the most number, and I'm very proud of that.

0:16:13.196 --> 0:16:17.156
<v Speaker 2>Do you can you just rattle off some of the categories.

0:16:17.356 --> 0:16:23.276
<v Speaker 3>Star nosed mole, star No's mole category to itself, that's

0:16:23.356 --> 0:16:31.436
<v Speaker 3>my favorite, one of my favorite categories. And Guardenian windsor

0:16:31.756 --> 0:16:37.676
<v Speaker 3>flower has windsor chair. There were hundreds of dogs. I

0:16:37.756 --> 0:16:43.076
<v Speaker 3>remember there were uh different kind of cars like like

0:16:43.356 --> 0:16:51.956
<v Speaker 3>sports sedan and uh monocycles. It's it's a lot.

0:16:54.836 --> 0:16:57.876
<v Speaker 2>So Faithley has her number, she has her big idea.

0:16:58.476 --> 0:17:02.836
<v Speaker 2>She knows what she needs to build a gigantic image database.

0:17:03.356 --> 0:17:04.836
<v Speaker 2>But how do you actually do that?

0:17:05.756 --> 0:17:19.436
<v Speaker 1>Well? I have the answer in just a minute.

0:17:19.756 --> 0:17:24.316
<v Speaker 2>Okay, so you've got your giant north Star task ahead

0:17:24.316 --> 0:17:28.236
<v Speaker 2>of you. Not only do you have, you know, thirty

0:17:28.276 --> 0:17:33.876
<v Speaker 2>thousand ish categories to deal with, presumably for each category

0:17:33.916 --> 0:17:38.916
<v Speaker 2>you need many many thousands. So it's thousands of images

0:17:38.956 --> 0:17:43.356
<v Speaker 2>per category, tens of thousands of categories. What is the

0:17:43.476 --> 0:17:44.276
<v Speaker 2>order of magnitude?

0:17:44.396 --> 0:17:47.236
<v Speaker 3>We're talking about tons of millions?

0:17:47.276 --> 0:17:50.476
<v Speaker 2>A million this is and this is not a time

0:17:51.236 --> 0:17:53.676
<v Speaker 2>where you can do this in an automated or semi

0:17:53.676 --> 0:17:55.276
<v Speaker 2>automated way like you could now.

0:17:55.476 --> 0:17:58.476
<v Speaker 3>No, I mean the point is, the machines cannot do it.

0:17:58.636 --> 0:18:01.596
<v Speaker 3>We have to. This is a north start to push

0:18:01.716 --> 0:18:04.196
<v Speaker 3>machines towards that, So you have to do it by

0:18:04.236 --> 0:18:06.436
<v Speaker 3>human hen and the good.

0:18:06.276 --> 0:18:11.516
<v Speaker 2>News downloading and labeling. Yeah, millions or tens of millions

0:18:11.516 --> 0:18:12.236
<v Speaker 2>of images.

0:18:11.996 --> 0:18:17.916
<v Speaker 3>Downloading, cleaning, labeling, and yes, that's that was the task.

0:18:18.516 --> 0:18:21.116
<v Speaker 2>So now you're like Henry Ford or something right now

0:18:21.156 --> 0:18:24.756
<v Speaker 2>you need an assembly line, you need a factory for

0:18:25.596 --> 0:18:26.716
<v Speaker 2>creating this database.

0:18:26.956 --> 0:18:30.516
<v Speaker 3>Yeah, you can put it that way. And we needed

0:18:30.836 --> 0:18:35.636
<v Speaker 3>a global workforce, and eventually we found them on Amazon

0:18:35.676 --> 0:18:39.116
<v Speaker 3>Mechanical Turk. It's an online global market.

0:18:39.476 --> 0:18:42.596
<v Speaker 2>It's a market for project based work, right, people doing

0:18:42.836 --> 0:18:47.716
<v Speaker 2>project based work. And so, so how long does it

0:18:47.756 --> 0:18:49.396
<v Speaker 2>take you to build this thing? And how big is

0:18:49.396 --> 0:18:50.756
<v Speaker 2>it when it's done?

0:18:51.276 --> 0:18:54.236
<v Speaker 3>It took us three years. When it was done, it's

0:18:54.316 --> 0:19:02.516
<v Speaker 3>fifteen million hand cleaned, sorted, curated, labeled images across twenty

0:19:02.596 --> 0:19:04.076
<v Speaker 3>two thousand categories.

0:19:04.356 --> 0:19:06.556
<v Speaker 2>So now you have this thing. It's called image net

0:19:06.636 --> 0:19:10.556
<v Speaker 2>and basically the function of it is it itself is

0:19:10.596 --> 0:19:12.916
<v Speaker 2>not useful, right, It is there to train. Well, it's

0:19:12.956 --> 0:19:15.756
<v Speaker 2>useful as a means to an end. It's there for

0:19:15.996 --> 0:19:20.276
<v Speaker 2>people who have models that aim to teach computer's vision

0:19:20.876 --> 0:19:23.996
<v Speaker 2>to see and understand to train their models. Now there's

0:19:24.076 --> 0:19:27.836
<v Speaker 2>this giant database. I mean people talk about this as

0:19:28.156 --> 0:19:30.636
<v Speaker 2>kind of one of the beginnings of big data.

0:19:30.876 --> 0:19:36.236
<v Speaker 3>Yes, yeah, I think it should be properly recognized as

0:19:36.316 --> 0:19:39.676
<v Speaker 3>the beginning of big data in AI, because before this,

0:19:40.676 --> 0:19:44.196
<v Speaker 3>there isn't this concept of big data in AI. It

0:19:45.356 --> 0:19:48.836
<v Speaker 3>was just a paradigm shift from that point of view.

0:19:49.956 --> 0:19:53.116
<v Speaker 2>And so you create this contest where people can come

0:19:53.876 --> 0:19:56.996
<v Speaker 2>and train their models on image net, on this giant

0:19:57.076 --> 0:20:00.916
<v Speaker 2>database that you've built, and then and then in the

0:20:00.916 --> 0:20:04.196
<v Speaker 2>contest their models will be shown new images images not

0:20:04.316 --> 0:20:07.436
<v Speaker 2>in the database, and you'll see how how good they are.

0:20:07.636 --> 0:20:10.476
<v Speaker 2>And for a while it's like going okay, right, but

0:20:11.356 --> 0:20:14.596
<v Speaker 2>kind of slow, like you're in the book you write

0:20:14.596 --> 0:20:16.476
<v Speaker 2>about like you get a little worried. You've built this

0:20:16.596 --> 0:20:18.756
<v Speaker 2>giant thing with people all around the world and it's

0:20:18.796 --> 0:20:22.396
<v Speaker 2>not for a while leading to the breakthroughs that you

0:20:22.436 --> 0:20:23.196
<v Speaker 2>had imagined.

0:20:23.356 --> 0:20:28.276
<v Speaker 3>Yeah, it was first of all, we open source this.

0:20:28.596 --> 0:20:32.516
<v Speaker 3>We didn't even though we spend a lot of sweat

0:20:32.516 --> 0:20:35.836
<v Speaker 3>and tears, you know, building this, but we know the

0:20:35.876 --> 0:20:38.756
<v Speaker 3>real value is to open source. So we gave it

0:20:38.836 --> 0:20:44.996
<v Speaker 3>for free to the whole community. And then I wanted

0:20:45.076 --> 0:20:49.156
<v Speaker 3>everybody to use it. I wanted to see this driving

0:20:50.636 --> 0:20:53.276
<v Speaker 3>all of us towards the north Star. I want the

0:20:53.276 --> 0:20:59.476
<v Speaker 3>field to work out. But it wasn't like an overnight success.

0:20:59.636 --> 0:21:02.516
<v Speaker 3>It wasn't like everybody's running around and say, oh my god,

0:21:02.556 --> 0:21:08.476
<v Speaker 3>there's image Net to use. And of course we were not.

0:21:10.036 --> 0:21:13.516
<v Speaker 3>You know, we were disappointed, but we were not sitting

0:21:13.556 --> 0:21:15.356
<v Speaker 3>there crying. We were just disappointed.

0:21:16.436 --> 0:21:19.116
<v Speaker 2>And so there is this big moment right after a

0:21:19.156 --> 0:21:21.956
<v Speaker 2>few years at one of the contests, there's a new

0:21:22.036 --> 0:21:24.956
<v Speaker 2>model in three years, So tell me about that moment.

0:21:25.636 --> 0:21:30.676
<v Speaker 3>So the result of two thousand and twelve came in

0:21:31.036 --> 0:21:34.236
<v Speaker 3>and we saw this result coming out of Professor Jeff

0:21:34.276 --> 0:21:39.236
<v Speaker 3>Hinton's lab using your network, and the difference that the

0:21:39.956 --> 0:21:45.396
<v Speaker 3>error reduction compared to previous years was just much bigger,

0:21:46.196 --> 0:21:49.636
<v Speaker 3>you know, and we started to realize this is a

0:21:49.756 --> 0:21:58.876
<v Speaker 3>very very significant moment because there's a serious, serious breakthrough

0:21:59.436 --> 0:22:05.196
<v Speaker 3>in terms of the results of image that which is

0:22:05.236 --> 0:22:08.636
<v Speaker 3>the north Star problem. Right. So it was so important

0:22:08.676 --> 0:22:13.716
<v Speaker 3>for me that I, you know, bought last minute plane

0:22:13.756 --> 0:22:20.076
<v Speaker 3>ticket to fly to Italy to announce the Imagnet Challenge

0:22:20.116 --> 0:22:22.716
<v Speaker 3>winner that year, and.

0:22:22.756 --> 0:22:24.396
<v Speaker 2>You weren't going to go otherwise.

0:22:25.036 --> 0:22:27.396
<v Speaker 3>I wasn't planning to go because I was still a

0:22:27.516 --> 0:22:31.636
<v Speaker 3>nursing mom, so I was taking you know, I was

0:22:32.236 --> 0:22:35.556
<v Speaker 3>mostly working from home at that home in that month.

0:22:36.196 --> 0:22:39.996
<v Speaker 3>But I was like, this is so important that I

0:22:40.076 --> 0:22:40.796
<v Speaker 3>needed to go.

0:22:41.236 --> 0:22:44.796
<v Speaker 2>And so I mean, so this was someone working with

0:22:45.036 --> 0:22:48.716
<v Speaker 2>Jeff Hinton and using a neural network like today Jeff Hinton.

0:22:48.796 --> 0:22:51.596
<v Speaker 2>You know, if you know two names in AI, Jeff

0:22:51.636 --> 0:22:53.396
<v Speaker 2>Hinton is probably one of them. People call him the

0:22:53.396 --> 0:22:56.756
<v Speaker 2>godfather of kind of modern AI, right, and neural networks

0:22:56.956 --> 0:22:59.796
<v Speaker 2>are essentially the thing that has worked right both for

0:22:59.996 --> 0:23:04.076
<v Speaker 2>vision and for language. You know, jat GPTs neural network,

0:23:04.836 --> 0:23:06.876
<v Speaker 2>and so this was a moment was like, oh, this

0:23:06.956 --> 0:23:08.876
<v Speaker 2>technique that a lot of people thought wasn't going to

0:23:08.916 --> 0:23:11.076
<v Speaker 2>work had kind of given up on its back.

0:23:11.556 --> 0:23:15.756
<v Speaker 3>Yeah, yeah, exactly. I think it's actually a parallel story

0:23:15.796 --> 0:23:23.516
<v Speaker 3>of two groups of people that had that determination seeing

0:23:23.596 --> 0:23:28.916
<v Speaker 3>something that you know, maybe the mainstream wasn't seen, and

0:23:28.956 --> 0:23:36.316
<v Speaker 3>then had the resilience and just perseverance to keep marching on.

0:23:36.636 --> 0:23:39.836
<v Speaker 3>I was doing my north Star pursue. I was doing

0:23:39.836 --> 0:23:43.156
<v Speaker 3>the big data approach. They're doing the new network algorithm,

0:23:43.436 --> 0:23:44.556
<v Speaker 3>and then we converged.

0:23:45.316 --> 0:23:48.876
<v Speaker 2>Uh huh. That's really elegant, right, because it's like your

0:23:48.916 --> 0:23:52.076
<v Speaker 2>big data is just sitting there and you don't maybe

0:23:52.236 --> 0:23:54.916
<v Speaker 2>entirely know it, but you kind of need a neural

0:23:54.956 --> 0:23:58.036
<v Speaker 2>network to come in and train on it. Right, And

0:23:58.076 --> 0:24:00.396
<v Speaker 2>they're over there building their neural network and they may

0:24:00.476 --> 0:24:01.996
<v Speaker 2>or may not know it, but they need the big

0:24:02.076 --> 0:24:04.836
<v Speaker 2>data that you're over here building, and then when it

0:24:04.836 --> 0:24:06.196
<v Speaker 2>comes together, it's like, hey.

0:24:06.076 --> 0:24:10.436
<v Speaker 3>It works. Yeah. So I think that's how science progresses.

0:24:10.636 --> 0:24:15.356
<v Speaker 3>It's kind of spiraling up, and sometimes it takes a

0:24:15.396 --> 0:24:18.716
<v Speaker 3>couple of more threats. It's not a single spral I

0:24:18.876 --> 0:24:25.116
<v Speaker 3>remember very vividly that some of one of the critiques,

0:24:25.196 --> 0:24:28.236
<v Speaker 3>one of the main critiques of image net by my

0:24:28.356 --> 0:24:32.196
<v Speaker 3>colleagues is this is too big. We cannot even fit

0:24:32.276 --> 0:24:34.996
<v Speaker 3>this into memory. What are you doing? What are you

0:24:35.116 --> 0:24:39.396
<v Speaker 3>making this giant data set for when we cannot even

0:24:40.636 --> 0:24:44.196
<v Speaker 3>you know, put it on a chip. And as that

0:24:44.276 --> 0:24:46.436
<v Speaker 3>was happening, GPU was happening.

0:24:46.476 --> 0:24:50.676
<v Speaker 2>So GPUs are the Nvidia chips that are now made

0:24:50.716 --> 0:24:52.556
<v Speaker 2>in video, one of the biggest companies in the world.

0:24:52.756 --> 0:24:55.476
<v Speaker 2>But they were figuring out that GPUs are particularly good

0:24:55.516 --> 0:24:57.196
<v Speaker 2>for the neural.

0:24:56.956 --> 0:24:59.076
<v Speaker 3>Network exactly exactly.

0:24:58.876 --> 0:25:01.876
<v Speaker 2>So that moment, this moment when you guys come together

0:25:01.956 --> 0:25:04.556
<v Speaker 2>and kind of create this you know, new era of

0:25:04.596 --> 0:25:07.836
<v Speaker 2>computing really that we're still living in of AI is

0:25:07.876 --> 0:25:13.556
<v Speaker 2>about ten years ago now, right, yeah, so just bring

0:25:13.596 --> 0:25:16.636
<v Speaker 2>me to the present. Like that happened, then where are

0:25:16.636 --> 0:25:19.116
<v Speaker 2>we now? I mean, it's it's kind of the same universe, right,

0:25:19.156 --> 0:25:21.596
<v Speaker 2>It has advanced a lot, but the basic premise of

0:25:22.436 --> 0:25:27.636
<v Speaker 2>you have neural networks training on vast, vast databases of images, like,

0:25:27.836 --> 0:25:29.316
<v Speaker 2>it's basically the same.

0:25:29.196 --> 0:25:32.636
<v Speaker 3>Right, So from my conceptional point of view, you're right,

0:25:32.716 --> 0:25:35.596
<v Speaker 3>at that time, I was downloading the Internet of images

0:25:35.876 --> 0:25:38.596
<v Speaker 3>to be honest. Now, the Internet of Images is just

0:25:38.636 --> 0:25:41.516
<v Speaker 3>so vast, I don't know who can download it all.

0:25:42.036 --> 0:25:48.276
<v Speaker 3>And then the GPUs is mind bogglingly advanced. Right, but

0:25:48.316 --> 0:25:50.596
<v Speaker 3>you're right, the ingredients are still the same.

0:25:54.036 --> 0:25:56.196
<v Speaker 2>We'll be back in a minute with the lightning round.

0:26:05.796 --> 0:26:09.356
<v Speaker 2>Let's do a lightning round. Okay, what's one thing you

0:26:09.436 --> 0:26:10.796
<v Speaker 2>learned running a drag cleaning shop.

0:26:13.996 --> 0:26:17.836
<v Speaker 3>That's how very night to have to think. I think

0:26:17.876 --> 0:26:24.196
<v Speaker 3>I learned resilience because my goal is to be a scientist.

0:26:24.316 --> 0:26:27.196
<v Speaker 3>But if it takes running a dry cleaner shop to

0:26:27.476 --> 0:26:31.916
<v Speaker 3>get there in the most detoured way, I'll have to

0:26:31.956 --> 0:26:32.276
<v Speaker 3>do that.

0:26:32.796 --> 0:26:35.636
<v Speaker 2>So you're writing the book about a high school teacher

0:26:35.636 --> 0:26:38.076
<v Speaker 2>who was a very big and important influence on you,

0:26:38.236 --> 0:26:41.036
<v Speaker 2>and how your advisors in grad school were an important influence.

0:26:41.076 --> 0:26:44.076
<v Speaker 2>And now you have been a mentor to many people.

0:26:44.396 --> 0:26:47.116
<v Speaker 2>So I'm curious, what's one tip for finding a mentor?

0:26:48.836 --> 0:26:55.516
<v Speaker 3>For finding a mentor? That's a great question. I trusted

0:26:55.596 --> 0:27:01.676
<v Speaker 3>them different stage. This trust meant different things. I trust

0:27:01.716 --> 0:27:08.836
<v Speaker 3>their genuine intention, I trusted their wisdom, I trusted their

0:27:08.916 --> 0:27:16.876
<v Speaker 3>vas and I trusted there you know, believe in me.

0:27:17.556 --> 0:27:21.676
<v Speaker 3>So that was how I was lucky to find my mentors.

0:27:22.556 --> 0:27:24.316
<v Speaker 2>What's one tip for being a mentor?

0:27:25.116 --> 0:27:32.116
<v Speaker 3>Being a mentor is really about respecting the the person,

0:27:32.276 --> 0:27:36.316
<v Speaker 3>the soul and help them to find their north star,

0:27:36.436 --> 0:27:37.556
<v Speaker 3>to find their passion.

0:27:39.076 --> 0:27:41.316
<v Speaker 2>If everything goes well, what problem will you be trying

0:27:41.316 --> 0:27:42.996
<v Speaker 2>to solve in five years?

0:27:43.836 --> 0:27:50.636
<v Speaker 3>I'm trying to usher in machines being so helpful and

0:27:50.756 --> 0:27:56.596
<v Speaker 3>collaborative for humans, whether it's productivity or our wellbeing. If

0:27:56.596 --> 0:28:03.516
<v Speaker 3>this includes sensors, spar sensors, virtual agents, or real robots,

0:28:04.276 --> 0:28:07.796
<v Speaker 3>I think it all. You know, I'm very excited by that.

0:28:13.556 --> 0:28:15.916
<v Speaker 2>Feif A Lee is a professor of computer science at

0:28:15.956 --> 0:28:19.236
<v Speaker 2>Stanford and the author of the book The Worlds I See.

0:28:20.636 --> 0:28:24.876
<v Speaker 2>Today's show was produced by Edith Russolo and Gabriel Hunter Chang.

0:28:25.396 --> 0:28:29.276
<v Speaker 2>It was edited by Karen Chakerji and engineered by Sarah Bouger.

0:28:29.756 --> 0:28:33.036
<v Speaker 2>You can email us at problem at Pushkin dot FM.

0:28:33.716 --> 0:28:36.276
<v Speaker 2>I'm Jacob Goldstein, and we'll be back next week with

0:28:36.356 --> 0:28:36.636
<v Speaker 2>another

0:28:36.676 --> 0:28:37.556
<v Speaker 1>Episode of What You Have