1 00:00:15,356 --> 00:00:15,796 Speaker 1: Pushkin. 2 00:00:21,316 --> 00:00:24,036 Speaker 2: Over the past year, we've heard a lot about artificial 3 00:00:24,076 --> 00:00:28,676 Speaker 2: intelligence models that are really good at manipulating language. We've 4 00:00:28,676 --> 00:00:33,116 Speaker 2: heard somewhat less about AI that deals with images. It's 5 00:00:33,156 --> 00:00:36,956 Speaker 2: called computer vision, and it's a huge deal, which you 6 00:00:36,996 --> 00:00:40,876 Speaker 2: know obviously. Like language, vision is this core part of 7 00:00:40,916 --> 00:00:44,916 Speaker 2: the experience of being human, And on a more practical level, 8 00:00:45,316 --> 00:00:48,516 Speaker 2: computer vision is key for self driving cars, and for 9 00:00:48,636 --> 00:00:52,476 Speaker 2: drones and for all kinds of industrial robots. As it 10 00:00:52,516 --> 00:00:56,236 Speaker 2: turns out, there was this one key moment in the 11 00:00:56,276 --> 00:01:00,516 Speaker 2: development of modern AI, for both vision and language, and 12 00:01:00,556 --> 00:01:03,916 Speaker 2: if you understand this moment, you understand a lot about 13 00:01:03,916 --> 00:01:13,036 Speaker 2: how AI works today. I'm Jacob Goldstein. This is What's 14 00:01:13,036 --> 00:01:15,556 Speaker 2: Your Problem, the show where I talk to people who 15 00:01:15,636 --> 00:01:19,996 Speaker 2: are trying to make technological progress. My guest today played 16 00:01:20,036 --> 00:01:23,436 Speaker 2: a central role in that key moment in AI history. 17 00:01:23,956 --> 00:01:25,196 Speaker 2: Her name is faith A. 18 00:01:25,316 --> 00:01:25,676 Speaker 1: Lee. 19 00:01:26,156 --> 00:01:29,716 Speaker 2: She's a Stanford computer scientist, the author of a memoir 20 00:01:29,756 --> 00:01:33,556 Speaker 2: called The Worlds I See, the former chief scientist of 21 00:01:33,676 --> 00:01:37,516 Speaker 2: AI and machine learning at Google, and just generally one 22 00:01:37,556 --> 00:01:40,676 Speaker 2: of the most important innovators in the history of computer vision. 23 00:01:41,476 --> 00:01:44,476 Speaker 2: I started our conversation with really a pretty general question. 24 00:01:45,116 --> 00:01:48,236 Speaker 2: I asked Fay fe just to explain what computer vision 25 00:01:48,356 --> 00:01:49,996 Speaker 2: is and why it's so important. 26 00:01:53,076 --> 00:01:59,236 Speaker 3: So computer vision is about enabling computers and machines to 27 00:01:59,596 --> 00:02:05,316 Speaker 3: have visual intelligence. What is visual intelligence? Well, the best 28 00:02:06,036 --> 00:02:12,716 Speaker 3: example comes from humans who are extremely visually intelligent animals, 29 00:02:13,156 --> 00:02:18,116 Speaker 3: So that we can make an omelet by knowing what 30 00:02:18,316 --> 00:02:20,996 Speaker 3: is in our fridge. How do we go and take 31 00:02:21,076 --> 00:02:23,916 Speaker 3: the egg out, how do we take the tomato out? 32 00:02:24,036 --> 00:02:29,036 Speaker 3: How do we plan the cooking of the omelet? How 33 00:02:29,076 --> 00:02:32,996 Speaker 3: do we interact with every ingredients, and how do we 34 00:02:33,116 --> 00:02:38,636 Speaker 3: understand all the changes of the objects and all this 35 00:02:39,036 --> 00:02:40,596 Speaker 3: is part of visual intelligence. 36 00:02:41,076 --> 00:02:43,276 Speaker 2: Yeah, I mean you write in your book that vision 37 00:02:43,316 --> 00:02:47,676 Speaker 2: isn't just an application of our intelligence, it is synonymous 38 00:02:47,716 --> 00:02:50,116 Speaker 2: with our intelligence, which is something I want to talk 39 00:02:50,156 --> 00:02:53,756 Speaker 2: more about. But before we get into human vision and 40 00:02:53,796 --> 00:02:57,036 Speaker 2: how that led you into computer vision, just give me 41 00:02:57,116 --> 00:03:00,356 Speaker 2: a sense of some of the applications, both the current 42 00:03:00,436 --> 00:03:05,916 Speaker 2: applications of computer vision and potential future applications of computer vision. 43 00:03:06,396 --> 00:03:09,556 Speaker 3: In fact, we're already using computer vision to do a 44 00:03:09,636 --> 00:03:14,116 Speaker 3: lot of things. The most obvious example is all kinds 45 00:03:14,116 --> 00:03:18,516 Speaker 3: of driver's assistant programs. Right, we're not having even about 46 00:03:18,516 --> 00:03:22,236 Speaker 3: self driving cars. We're talking about lane detection or talking 47 00:03:22,276 --> 00:03:28,076 Speaker 3: about avoiding curb sized pedestrian alert. You know, we are 48 00:03:28,196 --> 00:03:33,516 Speaker 3: using computer vision in our healthcare system, in radiology, in pathology, 49 00:03:34,156 --> 00:03:39,476 Speaker 3: or you know, in protecting of species. A lot of 50 00:03:40,556 --> 00:03:44,956 Speaker 3: the camera traps in the in the deep forests are 51 00:03:45,076 --> 00:03:50,476 Speaker 3: using computer vision to track to track different animals. So 52 00:03:50,756 --> 00:03:53,796 Speaker 3: we're using computer vision already on a daily basis. 53 00:03:54,036 --> 00:03:57,916 Speaker 2: And then when you dream of some applications that are 54 00:03:57,956 --> 00:04:00,476 Speaker 2: not here yet but that might be here in whatever 55 00:04:00,516 --> 00:04:02,236 Speaker 2: five or ten years, what do you think of what's 56 00:04:02,236 --> 00:04:03,236 Speaker 2: at the top of the list. 57 00:04:04,156 --> 00:04:07,156 Speaker 3: So when I dream of computer vision, I dream of 58 00:04:07,316 --> 00:04:10,676 Speaker 3: all kinds of robotic application, so from self driving car 59 00:04:10,836 --> 00:04:14,916 Speaker 3: to personal robots using computer vision. I dream of our 60 00:04:14,956 --> 00:04:20,276 Speaker 3: biodiversity being mapped using computer vision. I dream of exploration 61 00:04:20,596 --> 00:04:21,716 Speaker 3: using computer vision. 62 00:04:21,956 --> 00:04:25,596 Speaker 2: Wonderful. So I want to talk about your work in 63 00:04:25,716 --> 00:04:31,796 Speaker 2: computer vision, which goes back well decades now, and I 64 00:04:31,876 --> 00:04:37,436 Speaker 2: want to start with work not on computers actually, but 65 00:04:37,556 --> 00:04:45,356 Speaker 2: on human beings right, on understanding of how humans process 66 00:04:45,436 --> 00:04:48,116 Speaker 2: visual information, right, how we make sense of what we're seeing. 67 00:04:48,516 --> 00:04:51,636 Speaker 2: And in the book, you write in particular about this 68 00:04:52,716 --> 00:04:57,076 Speaker 2: nineteen ninety six paper with a boring name that was 69 00:04:57,116 --> 00:05:00,196 Speaker 2: a huge deal. It was called speed of processing in 70 00:05:00,236 --> 00:05:04,156 Speaker 2: the human visual system. Tell me about that paper and 71 00:05:04,156 --> 00:05:05,156 Speaker 2: what it meant. 72 00:05:05,356 --> 00:05:11,156 Speaker 3: It's a paper of using EG, which is recording electrical 73 00:05:11,196 --> 00:05:17,236 Speaker 3: brain waves to make a link between how fast can 74 00:05:17,356 --> 00:05:22,276 Speaker 3: humans make a very complex visual decision when they sees something, 75 00:05:22,356 --> 00:05:27,236 Speaker 3: And the particular decision humans were to make is to 76 00:05:28,276 --> 00:05:34,356 Speaker 3: separate images from images containing animals and images not containing animals. 77 00:05:34,676 --> 00:05:38,876 Speaker 3: And if you think about the pool of possibilities is 78 00:05:38,996 --> 00:05:45,316 Speaker 3: extremely complex. It's actually mathematically just an infinite possibility because 79 00:05:45,756 --> 00:05:48,236 Speaker 3: there are so many different types of animals, so many 80 00:05:48,276 --> 00:05:53,196 Speaker 3: different different types of non animals. That's infinite for practical purposes. 81 00:05:53,716 --> 00:05:56,396 Speaker 3: And then you put them in photos, you can get 82 00:05:56,636 --> 00:06:01,956 Speaker 3: infinite possibilities of photos. Yet you show them one by 83 00:06:02,116 --> 00:06:06,116 Speaker 3: one two humans they make decisions really quickly, and they 84 00:06:06,116 --> 00:06:08,076 Speaker 3: make correct decisions really quickly. 85 00:06:09,196 --> 00:06:14,436 Speaker 2: That really quickly, but like mind bogglingly quickly at the time, right, 86 00:06:14,476 --> 00:06:18,676 Speaker 2: it was shocking just how fast it was, right milliseconds. 87 00:06:18,916 --> 00:06:22,556 Speaker 3: Yeah. So the thing is we kind of sort of 88 00:06:22,756 --> 00:06:26,436 Speaker 3: know we're good at see, right as a species. We 89 00:06:26,556 --> 00:06:29,196 Speaker 3: know we open our eyes we see the world, but 90 00:06:29,476 --> 00:06:32,476 Speaker 3: we don't really know how good and how fast. 91 00:06:33,036 --> 00:06:35,596 Speaker 2: And this is we underestimate. It's it's a rare case 92 00:06:35,596 --> 00:06:38,436 Speaker 2: where human beings underestimate ourselves exactly. 93 00:06:38,796 --> 00:06:44,436 Speaker 3: This is a rigorous scientific study put a time, actual 94 00:06:44,596 --> 00:06:48,916 Speaker 3: time to that speed of visual intelligence, and it's using 95 00:06:48,996 --> 00:06:53,316 Speaker 3: modern technique. It's very smart and very very exciting. 96 00:06:54,756 --> 00:06:58,596 Speaker 2: What did it mean to you when you saw that result? 97 00:06:58,676 --> 00:06:59,676 Speaker 2: When you read that paper? 98 00:07:00,436 --> 00:07:03,556 Speaker 3: When I read that paper, it means north star. Let 99 00:07:03,556 --> 00:07:07,116 Speaker 3: me explain what does north star mean? As north star? 100 00:07:07,236 --> 00:07:07,516 Speaker 2: Okay? 101 00:07:07,596 --> 00:07:12,756 Speaker 3: Yeah, As a scientist, I'm driven by finding answers to 102 00:07:12,796 --> 00:07:18,076 Speaker 3: the most audacious question. But as high Einstein has said, 103 00:07:18,676 --> 00:07:24,796 Speaker 3: in science scientific inquiry, the hardest job is not finding 104 00:07:24,876 --> 00:07:30,236 Speaker 3: solutions asking the right question because you you know, when 105 00:07:30,276 --> 00:07:32,996 Speaker 3: you like we talk about visual intelligence, it's such a 106 00:07:33,116 --> 00:07:37,756 Speaker 3: vast topic. What is the topic to pursue? What is 107 00:07:37,796 --> 00:07:41,996 Speaker 3: the question to ask that is fundamental to visual intelligence? 108 00:07:42,436 --> 00:07:45,836 Speaker 3: And how do we unlock it? When when we read 109 00:07:45,876 --> 00:07:54,156 Speaker 3: that Simon Thorp paper, it convinced me that complex object categorization, 110 00:07:54,876 --> 00:08:00,556 Speaker 3: the ability to classify you know, animal versus no animal, 111 00:08:01,116 --> 00:08:07,076 Speaker 3: chair versus you know, table, hot dog, hot dog versus hamburger. 112 00:08:07,316 --> 00:08:10,956 Speaker 3: You know, this is fundamental told to humans. It's a 113 00:08:10,956 --> 00:08:15,276 Speaker 3: building block of visual intelligence. Is it has a neural 114 00:08:15,356 --> 00:08:22,996 Speaker 3: correlate in human brain that shows how evolutionally evolutionarily optimized 115 00:08:23,196 --> 00:08:27,476 Speaker 3: it is. So with all that evidence, it comments me 116 00:08:28,036 --> 00:08:32,036 Speaker 3: object categorization is a north star to pursue. 117 00:08:32,676 --> 00:08:35,076 Speaker 2: And you were a grad student at the time, right, 118 00:08:35,156 --> 00:08:38,516 Speaker 2: this is sort of the thing and any ambitious grad 119 00:08:38,556 --> 00:08:40,476 Speaker 2: student that's going to be doing it's like, I know, 120 00:08:40,476 --> 00:08:42,836 Speaker 2: I'm interested in this field, but I need my question, 121 00:08:42,956 --> 00:08:45,316 Speaker 2: I need my thing, right, And so now you've got 122 00:08:45,356 --> 00:08:49,596 Speaker 2: your thing, yes, and it's categorization in particular. And you 123 00:08:49,676 --> 00:08:55,916 Speaker 2: describe how earlier theories of how humans process visual input 124 00:08:56,076 --> 00:08:59,836 Speaker 2: was not so categorization focused, right. It was kind of 125 00:08:59,876 --> 00:09:02,316 Speaker 2: like if you just sort of thought from first principles, 126 00:09:02,556 --> 00:09:04,756 Speaker 2: you would think, well, we see color and we see 127 00:09:04,796 --> 00:09:06,516 Speaker 2: shapes and then we kind of make sense of it. 128 00:09:06,796 --> 00:09:09,556 Speaker 2: But with this paper and related work, it show is 129 00:09:09,596 --> 00:09:12,876 Speaker 2: like that's actually not it, right, and in fact our brains. 130 00:09:13,276 --> 00:09:15,356 Speaker 2: You write about how there are specific regions of the 131 00:09:15,356 --> 00:09:19,596 Speaker 2: brain like this region is just face, the face categorization region, 132 00:09:19,596 --> 00:09:21,796 Speaker 2: and this region is the like place as we go 133 00:09:21,876 --> 00:09:24,636 Speaker 2: all the time region And so it's a really different 134 00:09:24,716 --> 00:09:29,196 Speaker 2: and interesting way of thinking about seeing, and it's fundamentally 135 00:09:29,276 --> 00:09:35,036 Speaker 2: about just incredibly quickly putting things into categories. And so 136 00:09:35,356 --> 00:09:38,956 Speaker 2: you decide to take this idea of vision and categorization 137 00:09:39,076 --> 00:09:43,196 Speaker 2: and try and figure out how to how to get 138 00:09:43,236 --> 00:09:45,636 Speaker 2: computers to do this, right, how to get computers to 139 00:09:45,716 --> 00:09:52,036 Speaker 2: be able to categorize objects from the world. And you 140 00:09:53,076 --> 00:09:58,636 Speaker 2: start building these data sets essentially of labeled images, right, 141 00:09:58,676 --> 00:10:01,596 Speaker 2: And you build what seems in retrospect like a relatively 142 00:10:01,596 --> 00:10:04,836 Speaker 2: small one at Caltech, and then you decide to build 143 00:10:04,996 --> 00:10:07,276 Speaker 2: a really big one. Right. It comes to be called 144 00:10:07,316 --> 00:10:10,236 Speaker 2: the image net. It's a thing you're famous for, NERD 145 00:10:10,276 --> 00:10:15,196 Speaker 2: famous for, And I want to talk about building image net, right, 146 00:10:15,236 --> 00:10:18,276 Speaker 2: So tell me about deciding to build what becomes image net. 147 00:10:19,556 --> 00:10:24,756 Speaker 3: So the image net is the north Star. To me, 148 00:10:25,356 --> 00:10:28,356 Speaker 3: I was in the field long enough because I finished 149 00:10:28,356 --> 00:10:31,876 Speaker 3: my PhD, I started my own lab. I had this 150 00:10:32,236 --> 00:10:38,996 Speaker 3: unwavering faith and believe that, you know, unlocking object recognition 151 00:10:39,356 --> 00:10:42,316 Speaker 3: is part of the north is a north star, is 152 00:10:42,356 --> 00:10:45,596 Speaker 3: a critical north star. And I became impatient because I 153 00:10:45,676 --> 00:10:49,836 Speaker 3: realized we were not making enough progress. I realized that, 154 00:10:51,276 --> 00:10:55,716 Speaker 3: especially algorithmically, we were like running in circles a little 155 00:10:55,756 --> 00:11:01,356 Speaker 3: bit of optimizing very small algorithms that are not really 156 00:11:01,396 --> 00:11:05,356 Speaker 3: getting to the essence of the problem, and part of 157 00:11:05,396 --> 00:11:09,116 Speaker 3: the essence, which a lot of people overlooked, is actually 158 00:11:09,316 --> 00:11:12,956 Speaker 3: the scale of the problem. What was really bothering me 159 00:11:13,636 --> 00:11:17,116 Speaker 3: is that we were not seeing the problem. We're not 160 00:11:17,196 --> 00:11:21,396 Speaker 3: seeing the mathematical problem with the scale thinking, because it's 161 00:11:21,436 --> 00:11:27,076 Speaker 3: not just about being big. It's about the mathematical reason 162 00:11:27,636 --> 00:11:31,076 Speaker 3: of why we should go big, and it's it's a 163 00:11:31,156 --> 00:11:35,156 Speaker 3: very deep reason in general, it's a reason for what 164 00:11:35,196 --> 00:11:39,276 Speaker 3: we call generalization. You have to learn enough to be 165 00:11:39,436 --> 00:11:44,036 Speaker 3: able to see everything and that minds You've got. 166 00:11:43,836 --> 00:11:45,716 Speaker 2: To see a lot of pictures of things that are 167 00:11:45,756 --> 00:11:47,836 Speaker 2: cats and not cats to understand what. 168 00:11:47,836 --> 00:11:52,956 Speaker 3: Is right that that mindset was just not you know, 169 00:11:53,036 --> 00:11:57,836 Speaker 3: that's a big data mindset. It was not in the 170 00:11:57,876 --> 00:11:59,316 Speaker 3: world at all at that time. 171 00:11:59,556 --> 00:12:02,476 Speaker 2: So how did you get there? How did you Because 172 00:12:02,516 --> 00:12:05,556 Speaker 2: what you end up doing is building this just gargantuan 173 00:12:06,316 --> 00:12:10,356 Speaker 2: uh thing full of labeled images, bigger than anybody'd ever 174 00:12:10,356 --> 00:12:12,116 Speaker 2: built before. Like, how did you arrive at that? 175 00:12:12,116 --> 00:12:15,396 Speaker 3: That's a great question. I think that's actually the most 176 00:12:15,436 --> 00:12:17,956 Speaker 3: fun but difficult part of the book to write, is 177 00:12:18,236 --> 00:12:24,156 Speaker 3: you know, like dig in to my own brain. In hindsight, 178 00:12:24,236 --> 00:12:29,756 Speaker 3: it's just little by little the insight and the realization epiphany, 179 00:12:30,196 --> 00:12:32,956 Speaker 3: But honestly, I don't know how to analyze my own brain. 180 00:12:33,236 --> 00:12:40,236 Speaker 3: I had the mathematical intuition that scale makes a difference, 181 00:12:40,276 --> 00:12:44,476 Speaker 3: bigger difference than most people give credit to. I also 182 00:12:44,636 --> 00:12:51,276 Speaker 3: had the neurocognitive science inspiration that early human development was 183 00:12:51,356 --> 00:12:55,476 Speaker 3: exposure to the world in continuous ways. We don't like 184 00:12:55,676 --> 00:12:59,676 Speaker 3: lock the baby in a dark room and show them, 185 00:12:59,836 --> 00:13:04,316 Speaker 3: you know, one hundred cats. They just go out and experience. 186 00:13:04,676 --> 00:13:08,116 Speaker 3: You know that experience is actually driven by big data. 187 00:13:09,276 --> 00:13:13,156 Speaker 3: Maybe I was also inspired by this Internet age coming 188 00:13:13,236 --> 00:13:16,316 Speaker 3: our way right like that part, I do think it's 189 00:13:16,356 --> 00:13:21,316 Speaker 3: a little bit moment of just being alone, and somehow 190 00:13:22,196 --> 00:13:26,996 Speaker 3: all the stars aligned in my head. I decided I'm 191 00:13:27,036 --> 00:13:32,996 Speaker 3: going to try the craziest thing, and I did have 192 00:13:33,676 --> 00:13:36,796 Speaker 3: a faith and believe that it was the right thing 193 00:13:36,876 --> 00:13:37,156 Speaker 3: to do. 194 00:13:37,316 --> 00:13:39,796 Speaker 2: And specifically, like, what was this thing that you were 195 00:13:39,796 --> 00:13:40,316 Speaker 2: going to build? 196 00:13:41,036 --> 00:13:46,876 Speaker 3: I'm gonna get the entire Internet of images, consisted of 197 00:13:47,396 --> 00:13:50,316 Speaker 3: all the objects I can get my hands on that 198 00:13:50,396 --> 00:13:55,156 Speaker 3: humans have ever taken pictures of and catalog them in 199 00:13:55,236 --> 00:14:00,796 Speaker 3: a gigantic, big database. And I will use that to 200 00:14:00,876 --> 00:14:05,756 Speaker 3: do two things. To train machines to recognize the entire 201 00:14:05,876 --> 00:14:11,476 Speaker 3: world of objects, and also to benchmark everybody's progress. You 202 00:14:11,516 --> 00:14:15,876 Speaker 3: know everybody, I mean the international community of computer vision scientists. 203 00:14:15,876 --> 00:14:18,556 Speaker 2: So you will have this database and then everyone can 204 00:14:18,636 --> 00:14:21,676 Speaker 2: train their computer vision models on your database and see 205 00:14:21,676 --> 00:14:22,996 Speaker 2: how they do on new images. 206 00:14:23,316 --> 00:14:23,756 Speaker 3: Yes, so. 207 00:14:25,876 --> 00:14:28,036 Speaker 2: You have to decide. There's this interesting part of the 208 00:14:28,076 --> 00:14:29,836 Speaker 2: book where you're like, Okay, I want to build a 209 00:14:29,876 --> 00:14:34,476 Speaker 2: database with everything in it. How many categories of everything 210 00:14:34,556 --> 00:14:37,116 Speaker 2: are there? Right, somebody's actually done that research. If you 211 00:14:37,196 --> 00:14:41,076 Speaker 2: take all the things, how many kinds of things are there? 212 00:14:41,276 --> 00:14:42,076 Speaker 2: What's the number? 213 00:14:43,916 --> 00:14:48,476 Speaker 3: The number is the Beaderman number, and the Beierman number 214 00:14:48,916 --> 00:14:54,796 Speaker 3: is a I'm proud of really giving professor or Piederman 215 00:14:54,916 --> 00:14:59,076 Speaker 3: that credit. Yeah, nobody noticed that number. He wrote. He's 216 00:14:59,116 --> 00:15:03,156 Speaker 3: a cognitive scientist who wrote it very very good, But 217 00:15:03,276 --> 00:15:06,236 Speaker 3: I don't think it's a famous paper in the nineteen 218 00:15:06,276 --> 00:15:12,196 Speaker 3: eighties guestimating or estimating with the back of the envelope 219 00:15:12,236 --> 00:15:17,036 Speaker 3: computation of how many visual concepts humans see? And that 220 00:15:17,196 --> 00:15:20,756 Speaker 3: is a very hard number. How do you interrogate a 221 00:15:20,836 --> 00:15:25,116 Speaker 3: person and say, list me all the visual concepts. It's impossible. 222 00:15:25,556 --> 00:15:30,196 Speaker 3: But he had a way of using dictionary and using 223 00:15:30,316 --> 00:15:33,956 Speaker 3: visual structure to estimate, and he put a number of 224 00:15:33,996 --> 00:15:36,436 Speaker 3: thirty thousand visual concepts. 225 00:15:36,876 --> 00:15:40,396 Speaker 2: They're thirty thousand different sort of kinds of things, right, 226 00:15:40,476 --> 00:15:44,276 Speaker 2: people can identify differentiating. Yeah, it's a lot. 227 00:15:44,476 --> 00:15:45,236 Speaker 3: That's a lot. 228 00:15:46,036 --> 00:15:51,636 Speaker 2: Yeah, And every concept you're setting out, if you're setting out, sorry, 229 00:15:51,676 --> 00:15:53,356 Speaker 2: is that your number? Is your number? 230 00:15:53,556 --> 00:15:57,356 Speaker 3: That was my number. I was obsessed with that number, 231 00:15:57,956 --> 00:16:00,996 Speaker 3: and I was obsessed in a way that I feel 232 00:16:00,996 --> 00:16:03,476 Speaker 3: I was kind of crazy because nobody was obsessed with 233 00:16:03,516 --> 00:16:06,636 Speaker 3: that number. Nobody even knew. I think my book is 234 00:16:06,676 --> 00:16:10,076 Speaker 3: the book that gave the number A which is be 235 00:16:10,236 --> 00:16:12,716 Speaker 3: the most number, and I'm very proud of that. 236 00:16:13,196 --> 00:16:17,156 Speaker 2: Do you can you just rattle off some of the categories. 237 00:16:17,356 --> 00:16:23,276 Speaker 3: Star nosed mole, star No's mole category to itself, that's 238 00:16:23,356 --> 00:16:31,436 Speaker 3: my favorite, one of my favorite categories. And Guardenian windsor 239 00:16:31,756 --> 00:16:37,676 Speaker 3: flower has windsor chair. There were hundreds of dogs. I 240 00:16:37,756 --> 00:16:43,076 Speaker 3: remember there were uh different kind of cars like like 241 00:16:43,356 --> 00:16:51,956 Speaker 3: sports sedan and uh monocycles. It's it's a lot. 242 00:16:54,836 --> 00:16:57,876 Speaker 2: So Faithley has her number, she has her big idea. 243 00:16:58,476 --> 00:17:02,836 Speaker 2: She knows what she needs to build a gigantic image database. 244 00:17:03,356 --> 00:17:04,836 Speaker 2: But how do you actually do that? 245 00:17:05,756 --> 00:17:19,436 Speaker 1: Well? I have the answer in just a minute. 246 00:17:19,756 --> 00:17:24,316 Speaker 2: Okay, so you've got your giant north Star task ahead 247 00:17:24,316 --> 00:17:28,236 Speaker 2: of you. Not only do you have, you know, thirty 248 00:17:28,276 --> 00:17:33,876 Speaker 2: thousand ish categories to deal with, presumably for each category 249 00:17:33,916 --> 00:17:38,916 Speaker 2: you need many many thousands. So it's thousands of images 250 00:17:38,956 --> 00:17:43,356 Speaker 2: per category, tens of thousands of categories. What is the 251 00:17:43,476 --> 00:17:44,276 Speaker 2: order of magnitude? 252 00:17:44,396 --> 00:17:47,236 Speaker 3: We're talking about tons of millions? 253 00:17:47,276 --> 00:17:50,476 Speaker 2: A million this is and this is not a time 254 00:17:51,236 --> 00:17:53,676 Speaker 2: where you can do this in an automated or semi 255 00:17:53,676 --> 00:17:55,276 Speaker 2: automated way like you could now. 256 00:17:55,476 --> 00:17:58,476 Speaker 3: No, I mean the point is, the machines cannot do it. 257 00:17:58,636 --> 00:18:01,596 Speaker 3: We have to. This is a north start to push 258 00:18:01,716 --> 00:18:04,196 Speaker 3: machines towards that, So you have to do it by 259 00:18:04,236 --> 00:18:06,436 Speaker 3: human hen and the good. 260 00:18:06,276 --> 00:18:11,516 Speaker 2: News downloading and labeling. Yeah, millions or tens of millions 261 00:18:11,516 --> 00:18:12,236 Speaker 2: of images. 262 00:18:11,996 --> 00:18:17,916 Speaker 3: Downloading, cleaning, labeling, and yes, that's that was the task. 263 00:18:18,516 --> 00:18:21,116 Speaker 2: So now you're like Henry Ford or something right now 264 00:18:21,156 --> 00:18:24,756 Speaker 2: you need an assembly line, you need a factory for 265 00:18:25,596 --> 00:18:26,716 Speaker 2: creating this database. 266 00:18:26,956 --> 00:18:30,516 Speaker 3: Yeah, you can put it that way. And we needed 267 00:18:30,836 --> 00:18:35,636 Speaker 3: a global workforce, and eventually we found them on Amazon 268 00:18:35,676 --> 00:18:39,116 Speaker 3: Mechanical Turk. It's an online global market. 269 00:18:39,476 --> 00:18:42,596 Speaker 2: It's a market for project based work, right, people doing 270 00:18:42,836 --> 00:18:47,716 Speaker 2: project based work. And so, so how long does it 271 00:18:47,756 --> 00:18:49,396 Speaker 2: take you to build this thing? And how big is 272 00:18:49,396 --> 00:18:50,756 Speaker 2: it when it's done? 273 00:18:51,276 --> 00:18:54,236 Speaker 3: It took us three years. When it was done, it's 274 00:18:54,316 --> 00:19:02,516 Speaker 3: fifteen million hand cleaned, sorted, curated, labeled images across twenty 275 00:19:02,596 --> 00:19:04,076 Speaker 3: two thousand categories. 276 00:19:04,356 --> 00:19:06,556 Speaker 2: So now you have this thing. It's called image net 277 00:19:06,636 --> 00:19:10,556 Speaker 2: and basically the function of it is it itself is 278 00:19:10,596 --> 00:19:12,916 Speaker 2: not useful, right, It is there to train. Well, it's 279 00:19:12,956 --> 00:19:15,756 Speaker 2: useful as a means to an end. It's there for 280 00:19:15,996 --> 00:19:20,276 Speaker 2: people who have models that aim to teach computer's vision 281 00:19:20,876 --> 00:19:23,996 Speaker 2: to see and understand to train their models. Now there's 282 00:19:24,076 --> 00:19:27,836 Speaker 2: this giant database. I mean people talk about this as 283 00:19:28,156 --> 00:19:30,636 Speaker 2: kind of one of the beginnings of big data. 284 00:19:30,876 --> 00:19:36,236 Speaker 3: Yes, yeah, I think it should be properly recognized as 285 00:19:36,316 --> 00:19:39,676 Speaker 3: the beginning of big data in AI, because before this, 286 00:19:40,676 --> 00:19:44,196 Speaker 3: there isn't this concept of big data in AI. It 287 00:19:45,356 --> 00:19:48,836 Speaker 3: was just a paradigm shift from that point of view. 288 00:19:49,956 --> 00:19:53,116 Speaker 2: And so you create this contest where people can come 289 00:19:53,876 --> 00:19:56,996 Speaker 2: and train their models on image net, on this giant 290 00:19:57,076 --> 00:20:00,916 Speaker 2: database that you've built, and then and then in the 291 00:20:00,916 --> 00:20:04,196 Speaker 2: contest their models will be shown new images images not 292 00:20:04,316 --> 00:20:07,436 Speaker 2: in the database, and you'll see how how good they are. 293 00:20:07,636 --> 00:20:10,476 Speaker 2: And for a while it's like going okay, right, but 294 00:20:11,356 --> 00:20:14,596 Speaker 2: kind of slow, like you're in the book you write 295 00:20:14,596 --> 00:20:16,476 Speaker 2: about like you get a little worried. You've built this 296 00:20:16,596 --> 00:20:18,756 Speaker 2: giant thing with people all around the world and it's 297 00:20:18,796 --> 00:20:22,396 Speaker 2: not for a while leading to the breakthroughs that you 298 00:20:22,436 --> 00:20:23,196 Speaker 2: had imagined. 299 00:20:23,356 --> 00:20:28,276 Speaker 3: Yeah, it was first of all, we open source this. 300 00:20:28,596 --> 00:20:32,516 Speaker 3: We didn't even though we spend a lot of sweat 301 00:20:32,516 --> 00:20:35,836 Speaker 3: and tears, you know, building this, but we know the 302 00:20:35,876 --> 00:20:38,756 Speaker 3: real value is to open source. So we gave it 303 00:20:38,836 --> 00:20:44,996 Speaker 3: for free to the whole community. And then I wanted 304 00:20:45,076 --> 00:20:49,156 Speaker 3: everybody to use it. I wanted to see this driving 305 00:20:50,636 --> 00:20:53,276 Speaker 3: all of us towards the north Star. I want the 306 00:20:53,276 --> 00:20:59,476 Speaker 3: field to work out. But it wasn't like an overnight success. 307 00:20:59,636 --> 00:21:02,516 Speaker 3: It wasn't like everybody's running around and say, oh my god, 308 00:21:02,556 --> 00:21:08,476 Speaker 3: there's image Net to use. And of course we were not. 309 00:21:10,036 --> 00:21:13,516 Speaker 3: You know, we were disappointed, but we were not sitting 310 00:21:13,556 --> 00:21:15,356 Speaker 3: there crying. We were just disappointed. 311 00:21:16,436 --> 00:21:19,116 Speaker 2: And so there is this big moment right after a 312 00:21:19,156 --> 00:21:21,956 Speaker 2: few years at one of the contests, there's a new 313 00:21:22,036 --> 00:21:24,956 Speaker 2: model in three years, So tell me about that moment. 314 00:21:25,636 --> 00:21:30,676 Speaker 3: So the result of two thousand and twelve came in 315 00:21:31,036 --> 00:21:34,236 Speaker 3: and we saw this result coming out of Professor Jeff 316 00:21:34,276 --> 00:21:39,236 Speaker 3: Hinton's lab using your network, and the difference that the 317 00:21:39,956 --> 00:21:45,396 Speaker 3: error reduction compared to previous years was just much bigger, 318 00:21:46,196 --> 00:21:49,636 Speaker 3: you know, and we started to realize this is a 319 00:21:49,756 --> 00:21:58,876 Speaker 3: very very significant moment because there's a serious, serious breakthrough 320 00:21:59,436 --> 00:22:05,196 Speaker 3: in terms of the results of image that which is 321 00:22:05,236 --> 00:22:08,636 Speaker 3: the north Star problem. Right. So it was so important 322 00:22:08,676 --> 00:22:13,716 Speaker 3: for me that I, you know, bought last minute plane 323 00:22:13,756 --> 00:22:20,076 Speaker 3: ticket to fly to Italy to announce the Imagnet Challenge 324 00:22:20,116 --> 00:22:22,716 Speaker 3: winner that year, and. 325 00:22:22,756 --> 00:22:24,396 Speaker 2: You weren't going to go otherwise. 326 00:22:25,036 --> 00:22:27,396 Speaker 3: I wasn't planning to go because I was still a 327 00:22:27,516 --> 00:22:31,636 Speaker 3: nursing mom, so I was taking you know, I was 328 00:22:32,236 --> 00:22:35,556 Speaker 3: mostly working from home at that home in that month. 329 00:22:36,196 --> 00:22:39,996 Speaker 3: But I was like, this is so important that I 330 00:22:40,076 --> 00:22:40,796 Speaker 3: needed to go. 331 00:22:41,236 --> 00:22:44,796 Speaker 2: And so I mean, so this was someone working with 332 00:22:45,036 --> 00:22:48,716 Speaker 2: Jeff Hinton and using a neural network like today Jeff Hinton. 333 00:22:48,796 --> 00:22:51,596 Speaker 2: You know, if you know two names in AI, Jeff 334 00:22:51,636 --> 00:22:53,396 Speaker 2: Hinton is probably one of them. People call him the 335 00:22:53,396 --> 00:22:56,756 Speaker 2: godfather of kind of modern AI, right, and neural networks 336 00:22:56,956 --> 00:22:59,796 Speaker 2: are essentially the thing that has worked right both for 337 00:22:59,996 --> 00:23:04,076 Speaker 2: vision and for language. You know, jat GPTs neural network, 338 00:23:04,836 --> 00:23:06,876 Speaker 2: and so this was a moment was like, oh, this 339 00:23:06,956 --> 00:23:08,876 Speaker 2: technique that a lot of people thought wasn't going to 340 00:23:08,916 --> 00:23:11,076 Speaker 2: work had kind of given up on its back. 341 00:23:11,556 --> 00:23:15,756 Speaker 3: Yeah, yeah, exactly. I think it's actually a parallel story 342 00:23:15,796 --> 00:23:23,516 Speaker 3: of two groups of people that had that determination seeing 343 00:23:23,596 --> 00:23:28,916 Speaker 3: something that you know, maybe the mainstream wasn't seen, and 344 00:23:28,956 --> 00:23:36,316 Speaker 3: then had the resilience and just perseverance to keep marching on. 345 00:23:36,636 --> 00:23:39,836 Speaker 3: I was doing my north Star pursue. I was doing 346 00:23:39,836 --> 00:23:43,156 Speaker 3: the big data approach. They're doing the new network algorithm, 347 00:23:43,436 --> 00:23:44,556 Speaker 3: and then we converged. 348 00:23:45,316 --> 00:23:48,876 Speaker 2: Uh huh. That's really elegant, right, because it's like your 349 00:23:48,916 --> 00:23:52,076 Speaker 2: big data is just sitting there and you don't maybe 350 00:23:52,236 --> 00:23:54,916 Speaker 2: entirely know it, but you kind of need a neural 351 00:23:54,956 --> 00:23:58,036 Speaker 2: network to come in and train on it. Right, And 352 00:23:58,076 --> 00:24:00,396 Speaker 2: they're over there building their neural network and they may 353 00:24:00,476 --> 00:24:01,996 Speaker 2: or may not know it, but they need the big 354 00:24:02,076 --> 00:24:04,836 Speaker 2: data that you're over here building, and then when it 355 00:24:04,836 --> 00:24:06,196 Speaker 2: comes together, it's like, hey. 356 00:24:06,076 --> 00:24:10,436 Speaker 3: It works. Yeah. So I think that's how science progresses. 357 00:24:10,636 --> 00:24:15,356 Speaker 3: It's kind of spiraling up, and sometimes it takes a 358 00:24:15,396 --> 00:24:18,716 Speaker 3: couple of more threats. It's not a single spral I 359 00:24:18,876 --> 00:24:25,116 Speaker 3: remember very vividly that some of one of the critiques, 360 00:24:25,196 --> 00:24:28,236 Speaker 3: one of the main critiques of image net by my 361 00:24:28,356 --> 00:24:32,196 Speaker 3: colleagues is this is too big. We cannot even fit 362 00:24:32,276 --> 00:24:34,996 Speaker 3: this into memory. What are you doing? What are you 363 00:24:35,116 --> 00:24:39,396 Speaker 3: making this giant data set for when we cannot even 364 00:24:40,636 --> 00:24:44,196 Speaker 3: you know, put it on a chip. And as that 365 00:24:44,276 --> 00:24:46,436 Speaker 3: was happening, GPU was happening. 366 00:24:46,476 --> 00:24:50,676 Speaker 2: So GPUs are the Nvidia chips that are now made 367 00:24:50,716 --> 00:24:52,556 Speaker 2: in video, one of the biggest companies in the world. 368 00:24:52,756 --> 00:24:55,476 Speaker 2: But they were figuring out that GPUs are particularly good 369 00:24:55,516 --> 00:24:57,196 Speaker 2: for the neural. 370 00:24:56,956 --> 00:24:59,076 Speaker 3: Network exactly exactly. 371 00:24:58,876 --> 00:25:01,876 Speaker 2: So that moment, this moment when you guys come together 372 00:25:01,956 --> 00:25:04,556 Speaker 2: and kind of create this you know, new era of 373 00:25:04,596 --> 00:25:07,836 Speaker 2: computing really that we're still living in of AI is 374 00:25:07,876 --> 00:25:13,556 Speaker 2: about ten years ago now, right, yeah, so just bring 375 00:25:13,596 --> 00:25:16,636 Speaker 2: me to the present. Like that happened, then where are 376 00:25:16,636 --> 00:25:19,116 Speaker 2: we now? I mean, it's it's kind of the same universe, right, 377 00:25:19,156 --> 00:25:21,596 Speaker 2: It has advanced a lot, but the basic premise of 378 00:25:22,436 --> 00:25:27,636 Speaker 2: you have neural networks training on vast, vast databases of images, like, 379 00:25:27,836 --> 00:25:29,316 Speaker 2: it's basically the same. 380 00:25:29,196 --> 00:25:32,636 Speaker 3: Right, So from my conceptional point of view, you're right, 381 00:25:32,716 --> 00:25:35,596 Speaker 3: at that time, I was downloading the Internet of images 382 00:25:35,876 --> 00:25:38,596 Speaker 3: to be honest. Now, the Internet of Images is just 383 00:25:38,636 --> 00:25:41,516 Speaker 3: so vast, I don't know who can download it all. 384 00:25:42,036 --> 00:25:48,276 Speaker 3: And then the GPUs is mind bogglingly advanced. Right, but 385 00:25:48,316 --> 00:25:50,596 Speaker 3: you're right, the ingredients are still the same. 386 00:25:54,036 --> 00:25:56,196 Speaker 2: We'll be back in a minute with the lightning round. 387 00:26:05,796 --> 00:26:09,356 Speaker 2: Let's do a lightning round. Okay, what's one thing you 388 00:26:09,436 --> 00:26:10,796 Speaker 2: learned running a drag cleaning shop. 389 00:26:13,996 --> 00:26:17,836 Speaker 3: That's how very night to have to think. I think 390 00:26:17,876 --> 00:26:24,196 Speaker 3: I learned resilience because my goal is to be a scientist. 391 00:26:24,316 --> 00:26:27,196 Speaker 3: But if it takes running a dry cleaner shop to 392 00:26:27,476 --> 00:26:31,916 Speaker 3: get there in the most detoured way, I'll have to 393 00:26:31,956 --> 00:26:32,276 Speaker 3: do that. 394 00:26:32,796 --> 00:26:35,636 Speaker 2: So you're writing the book about a high school teacher 395 00:26:35,636 --> 00:26:38,076 Speaker 2: who was a very big and important influence on you, 396 00:26:38,236 --> 00:26:41,036 Speaker 2: and how your advisors in grad school were an important influence. 397 00:26:41,076 --> 00:26:44,076 Speaker 2: And now you have been a mentor to many people. 398 00:26:44,396 --> 00:26:47,116 Speaker 2: So I'm curious, what's one tip for finding a mentor? 399 00:26:48,836 --> 00:26:55,516 Speaker 3: For finding a mentor? That's a great question. I trusted 400 00:26:55,596 --> 00:27:01,676 Speaker 3: them different stage. This trust meant different things. I trust 401 00:27:01,716 --> 00:27:08,836 Speaker 3: their genuine intention, I trusted their wisdom, I trusted their 402 00:27:08,916 --> 00:27:16,876 Speaker 3: vas and I trusted there you know, believe in me. 403 00:27:17,556 --> 00:27:21,676 Speaker 3: So that was how I was lucky to find my mentors. 404 00:27:22,556 --> 00:27:24,316 Speaker 2: What's one tip for being a mentor? 405 00:27:25,116 --> 00:27:32,116 Speaker 3: Being a mentor is really about respecting the the person, 406 00:27:32,276 --> 00:27:36,316 Speaker 3: the soul and help them to find their north star, 407 00:27:36,436 --> 00:27:37,556 Speaker 3: to find their passion. 408 00:27:39,076 --> 00:27:41,316 Speaker 2: If everything goes well, what problem will you be trying 409 00:27:41,316 --> 00:27:42,996 Speaker 2: to solve in five years? 410 00:27:43,836 --> 00:27:50,636 Speaker 3: I'm trying to usher in machines being so helpful and 411 00:27:50,756 --> 00:27:56,596 Speaker 3: collaborative for humans, whether it's productivity or our wellbeing. If 412 00:27:56,596 --> 00:28:03,516 Speaker 3: this includes sensors, spar sensors, virtual agents, or real robots, 413 00:28:04,276 --> 00:28:07,796 Speaker 3: I think it all. You know, I'm very excited by that. 414 00:28:13,556 --> 00:28:15,916 Speaker 2: Feif A Lee is a professor of computer science at 415 00:28:15,956 --> 00:28:19,236 Speaker 2: Stanford and the author of the book The Worlds I See. 416 00:28:20,636 --> 00:28:24,876 Speaker 2: Today's show was produced by Edith Russolo and Gabriel Hunter Chang. 417 00:28:25,396 --> 00:28:29,276 Speaker 2: It was edited by Karen Chakerji and engineered by Sarah Bouger. 418 00:28:29,756 --> 00:28:33,036 Speaker 2: You can email us at problem at Pushkin dot FM. 419 00:28:33,716 --> 00:28:36,276 Speaker 2: I'm Jacob Goldstein, and we'll be back next week with 420 00:28:36,356 --> 00:28:36,636 Speaker 2: another 421 00:28:36,676 --> 00:28:37,556 Speaker 1: Episode of What You Have