1 00:00:15,356 --> 00:00:21,876 Speaker 1: Pushkin for a long time. Now, we've had a lot 2 00:00:21,876 --> 00:00:25,876 Speaker 1: of technological innovation in virtual things in bits, you know, 3 00:00:26,316 --> 00:00:30,636 Speaker 1: the Internet, digital images, large language models, etc. We have 4 00:00:30,716 --> 00:00:36,076 Speaker 1: had noticeably less innovation in actual things and things made 5 00:00:36,116 --> 00:00:39,116 Speaker 1: of atoms, things that would hurt if you dropped them 6 00:00:39,116 --> 00:00:42,956 Speaker 1: on your foot. Now that seems to be changing. People 7 00:00:43,036 --> 00:00:47,316 Speaker 1: are using innovations in bits, improvements in computing and communications 8 00:00:47,316 --> 00:00:51,916 Speaker 1: and AI to drive innovation in actual things, everything from 9 00:00:51,956 --> 00:01:03,236 Speaker 1: batteries to garbage cans to airplanes. Next up robots. I'm 10 00:01:03,316 --> 00:01:05,836 Speaker 1: Jacob Boldstein and this is What's Your Problem, the show 11 00:01:05,876 --> 00:01:08,156 Speaker 1: where I talk to people who are trying to make 12 00:01:08,276 --> 00:01:12,796 Speaker 1: technological progress. My guest today is Peter Chi. He's the 13 00:01:12,836 --> 00:01:17,236 Speaker 1: co founder and CEO of Covariant. Peter's work at Covariant 14 00:01:17,276 --> 00:01:20,356 Speaker 1: was partly inspired by the work of Fafe Lead, who 15 00:01:20,396 --> 00:01:24,756 Speaker 1: coincidentally is the AI researcher I interviewed just last week 16 00:01:24,876 --> 00:01:28,556 Speaker 1: on the show. Peter's problem is this, how do you 17 00:01:28,596 --> 00:01:31,476 Speaker 1: take the AI breakthroughs of the past decade or so 18 00:01:32,236 --> 00:01:34,596 Speaker 1: and make them work in robots. 19 00:01:39,116 --> 00:01:44,316 Speaker 2: So to really tell the story of robotics, like we 20 00:01:44,396 --> 00:01:47,956 Speaker 2: have to tell the story of robotics even without AI 21 00:01:48,476 --> 00:01:51,276 Speaker 2: like robotics for a very long time. It's a field 22 00:01:51,476 --> 00:01:56,196 Speaker 2: that you would actually find in mechanical engineering departments of universities. 23 00:01:56,236 --> 00:01:59,556 Speaker 2: Like it's largely a hardware problem. It's a control problem, 24 00:01:59,596 --> 00:02:02,196 Speaker 2: like how can you design the moti well, how can 25 00:02:02,236 --> 00:02:03,636 Speaker 2: you design the gearbox? Well? 26 00:02:03,756 --> 00:02:04,116 Speaker 1: Yeah, right? 27 00:02:04,236 --> 00:02:07,636 Speaker 2: Can you design like the control algorithm so that you 28 00:02:07,676 --> 00:02:11,476 Speaker 2: can get the robot to a exact xyz location in 29 00:02:11,556 --> 00:02:15,036 Speaker 2: the three D physical world like without oscillating around and 30 00:02:15,076 --> 00:02:15,516 Speaker 2: you can. 31 00:02:15,436 --> 00:02:17,756 Speaker 1: Making the thing move? How do you build the parts 32 00:02:17,756 --> 00:02:19,396 Speaker 1: that make the thing move the way we. 33 00:02:19,396 --> 00:02:22,956 Speaker 2: Wanted to move exactly? Like it's all about like telling 34 00:02:23,196 --> 00:02:26,156 Speaker 2: this piece of machinery that we call robot to do 35 00:02:26,196 --> 00:02:28,596 Speaker 2: the thing that's exactly what we tell it to do, 36 00:02:29,036 --> 00:02:32,236 Speaker 2: which turned out to be like obviously a fairly difficult 37 00:02:32,236 --> 00:02:34,916 Speaker 2: engineering problems, and that's why people have woke on it 38 00:02:34,996 --> 00:02:38,396 Speaker 2: for many decades. But it has gotten really good. 39 00:02:38,476 --> 00:02:40,876 Speaker 1: And so this is like this is like the classic 40 00:02:40,996 --> 00:02:43,996 Speaker 1: kind of image you see from a car assembly line 41 00:02:44,036 --> 00:02:48,676 Speaker 1: of like a robot arm you know whatever, welding a 42 00:02:48,796 --> 00:02:52,356 Speaker 1: part onto the body of a cart again and again 43 00:02:52,516 --> 00:02:53,396 Speaker 1: again all day long. 44 00:02:53,476 --> 00:02:54,796 Speaker 2: Yeah, exactly right. 45 00:02:54,636 --> 00:02:57,516 Speaker 1: So they're good at robots are clearly good at welding 46 00:02:57,516 --> 00:02:59,996 Speaker 1: the same part onto the same car a million times? 47 00:03:00,356 --> 00:03:02,716 Speaker 1: What are the limits of that approach? What were the 48 00:03:02,836 --> 00:03:05,236 Speaker 1: problems people were bumping up against. 49 00:03:05,556 --> 00:03:08,516 Speaker 2: Yeah, so the problem is that in order to use 50 00:03:08,596 --> 00:03:13,316 Speaker 2: that kind of robotics, it has a really big limitations 51 00:03:13,356 --> 00:03:15,916 Speaker 2: on your environment. Right. You basically need to be able 52 00:03:15,916 --> 00:03:22,076 Speaker 2: to reduce your task to be solvable by repeated motion. Right, 53 00:03:22,236 --> 00:03:24,876 Speaker 2: And so if you look at like, how like these 54 00:03:24,956 --> 00:03:28,356 Speaker 2: kind of assembly lines that use classical robots, they always 55 00:03:28,516 --> 00:03:32,436 Speaker 2: feed the material into exactly the same place. So no 56 00:03:32,476 --> 00:03:34,476 Speaker 2: matter how how the way that they came in from 57 00:03:34,516 --> 00:03:37,236 Speaker 2: their suppliers and whatnot, you always need to load them 58 00:03:37,316 --> 00:03:39,956 Speaker 2: up in exactly the same way because like there's really 59 00:03:39,996 --> 00:03:43,476 Speaker 2: no adaptivity at all that these robots have because they're 60 00:03:43,516 --> 00:03:45,236 Speaker 2: just executing the same thing again, and it. 61 00:03:45,436 --> 00:03:47,876 Speaker 1: Just it has to be they're very precise, but their 62 00:03:47,916 --> 00:03:51,916 Speaker 1: whole environment has to be super homogeneous, the same every 63 00:03:51,956 --> 00:03:52,996 Speaker 1: time exactly. 64 00:03:53,076 --> 00:03:56,836 Speaker 2: Like, So, like that's the problem one, but that's very difficult, Like, 65 00:03:56,876 --> 00:03:59,956 Speaker 2: not everything can be reduced that way. The second problem 66 00:03:59,956 --> 00:04:02,356 Speaker 2: with it is even in the case that you can 67 00:04:02,396 --> 00:04:07,156 Speaker 2: reduce the problem to that kind of pure mechanical repeated motion, 68 00:04:07,716 --> 00:04:10,796 Speaker 2: it's still very expensive because you still need to program 69 00:04:10,836 --> 00:04:14,116 Speaker 2: a robot to do that one specific task, and if 70 00:04:14,156 --> 00:04:17,356 Speaker 2: you change your task slightly, you need to reprogram everything, 71 00:04:17,796 --> 00:04:21,676 Speaker 2: typically from scratch, and that means like robot are not 72 00:04:21,756 --> 00:04:26,196 Speaker 2: just extremely rigid, which limits like the range of capabilities 73 00:04:26,236 --> 00:04:30,396 Speaker 2: that they can rich and do. And it's also very 74 00:04:30,476 --> 00:04:34,196 Speaker 2: expensive even on the very fixed rigid capability will. 75 00:04:34,116 --> 00:04:37,156 Speaker 1: Right, And so you need something that's the same every time, 76 00:04:37,196 --> 00:04:38,996 Speaker 1: and you need to be doing a lot of it 77 00:04:39,116 --> 00:04:42,996 Speaker 1: because otherwise the economy of scale just doesn't work out. 78 00:04:42,996 --> 00:04:45,036 Speaker 1: It's too expensive to try and get the robot to 79 00:04:45,076 --> 00:04:45,716 Speaker 1: do something else. 80 00:04:46,076 --> 00:04:46,516 Speaker 2: Exactly. 81 00:04:47,036 --> 00:04:50,796 Speaker 1: So I know that you as a as a student, 82 00:04:50,876 --> 00:04:53,396 Speaker 1: as an undergrad and a grad student, if I have 83 00:04:53,436 --> 00:04:56,436 Speaker 1: it right, you worked in the lab of this professor 84 00:04:56,436 --> 00:04:59,196 Speaker 1: at Berkeley who for a long time had been trying 85 00:04:59,236 --> 00:05:02,996 Speaker 1: to teach robots to fold towels. Yes, which is an 86 00:05:03,036 --> 00:05:06,116 Speaker 1: amazing problem because it's one of those ones that seems 87 00:05:06,156 --> 00:05:09,836 Speaker 1: so simple, right, it seems like way easier than riveting 88 00:05:09,916 --> 00:05:12,636 Speaker 1: parts onto a car or whatever, but turned out to 89 00:05:12,636 --> 00:05:16,876 Speaker 1: be in fact much harder for robots, right. And I 90 00:05:16,876 --> 00:05:19,596 Speaker 1: feel like that's telling like why was that so hard 91 00:05:19,636 --> 00:05:20,156 Speaker 1: for robots? 92 00:05:20,196 --> 00:05:22,276 Speaker 2: And what do we learn from that when you do 93 00:05:22,836 --> 00:05:25,476 Speaker 2: welding on a car body, like as we have discussed, 94 00:05:25,476 --> 00:05:28,716 Speaker 2: like you can reduce the problem to just simple mechanical 95 00:05:28,956 --> 00:05:33,276 Speaker 2: repeated motion. But because like these a piece of fabric 96 00:05:33,396 --> 00:05:37,596 Speaker 2: is flexible, is deformable, like it can come in many 97 00:05:37,596 --> 00:05:40,836 Speaker 2: many different kinds of shapes. It has many different possibilities, right, and. 98 00:05:40,756 --> 00:05:43,676 Speaker 1: It's much more complex than a than a car body. 99 00:05:43,716 --> 00:05:46,596 Speaker 1: Weirdly not into exactly, but when you think about it's like, 100 00:05:46,676 --> 00:05:49,516 Speaker 1: oh it could have a little fast possibilities. 101 00:05:50,196 --> 00:05:52,436 Speaker 2: It has a lot more possibility and how it can 102 00:05:52,476 --> 00:05:55,916 Speaker 2: present itself to you. And because exactly of that, like 103 00:05:56,036 --> 00:06:00,916 Speaker 2: recall like the first limitation of traditional robotics, which is 104 00:06:00,956 --> 00:06:03,716 Speaker 2: it can only work with problems that can be reduced 105 00:06:03,716 --> 00:06:09,916 Speaker 2: to repeated motion like and towel folding and folding apparel 106 00:06:09,956 --> 00:06:12,876 Speaker 2: items is exactly one of those things that cannot be 107 00:06:12,956 --> 00:06:14,076 Speaker 2: reduced because. 108 00:06:13,836 --> 00:06:16,676 Speaker 1: It's just a little bit different every time it tells 109 00:06:16,676 --> 00:06:18,116 Speaker 1: a little bit of a different shape. It might be 110 00:06:18,156 --> 00:06:21,116 Speaker 1: sitting on the table and it'll folded over some. 111 00:06:20,996 --> 00:06:24,076 Speaker 2: Weird exactly, like if it's folded onto itself, like how 112 00:06:24,156 --> 00:06:26,436 Speaker 2: much is it folded onto yourself? Like how much wrinkle 113 00:06:26,556 --> 00:06:28,756 Speaker 2: does it have? Like all of those make a big 114 00:06:28,796 --> 00:06:32,316 Speaker 2: difference in terms of what the robots should do with it, 115 00:06:32,356 --> 00:06:35,036 Speaker 2: and so that like it's a really good example of 116 00:06:35,316 --> 00:06:40,116 Speaker 2: something that's traditional robotics cannot solve and you really need 117 00:06:40,156 --> 00:06:41,956 Speaker 2: AI to solve it. 118 00:06:42,076 --> 00:06:45,596 Speaker 1: And when your co founder started working on the problem, 119 00:06:45,676 --> 00:06:49,556 Speaker 1: it was sort of before the kind of modern era 120 00:06:49,596 --> 00:06:52,116 Speaker 1: of AI that we're living in now, right, And I 121 00:06:53,116 --> 00:06:56,716 Speaker 1: read that there was this big moment in kind of 122 00:06:56,756 --> 00:07:03,636 Speaker 1: the origin of the company was when this database essentially 123 00:07:03,716 --> 00:07:08,156 Speaker 1: of labeled images called image net was released, Which was 124 00:07:08,196 --> 00:07:10,316 Speaker 1: interesting to me because I just talk to Faye Lee 125 00:07:10,436 --> 00:07:10,716 Speaker 1: for this. 126 00:07:10,756 --> 00:07:16,396 Speaker 2: So we have is actually one of the advisor investors. 127 00:07:16,436 --> 00:07:18,556 Speaker 1: Just the coincidence. For the record, she didn't put me 128 00:07:18,596 --> 00:07:21,796 Speaker 1: in touch with you, but tell me about sort of 129 00:07:21,956 --> 00:07:26,596 Speaker 1: why why image net was meaningful in the in the 130 00:07:26,636 --> 00:07:28,076 Speaker 1: birth of your company. 131 00:07:28,476 --> 00:07:33,196 Speaker 2: Yeah, there was actually a bigger lesson here than just robotics. 132 00:07:33,556 --> 00:07:38,676 Speaker 2: The bigger lesson is that artificial intelligence is actually becoming 133 00:07:38,796 --> 00:07:42,396 Speaker 2: simpler and simpler, Like when you look at the field 134 00:07:42,396 --> 00:07:46,516 Speaker 2: of artificial intelligence fifteen twenty years ago, like it used 135 00:07:46,516 --> 00:07:48,676 Speaker 2: to be many different sub fields. Like the people that 136 00:07:48,716 --> 00:07:51,676 Speaker 2: work on robotics have a completely different tool sets than 137 00:07:51,716 --> 00:07:54,036 Speaker 2: the people that work on computer vision, and the people 138 00:07:54,036 --> 00:07:56,316 Speaker 2: that are working on computer vision had a different two 139 00:07:56,396 --> 00:07:58,756 Speaker 2: sets from people that are working on natural language processing. 140 00:07:59,116 --> 00:08:02,356 Speaker 1: Well, and they feel different, Right, Like, teaching computers to 141 00:08:02,556 --> 00:08:07,316 Speaker 1: understand an image, you know, to sort of deal with 142 00:08:07,436 --> 00:08:09,716 Speaker 1: an image of the world and understand what it means, 143 00:08:09,916 --> 00:08:14,196 Speaker 1: feels quite different than teaching a computer to have a conversation. 144 00:08:14,396 --> 00:08:16,436 Speaker 1: It doesn't. It's not obvious that you would use the 145 00:08:16,436 --> 00:08:17,996 Speaker 1: same tools to do exactly. 146 00:08:17,636 --> 00:08:21,356 Speaker 2: Those and that definitely was the consensus, Like why. 147 00:08:21,236 --> 00:08:22,116 Speaker 1: Would it be the same? 148 00:08:22,396 --> 00:08:23,876 Speaker 2: Yeah, what would it be the same? Like it feels 149 00:08:23,956 --> 00:08:25,556 Speaker 2: very different, Like it feels like you need different kind 150 00:08:25,596 --> 00:08:29,356 Speaker 2: of data, and I would just like say it. Essentially, 151 00:08:29,396 --> 00:08:32,516 Speaker 2: the field of AI is becoming more and more unified, 152 00:08:32,836 --> 00:08:35,756 Speaker 2: Like the methodology, the model that you would use is 153 00:08:35,756 --> 00:08:39,236 Speaker 2: actually becoming similar and similar and sometimes it's even the 154 00:08:39,276 --> 00:08:43,876 Speaker 2: same across these very different fields of robotics, computer vision, language. 155 00:08:43,996 --> 00:08:46,676 Speaker 1: So it's basically, you build a neural network and then 156 00:08:46,716 --> 00:08:49,116 Speaker 1: you just train it on a bunch of images or 157 00:08:49,196 --> 00:08:52,036 Speaker 1: train it on a bunch of documents, and what you 158 00:08:52,156 --> 00:08:55,036 Speaker 1: train it on is what determines sort of what it's 159 00:08:55,076 --> 00:08:55,476 Speaker 1: good for. 160 00:08:55,996 --> 00:08:59,596 Speaker 3: Is it like that that that basically is it like 161 00:08:59,636 --> 00:09:05,436 Speaker 3: it's it's like it instead of each sub feel coming 162 00:09:05,556 --> 00:09:08,956 Speaker 3: up with different and think about it as like hand 163 00:09:09,156 --> 00:09:13,276 Speaker 3: programmed intelligence, right, like let's try to break down what's 164 00:09:13,316 --> 00:09:16,716 Speaker 3: a sentence means, like break into different parts and when 165 00:09:16,756 --> 00:09:18,996 Speaker 3: you approach a computer vision problem, or let's try to 166 00:09:19,036 --> 00:09:21,996 Speaker 3: come up with different features and some features represent an edge, 167 00:09:22,076 --> 00:09:23,836 Speaker 3: some features represent the background. 168 00:09:24,236 --> 00:09:27,956 Speaker 2: Instead of like trying to manually program the kind of 169 00:09:28,036 --> 00:09:32,316 Speaker 2: quite human intelligence into AI, you're basically taking a step 170 00:09:32,356 --> 00:09:34,876 Speaker 2: back and say, I'm just going to create a very 171 00:09:34,916 --> 00:09:39,516 Speaker 2: flexible learning mechanism, which is an artificial new net, and 172 00:09:39,556 --> 00:09:41,436 Speaker 2: then we're just going to feed it a lot of data. 173 00:09:41,916 --> 00:09:44,796 Speaker 2: And if you have different types of problems that you're solving, 174 00:09:45,036 --> 00:09:48,316 Speaker 2: you're just feeding like this new net different types of data, 175 00:09:49,156 --> 00:09:51,476 Speaker 2: but they're still like really the same kind of mechanism. 176 00:09:51,516 --> 00:09:55,036 Speaker 2: Like then that is a drastic departure from how artificial 177 00:09:55,076 --> 00:09:56,836 Speaker 2: intelligence used to be done. 178 00:09:57,196 --> 00:10:00,316 Speaker 1: And so in that world, then the differentiation is just 179 00:10:00,436 --> 00:10:03,916 Speaker 1: in the data set that you are feeding the totally model. 180 00:10:03,916 --> 00:10:07,076 Speaker 2: Totally and in fact, like I mean, this is like 181 00:10:07,196 --> 00:10:10,156 Speaker 2: jumping way forward in time. We were still talking about 182 00:10:10,156 --> 00:10:13,156 Speaker 2: image net. But if you look at really the most 183 00:10:13,236 --> 00:10:17,196 Speaker 2: popular technologies today, these large language models. When you use 184 00:10:17,276 --> 00:10:22,236 Speaker 2: different companies large language models, it's really what you're saying, 185 00:10:22,276 --> 00:10:26,596 Speaker 2: like as if I'm using GPT chat, gipt four versus 186 00:10:26,676 --> 00:10:32,836 Speaker 2: using Google's bought backed by Germini versus Anthpics, Claude or 187 00:10:33,036 --> 00:10:34,836 Speaker 2: coheres command model. 188 00:10:34,636 --> 00:10:37,836 Speaker 1: You're just naming all the different big language large language models. Now. 189 00:10:37,916 --> 00:10:39,836 Speaker 2: Yeah, and like when you think about these different big 190 00:10:39,916 --> 00:10:43,236 Speaker 2: language models, you're just really referring to the different data 191 00:10:43,236 --> 00:10:44,956 Speaker 2: sets that's see behind. 192 00:10:44,636 --> 00:10:47,236 Speaker 1: That they were trained on. So this is interesting kind 193 00:10:47,276 --> 00:10:50,316 Speaker 1: of abstract big picture talk. I want to kind of 194 00:10:50,436 --> 00:10:54,596 Speaker 1: map this now onto the story of Covariate, the story 195 00:10:54,676 --> 00:10:57,756 Speaker 1: of your company. Right, so we're going back in time. Now, 196 00:10:58,956 --> 00:11:01,196 Speaker 1: tell me about the moment when you decided to start 197 00:11:01,196 --> 00:11:02,316 Speaker 1: the company. What's going on? 198 00:11:02,916 --> 00:11:04,876 Speaker 2: Yeah, so the moment that we decide to start a 199 00:11:04,956 --> 00:11:09,636 Speaker 2: company is you'll refer to this image net mode like 200 00:11:09,716 --> 00:11:14,036 Speaker 2: this like large data set that actually first time like 201 00:11:14,156 --> 00:11:18,796 Speaker 2: really taught people that you can train a network to 202 00:11:18,916 --> 00:11:21,116 Speaker 2: solve one specific task really work. 203 00:11:21,636 --> 00:11:27,916 Speaker 1: So this is twenty twelve. There's when a neural net 204 00:11:28,436 --> 00:11:32,516 Speaker 1: trains on image net and does a really good job, 205 00:11:32,676 --> 00:11:35,436 Speaker 1: way better than anybody has ever done. Any Then any 206 00:11:35,476 --> 00:11:39,316 Speaker 1: model has ever done of identifying objects. 207 00:11:38,716 --> 00:11:42,516 Speaker 2: In exactly right, exactly. That was really significant because that 208 00:11:42,676 --> 00:11:45,716 Speaker 2: means if you can collect a lot of data for 209 00:11:45,996 --> 00:11:48,756 Speaker 2: a single task, and if you can get a group 210 00:11:48,796 --> 00:11:51,996 Speaker 2: of PhDs to work on a model for that single task, 211 00:11:52,356 --> 00:11:55,076 Speaker 2: you basically have AI. Like you can solve that task 212 00:11:55,196 --> 00:11:58,756 Speaker 2: really well. I mean like you might need to iterate 213 00:11:58,796 --> 00:12:00,956 Speaker 2: on your data, you might need to iterate on your algorithm, 214 00:12:01,076 --> 00:12:03,276 Speaker 2: but ultimately you can solve that task really well. 215 00:12:03,436 --> 00:12:07,356 Speaker 1: You're basically saying, if you can gather the data, you 216 00:12:07,396 --> 00:12:11,316 Speaker 1: can gather a shitload of data of whatever kind, then 217 00:12:11,396 --> 00:12:14,356 Speaker 1: you can get AI around. 218 00:12:14,156 --> 00:12:16,756 Speaker 2: For that for that kind, right, And which is like 219 00:12:16,876 --> 00:12:20,596 Speaker 2: why you saw artificial intelligence like really started working like 220 00:12:20,676 --> 00:12:24,116 Speaker 2: after twenty twelve, Like like Google, Facebook, all of these 221 00:12:24,116 --> 00:12:27,476 Speaker 2: companies like have a lot of AI based applications, but 222 00:12:27,516 --> 00:12:31,236 Speaker 2: they are largely not democratized because like in order to 223 00:12:31,316 --> 00:12:33,956 Speaker 2: get any single AI working, you still need a lot 224 00:12:33,996 --> 00:12:36,636 Speaker 2: of data and you still need a team of PhDs 225 00:12:36,676 --> 00:12:39,996 Speaker 2: to work on it. So it was a huge breakthrough 226 00:12:40,076 --> 00:12:43,556 Speaker 2: in AI, but it was not sufficient for really widespread 227 00:12:43,636 --> 00:12:46,276 Speaker 2: usage just because the barrier to create one AI is 228 00:12:46,316 --> 00:12:51,596 Speaker 2: so high. Okay, and then comes the second inflection point 229 00:12:51,916 --> 00:12:54,596 Speaker 2: in the history of AI, which is really the start 230 00:12:54,796 --> 00:12:58,876 Speaker 2: of foundation models. So like I'm talking about the most 231 00:12:58,876 --> 00:13:01,796 Speaker 2: initial version of GPT, right, Like I'm talking about like 232 00:13:02,236 --> 00:13:05,716 Speaker 2: these large language models that are trained on multiple tasks 233 00:13:06,036 --> 00:13:09,396 Speaker 2: so that they're incredibly generalizable, Like you can ask to 234 00:13:09,396 --> 00:13:11,556 Speaker 2: do something new and it can do it really well. 235 00:13:12,076 --> 00:13:16,796 Speaker 2: And also it performed better at single task than specialized model. 236 00:13:16,916 --> 00:13:19,116 Speaker 1: And so just to be clear, like until a few 237 00:13:19,236 --> 00:13:23,036 Speaker 1: years ago, people thought reasonably that if you want to 238 00:13:23,076 --> 00:13:28,516 Speaker 1: build AI to whatever translate language right, you would work 239 00:13:28,556 --> 00:13:31,316 Speaker 1: really hard on that. You would try and build an 240 00:13:31,356 --> 00:13:36,916 Speaker 1: AI specifically designed to be really good at translating text 241 00:13:36,996 --> 00:13:41,996 Speaker 1: from one language to another. But this really surprising result, 242 00:13:42,076 --> 00:13:45,236 Speaker 1: this really surprising thing that emerged from just work people 243 00:13:45,236 --> 00:13:47,796 Speaker 1: were doing, was, in fact, that's not the best way 244 00:13:48,116 --> 00:13:52,636 Speaker 1: to get AI to translate language. It's just throw everything 245 00:13:52,716 --> 00:13:55,316 Speaker 1: you can, all the words on the internet at an 246 00:13:55,356 --> 00:13:58,876 Speaker 1: AI model and just say, figure out everything about language, 247 00:13:58,876 --> 00:14:02,316 Speaker 1: figure out how to answer questions about history, and figure 248 00:14:02,356 --> 00:14:05,076 Speaker 1: out how to translate, and figure out how to give 249 00:14:05,116 --> 00:14:08,396 Speaker 1: me a recipe for you know, pasta. And it turns 250 00:14:08,396 --> 00:14:12,316 Speaker 1: out that LAD technique gets you better results at each 251 00:14:12,356 --> 00:14:15,556 Speaker 1: specific thing than trying to build specialized model exactly. 252 00:14:15,596 --> 00:14:18,916 Speaker 2: And that is really the magic of foundation models. And 253 00:14:18,956 --> 00:14:22,756 Speaker 2: that's the thing that we're not obvious to people outside 254 00:14:22,756 --> 00:14:26,196 Speaker 2: of open Ai for a very long time. And because 255 00:14:26,236 --> 00:14:28,236 Speaker 2: we came from open Ai, a lot of the founding 256 00:14:28,276 --> 00:14:30,676 Speaker 2: team at Coverin came from open Ai. We saw that 257 00:14:30,756 --> 00:14:36,796 Speaker 2: inside earlier, and that inside allow us to start Covariant 258 00:14:36,796 --> 00:14:40,916 Speaker 2: to build foundation models for robotics way before other people 259 00:14:40,956 --> 00:14:42,236 Speaker 2: even believed in the. 260 00:14:42,236 --> 00:14:45,796 Speaker 1: Approach, even people in the field, even people in the field. 261 00:14:45,916 --> 00:14:47,916 Speaker 1: So so, yeah, so you were when did you go 262 00:14:47,956 --> 00:14:49,876 Speaker 1: to open Ai? You were you went to work at 263 00:14:49,876 --> 00:14:50,396 Speaker 1: open Ai. 264 00:14:50,756 --> 00:14:53,876 Speaker 2: We went to open Ai when it was about ten 265 00:14:54,036 --> 00:14:57,356 Speaker 2: ish people like like sometime in twenty sixteen. 266 00:14:57,036 --> 00:15:03,116 Speaker 1: Okay, And and when do you sort of personally have 267 00:15:03,276 --> 00:15:06,836 Speaker 1: this realization and you're not the only one to have it, 268 00:15:06,876 --> 00:15:11,236 Speaker 1: but when do you see that the power of foundation models? 269 00:15:12,916 --> 00:15:15,156 Speaker 2: There are two things on it, Like, so the first 270 00:15:15,196 --> 00:15:20,436 Speaker 2: thing is that early on at open Ai, we believed 271 00:15:20,516 --> 00:15:24,476 Speaker 2: in the idea of scaling, like really scaling up the 272 00:15:24,476 --> 00:15:27,236 Speaker 2: model and scaling up the data sets. And you actually 273 00:15:27,276 --> 00:15:31,796 Speaker 2: see models getting like getting increasingly smarter as you actually 274 00:15:31,796 --> 00:15:34,196 Speaker 2: scale them up. So one is that, and then the 275 00:15:34,236 --> 00:15:39,396 Speaker 2: other one is I would say we we had conviction 276 00:15:39,596 --> 00:15:44,156 Speaker 2: in foundation model for robotics probably earlier than foundation model 277 00:15:44,196 --> 00:15:46,556 Speaker 2: for language. And this is like the one key thing 278 00:15:46,636 --> 00:15:51,276 Speaker 2: is that if you think about building a large language 279 00:15:51,316 --> 00:15:54,796 Speaker 2: model that tried to compress the whole internet of knowledge, 280 00:15:55,036 --> 00:15:58,276 Speaker 2: you still need to compress many things that are not 281 00:15:58,396 --> 00:16:01,396 Speaker 2: quite related to each other. Right, Like maybe you're browsing 282 00:16:01,436 --> 00:16:05,716 Speaker 2: on Wikipedia and you have to recite the composition of 283 00:16:05,836 --> 00:16:09,876 Speaker 2: materials of soil on the moon, and you also need 284 00:16:09,916 --> 00:16:12,436 Speaker 2: to learn how to play chess. What there is there's 285 00:16:12,476 --> 00:16:15,476 Speaker 2: really nothing in common with these two things. Yeah, these 286 00:16:15,476 --> 00:16:17,556 Speaker 2: two parts of the knowledge, but you are asking one 287 00:16:17,596 --> 00:16:21,036 Speaker 2: AI model to learn all of these. And the place 288 00:16:21,116 --> 00:16:23,676 Speaker 2: that make a lot of sense to us is that 289 00:16:24,476 --> 00:16:27,356 Speaker 2: there's only one physical world. Like even when you have 290 00:16:27,716 --> 00:16:30,956 Speaker 2: many different robots that need to do different things in 291 00:16:31,036 --> 00:16:34,556 Speaker 2: different factories, different warehouses, they are still interacting in the 292 00:16:34,596 --> 00:16:37,956 Speaker 2: same physical world. And so building a foundation model for 293 00:16:38,076 --> 00:16:42,676 Speaker 2: robotics like has this amazing property of grounding that like, 294 00:16:42,756 --> 00:16:45,516 Speaker 2: no matter what kind of tasks that you're asking this 295 00:16:45,596 --> 00:16:48,396 Speaker 2: foundation model to learn, well, it's just learning with the 296 00:16:48,396 --> 00:16:50,596 Speaker 2: same sets of physics, Like the same surrounding. 297 00:16:50,756 --> 00:16:52,876 Speaker 1: Is the literal ground with the models. 298 00:16:52,636 --> 00:16:54,596 Speaker 2: The literal ground exactly, and. 299 00:16:54,516 --> 00:16:57,916 Speaker 1: The models have to understand just how the physical world works, 300 00:16:57,956 --> 00:17:01,596 Speaker 1: and that you drop a thing, it will fall and etc. 301 00:17:02,116 --> 00:17:05,556 Speaker 2: And it's if it's something that is like deformable, you 302 00:17:05,596 --> 00:17:07,316 Speaker 2: push it, it would move a little bit. If something 303 00:17:07,436 --> 00:17:10,916 Speaker 2: is rigid like you would sly, if something is rollable, 304 00:17:10,956 --> 00:17:11,716 Speaker 2: it would roll away. 305 00:17:11,796 --> 00:17:11,916 Speaker 1: Like. 306 00:17:12,116 --> 00:17:13,996 Speaker 2: These are the type of things that like, no matter 307 00:17:14,436 --> 00:17:17,076 Speaker 2: where you are on Earth and what type of robots 308 00:17:17,116 --> 00:17:19,836 Speaker 2: like body you're using, is the same, right, and so 309 00:17:19,956 --> 00:17:22,196 Speaker 2: like if you can be one single foundation model that 310 00:17:22,276 --> 00:17:25,076 Speaker 2: can learn from all of these different data, it would 311 00:17:25,076 --> 00:17:26,236 Speaker 2: be incredibly powerful. 312 00:17:26,676 --> 00:17:29,756 Speaker 1: So just to just to state it clearly, So you're 313 00:17:29,836 --> 00:17:33,036 Speaker 1: at open AI, you're seeing the power of foundation models, 314 00:17:33,636 --> 00:17:36,476 Speaker 1: you decide to leave and start the company that is 315 00:17:36,556 --> 00:17:39,676 Speaker 1: now Covariant, Like what are you setting out to do 316 00:17:39,876 --> 00:17:41,756 Speaker 1: when you when you start the company? 317 00:17:42,276 --> 00:17:46,596 Speaker 2: Yeah, So when we started Covariant, we had this really 318 00:17:46,636 --> 00:17:51,556 Speaker 2: strong conviction that like, there should be a future that 319 00:17:51,756 --> 00:17:54,996 Speaker 2: has a lot of autonomous robots doing all the things 320 00:17:55,036 --> 00:18:00,716 Speaker 2: that are repetitive, injury prone, dangerous, like and so that 321 00:18:00,876 --> 00:18:04,316 Speaker 2: can really revolutionize the physical world, make it a lot 322 00:18:04,356 --> 00:18:09,676 Speaker 2: more abundant and to really enable that future of autonomous robots, 323 00:18:09,836 --> 00:18:12,716 Speaker 2: you need really smart AI, and like because of the 324 00:18:12,756 --> 00:18:15,316 Speaker 2: inside that we just talk about, like we believe that 325 00:18:15,356 --> 00:18:17,596 Speaker 2: AI had to be a foundation model. We believe like 326 00:18:17,636 --> 00:18:20,356 Speaker 2: you should have single model that learn from all these 327 00:18:20,356 --> 00:18:23,036 Speaker 2: two different robots together and become smarter together. 328 00:18:23,596 --> 00:18:28,116 Speaker 1: So I mean, so the basic idea is, the dream 329 00:18:28,196 --> 00:18:35,476 Speaker 1: is to build one AI foundation model for robots. Basically, yeah, 330 00:18:35,636 --> 00:18:37,756 Speaker 1: in the same way that you can ask chat GBT 331 00:18:37,956 --> 00:18:40,196 Speaker 1: anything in language and it can answer you in language 332 00:18:40,196 --> 00:18:42,796 Speaker 1: about any different thing. You have a model where you 333 00:18:42,996 --> 00:18:45,876 Speaker 1: just sort of make it be the brain of any 334 00:18:45,996 --> 00:18:48,716 Speaker 1: robot and that robot can sort of see the world 335 00:18:48,796 --> 00:18:51,356 Speaker 1: and move and pick things up and behave in the 336 00:18:51,396 --> 00:18:52,436 Speaker 1: world exactly. 337 00:18:53,076 --> 00:18:56,396 Speaker 2: And there's one key problem, Like the key problem is 338 00:18:56,436 --> 00:19:01,636 Speaker 2: that unlike foundation model for language, where you can scrape 339 00:19:01,676 --> 00:19:04,596 Speaker 2: the whole Internet of text as your pre training data, 340 00:19:05,036 --> 00:19:09,036 Speaker 2: there's nothing that is equivalent in the case of robotic 341 00:19:09,436 --> 00:19:11,236 Speaker 2: I mean, there are some images online, there are some 342 00:19:11,316 --> 00:19:14,556 Speaker 2: YouTube videos online, but by and large, like they don't 343 00:19:14,596 --> 00:19:20,036 Speaker 2: really give you the same type of data that are 344 00:19:20,036 --> 00:19:22,316 Speaker 2: in the form of robots interacting with the world. And 345 00:19:22,356 --> 00:19:25,116 Speaker 2: the big problem is because they're just not that many robots, 346 00:19:25,116 --> 00:19:28,556 Speaker 2: they're just doing interesting things in the world. And so 347 00:19:28,636 --> 00:19:31,436 Speaker 2: a big chunk of what we set out to build 348 00:19:31,436 --> 00:19:34,156 Speaker 2: as a company is recognizing that we need to build 349 00:19:34,276 --> 00:19:37,436 Speaker 2: fundation model for robotics. And in order to be a 350 00:19:37,476 --> 00:19:40,916 Speaker 2: foundation model for robotics, you need to have large data sets. 351 00:19:41,196 --> 00:19:43,676 Speaker 2: And in order to create large data sets, you have 352 00:19:43,836 --> 00:19:47,836 Speaker 2: to have robots that are actually creating value for customers 353 00:19:47,876 --> 00:19:51,476 Speaker 2: in production at scale, because if you're only collecting data 354 00:19:51,516 --> 00:19:53,996 Speaker 2: in your own lab, there's only so much data. And 355 00:19:54,436 --> 00:19:59,076 Speaker 2: so I would say the last six years of Covariance 356 00:19:59,076 --> 00:20:04,956 Speaker 2: is largely focused on really building autonomous robot systems that 357 00:20:05,076 --> 00:20:08,476 Speaker 2: work really well for customers, and they're doing interesting things 358 00:20:08,516 --> 00:20:11,996 Speaker 2: to a level of autonomy and reliability that have not 359 00:20:12,076 --> 00:20:13,236 Speaker 2: been hit before. 360 00:20:16,196 --> 00:20:19,036 Speaker 4: In other words, Peter and his colleagues at Covariant built 361 00:20:19,276 --> 00:20:22,756 Speaker 4: robot arms that businesses are paying to use out in 362 00:20:22,796 --> 00:20:26,476 Speaker 4: the world. But to some extent, those robot arms are 363 00:20:26,516 --> 00:20:29,316 Speaker 4: just a means to an end because they're not just 364 00:20:29,396 --> 00:20:32,356 Speaker 4: doing warehouse work. They're collecting more and more. 365 00:20:32,276 --> 00:20:35,436 Speaker 1: Data that Peter and his colleagues are feeding back into 366 00:20:35,476 --> 00:20:38,276 Speaker 1: their AI model to try and make it get better 367 00:20:38,476 --> 00:20:43,076 Speaker 1: and better. In a minute, what those robot arms are 368 00:20:43,156 --> 00:20:46,876 Speaker 1: actually doing out in the world and what they're learning. 369 00:20:56,916 --> 00:21:02,036 Speaker 1: So let's talk about what the robots you have built 370 00:21:02,436 --> 00:21:05,356 Speaker 1: are doing out in the world. Besides generating the data 371 00:21:05,396 --> 00:21:08,116 Speaker 1: that you will use to train the next generation of robots, 372 00:21:08,116 --> 00:21:11,196 Speaker 1: what are they actually doing in the world right now today. 373 00:21:11,356 --> 00:21:14,636 Speaker 2: Yeah, So when we started the company, we soveyed the 374 00:21:14,716 --> 00:21:19,156 Speaker 2: landscape pretty carefully and then we selected warehouse and logics 375 00:21:19,156 --> 00:21:22,836 Speaker 2: as the primary sector that we focus on today. When 376 00:21:22,876 --> 00:21:26,196 Speaker 2: you are stopping online shops, like when you click a 377 00:21:26,196 --> 00:21:30,436 Speaker 2: button and something shows up next day, there's tremendous amount 378 00:21:30,636 --> 00:21:34,676 Speaker 2: of complexities behind that back end logistics of getting things 379 00:21:34,716 --> 00:21:38,196 Speaker 2: to you. And typically it's estimated that each item is 380 00:21:38,276 --> 00:21:41,476 Speaker 2: touched fifteen to twenty times between when you click a 381 00:21:41,516 --> 00:21:43,396 Speaker 2: butttom to buy and when it shows up. 382 00:21:44,716 --> 00:21:48,476 Speaker 1: In contrast, that fifteen to twenty touches of getting a 383 00:21:48,516 --> 00:21:52,596 Speaker 1: thing from the warehouse to your door is your opportunity. 384 00:21:52,316 --> 00:21:56,556 Speaker 2: Exactly right, And like combining with that, people don't want 385 00:21:56,596 --> 00:21:59,596 Speaker 2: to drive in the middle of the night two hours 386 00:21:59,596 --> 00:22:03,116 Speaker 2: into a suburb to work in a warehouse, Like it's 387 00:22:03,196 --> 00:22:06,156 Speaker 2: just a kind of job that has extremely high turnover rate, 388 00:22:06,236 --> 00:22:08,156 Speaker 2: Like not the kind of job that people stay there 389 00:22:08,196 --> 00:22:11,196 Speaker 2: for a very long time, and so there's a tremendous 390 00:22:11,196 --> 00:22:15,756 Speaker 2: amount of desire for more robotics and more automations in 391 00:22:15,796 --> 00:22:20,596 Speaker 2: those environment to do picking up objects, sorting them into 392 00:22:20,636 --> 00:22:24,636 Speaker 2: the right compartment of boxes, and then packing it nicely 393 00:22:24,716 --> 00:22:27,996 Speaker 2: and then shipping it out to you as a customers. 394 00:22:28,196 --> 00:22:30,556 Speaker 1: Tell me a little bit more specifically, I mean, what's 395 00:22:30,636 --> 00:22:34,236 Speaker 1: one thing one of your robots is doing today in 396 00:22:35,036 --> 00:22:36,916 Speaker 1: a warehouse somewhere on the earth. 397 00:22:37,236 --> 00:22:41,756 Speaker 2: Yeah, so we would get like pretty detail and giky here, 398 00:22:41,796 --> 00:22:43,756 Speaker 2: and so we'd actually tell you a little bit of 399 00:22:43,836 --> 00:22:48,996 Speaker 2: like the needy, greitty details of like warehouse. Like, so 400 00:22:49,076 --> 00:22:50,916 Speaker 2: let me describe what the robot is doing. Like the 401 00:22:50,996 --> 00:22:55,316 Speaker 2: robot is doing, is I have a toad full of 402 00:22:55,316 --> 00:22:57,956 Speaker 2: items that come up to the robots, and then I 403 00:22:57,996 --> 00:23:00,716 Speaker 2: would need to grab one thing at a time, like we're. 404 00:23:00,596 --> 00:23:03,156 Speaker 1: Is like a tote bag with a bunch of different stuff. 405 00:23:02,836 --> 00:23:05,316 Speaker 2: And a bunch of different stuff in it, and like 406 00:23:05,396 --> 00:23:08,676 Speaker 2: because like these stuffs are all layout in a chaotic 407 00:23:09,516 --> 00:23:12,156 Speaker 2: way and like they're overlapping with each other. If you're 408 00:23:12,156 --> 00:23:14,356 Speaker 2: not careful, you might drag out multiple items at the 409 00:23:14,356 --> 00:23:17,316 Speaker 2: same time. And these items all have different shapes like 410 00:23:17,356 --> 00:23:19,596 Speaker 2: they might be transparent, they might be reflective, they might 411 00:23:19,636 --> 00:23:20,316 Speaker 2: be hard to see. 412 00:23:20,476 --> 00:23:23,116 Speaker 1: This is hard, like to go back to our old 413 00:23:23,276 --> 00:23:27,196 Speaker 1: Like riveting parts on a car is easy, Folding a 414 00:23:27,236 --> 00:23:30,596 Speaker 1: towel is hard. This is hard because it's heterogeneous. Things 415 00:23:30,636 --> 00:23:33,236 Speaker 1: look different, They come differently every time. This is hard 416 00:23:33,276 --> 00:23:36,436 Speaker 1: for robots. Hard for a sort of classical robot to do. 417 00:23:37,836 --> 00:23:40,556 Speaker 2: Impossible for a classical impossible. 418 00:23:39,996 --> 00:23:44,236 Speaker 1: Not just hard, impossible. Can your robots do it? 419 00:23:44,916 --> 00:23:47,036 Speaker 2: Yeah? They can do it? How extremely well? 420 00:23:47,196 --> 00:23:48,996 Speaker 1: How did you solve How did you solve it? 421 00:23:49,036 --> 00:23:50,836 Speaker 2: How does it work in the end of the day? 422 00:23:50,916 --> 00:23:53,276 Speaker 2: Like it the way that it operates us very similar 423 00:23:53,356 --> 00:23:57,116 Speaker 2: to how like humans vision system work, Like we have 424 00:23:57,196 --> 00:23:59,836 Speaker 2: two eyes and then like by two eyes looking at 425 00:23:59,876 --> 00:24:03,756 Speaker 2: something like we like we can figure out what's the 426 00:24:03,796 --> 00:24:06,516 Speaker 2: depth of a certain items, like because our two eyes 427 00:24:06,556 --> 00:24:10,356 Speaker 2: can triangulate a single point the three D world and 428 00:24:10,396 --> 00:24:12,516 Speaker 2: it's the same kind of mechanisms and so like you 429 00:24:12,556 --> 00:24:15,876 Speaker 2: can just use multiple regular cameras, just like the one 430 00:24:15,916 --> 00:24:18,836 Speaker 2: that you have on your iPhone, and by having multiple 431 00:24:18,876 --> 00:24:22,396 Speaker 2: ones of those like you give the new net the 432 00:24:22,476 --> 00:24:24,836 Speaker 2: ability to triangulate what's happening. 433 00:24:24,956 --> 00:24:27,196 Speaker 1: Just the way our two eyes allow us to see 434 00:24:27,236 --> 00:24:31,756 Speaker 1: depth essentially exactly right, And are there other things like 435 00:24:31,756 --> 00:24:33,676 Speaker 1: like weight, I mean the arm is going to be 436 00:24:33,676 --> 00:24:36,756 Speaker 1: picking things up like weight or whether you know, presumably 437 00:24:36,796 --> 00:24:39,316 Speaker 1: there's like could be a shirt in a plastic bag, 438 00:24:39,396 --> 00:24:41,756 Speaker 1: could be a box, or something's are rigid, some things 439 00:24:41,756 --> 00:24:42,596 Speaker 1: are deformable. 440 00:24:42,836 --> 00:24:45,956 Speaker 2: Yeah, So what we have found is that if you 441 00:24:46,036 --> 00:24:50,396 Speaker 2: just have a visual understanding of the world that is 442 00:24:50,476 --> 00:24:54,476 Speaker 2: as robust as human, you go a really long way. Right, 443 00:24:54,556 --> 00:24:58,356 Speaker 2: Like so when I when I pick up a cup, right, like, 444 00:24:58,476 --> 00:25:02,356 Speaker 2: I'm not doing a lot of calculations on my how 445 00:25:02,436 --> 00:25:05,756 Speaker 2: is my fingers faced? Like exactly, it gets translated to 446 00:25:05,796 --> 00:25:07,476 Speaker 2: the cup and making sure it holds right. 447 00:25:07,516 --> 00:25:09,876 Speaker 1: It's part of the miracle of being a person though, Right, 448 00:25:09,916 --> 00:25:11,356 Speaker 1: it's like a really hard problem to pick. 449 00:25:11,596 --> 00:25:13,676 Speaker 2: It's a very hard problem. But then like your brains 450 00:25:13,676 --> 00:25:16,476 Speaker 2: subconsciously solve it for you, like your your system one 451 00:25:16,596 --> 00:25:18,516 Speaker 2: thinking somewhat solved that. 452 00:25:18,396 --> 00:25:20,996 Speaker 1: For thinking, you don't have to think about yeahm. 453 00:25:20,636 --> 00:25:24,116 Speaker 2: Hm, exactly, And and you can like imagine like when 454 00:25:24,156 --> 00:25:26,356 Speaker 2: you do this, like in fact, like even if my 455 00:25:26,436 --> 00:25:29,236 Speaker 2: fingers are numb, I can still do this perfectly like 456 00:25:29,476 --> 00:25:34,676 Speaker 2: just because like so it acquires this intuitive understanding of 457 00:25:35,156 --> 00:25:37,756 Speaker 2: interaction with physical will so well that you can do it. 458 00:25:38,196 --> 00:25:41,716 Speaker 1: So basically vision, vision gets you most of the way there. 459 00:25:41,836 --> 00:25:45,236 Speaker 2: I would say, vision and then the ability to intuit 460 00:25:45,636 --> 00:25:48,916 Speaker 2: physics from your visual input that you'll get. 461 00:25:48,996 --> 00:25:51,476 Speaker 1: That told me that second. What is wild though, Like 462 00:25:51,676 --> 00:25:54,476 Speaker 1: I mean into it, I mean into it in the 463 00:25:54,516 --> 00:25:57,676 Speaker 1: in the context of it, I mean sort of make inferences. 464 00:25:57,956 --> 00:26:00,796 Speaker 1: I mean intuit is yeah, okay, yeah. 465 00:26:00,876 --> 00:26:03,076 Speaker 2: But like by by into it, I mean like it's 466 00:26:03,116 --> 00:26:07,236 Speaker 2: not doing some kind of detail physical calculation. 467 00:26:07,276 --> 00:26:09,116 Speaker 1: It's not doing math. It's not doing math. 468 00:26:09,516 --> 00:26:11,796 Speaker 2: Yeah, it's doing a kind of high level pattern matching 469 00:26:11,796 --> 00:26:15,556 Speaker 2: of well, like based on how these things looks, this 470 00:26:15,716 --> 00:26:18,516 Speaker 2: is likely going to be a successful way to approach 471 00:26:18,596 --> 00:26:20,716 Speaker 2: the item and interact with it. 472 00:26:21,036 --> 00:26:23,356 Speaker 1: What what's next? 473 00:26:23,916 --> 00:26:28,876 Speaker 2: So what is next immediately is very exciting? Right. We 474 00:26:28,956 --> 00:26:31,836 Speaker 2: are now getting to a place that by we, I 475 00:26:31,876 --> 00:26:36,796 Speaker 2: mean we as in AI community has gotten to a 476 00:26:36,876 --> 00:26:43,116 Speaker 2: place that we have enough computation power and algorithmic and 477 00:26:43,236 --> 00:26:47,796 Speaker 2: modeling understanding that can allow us to extract a lot 478 00:26:48,356 --> 00:26:48,996 Speaker 2: out of data. 479 00:26:49,356 --> 00:26:54,596 Speaker 1: Right, So from any given amount of data, you can get. 480 00:26:54,396 --> 00:26:57,196 Speaker 2: More, you can get more out of it. Right. 481 00:26:57,276 --> 00:27:00,676 Speaker 1: It's exciting for you because data is such a constraint 482 00:27:00,716 --> 00:27:02,516 Speaker 1: on what you're trying to do exactly. 483 00:27:02,556 --> 00:27:05,876 Speaker 2: And then like we're building up this large robotic data 484 00:27:05,876 --> 00:27:09,916 Speaker 2: set by tapping into a lot of these events, is 485 00:27:11,156 --> 00:27:13,876 Speaker 2: it gives us the ability to even get more out 486 00:27:13,916 --> 00:27:16,156 Speaker 2: of the data sets that we're building and allow us 487 00:27:16,156 --> 00:27:20,956 Speaker 2: to build smarter than better robotics foundation models that perform 488 00:27:21,036 --> 00:27:23,556 Speaker 2: better at the current tasks that they're supposed to do 489 00:27:23,996 --> 00:27:26,436 Speaker 2: and also power more robots. 490 00:27:26,476 --> 00:27:29,076 Speaker 1: When people talk about concerns around AI, they often kind 491 00:27:29,076 --> 00:27:32,316 Speaker 1: of have jokingly use the phrase killer robots, which is 492 00:27:32,396 --> 00:27:35,836 Speaker 1: usually like a metaphor or something. But in your instance, 493 00:27:35,916 --> 00:27:38,516 Speaker 1: because you are building robots, and because you are building 494 00:27:38,756 --> 00:27:41,436 Speaker 1: by design a model that is supposed to be used 495 00:27:41,476 --> 00:27:44,516 Speaker 1: for lots of different purposes, I can in fact very 496 00:27:44,516 --> 00:27:47,796 Speaker 1: easily imagine killer robot applications of your work. Like that 497 00:27:47,796 --> 00:27:50,636 Speaker 1: seems like a very plausible thing someone could do with it, Like, 498 00:27:51,196 --> 00:27:52,996 Speaker 1: is that something you think about worry about? 499 00:27:54,436 --> 00:27:59,316 Speaker 2: I would say very Fortunately, in the very near to 500 00:27:59,476 --> 00:28:03,076 Speaker 2: medium term use cases, we are very safe because like 501 00:28:03,196 --> 00:28:06,836 Speaker 2: all of these industrial robots are very much confined by 502 00:28:06,956 --> 00:28:10,796 Speaker 2: the stations that they're designed in. Curse like industrial robots 503 00:28:11,356 --> 00:28:14,956 Speaker 2: heavy machineries that are subject to regulations and are very 504 00:28:14,996 --> 00:28:20,676 Speaker 2: carefully They are like careful design guidelines and compliance requirements 505 00:28:20,716 --> 00:28:23,396 Speaker 2: for them. They are already by design safe. 506 00:28:23,516 --> 00:28:28,756 Speaker 1: You're saying your model is built for robots basically for 507 00:28:28,876 --> 00:28:31,316 Speaker 1: robot arms. I mean, is that essentially what the model 508 00:28:31,316 --> 00:28:34,036 Speaker 1: you're building is really a foundation model for robot arms 509 00:28:34,036 --> 00:28:36,956 Speaker 1: that are built to just be in one place, pick 510 00:28:36,996 --> 00:28:39,076 Speaker 1: things up, put them down, that sort of thing. It's 511 00:28:39,116 --> 00:28:42,276 Speaker 1: not a foundation model that you could have map onto 512 00:28:42,476 --> 00:28:44,556 Speaker 1: a car or something, or even to a robot that 513 00:28:44,676 --> 00:28:46,716 Speaker 1: walks around. It wouldn't work for that. 514 00:28:46,716 --> 00:28:49,196 Speaker 2: That would not be the near term use cases. Like 515 00:28:49,276 --> 00:28:52,196 Speaker 2: so like because the near term use cases are more 516 00:28:52,236 --> 00:28:56,876 Speaker 2: in this safe by construction setting, it allows us to 517 00:28:57,316 --> 00:29:00,076 Speaker 2: not more about that problem and in fact, like have 518 00:29:00,756 --> 00:29:04,196 Speaker 2: basically no way to misuse the technology in the way 519 00:29:04,196 --> 00:29:06,196 Speaker 2: we need to. But I do agree with you, like 520 00:29:06,236 --> 00:29:09,596 Speaker 2: as we actually unleash this model to a set of 521 00:29:09,676 --> 00:29:12,316 Speaker 2: use cases, like when these robots can actually interact with 522 00:29:12,356 --> 00:29:15,756 Speaker 2: the world in a lot more freeform way those other cases, 523 00:29:15,836 --> 00:29:20,596 Speaker 2: that the safety considerations become a lot more important and 524 00:29:20,636 --> 00:29:22,596 Speaker 2: there's definitely a lot more work that need to be 525 00:29:22,636 --> 00:29:26,076 Speaker 2: done for that to be reliable. 526 00:29:29,836 --> 00:29:31,996 Speaker 1: We'll be back in a minute with the lightning round. 527 00:29:33,596 --> 00:29:44,236 Speaker 1: M there is a lightning round that we're gonna do 528 00:29:44,316 --> 00:29:48,516 Speaker 1: now for the end of the interview, what household chore 529 00:29:48,596 --> 00:29:50,356 Speaker 1: do you wish that a robot could do? 530 00:29:52,996 --> 00:29:57,556 Speaker 2: Cleaning up kitchen? Yeah, but I don't like the cleanup 531 00:29:57,596 --> 00:29:57,836 Speaker 2: of it. 532 00:29:59,556 --> 00:30:02,876 Speaker 1: That seems like a really hard one, like putting stuff away. 533 00:30:03,076 --> 00:30:05,716 Speaker 1: Basically I'm wiping the counter maybe less hard, but like 534 00:30:06,156 --> 00:30:08,556 Speaker 1: putting stuff away seems like a really hard job for 535 00:30:08,556 --> 00:30:09,076 Speaker 1: a robot. 536 00:30:09,796 --> 00:30:11,916 Speaker 2: It is. And these are the type of jobs that 537 00:30:12,036 --> 00:30:15,636 Speaker 2: start to get to how we are limitation, Like, these 538 00:30:15,636 --> 00:30:17,556 Speaker 2: are the type of jobs that start to get to 539 00:30:18,276 --> 00:30:20,996 Speaker 2: like you probably do want a humanoid robot, like you 540 00:30:20,996 --> 00:30:26,156 Speaker 2: probably do want something that kind of moves and conform 541 00:30:26,316 --> 00:30:29,196 Speaker 2: to the human standard of interacting with the world. 542 00:30:29,156 --> 00:30:31,796 Speaker 1: Because the kitchen is optimized, right, or you would have 543 00:30:31,836 --> 00:30:34,556 Speaker 1: to redesign the kitchen for a robot, and but then 544 00:30:34,756 --> 00:30:37,396 Speaker 1: that would suck because then you couldn't get your plates 545 00:30:37,396 --> 00:30:40,516 Speaker 1: because they'd be in some random spot or whatever. Okay, 546 00:30:40,556 --> 00:30:44,476 Speaker 1: so you left open ai in twenty seventeen. In the 547 00:30:44,516 --> 00:30:48,436 Speaker 1: past year, open ai became like this household word, GPT 548 00:30:48,596 --> 00:30:52,876 Speaker 1: became a household word. Were you surprised as a you know, 549 00:30:53,076 --> 00:30:56,516 Speaker 1: old school former open AI guy, were you surprised by 550 00:30:57,036 --> 00:31:00,236 Speaker 1: how how wild the world went for GPT or by 551 00:31:00,276 --> 00:31:01,516 Speaker 1: how good it was how soon? 552 00:31:03,516 --> 00:31:07,116 Speaker 2: I was definitely surprised by the speed of it. I 553 00:31:07,196 --> 00:31:10,076 Speaker 2: was surprised by the speed of bull of the technologists 554 00:31:10,116 --> 00:31:14,716 Speaker 2: development and the speed of adoption. But I was not 555 00:31:14,836 --> 00:31:18,156 Speaker 2: surprised by the fact that it could be dispig and 556 00:31:18,356 --> 00:31:19,236 Speaker 2: it could be bigger. 557 00:31:19,676 --> 00:31:22,676 Speaker 1: You know, when you were talking about sort of warehouse 558 00:31:22,796 --> 00:31:27,516 Speaker 1: and getting data from you know, picking and packing basically, 559 00:31:27,836 --> 00:31:31,116 Speaker 1: I thought, of course, as anyone would, of Amazon, and 560 00:31:31,156 --> 00:31:34,676 Speaker 1: I've read that they're working on some kind of robot 561 00:31:34,756 --> 00:31:36,876 Speaker 1: arm I feel like they would just have so much 562 00:31:37,836 --> 00:31:40,236 Speaker 1: data that they could gather if they wanted, just because 563 00:31:40,236 --> 00:31:41,916 Speaker 1: they're so big, they have so many warehouse I mean 564 00:31:41,916 --> 00:31:44,276 Speaker 1: the same way that say Google just gets tons of 565 00:31:44,436 --> 00:31:46,476 Speaker 1: data every day with every Google search and the way 566 00:31:46,516 --> 00:31:48,796 Speaker 1: people like I feel like that would be very hard 567 00:31:48,796 --> 00:31:50,436 Speaker 1: to compete with, But. 568 00:31:50,476 --> 00:31:52,196 Speaker 2: We also don't need to compete with them. They are 569 00:31:52,276 --> 00:31:56,516 Speaker 2: also a very large role that Amazon is not serving, 570 00:31:56,996 --> 00:31:59,676 Speaker 2: and there are a lot of customers that don't have 571 00:31:59,796 --> 00:32:05,276 Speaker 2: the same degree of engineering team data access as Amazon, and. 572 00:32:05,676 --> 00:32:07,596 Speaker 1: You could be the shop by of you could be 573 00:32:07,636 --> 00:32:09,116 Speaker 1: the Shopify of warehouse robots. 574 00:32:09,476 --> 00:32:12,596 Speaker 2: There are all of these people that still need help, 575 00:32:12,636 --> 00:32:14,356 Speaker 2: and we very gladly help them. 576 00:32:14,716 --> 00:32:17,276 Speaker 1: What was the first robot you personally built. 577 00:32:19,756 --> 00:32:21,356 Speaker 2: I think it's probably one of the first pick and 578 00:32:21,436 --> 00:32:23,116 Speaker 2: place robot e Covariant. 579 00:32:23,596 --> 00:32:25,396 Speaker 1: You didn't build them when you were kid there. 580 00:32:25,596 --> 00:32:26,956 Speaker 2: I didn't build them when I was kid. 581 00:32:27,916 --> 00:32:29,356 Speaker 1: What made you get into robots? 582 00:32:31,876 --> 00:32:36,276 Speaker 2: I would say I'm an AI person first and robot 583 00:32:36,316 --> 00:32:40,956 Speaker 2: person set, and a big part of the interest in 584 00:32:41,076 --> 00:32:44,516 Speaker 2: robotics is probably driven by my interests in AI. Like 585 00:32:44,596 --> 00:32:49,676 Speaker 2: it just like we have just not make as much 586 00:32:49,716 --> 00:32:52,796 Speaker 2: progress for AI in the physical world. That's AI in 587 00:32:52,836 --> 00:32:56,196 Speaker 2: the digital world, and to a large degree, like I 588 00:32:56,196 --> 00:32:59,436 Speaker 2: think we have to make progress there because like ultimately 589 00:32:59,436 --> 00:33:01,716 Speaker 2: we live in the physical world. Like you're creating all 590 00:33:01,756 --> 00:33:04,716 Speaker 2: these intelligence and amazing things in the digital world. That 591 00:33:04,836 --> 00:33:07,436 Speaker 2: is all great, but where's AI in the physical world? 592 00:33:07,436 --> 00:33:11,996 Speaker 2: Like this remarkably little progress there despite how much AI 593 00:33:12,036 --> 00:33:14,636 Speaker 2: has moved forward. And so to some degree, like it's 594 00:33:14,636 --> 00:33:20,156 Speaker 2: a it's driven by a conviction that like AI has 595 00:33:20,236 --> 00:33:24,436 Speaker 2: to progress forward and AI will have a large impact 596 00:33:24,436 --> 00:33:25,276 Speaker 2: in the physical world. 597 00:33:25,756 --> 00:33:28,076 Speaker 1: What do you understand about robots that most people don't. 598 00:33:28,156 --> 00:33:34,036 Speaker 2: I think the most interesting thing about robots that I 599 00:33:34,076 --> 00:33:37,596 Speaker 2: would say, like we understand a covariant that maybe outside 600 00:33:37,596 --> 00:33:46,036 Speaker 2: of the company don't is making one robot work is 601 00:33:47,476 --> 00:33:50,636 Speaker 2: obviously hard and fun, but making a lot of robots 602 00:33:50,756 --> 00:33:53,876 Speaker 2: work as scale for a lot of customers take a 603 00:33:53,916 --> 00:33:59,836 Speaker 2: lot of operational discipline. Like it's about like doing many 604 00:33:59,876 --> 00:34:03,076 Speaker 2: many things right, like before robots go into a facility, 605 00:34:03,196 --> 00:34:05,996 Speaker 2: like how should they prep the site so the robots 606 00:34:06,036 --> 00:34:13,716 Speaker 2: actually work. To ship robots at scale, it's a competencies 607 00:34:13,836 --> 00:34:17,516 Speaker 2: that requires a lot of operational excellence, and that is 608 00:34:17,556 --> 00:34:20,316 Speaker 2: something that I would say, like when most people think 609 00:34:20,316 --> 00:34:22,916 Speaker 2: about robots like big thing about is like this sexy, 610 00:34:22,956 --> 00:34:26,436 Speaker 2: interesting technology, they don't think about having to nail a 611 00:34:26,516 --> 00:34:29,996 Speaker 2: thousand steps, a thousand small steps well in order to 612 00:34:30,036 --> 00:34:33,436 Speaker 2: have robots actually have an impact in the world at scale. 613 00:34:33,956 --> 00:34:36,356 Speaker 1: That's the leap from the academic lab to being a 614 00:34:36,396 --> 00:34:41,036 Speaker 1: real company selling real products in the world. Great, anything 615 00:34:41,036 --> 00:34:42,076 Speaker 1: else do you want to talk about? 616 00:34:42,676 --> 00:34:44,316 Speaker 2: No? I enjoyed this conversation. 617 00:34:44,836 --> 00:34:55,436 Speaker 1: Yeah, likewise, thank you. Theater Chen is the co founder 618 00:34:55,476 --> 00:35:00,316 Speaker 1: and CEO of Covariate. Today's show was produced by Edith 619 00:35:00,356 --> 00:35:04,316 Speaker 1: Russolo and Gabriel Hunter Chang. It was edited by Karen 620 00:35:04,396 --> 00:35:08,356 Speaker 1: Chakerji and engineered by Sarah Bugueer. You can email us 621 00:35:08,476 --> 00:35:12,596 Speaker 1: at at Pushkin dot f M. I'm Jacob Goldstein, and 622 00:35:12,636 --> 00:35:14,916 Speaker 1: we'll be back next week with another episode of What's 623 00:35:14,916 --> 00:35:22,396 Speaker 1: Your Problem