WEBVTT - Using AI to Build Better Robots

0:00:15.356 --> 0:00:21.876
<v Speaker 1>Pushkin for a long time. Now, we've had a lot

0:00:21.876 --> 0:00:25.876
<v Speaker 1>of technological innovation in virtual things in bits, you know,

0:00:26.316 --> 0:00:30.636
<v Speaker 1>the Internet, digital images, large language models, etc. We have

0:00:30.716 --> 0:00:36.076
<v Speaker 1>had noticeably less innovation in actual things and things made

0:00:36.116 --> 0:00:39.116
<v Speaker 1>of atoms, things that would hurt if you dropped them

0:00:39.116 --> 0:00:42.956
<v Speaker 1>on your foot. Now that seems to be changing. People

0:00:43.036 --> 0:00:47.316
<v Speaker 1>are using innovations in bits, improvements in computing and communications

0:00:47.316 --> 0:00:51.916
<v Speaker 1>and AI to drive innovation in actual things, everything from

0:00:51.956 --> 0:01:03.236
<v Speaker 1>batteries to garbage cans to airplanes. Next up robots. I'm

0:01:03.316 --> 0:01:05.836
<v Speaker 1>Jacob Boldstein and this is What's Your Problem, the show

0:01:05.876 --> 0:01:08.156
<v Speaker 1>where I talk to people who are trying to make

0:01:08.276 --> 0:01:12.796
<v Speaker 1>technological progress. My guest today is Peter Chi. He's the

0:01:12.836 --> 0:01:17.236
<v Speaker 1>co founder and CEO of Covariant. Peter's work at Covariant

0:01:17.276 --> 0:01:20.356
<v Speaker 1>was partly inspired by the work of Fafe Lead, who

0:01:20.396 --> 0:01:24.756
<v Speaker 1>coincidentally is the AI researcher I interviewed just last week

0:01:24.876 --> 0:01:28.556
<v Speaker 1>on the show. Peter's problem is this, how do you

0:01:28.596 --> 0:01:31.476
<v Speaker 1>take the AI breakthroughs of the past decade or so

0:01:32.236 --> 0:01:34.596
<v Speaker 1>and make them work in robots.

0:01:39.116 --> 0:01:44.316
<v Speaker 2>So to really tell the story of robotics, like we

0:01:44.396 --> 0:01:47.956
<v Speaker 2>have to tell the story of robotics even without AI

0:01:48.476 --> 0:01:51.276
<v Speaker 2>like robotics for a very long time. It's a field

0:01:51.476 --> 0:01:56.196
<v Speaker 2>that you would actually find in mechanical engineering departments of universities.

0:01:56.236 --> 0:01:59.556
<v Speaker 2>Like it's largely a hardware problem. It's a control problem,

0:01:59.596 --> 0:02:02.196
<v Speaker 2>like how can you design the moti well, how can

0:02:02.236 --> 0:02:03.636
<v Speaker 2>you design the gearbox? Well?

0:02:03.756 --> 0:02:04.116
<v Speaker 1>Yeah, right?

0:02:04.236 --> 0:02:07.636
<v Speaker 2>Can you design like the control algorithm so that you

0:02:07.676 --> 0:02:11.476
<v Speaker 2>can get the robot to a exact xyz location in

0:02:11.556 --> 0:02:15.036
<v Speaker 2>the three D physical world like without oscillating around and

0:02:15.076 --> 0:02:15.516
<v Speaker 2>you can.

0:02:15.436 --> 0:02:17.756
<v Speaker 1>Making the thing move? How do you build the parts

0:02:17.756 --> 0:02:19.396
<v Speaker 1>that make the thing move the way we.

0:02:19.396 --> 0:02:22.956
<v Speaker 2>Wanted to move exactly? Like it's all about like telling

0:02:23.196 --> 0:02:26.156
<v Speaker 2>this piece of machinery that we call robot to do

0:02:26.196 --> 0:02:28.596
<v Speaker 2>the thing that's exactly what we tell it to do,

0:02:29.036 --> 0:02:32.236
<v Speaker 2>which turned out to be like obviously a fairly difficult

0:02:32.236 --> 0:02:34.916
<v Speaker 2>engineering problems, and that's why people have woke on it

0:02:34.996 --> 0:02:38.396
<v Speaker 2>for many decades. But it has gotten really good.

0:02:38.476 --> 0:02:40.876
<v Speaker 1>And so this is like this is like the classic

0:02:40.996 --> 0:02:43.996
<v Speaker 1>kind of image you see from a car assembly line

0:02:44.036 --> 0:02:48.676
<v Speaker 1>of like a robot arm you know whatever, welding a

0:02:48.796 --> 0:02:52.356
<v Speaker 1>part onto the body of a cart again and again

0:02:52.516 --> 0:02:53.396
<v Speaker 1>again all day long.

0:02:53.476 --> 0:02:54.796
<v Speaker 2>Yeah, exactly right.

0:02:54.636 --> 0:02:57.516
<v Speaker 1>So they're good at robots are clearly good at welding

0:02:57.516 --> 0:02:59.996
<v Speaker 1>the same part onto the same car a million times?

0:03:00.356 --> 0:03:02.716
<v Speaker 1>What are the limits of that approach? What were the

0:03:02.836 --> 0:03:05.236
<v Speaker 1>problems people were bumping up against.

0:03:05.556 --> 0:03:08.516
<v Speaker 2>Yeah, so the problem is that in order to use

0:03:08.596 --> 0:03:13.316
<v Speaker 2>that kind of robotics, it has a really big limitations

0:03:13.356 --> 0:03:15.916
<v Speaker 2>on your environment. Right. You basically need to be able

0:03:15.916 --> 0:03:22.076
<v Speaker 2>to reduce your task to be solvable by repeated motion. Right,

0:03:22.236 --> 0:03:24.876
<v Speaker 2>And so if you look at like, how like these

0:03:24.956 --> 0:03:28.356
<v Speaker 2>kind of assembly lines that use classical robots, they always

0:03:28.516 --> 0:03:32.436
<v Speaker 2>feed the material into exactly the same place. So no

0:03:32.476 --> 0:03:34.476
<v Speaker 2>matter how how the way that they came in from

0:03:34.516 --> 0:03:37.236
<v Speaker 2>their suppliers and whatnot, you always need to load them

0:03:37.316 --> 0:03:39.956
<v Speaker 2>up in exactly the same way because like there's really

0:03:39.996 --> 0:03:43.476
<v Speaker 2>no adaptivity at all that these robots have because they're

0:03:43.516 --> 0:03:45.236
<v Speaker 2>just executing the same thing again, and it.

0:03:45.436 --> 0:03:47.876
<v Speaker 1>Just it has to be they're very precise, but their

0:03:47.916 --> 0:03:51.916
<v Speaker 1>whole environment has to be super homogeneous, the same every

0:03:51.956 --> 0:03:52.996
<v Speaker 1>time exactly.

0:03:53.076 --> 0:03:56.836
<v Speaker 2>Like, So, like that's the problem one, but that's very difficult, Like,

0:03:56.876 --> 0:03:59.956
<v Speaker 2>not everything can be reduced that way. The second problem

0:03:59.956 --> 0:04:02.356
<v Speaker 2>with it is even in the case that you can

0:04:02.396 --> 0:04:07.156
<v Speaker 2>reduce the problem to that kind of pure mechanical repeated motion,

0:04:07.716 --> 0:04:10.796
<v Speaker 2>it's still very expensive because you still need to program

0:04:10.836 --> 0:04:14.116
<v Speaker 2>a robot to do that one specific task, and if

0:04:14.156 --> 0:04:17.356
<v Speaker 2>you change your task slightly, you need to reprogram everything,

0:04:17.796 --> 0:04:21.676
<v Speaker 2>typically from scratch, and that means like robot are not

0:04:21.756 --> 0:04:26.196
<v Speaker 2>just extremely rigid, which limits like the range of capabilities

0:04:26.236 --> 0:04:30.396
<v Speaker 2>that they can rich and do. And it's also very

0:04:30.476 --> 0:04:34.196
<v Speaker 2>expensive even on the very fixed rigid capability will.

0:04:34.116 --> 0:04:37.156
<v Speaker 1>Right, And so you need something that's the same every time,

0:04:37.196 --> 0:04:38.996
<v Speaker 1>and you need to be doing a lot of it

0:04:39.116 --> 0:04:42.996
<v Speaker 1>because otherwise the economy of scale just doesn't work out.

0:04:42.996 --> 0:04:45.036
<v Speaker 1>It's too expensive to try and get the robot to

0:04:45.076 --> 0:04:45.716
<v Speaker 1>do something else.

0:04:46.076 --> 0:04:46.516
<v Speaker 2>Exactly.

0:04:47.036 --> 0:04:50.796
<v Speaker 1>So I know that you as a as a student,

0:04:50.876 --> 0:04:53.396
<v Speaker 1>as an undergrad and a grad student, if I have

0:04:53.436 --> 0:04:56.436
<v Speaker 1>it right, you worked in the lab of this professor

0:04:56.436 --> 0:04:59.196
<v Speaker 1>at Berkeley who for a long time had been trying

0:04:59.236 --> 0:05:02.996
<v Speaker 1>to teach robots to fold towels. Yes, which is an

0:05:03.036 --> 0:05:06.116
<v Speaker 1>amazing problem because it's one of those ones that seems

0:05:06.156 --> 0:05:09.836
<v Speaker 1>so simple, right, it seems like way easier than riveting

0:05:09.916 --> 0:05:12.636
<v Speaker 1>parts onto a car or whatever, but turned out to

0:05:12.636 --> 0:05:16.876
<v Speaker 1>be in fact much harder for robots, right. And I

0:05:16.876 --> 0:05:19.596
<v Speaker 1>feel like that's telling like why was that so hard

0:05:19.636 --> 0:05:20.156
<v Speaker 1>for robots?

0:05:20.196 --> 0:05:22.276
<v Speaker 2>And what do we learn from that when you do

0:05:22.836 --> 0:05:25.476
<v Speaker 2>welding on a car body, like as we have discussed,

0:05:25.476 --> 0:05:28.716
<v Speaker 2>like you can reduce the problem to just simple mechanical

0:05:28.956 --> 0:05:33.276
<v Speaker 2>repeated motion. But because like these a piece of fabric

0:05:33.396 --> 0:05:37.596
<v Speaker 2>is flexible, is deformable, like it can come in many

0:05:37.596 --> 0:05:40.836
<v Speaker 2>many different kinds of shapes. It has many different possibilities, right, and.

0:05:40.756 --> 0:05:43.676
<v Speaker 1>It's much more complex than a than a car body.

0:05:43.716 --> 0:05:46.596
<v Speaker 1>Weirdly not into exactly, but when you think about it's like,

0:05:46.676 --> 0:05:49.516
<v Speaker 1>oh it could have a little fast possibilities.

0:05:50.196 --> 0:05:52.436
<v Speaker 2>It has a lot more possibility and how it can

0:05:52.476 --> 0:05:55.916
<v Speaker 2>present itself to you. And because exactly of that, like

0:05:56.036 --> 0:06:00.916
<v Speaker 2>recall like the first limitation of traditional robotics, which is

0:06:00.956 --> 0:06:03.716
<v Speaker 2>it can only work with problems that can be reduced

0:06:03.716 --> 0:06:09.916
<v Speaker 2>to repeated motion like and towel folding and folding apparel

0:06:09.956 --> 0:06:12.876
<v Speaker 2>items is exactly one of those things that cannot be

0:06:12.956 --> 0:06:14.076
<v Speaker 2>reduced because.

0:06:13.836 --> 0:06:16.676
<v Speaker 1>It's just a little bit different every time it tells

0:06:16.676 --> 0:06:18.116
<v Speaker 1>a little bit of a different shape. It might be

0:06:18.156 --> 0:06:21.116
<v Speaker 1>sitting on the table and it'll folded over some.

0:06:20.996 --> 0:06:24.076
<v Speaker 2>Weird exactly, like if it's folded onto itself, like how

0:06:24.156 --> 0:06:26.436
<v Speaker 2>much is it folded onto yourself? Like how much wrinkle

0:06:26.556 --> 0:06:28.756
<v Speaker 2>does it have? Like all of those make a big

0:06:28.796 --> 0:06:32.316
<v Speaker 2>difference in terms of what the robots should do with it,

0:06:32.356 --> 0:06:35.036
<v Speaker 2>and so that like it's a really good example of

0:06:35.316 --> 0:06:40.116
<v Speaker 2>something that's traditional robotics cannot solve and you really need

0:06:40.156 --> 0:06:41.956
<v Speaker 2>AI to solve it.

0:06:42.076 --> 0:06:45.596
<v Speaker 1>And when your co founder started working on the problem,

0:06:45.676 --> 0:06:49.556
<v Speaker 1>it was sort of before the kind of modern era

0:06:49.596 --> 0:06:52.116
<v Speaker 1>of AI that we're living in now, right, And I

0:06:53.116 --> 0:06:56.716
<v Speaker 1>read that there was this big moment in kind of

0:06:56.756 --> 0:07:03.636
<v Speaker 1>the origin of the company was when this database essentially

0:07:03.716 --> 0:07:08.156
<v Speaker 1>of labeled images called image net was released, Which was

0:07:08.196 --> 0:07:10.316
<v Speaker 1>interesting to me because I just talk to Faye Lee

0:07:10.436 --> 0:07:10.716
<v Speaker 1>for this.

0:07:10.756 --> 0:07:16.396
<v Speaker 2>So we have is actually one of the advisor investors.

0:07:16.436 --> 0:07:18.556
<v Speaker 1>Just the coincidence. For the record, she didn't put me

0:07:18.596 --> 0:07:21.796
<v Speaker 1>in touch with you, but tell me about sort of

0:07:21.956 --> 0:07:26.596
<v Speaker 1>why why image net was meaningful in the in the

0:07:26.636 --> 0:07:28.076
<v Speaker 1>birth of your company.

0:07:28.476 --> 0:07:33.196
<v Speaker 2>Yeah, there was actually a bigger lesson here than just robotics.

0:07:33.556 --> 0:07:38.676
<v Speaker 2>The bigger lesson is that artificial intelligence is actually becoming

0:07:38.796 --> 0:07:42.396
<v Speaker 2>simpler and simpler, Like when you look at the field

0:07:42.396 --> 0:07:46.516
<v Speaker 2>of artificial intelligence fifteen twenty years ago, like it used

0:07:46.516 --> 0:07:48.676
<v Speaker 2>to be many different sub fields. Like the people that

0:07:48.716 --> 0:07:51.676
<v Speaker 2>work on robotics have a completely different tool sets than

0:07:51.716 --> 0:07:54.036
<v Speaker 2>the people that work on computer vision, and the people

0:07:54.036 --> 0:07:56.316
<v Speaker 2>that are working on computer vision had a different two

0:07:56.396 --> 0:07:58.756
<v Speaker 2>sets from people that are working on natural language processing.

0:07:59.116 --> 0:08:02.356
<v Speaker 1>Well, and they feel different, Right, Like, teaching computers to

0:08:02.556 --> 0:08:07.316
<v Speaker 1>understand an image, you know, to sort of deal with

0:08:07.436 --> 0:08:09.716
<v Speaker 1>an image of the world and understand what it means,

0:08:09.916 --> 0:08:14.196
<v Speaker 1>feels quite different than teaching a computer to have a conversation.

0:08:14.396 --> 0:08:16.436
<v Speaker 1>It doesn't. It's not obvious that you would use the

0:08:16.436 --> 0:08:17.996
<v Speaker 1>same tools to do exactly.

0:08:17.636 --> 0:08:21.356
<v Speaker 2>Those and that definitely was the consensus, Like why.

0:08:21.236 --> 0:08:22.116
<v Speaker 1>Would it be the same?

0:08:22.396 --> 0:08:23.876
<v Speaker 2>Yeah, what would it be the same? Like it feels

0:08:23.956 --> 0:08:25.556
<v Speaker 2>very different, Like it feels like you need different kind

0:08:25.596 --> 0:08:29.356
<v Speaker 2>of data, and I would just like say it. Essentially,

0:08:29.396 --> 0:08:32.516
<v Speaker 2>the field of AI is becoming more and more unified,

0:08:32.836 --> 0:08:35.756
<v Speaker 2>Like the methodology, the model that you would use is

0:08:35.756 --> 0:08:39.236
<v Speaker 2>actually becoming similar and similar and sometimes it's even the

0:08:39.276 --> 0:08:43.876
<v Speaker 2>same across these very different fields of robotics, computer vision, language.

0:08:43.996 --> 0:08:46.676
<v Speaker 1>So it's basically, you build a neural network and then

0:08:46.716 --> 0:08:49.116
<v Speaker 1>you just train it on a bunch of images or

0:08:49.196 --> 0:08:52.036
<v Speaker 1>train it on a bunch of documents, and what you

0:08:52.156 --> 0:08:55.036
<v Speaker 1>train it on is what determines sort of what it's

0:08:55.076 --> 0:08:55.476
<v Speaker 1>good for.

0:08:55.996 --> 0:08:59.596
<v Speaker 3>Is it like that that that basically is it like

0:08:59.636 --> 0:09:05.436
<v Speaker 3>it's it's like it instead of each sub feel coming

0:09:05.556 --> 0:09:08.956
<v Speaker 3>up with different and think about it as like hand

0:09:09.156 --> 0:09:13.276
<v Speaker 3>programmed intelligence, right, like let's try to break down what's

0:09:13.316 --> 0:09:16.716
<v Speaker 3>a sentence means, like break into different parts and when

0:09:16.756 --> 0:09:18.996
<v Speaker 3>you approach a computer vision problem, or let's try to

0:09:19.036 --> 0:09:21.996
<v Speaker 3>come up with different features and some features represent an edge,

0:09:22.076 --> 0:09:23.836
<v Speaker 3>some features represent the background.

0:09:24.236 --> 0:09:27.956
<v Speaker 2>Instead of like trying to manually program the kind of

0:09:28.036 --> 0:09:32.316
<v Speaker 2>quite human intelligence into AI, you're basically taking a step

0:09:32.356 --> 0:09:34.876
<v Speaker 2>back and say, I'm just going to create a very

0:09:34.916 --> 0:09:39.516
<v Speaker 2>flexible learning mechanism, which is an artificial new net, and

0:09:39.556 --> 0:09:41.436
<v Speaker 2>then we're just going to feed it a lot of data.

0:09:41.916 --> 0:09:44.796
<v Speaker 2>And if you have different types of problems that you're solving,

0:09:45.036 --> 0:09:48.316
<v Speaker 2>you're just feeding like this new net different types of data,

0:09:49.156 --> 0:09:51.476
<v Speaker 2>but they're still like really the same kind of mechanism.

0:09:51.516 --> 0:09:55.036
<v Speaker 2>Like then that is a drastic departure from how artificial

0:09:55.076 --> 0:09:56.836
<v Speaker 2>intelligence used to be done.

0:09:57.196 --> 0:10:00.316
<v Speaker 1>And so in that world, then the differentiation is just

0:10:00.436 --> 0:10:03.916
<v Speaker 1>in the data set that you are feeding the totally model.

0:10:03.916 --> 0:10:07.076
<v Speaker 2>Totally and in fact, like I mean, this is like

0:10:07.196 --> 0:10:10.156
<v Speaker 2>jumping way forward in time. We were still talking about

0:10:10.156 --> 0:10:13.156
<v Speaker 2>image net. But if you look at really the most

0:10:13.236 --> 0:10:17.196
<v Speaker 2>popular technologies today, these large language models. When you use

0:10:17.276 --> 0:10:22.236
<v Speaker 2>different companies large language models, it's really what you're saying,

0:10:22.276 --> 0:10:26.596
<v Speaker 2>like as if I'm using GPT chat, gipt four versus

0:10:26.676 --> 0:10:32.836
<v Speaker 2>using Google's bought backed by Germini versus Anthpics, Claude or

0:10:33.036 --> 0:10:34.836
<v Speaker 2>coheres command model.

0:10:34.636 --> 0:10:37.836
<v Speaker 1>You're just naming all the different big language large language models. Now.

0:10:37.916 --> 0:10:39.836
<v Speaker 2>Yeah, and like when you think about these different big

0:10:39.916 --> 0:10:43.236
<v Speaker 2>language models, you're just really referring to the different data

0:10:43.236 --> 0:10:44.956
<v Speaker 2>sets that's see behind.

0:10:44.636 --> 0:10:47.236
<v Speaker 1>That they were trained on. So this is interesting kind

0:10:47.276 --> 0:10:50.316
<v Speaker 1>of abstract big picture talk. I want to kind of

0:10:50.436 --> 0:10:54.596
<v Speaker 1>map this now onto the story of Covariate, the story

0:10:54.676 --> 0:10:57.756
<v Speaker 1>of your company. Right, so we're going back in time. Now,

0:10:58.956 --> 0:11:01.196
<v Speaker 1>tell me about the moment when you decided to start

0:11:01.196 --> 0:11:02.316
<v Speaker 1>the company. What's going on?

0:11:02.916 --> 0:11:04.876
<v Speaker 2>Yeah, so the moment that we decide to start a

0:11:04.956 --> 0:11:09.636
<v Speaker 2>company is you'll refer to this image net mode like

0:11:09.716 --> 0:11:14.036
<v Speaker 2>this like large data set that actually first time like

0:11:14.156 --> 0:11:18.796
<v Speaker 2>really taught people that you can train a network to

0:11:18.916 --> 0:11:21.116
<v Speaker 2>solve one specific task really work.

0:11:21.636 --> 0:11:27.916
<v Speaker 1>So this is twenty twelve. There's when a neural net

0:11:28.436 --> 0:11:32.516
<v Speaker 1>trains on image net and does a really good job,

0:11:32.676 --> 0:11:35.436
<v Speaker 1>way better than anybody has ever done. Any Then any

0:11:35.476 --> 0:11:39.316
<v Speaker 1>model has ever done of identifying objects.

0:11:38.716 --> 0:11:42.516
<v Speaker 2>In exactly right, exactly. That was really significant because that

0:11:42.676 --> 0:11:45.716
<v Speaker 2>means if you can collect a lot of data for

0:11:45.996 --> 0:11:48.756
<v Speaker 2>a single task, and if you can get a group

0:11:48.796 --> 0:11:51.996
<v Speaker 2>of PhDs to work on a model for that single task,

0:11:52.356 --> 0:11:55.076
<v Speaker 2>you basically have AI. Like you can solve that task

0:11:55.196 --> 0:11:58.756
<v Speaker 2>really well. I mean like you might need to iterate

0:11:58.796 --> 0:12:00.956
<v Speaker 2>on your data, you might need to iterate on your algorithm,

0:12:01.076 --> 0:12:03.276
<v Speaker 2>but ultimately you can solve that task really well.

0:12:03.436 --> 0:12:07.356
<v Speaker 1>You're basically saying, if you can gather the data, you

0:12:07.396 --> 0:12:11.316
<v Speaker 1>can gather a shitload of data of whatever kind, then

0:12:11.396 --> 0:12:14.356
<v Speaker 1>you can get AI around.

0:12:14.156 --> 0:12:16.756
<v Speaker 2>For that for that kind, right, And which is like

0:12:16.876 --> 0:12:20.596
<v Speaker 2>why you saw artificial intelligence like really started working like

0:12:20.676 --> 0:12:24.116
<v Speaker 2>after twenty twelve, Like like Google, Facebook, all of these

0:12:24.116 --> 0:12:27.476
<v Speaker 2>companies like have a lot of AI based applications, but

0:12:27.516 --> 0:12:31.236
<v Speaker 2>they are largely not democratized because like in order to

0:12:31.316 --> 0:12:33.956
<v Speaker 2>get any single AI working, you still need a lot

0:12:33.996 --> 0:12:36.636
<v Speaker 2>of data and you still need a team of PhDs

0:12:36.676 --> 0:12:39.996
<v Speaker 2>to work on it. So it was a huge breakthrough

0:12:40.076 --> 0:12:43.556
<v Speaker 2>in AI, but it was not sufficient for really widespread

0:12:43.636 --> 0:12:46.276
<v Speaker 2>usage just because the barrier to create one AI is

0:12:46.316 --> 0:12:51.596
<v Speaker 2>so high. Okay, and then comes the second inflection point

0:12:51.916 --> 0:12:54.596
<v Speaker 2>in the history of AI, which is really the start

0:12:54.796 --> 0:12:58.876
<v Speaker 2>of foundation models. So like I'm talking about the most

0:12:58.876 --> 0:13:01.796
<v Speaker 2>initial version of GPT, right, Like I'm talking about like

0:13:02.236 --> 0:13:05.716
<v Speaker 2>these large language models that are trained on multiple tasks

0:13:06.036 --> 0:13:09.396
<v Speaker 2>so that they're incredibly generalizable, Like you can ask to

0:13:09.396 --> 0:13:11.556
<v Speaker 2>do something new and it can do it really well.

0:13:12.076 --> 0:13:16.796
<v Speaker 2>And also it performed better at single task than specialized model.

0:13:16.916 --> 0:13:19.116
<v Speaker 1>And so just to be clear, like until a few

0:13:19.236 --> 0:13:23.036
<v Speaker 1>years ago, people thought reasonably that if you want to

0:13:23.076 --> 0:13:28.516
<v Speaker 1>build AI to whatever translate language right, you would work

0:13:28.556 --> 0:13:31.316
<v Speaker 1>really hard on that. You would try and build an

0:13:31.356 --> 0:13:36.916
<v Speaker 1>AI specifically designed to be really good at translating text

0:13:36.996 --> 0:13:41.996
<v Speaker 1>from one language to another. But this really surprising result,

0:13:42.076 --> 0:13:45.236
<v Speaker 1>this really surprising thing that emerged from just work people

0:13:45.236 --> 0:13:47.796
<v Speaker 1>were doing, was, in fact, that's not the best way

0:13:48.116 --> 0:13:52.636
<v Speaker 1>to get AI to translate language. It's just throw everything

0:13:52.716 --> 0:13:55.316
<v Speaker 1>you can, all the words on the internet at an

0:13:55.356 --> 0:13:58.876
<v Speaker 1>AI model and just say, figure out everything about language,

0:13:58.876 --> 0:14:02.316
<v Speaker 1>figure out how to answer questions about history, and figure

0:14:02.356 --> 0:14:05.076
<v Speaker 1>out how to translate, and figure out how to give

0:14:05.116 --> 0:14:08.396
<v Speaker 1>me a recipe for you know, pasta. And it turns

0:14:08.396 --> 0:14:12.316
<v Speaker 1>out that LAD technique gets you better results at each

0:14:12.356 --> 0:14:15.556
<v Speaker 1>specific thing than trying to build specialized model exactly.

0:14:15.596 --> 0:14:18.916
<v Speaker 2>And that is really the magic of foundation models. And

0:14:18.956 --> 0:14:22.756
<v Speaker 2>that's the thing that we're not obvious to people outside

0:14:22.756 --> 0:14:26.196
<v Speaker 2>of open Ai for a very long time. And because

0:14:26.236 --> 0:14:28.236
<v Speaker 2>we came from open Ai, a lot of the founding

0:14:28.276 --> 0:14:30.676
<v Speaker 2>team at Coverin came from open Ai. We saw that

0:14:30.756 --> 0:14:36.796
<v Speaker 2>inside earlier, and that inside allow us to start Covariant

0:14:36.796 --> 0:14:40.916
<v Speaker 2>to build foundation models for robotics way before other people

0:14:40.956 --> 0:14:42.236
<v Speaker 2>even believed in the.

0:14:42.236 --> 0:14:45.796
<v Speaker 1>Approach, even people in the field, even people in the field.

0:14:45.916 --> 0:14:47.916
<v Speaker 1>So so, yeah, so you were when did you go

0:14:47.956 --> 0:14:49.876
<v Speaker 1>to open Ai? You were you went to work at

0:14:49.876 --> 0:14:50.396
<v Speaker 1>open Ai.

0:14:50.756 --> 0:14:53.876
<v Speaker 2>We went to open Ai when it was about ten

0:14:54.036 --> 0:14:57.356
<v Speaker 2>ish people like like sometime in twenty sixteen.

0:14:57.036 --> 0:15:03.116
<v Speaker 1>Okay, And and when do you sort of personally have

0:15:03.276 --> 0:15:06.836
<v Speaker 1>this realization and you're not the only one to have it,

0:15:06.876 --> 0:15:11.236
<v Speaker 1>but when do you see that the power of foundation models?

0:15:12.916 --> 0:15:15.156
<v Speaker 2>There are two things on it, Like, so the first

0:15:15.196 --> 0:15:20.436
<v Speaker 2>thing is that early on at open Ai, we believed

0:15:20.516 --> 0:15:24.476
<v Speaker 2>in the idea of scaling, like really scaling up the

0:15:24.476 --> 0:15:27.236
<v Speaker 2>model and scaling up the data sets. And you actually

0:15:27.276 --> 0:15:31.796
<v Speaker 2>see models getting like getting increasingly smarter as you actually

0:15:31.796 --> 0:15:34.196
<v Speaker 2>scale them up. So one is that, and then the

0:15:34.236 --> 0:15:39.396
<v Speaker 2>other one is I would say we we had conviction

0:15:39.596 --> 0:15:44.156
<v Speaker 2>in foundation model for robotics probably earlier than foundation model

0:15:44.196 --> 0:15:46.556
<v Speaker 2>for language. And this is like the one key thing

0:15:46.636 --> 0:15:51.276
<v Speaker 2>is that if you think about building a large language

0:15:51.316 --> 0:15:54.796
<v Speaker 2>model that tried to compress the whole internet of knowledge,

0:15:55.036 --> 0:15:58.276
<v Speaker 2>you still need to compress many things that are not

0:15:58.396 --> 0:16:01.396
<v Speaker 2>quite related to each other. Right, Like maybe you're browsing

0:16:01.436 --> 0:16:05.716
<v Speaker 2>on Wikipedia and you have to recite the composition of

0:16:05.836 --> 0:16:09.876
<v Speaker 2>materials of soil on the moon, and you also need

0:16:09.916 --> 0:16:12.436
<v Speaker 2>to learn how to play chess. What there is there's

0:16:12.476 --> 0:16:15.476
<v Speaker 2>really nothing in common with these two things. Yeah, these

0:16:15.476 --> 0:16:17.556
<v Speaker 2>two parts of the knowledge, but you are asking one

0:16:17.596 --> 0:16:21.036
<v Speaker 2>AI model to learn all of these. And the place

0:16:21.116 --> 0:16:23.676
<v Speaker 2>that make a lot of sense to us is that

0:16:24.476 --> 0:16:27.356
<v Speaker 2>there's only one physical world. Like even when you have

0:16:27.716 --> 0:16:30.956
<v Speaker 2>many different robots that need to do different things in

0:16:31.036 --> 0:16:34.556
<v Speaker 2>different factories, different warehouses, they are still interacting in the

0:16:34.596 --> 0:16:37.956
<v Speaker 2>same physical world. And so building a foundation model for

0:16:38.076 --> 0:16:42.676
<v Speaker 2>robotics like has this amazing property of grounding that like,

0:16:42.756 --> 0:16:45.516
<v Speaker 2>no matter what kind of tasks that you're asking this

0:16:45.596 --> 0:16:48.396
<v Speaker 2>foundation model to learn, well, it's just learning with the

0:16:48.396 --> 0:16:50.596
<v Speaker 2>same sets of physics, Like the same surrounding.

0:16:50.756 --> 0:16:52.876
<v Speaker 1>Is the literal ground with the models.

0:16:52.636 --> 0:16:54.596
<v Speaker 2>The literal ground exactly, and.

0:16:54.516 --> 0:16:57.916
<v Speaker 1>The models have to understand just how the physical world works,

0:16:57.956 --> 0:17:01.596
<v Speaker 1>and that you drop a thing, it will fall and etc.

0:17:02.116 --> 0:17:05.556
<v Speaker 2>And it's if it's something that is like deformable, you

0:17:05.596 --> 0:17:07.316
<v Speaker 2>push it, it would move a little bit. If something

0:17:07.436 --> 0:17:10.916
<v Speaker 2>is rigid like you would sly, if something is rollable,

0:17:10.956 --> 0:17:11.716
<v Speaker 2>it would roll away.

0:17:11.796 --> 0:17:11.916
<v Speaker 1>Like.

0:17:12.116 --> 0:17:13.996
<v Speaker 2>These are the type of things that like, no matter

0:17:14.436 --> 0:17:17.076
<v Speaker 2>where you are on Earth and what type of robots

0:17:17.116 --> 0:17:19.836
<v Speaker 2>like body you're using, is the same, right, and so

0:17:19.956 --> 0:17:22.196
<v Speaker 2>like if you can be one single foundation model that

0:17:22.276 --> 0:17:25.076
<v Speaker 2>can learn from all of these different data, it would

0:17:25.076 --> 0:17:26.236
<v Speaker 2>be incredibly powerful.

0:17:26.676 --> 0:17:29.756
<v Speaker 1>So just to just to state it clearly, So you're

0:17:29.836 --> 0:17:33.036
<v Speaker 1>at open AI, you're seeing the power of foundation models,

0:17:33.636 --> 0:17:36.476
<v Speaker 1>you decide to leave and start the company that is

0:17:36.556 --> 0:17:39.676
<v Speaker 1>now Covariant, Like what are you setting out to do

0:17:39.876 --> 0:17:41.756
<v Speaker 1>when you when you start the company?

0:17:42.276 --> 0:17:46.596
<v Speaker 2>Yeah, So when we started Covariant, we had this really

0:17:46.636 --> 0:17:51.556
<v Speaker 2>strong conviction that like, there should be a future that

0:17:51.756 --> 0:17:54.996
<v Speaker 2>has a lot of autonomous robots doing all the things

0:17:55.036 --> 0:18:00.716
<v Speaker 2>that are repetitive, injury prone, dangerous, like and so that

0:18:00.876 --> 0:18:04.316
<v Speaker 2>can really revolutionize the physical world, make it a lot

0:18:04.356 --> 0:18:09.676
<v Speaker 2>more abundant and to really enable that future of autonomous robots,

0:18:09.836 --> 0:18:12.716
<v Speaker 2>you need really smart AI, and like because of the

0:18:12.756 --> 0:18:15.316
<v Speaker 2>inside that we just talk about, like we believe that

0:18:15.356 --> 0:18:17.596
<v Speaker 2>AI had to be a foundation model. We believe like

0:18:17.636 --> 0:18:20.356
<v Speaker 2>you should have single model that learn from all these

0:18:20.356 --> 0:18:23.036
<v Speaker 2>two different robots together and become smarter together.

0:18:23.596 --> 0:18:28.116
<v Speaker 1>So I mean, so the basic idea is, the dream

0:18:28.196 --> 0:18:35.476
<v Speaker 1>is to build one AI foundation model for robots. Basically, yeah,

0:18:35.636 --> 0:18:37.756
<v Speaker 1>in the same way that you can ask chat GBT

0:18:37.956 --> 0:18:40.196
<v Speaker 1>anything in language and it can answer you in language

0:18:40.196 --> 0:18:42.796
<v Speaker 1>about any different thing. You have a model where you

0:18:42.996 --> 0:18:45.876
<v Speaker 1>just sort of make it be the brain of any

0:18:45.996 --> 0:18:48.716
<v Speaker 1>robot and that robot can sort of see the world

0:18:48.796 --> 0:18:51.356
<v Speaker 1>and move and pick things up and behave in the

0:18:51.396 --> 0:18:52.436
<v Speaker 1>world exactly.

0:18:53.076 --> 0:18:56.396
<v Speaker 2>And there's one key problem, Like the key problem is

0:18:56.436 --> 0:19:01.636
<v Speaker 2>that unlike foundation model for language, where you can scrape

0:19:01.676 --> 0:19:04.596
<v Speaker 2>the whole Internet of text as your pre training data,

0:19:05.036 --> 0:19:09.036
<v Speaker 2>there's nothing that is equivalent in the case of robotic

0:19:09.436 --> 0:19:11.236
<v Speaker 2>I mean, there are some images online, there are some

0:19:11.316 --> 0:19:14.556
<v Speaker 2>YouTube videos online, but by and large, like they don't

0:19:14.596 --> 0:19:20.036
<v Speaker 2>really give you the same type of data that are

0:19:20.036 --> 0:19:22.316
<v Speaker 2>in the form of robots interacting with the world. And

0:19:22.356 --> 0:19:25.116
<v Speaker 2>the big problem is because they're just not that many robots,

0:19:25.116 --> 0:19:28.556
<v Speaker 2>they're just doing interesting things in the world. And so

0:19:28.636 --> 0:19:31.436
<v Speaker 2>a big chunk of what we set out to build

0:19:31.436 --> 0:19:34.156
<v Speaker 2>as a company is recognizing that we need to build

0:19:34.276 --> 0:19:37.436
<v Speaker 2>fundation model for robotics. And in order to be a

0:19:37.476 --> 0:19:40.916
<v Speaker 2>foundation model for robotics, you need to have large data sets.

0:19:41.196 --> 0:19:43.676
<v Speaker 2>And in order to create large data sets, you have

0:19:43.836 --> 0:19:47.836
<v Speaker 2>to have robots that are actually creating value for customers

0:19:47.876 --> 0:19:51.476
<v Speaker 2>in production at scale, because if you're only collecting data

0:19:51.516 --> 0:19:53.996
<v Speaker 2>in your own lab, there's only so much data. And

0:19:54.436 --> 0:19:59.076
<v Speaker 2>so I would say the last six years of Covariance

0:19:59.076 --> 0:20:04.956
<v Speaker 2>is largely focused on really building autonomous robot systems that

0:20:05.076 --> 0:20:08.476
<v Speaker 2>work really well for customers, and they're doing interesting things

0:20:08.516 --> 0:20:11.996
<v Speaker 2>to a level of autonomy and reliability that have not

0:20:12.076 --> 0:20:13.236
<v Speaker 2>been hit before.

0:20:16.196 --> 0:20:19.036
<v Speaker 4>In other words, Peter and his colleagues at Covariant built

0:20:19.276 --> 0:20:22.756
<v Speaker 4>robot arms that businesses are paying to use out in

0:20:22.796 --> 0:20:26.476
<v Speaker 4>the world. But to some extent, those robot arms are

0:20:26.516 --> 0:20:29.316
<v Speaker 4>just a means to an end because they're not just

0:20:29.396 --> 0:20:32.356
<v Speaker 4>doing warehouse work. They're collecting more and more.

0:20:32.276 --> 0:20:35.436
<v Speaker 1>Data that Peter and his colleagues are feeding back into

0:20:35.476 --> 0:20:38.276
<v Speaker 1>their AI model to try and make it get better

0:20:38.476 --> 0:20:43.076
<v Speaker 1>and better. In a minute, what those robot arms are

0:20:43.156 --> 0:20:46.876
<v Speaker 1>actually doing out in the world and what they're learning.

0:20:56.916 --> 0:21:02.036
<v Speaker 1>So let's talk about what the robots you have built

0:21:02.436 --> 0:21:05.356
<v Speaker 1>are doing out in the world. Besides generating the data

0:21:05.396 --> 0:21:08.116
<v Speaker 1>that you will use to train the next generation of robots,

0:21:08.116 --> 0:21:11.196
<v Speaker 1>what are they actually doing in the world right now today.

0:21:11.356 --> 0:21:14.636
<v Speaker 2>Yeah, So when we started the company, we soveyed the

0:21:14.716 --> 0:21:19.156
<v Speaker 2>landscape pretty carefully and then we selected warehouse and logics

0:21:19.156 --> 0:21:22.836
<v Speaker 2>as the primary sector that we focus on today. When

0:21:22.876 --> 0:21:26.196
<v Speaker 2>you are stopping online shops, like when you click a

0:21:26.196 --> 0:21:30.436
<v Speaker 2>button and something shows up next day, there's tremendous amount

0:21:30.636 --> 0:21:34.676
<v Speaker 2>of complexities behind that back end logistics of getting things

0:21:34.716 --> 0:21:38.196
<v Speaker 2>to you. And typically it's estimated that each item is

0:21:38.276 --> 0:21:41.476
<v Speaker 2>touched fifteen to twenty times between when you click a

0:21:41.516 --> 0:21:43.396
<v Speaker 2>butttom to buy and when it shows up.

0:21:44.716 --> 0:21:48.476
<v Speaker 1>In contrast, that fifteen to twenty touches of getting a

0:21:48.516 --> 0:21:52.596
<v Speaker 1>thing from the warehouse to your door is your opportunity.

0:21:52.316 --> 0:21:56.556
<v Speaker 2>Exactly right, And like combining with that, people don't want

0:21:56.596 --> 0:21:59.596
<v Speaker 2>to drive in the middle of the night two hours

0:21:59.596 --> 0:22:03.116
<v Speaker 2>into a suburb to work in a warehouse, Like it's

0:22:03.196 --> 0:22:06.156
<v Speaker 2>just a kind of job that has extremely high turnover rate,

0:22:06.236 --> 0:22:08.156
<v Speaker 2>Like not the kind of job that people stay there

0:22:08.196 --> 0:22:11.196
<v Speaker 2>for a very long time, and so there's a tremendous

0:22:11.196 --> 0:22:15.756
<v Speaker 2>amount of desire for more robotics and more automations in

0:22:15.796 --> 0:22:20.596
<v Speaker 2>those environment to do picking up objects, sorting them into

0:22:20.636 --> 0:22:24.636
<v Speaker 2>the right compartment of boxes, and then packing it nicely

0:22:24.716 --> 0:22:27.996
<v Speaker 2>and then shipping it out to you as a customers.

0:22:28.196 --> 0:22:30.556
<v Speaker 1>Tell me a little bit more specifically, I mean, what's

0:22:30.636 --> 0:22:34.236
<v Speaker 1>one thing one of your robots is doing today in

0:22:35.036 --> 0:22:36.916
<v Speaker 1>a warehouse somewhere on the earth.

0:22:37.236 --> 0:22:41.756
<v Speaker 2>Yeah, so we would get like pretty detail and giky here,

0:22:41.796 --> 0:22:43.756
<v Speaker 2>and so we'd actually tell you a little bit of

0:22:43.836 --> 0:22:48.996
<v Speaker 2>like the needy, greitty details of like warehouse. Like, so

0:22:49.076 --> 0:22:50.916
<v Speaker 2>let me describe what the robot is doing. Like the

0:22:50.996 --> 0:22:55.316
<v Speaker 2>robot is doing, is I have a toad full of

0:22:55.316 --> 0:22:57.956
<v Speaker 2>items that come up to the robots, and then I

0:22:57.996 --> 0:23:00.716
<v Speaker 2>would need to grab one thing at a time, like we're.

0:23:00.596 --> 0:23:03.156
<v Speaker 1>Is like a tote bag with a bunch of different stuff.

0:23:02.836 --> 0:23:05.316
<v Speaker 2>And a bunch of different stuff in it, and like

0:23:05.396 --> 0:23:08.676
<v Speaker 2>because like these stuffs are all layout in a chaotic

0:23:09.516 --> 0:23:12.156
<v Speaker 2>way and like they're overlapping with each other. If you're

0:23:12.156 --> 0:23:14.356
<v Speaker 2>not careful, you might drag out multiple items at the

0:23:14.356 --> 0:23:17.316
<v Speaker 2>same time. And these items all have different shapes like

0:23:17.356 --> 0:23:19.596
<v Speaker 2>they might be transparent, they might be reflective, they might

0:23:19.636 --> 0:23:20.316
<v Speaker 2>be hard to see.

0:23:20.476 --> 0:23:23.116
<v Speaker 1>This is hard, like to go back to our old

0:23:23.276 --> 0:23:27.196
<v Speaker 1>Like riveting parts on a car is easy, Folding a

0:23:27.236 --> 0:23:30.596
<v Speaker 1>towel is hard. This is hard because it's heterogeneous. Things

0:23:30.636 --> 0:23:33.236
<v Speaker 1>look different, They come differently every time. This is hard

0:23:33.276 --> 0:23:36.436
<v Speaker 1>for robots. Hard for a sort of classical robot to do.

0:23:37.836 --> 0:23:40.556
<v Speaker 2>Impossible for a classical impossible.

0:23:39.996 --> 0:23:44.236
<v Speaker 1>Not just hard, impossible. Can your robots do it?

0:23:44.916 --> 0:23:47.036
<v Speaker 2>Yeah? They can do it? How extremely well?

0:23:47.196 --> 0:23:48.996
<v Speaker 1>How did you solve How did you solve it?

0:23:49.036 --> 0:23:50.836
<v Speaker 2>How does it work in the end of the day?

0:23:50.916 --> 0:23:53.276
<v Speaker 2>Like it the way that it operates us very similar

0:23:53.356 --> 0:23:57.116
<v Speaker 2>to how like humans vision system work, Like we have

0:23:57.196 --> 0:23:59.836
<v Speaker 2>two eyes and then like by two eyes looking at

0:23:59.876 --> 0:24:03.756
<v Speaker 2>something like we like we can figure out what's the

0:24:03.796 --> 0:24:06.516
<v Speaker 2>depth of a certain items, like because our two eyes

0:24:06.556 --> 0:24:10.356
<v Speaker 2>can triangulate a single point the three D world and

0:24:10.396 --> 0:24:12.516
<v Speaker 2>it's the same kind of mechanisms and so like you

0:24:12.556 --> 0:24:15.876
<v Speaker 2>can just use multiple regular cameras, just like the one

0:24:15.916 --> 0:24:18.836
<v Speaker 2>that you have on your iPhone, and by having multiple

0:24:18.876 --> 0:24:22.396
<v Speaker 2>ones of those like you give the new net the

0:24:22.476 --> 0:24:24.836
<v Speaker 2>ability to triangulate what's happening.

0:24:24.956 --> 0:24:27.196
<v Speaker 1>Just the way our two eyes allow us to see

0:24:27.236 --> 0:24:31.756
<v Speaker 1>depth essentially exactly right, And are there other things like

0:24:31.756 --> 0:24:33.676
<v Speaker 1>like weight, I mean the arm is going to be

0:24:33.676 --> 0:24:36.756
<v Speaker 1>picking things up like weight or whether you know, presumably

0:24:36.796 --> 0:24:39.316
<v Speaker 1>there's like could be a shirt in a plastic bag,

0:24:39.396 --> 0:24:41.756
<v Speaker 1>could be a box, or something's are rigid, some things

0:24:41.756 --> 0:24:42.596
<v Speaker 1>are deformable.

0:24:42.836 --> 0:24:45.956
<v Speaker 2>Yeah, So what we have found is that if you

0:24:46.036 --> 0:24:50.396
<v Speaker 2>just have a visual understanding of the world that is

0:24:50.476 --> 0:24:54.476
<v Speaker 2>as robust as human, you go a really long way. Right,

0:24:54.556 --> 0:24:58.356
<v Speaker 2>Like so when I when I pick up a cup, right, like,

0:24:58.476 --> 0:25:02.356
<v Speaker 2>I'm not doing a lot of calculations on my how

0:25:02.436 --> 0:25:05.756
<v Speaker 2>is my fingers faced? Like exactly, it gets translated to

0:25:05.796 --> 0:25:07.476
<v Speaker 2>the cup and making sure it holds right.

0:25:07.516 --> 0:25:09.876
<v Speaker 1>It's part of the miracle of being a person though, Right,

0:25:09.916 --> 0:25:11.356
<v Speaker 1>it's like a really hard problem to pick.

0:25:11.596 --> 0:25:13.676
<v Speaker 2>It's a very hard problem. But then like your brains

0:25:13.676 --> 0:25:16.476
<v Speaker 2>subconsciously solve it for you, like your your system one

0:25:16.596 --> 0:25:18.516
<v Speaker 2>thinking somewhat solved that.

0:25:18.396 --> 0:25:20.996
<v Speaker 1>For thinking, you don't have to think about yeahm.

0:25:20.636 --> 0:25:24.116
<v Speaker 2>Hm, exactly, And and you can like imagine like when

0:25:24.156 --> 0:25:26.356
<v Speaker 2>you do this, like in fact, like even if my

0:25:26.436 --> 0:25:29.236
<v Speaker 2>fingers are numb, I can still do this perfectly like

0:25:29.476 --> 0:25:34.676
<v Speaker 2>just because like so it acquires this intuitive understanding of

0:25:35.156 --> 0:25:37.756
<v Speaker 2>interaction with physical will so well that you can do it.

0:25:38.196 --> 0:25:41.716
<v Speaker 1>So basically vision, vision gets you most of the way there.

0:25:41.836 --> 0:25:45.236
<v Speaker 2>I would say, vision and then the ability to intuit

0:25:45.636 --> 0:25:48.916
<v Speaker 2>physics from your visual input that you'll get.

0:25:48.996 --> 0:25:51.476
<v Speaker 1>That told me that second. What is wild though, Like

0:25:51.676 --> 0:25:54.476
<v Speaker 1>I mean into it, I mean into it in the

0:25:54.516 --> 0:25:57.676
<v Speaker 1>in the context of it, I mean sort of make inferences.

0:25:57.956 --> 0:26:00.796
<v Speaker 1>I mean intuit is yeah, okay, yeah.

0:26:00.876 --> 0:26:03.076
<v Speaker 2>But like by by into it, I mean like it's

0:26:03.116 --> 0:26:07.236
<v Speaker 2>not doing some kind of detail physical calculation.

0:26:07.276 --> 0:26:09.116
<v Speaker 1>It's not doing math. It's not doing math.

0:26:09.516 --> 0:26:11.796
<v Speaker 2>Yeah, it's doing a kind of high level pattern matching

0:26:11.796 --> 0:26:15.556
<v Speaker 2>of well, like based on how these things looks, this

0:26:15.716 --> 0:26:18.516
<v Speaker 2>is likely going to be a successful way to approach

0:26:18.596 --> 0:26:20.716
<v Speaker 2>the item and interact with it.

0:26:21.036 --> 0:26:23.356
<v Speaker 1>What what's next?

0:26:23.916 --> 0:26:28.876
<v Speaker 2>So what is next immediately is very exciting? Right. We

0:26:28.956 --> 0:26:31.836
<v Speaker 2>are now getting to a place that by we, I

0:26:31.876 --> 0:26:36.796
<v Speaker 2>mean we as in AI community has gotten to a

0:26:36.876 --> 0:26:43.116
<v Speaker 2>place that we have enough computation power and algorithmic and

0:26:43.236 --> 0:26:47.796
<v Speaker 2>modeling understanding that can allow us to extract a lot

0:26:48.356 --> 0:26:48.996
<v Speaker 2>out of data.

0:26:49.356 --> 0:26:54.596
<v Speaker 1>Right, So from any given amount of data, you can get.

0:26:54.396 --> 0:26:57.196
<v Speaker 2>More, you can get more out of it. Right.

0:26:57.276 --> 0:27:00.676
<v Speaker 1>It's exciting for you because data is such a constraint

0:27:00.716 --> 0:27:02.516
<v Speaker 1>on what you're trying to do exactly.

0:27:02.556 --> 0:27:05.876
<v Speaker 2>And then like we're building up this large robotic data

0:27:05.876 --> 0:27:09.916
<v Speaker 2>set by tapping into a lot of these events, is

0:27:11.156 --> 0:27:13.876
<v Speaker 2>it gives us the ability to even get more out

0:27:13.916 --> 0:27:16.156
<v Speaker 2>of the data sets that we're building and allow us

0:27:16.156 --> 0:27:20.956
<v Speaker 2>to build smarter than better robotics foundation models that perform

0:27:21.036 --> 0:27:23.556
<v Speaker 2>better at the current tasks that they're supposed to do

0:27:23.996 --> 0:27:26.436
<v Speaker 2>and also power more robots.

0:27:26.476 --> 0:27:29.076
<v Speaker 1>When people talk about concerns around AI, they often kind

0:27:29.076 --> 0:27:32.316
<v Speaker 1>of have jokingly use the phrase killer robots, which is

0:27:32.396 --> 0:27:35.836
<v Speaker 1>usually like a metaphor or something. But in your instance,

0:27:35.916 --> 0:27:38.516
<v Speaker 1>because you are building robots, and because you are building

0:27:38.756 --> 0:27:41.436
<v Speaker 1>by design a model that is supposed to be used

0:27:41.476 --> 0:27:44.516
<v Speaker 1>for lots of different purposes, I can in fact very

0:27:44.516 --> 0:27:47.796
<v Speaker 1>easily imagine killer robot applications of your work. Like that

0:27:47.796 --> 0:27:50.636
<v Speaker 1>seems like a very plausible thing someone could do with it, Like,

0:27:51.196 --> 0:27:52.996
<v Speaker 1>is that something you think about worry about?

0:27:54.436 --> 0:27:59.316
<v Speaker 2>I would say very Fortunately, in the very near to

0:27:59.476 --> 0:28:03.076
<v Speaker 2>medium term use cases, we are very safe because like

0:28:03.196 --> 0:28:06.836
<v Speaker 2>all of these industrial robots are very much confined by

0:28:06.956 --> 0:28:10.796
<v Speaker 2>the stations that they're designed in. Curse like industrial robots

0:28:11.356 --> 0:28:14.956
<v Speaker 2>heavy machineries that are subject to regulations and are very

0:28:14.996 --> 0:28:20.676
<v Speaker 2>carefully They are like careful design guidelines and compliance requirements

0:28:20.716 --> 0:28:23.396
<v Speaker 2>for them. They are already by design safe.

0:28:23.516 --> 0:28:28.756
<v Speaker 1>You're saying your model is built for robots basically for

0:28:28.876 --> 0:28:31.316
<v Speaker 1>robot arms. I mean, is that essentially what the model

0:28:31.316 --> 0:28:34.036
<v Speaker 1>you're building is really a foundation model for robot arms

0:28:34.036 --> 0:28:36.956
<v Speaker 1>that are built to just be in one place, pick

0:28:36.996 --> 0:28:39.076
<v Speaker 1>things up, put them down, that sort of thing. It's

0:28:39.116 --> 0:28:42.276
<v Speaker 1>not a foundation model that you could have map onto

0:28:42.476 --> 0:28:44.556
<v Speaker 1>a car or something, or even to a robot that

0:28:44.676 --> 0:28:46.716
<v Speaker 1>walks around. It wouldn't work for that.

0:28:46.716 --> 0:28:49.196
<v Speaker 2>That would not be the near term use cases. Like

0:28:49.276 --> 0:28:52.196
<v Speaker 2>so like because the near term use cases are more

0:28:52.236 --> 0:28:56.876
<v Speaker 2>in this safe by construction setting, it allows us to

0:28:57.316 --> 0:29:00.076
<v Speaker 2>not more about that problem and in fact, like have

0:29:00.756 --> 0:29:04.196
<v Speaker 2>basically no way to misuse the technology in the way

0:29:04.196 --> 0:29:06.196
<v Speaker 2>we need to. But I do agree with you, like

0:29:06.236 --> 0:29:09.596
<v Speaker 2>as we actually unleash this model to a set of

0:29:09.676 --> 0:29:12.316
<v Speaker 2>use cases, like when these robots can actually interact with

0:29:12.356 --> 0:29:15.756
<v Speaker 2>the world in a lot more freeform way those other cases,

0:29:15.836 --> 0:29:20.596
<v Speaker 2>that the safety considerations become a lot more important and

0:29:20.636 --> 0:29:22.596
<v Speaker 2>there's definitely a lot more work that need to be

0:29:22.636 --> 0:29:26.076
<v Speaker 2>done for that to be reliable.

0:29:29.836 --> 0:29:31.996
<v Speaker 1>We'll be back in a minute with the lightning round.

0:29:33.596 --> 0:29:44.236
<v Speaker 1>M there is a lightning round that we're gonna do

0:29:44.316 --> 0:29:48.516
<v Speaker 1>now for the end of the interview, what household chore

0:29:48.596 --> 0:29:50.356
<v Speaker 1>do you wish that a robot could do?

0:29:52.996 --> 0:29:57.556
<v Speaker 2>Cleaning up kitchen? Yeah, but I don't like the cleanup

0:29:57.596 --> 0:29:57.836
<v Speaker 2>of it.

0:29:59.556 --> 0:30:02.876
<v Speaker 1>That seems like a really hard one, like putting stuff away.

0:30:03.076 --> 0:30:05.716
<v Speaker 1>Basically I'm wiping the counter maybe less hard, but like

0:30:06.156 --> 0:30:08.556
<v Speaker 1>putting stuff away seems like a really hard job for

0:30:08.556 --> 0:30:09.076
<v Speaker 1>a robot.

0:30:09.796 --> 0:30:11.916
<v Speaker 2>It is. And these are the type of jobs that

0:30:12.036 --> 0:30:15.636
<v Speaker 2>start to get to how we are limitation, Like, these

0:30:15.636 --> 0:30:17.556
<v Speaker 2>are the type of jobs that start to get to

0:30:18.276 --> 0:30:20.996
<v Speaker 2>like you probably do want a humanoid robot, like you

0:30:20.996 --> 0:30:26.156
<v Speaker 2>probably do want something that kind of moves and conform

0:30:26.316 --> 0:30:29.196
<v Speaker 2>to the human standard of interacting with the world.

0:30:29.156 --> 0:30:31.796
<v Speaker 1>Because the kitchen is optimized, right, or you would have

0:30:31.836 --> 0:30:34.556
<v Speaker 1>to redesign the kitchen for a robot, and but then

0:30:34.756 --> 0:30:37.396
<v Speaker 1>that would suck because then you couldn't get your plates

0:30:37.396 --> 0:30:40.516
<v Speaker 1>because they'd be in some random spot or whatever. Okay,

0:30:40.556 --> 0:30:44.476
<v Speaker 1>so you left open ai in twenty seventeen. In the

0:30:44.516 --> 0:30:48.436
<v Speaker 1>past year, open ai became like this household word, GPT

0:30:48.596 --> 0:30:52.876
<v Speaker 1>became a household word. Were you surprised as a you know,

0:30:53.076 --> 0:30:56.516
<v Speaker 1>old school former open AI guy, were you surprised by

0:30:57.036 --> 0:31:00.236
<v Speaker 1>how how wild the world went for GPT or by

0:31:00.276 --> 0:31:01.516
<v Speaker 1>how good it was how soon?

0:31:03.516 --> 0:31:07.116
<v Speaker 2>I was definitely surprised by the speed of it. I

0:31:07.196 --> 0:31:10.076
<v Speaker 2>was surprised by the speed of bull of the technologists

0:31:10.116 --> 0:31:14.716
<v Speaker 2>development and the speed of adoption. But I was not

0:31:14.836 --> 0:31:18.156
<v Speaker 2>surprised by the fact that it could be dispig and

0:31:18.356 --> 0:31:19.236
<v Speaker 2>it could be bigger.

0:31:19.676 --> 0:31:22.676
<v Speaker 1>You know, when you were talking about sort of warehouse

0:31:22.796 --> 0:31:27.516
<v Speaker 1>and getting data from you know, picking and packing basically,

0:31:27.836 --> 0:31:31.116
<v Speaker 1>I thought, of course, as anyone would, of Amazon, and

0:31:31.156 --> 0:31:34.676
<v Speaker 1>I've read that they're working on some kind of robot

0:31:34.756 --> 0:31:36.876
<v Speaker 1>arm I feel like they would just have so much

0:31:37.836 --> 0:31:40.236
<v Speaker 1>data that they could gather if they wanted, just because

0:31:40.236 --> 0:31:41.916
<v Speaker 1>they're so big, they have so many warehouse I mean

0:31:41.916 --> 0:31:44.276
<v Speaker 1>the same way that say Google just gets tons of

0:31:44.436 --> 0:31:46.476
<v Speaker 1>data every day with every Google search and the way

0:31:46.516 --> 0:31:48.796
<v Speaker 1>people like I feel like that would be very hard

0:31:48.796 --> 0:31:50.436
<v Speaker 1>to compete with, But.

0:31:50.476 --> 0:31:52.196
<v Speaker 2>We also don't need to compete with them. They are

0:31:52.276 --> 0:31:56.516
<v Speaker 2>also a very large role that Amazon is not serving,

0:31:56.996 --> 0:31:59.676
<v Speaker 2>and there are a lot of customers that don't have

0:31:59.796 --> 0:32:05.276
<v Speaker 2>the same degree of engineering team data access as Amazon, and.

0:32:05.676 --> 0:32:07.596
<v Speaker 1>You could be the shop by of you could be

0:32:07.636 --> 0:32:09.116
<v Speaker 1>the Shopify of warehouse robots.

0:32:09.476 --> 0:32:12.596
<v Speaker 2>There are all of these people that still need help,

0:32:12.636 --> 0:32:14.356
<v Speaker 2>and we very gladly help them.

0:32:14.716 --> 0:32:17.276
<v Speaker 1>What was the first robot you personally built.

0:32:19.756 --> 0:32:21.356
<v Speaker 2>I think it's probably one of the first pick and

0:32:21.436 --> 0:32:23.116
<v Speaker 2>place robot e Covariant.

0:32:23.596 --> 0:32:25.396
<v Speaker 1>You didn't build them when you were kid there.

0:32:25.596 --> 0:32:26.956
<v Speaker 2>I didn't build them when I was kid.

0:32:27.916 --> 0:32:29.356
<v Speaker 1>What made you get into robots?

0:32:31.876 --> 0:32:36.276
<v Speaker 2>I would say I'm an AI person first and robot

0:32:36.316 --> 0:32:40.956
<v Speaker 2>person set, and a big part of the interest in

0:32:41.076 --> 0:32:44.516
<v Speaker 2>robotics is probably driven by my interests in AI. Like

0:32:44.596 --> 0:32:49.676
<v Speaker 2>it just like we have just not make as much

0:32:49.716 --> 0:32:52.796
<v Speaker 2>progress for AI in the physical world. That's AI in

0:32:52.836 --> 0:32:56.196
<v Speaker 2>the digital world, and to a large degree, like I

0:32:56.196 --> 0:32:59.436
<v Speaker 2>think we have to make progress there because like ultimately

0:32:59.436 --> 0:33:01.716
<v Speaker 2>we live in the physical world. Like you're creating all

0:33:01.756 --> 0:33:04.716
<v Speaker 2>these intelligence and amazing things in the digital world. That

0:33:04.836 --> 0:33:07.436
<v Speaker 2>is all great, but where's AI in the physical world?

0:33:07.436 --> 0:33:11.996
<v Speaker 2>Like this remarkably little progress there despite how much AI

0:33:12.036 --> 0:33:14.636
<v Speaker 2>has moved forward. And so to some degree, like it's

0:33:14.636 --> 0:33:20.156
<v Speaker 2>a it's driven by a conviction that like AI has

0:33:20.236 --> 0:33:24.436
<v Speaker 2>to progress forward and AI will have a large impact

0:33:24.436 --> 0:33:25.276
<v Speaker 2>in the physical world.

0:33:25.756 --> 0:33:28.076
<v Speaker 1>What do you understand about robots that most people don't.

0:33:28.156 --> 0:33:34.036
<v Speaker 2>I think the most interesting thing about robots that I

0:33:34.076 --> 0:33:37.596
<v Speaker 2>would say, like we understand a covariant that maybe outside

0:33:37.596 --> 0:33:46.036
<v Speaker 2>of the company don't is making one robot work is

0:33:47.476 --> 0:33:50.636
<v Speaker 2>obviously hard and fun, but making a lot of robots

0:33:50.756 --> 0:33:53.876
<v Speaker 2>work as scale for a lot of customers take a

0:33:53.916 --> 0:33:59.836
<v Speaker 2>lot of operational discipline. Like it's about like doing many

0:33:59.876 --> 0:34:03.076
<v Speaker 2>many things right, like before robots go into a facility,

0:34:03.196 --> 0:34:05.996
<v Speaker 2>like how should they prep the site so the robots

0:34:06.036 --> 0:34:13.716
<v Speaker 2>actually work. To ship robots at scale, it's a competencies

0:34:13.836 --> 0:34:17.516
<v Speaker 2>that requires a lot of operational excellence, and that is

0:34:17.556 --> 0:34:20.316
<v Speaker 2>something that I would say, like when most people think

0:34:20.316 --> 0:34:22.916
<v Speaker 2>about robots like big thing about is like this sexy,

0:34:22.956 --> 0:34:26.436
<v Speaker 2>interesting technology, they don't think about having to nail a

0:34:26.516 --> 0:34:29.996
<v Speaker 2>thousand steps, a thousand small steps well in order to

0:34:30.036 --> 0:34:33.436
<v Speaker 2>have robots actually have an impact in the world at scale.

0:34:33.956 --> 0:34:36.356
<v Speaker 1>That's the leap from the academic lab to being a

0:34:36.396 --> 0:34:41.036
<v Speaker 1>real company selling real products in the world. Great, anything

0:34:41.036 --> 0:34:42.076
<v Speaker 1>else do you want to talk about?

0:34:42.676 --> 0:34:44.316
<v Speaker 2>No? I enjoyed this conversation.

0:34:44.836 --> 0:34:55.436
<v Speaker 1>Yeah, likewise, thank you. Theater Chen is the co founder

0:34:55.476 --> 0:35:00.316
<v Speaker 1>and CEO of Covariate. Today's show was produced by Edith

0:35:00.356 --> 0:35:04.316
<v Speaker 1>Russolo and Gabriel Hunter Chang. It was edited by Karen

0:35:04.396 --> 0:35:08.356
<v Speaker 1>Chakerji and engineered by Sarah Bugueer. You can email us

0:35:08.476 --> 0:35:12.596
<v Speaker 1>at at Pushkin dot f M. I'm Jacob Goldstein, and

0:35:12.636 --> 0:35:14.916
<v Speaker 1>we'll be back next week with another episode of What's

0:35:14.916 --> 0:35:22.396
<v Speaker 1>Your Problem