1
00:00:15,356 --> 00:00:21,876
Speaker 1: Pushkin for a long time. Now, we've had a lot

2
00:00:21,876 --> 00:00:25,876
Speaker 1: of technological innovation in virtual things in bits, you know,

3
00:00:26,316 --> 00:00:30,636
Speaker 1: the Internet, digital images, large language models, etc. We have

4
00:00:30,716 --> 00:00:36,076
Speaker 1: had noticeably less innovation in actual things and things made

5
00:00:36,116 --> 00:00:39,116
Speaker 1: of atoms, things that would hurt if you dropped them

6
00:00:39,116 --> 00:00:42,956
Speaker 1: on your foot. Now that seems to be changing. People

7
00:00:43,036 --> 00:00:47,316
Speaker 1: are using innovations in bits, improvements in computing and communications

8
00:00:47,316 --> 00:00:51,916
Speaker 1: and AI to drive innovation in actual things, everything from

9
00:00:51,956 --> 00:01:03,236
Speaker 1: batteries to garbage cans to airplanes. Next up robots. I'm

10
00:01:03,316 --> 00:01:05,836
Speaker 1: Jacob Boldstein and this is What's Your Problem, the show

11
00:01:05,876 --> 00:01:08,156
Speaker 1: where I talk to people who are trying to make

12
00:01:08,276 --> 00:01:12,796
Speaker 1: technological progress. My guest today is Peter Chi. He's the

13
00:01:12,836 --> 00:01:17,236
Speaker 1: co founder and CEO of Covariant. Peter's work at Covariant

14
00:01:17,276 --> 00:01:20,356
Speaker 1: was partly inspired by the work of Fafe Lead, who

15
00:01:20,396 --> 00:01:24,756
Speaker 1: coincidentally is the AI researcher I interviewed just last week

16
00:01:24,876 --> 00:01:28,556
Speaker 1: on the show. Peter's problem is this, how do you

17
00:01:28,596 --> 00:01:31,476
Speaker 1: take the AI breakthroughs of the past decade or so

18
00:01:32,236 --> 00:01:34,596
Speaker 1: and make them work in robots.

19
00:01:39,116 --> 00:01:44,316
Speaker 2: So to really tell the story of robotics, like we

20
00:01:44,396 --> 00:01:47,956
Speaker 2: have to tell the story of robotics even without AI

21
00:01:48,476 --> 00:01:51,276
Speaker 2: like robotics for a very long time. It's a field

22
00:01:51,476 --> 00:01:56,196
Speaker 2: that you would actually find in mechanical engineering departments of universities.

23
00:01:56,236 --> 00:01:59,556
Speaker 2: Like it's largely a hardware problem. It's a control problem,

24
00:01:59,596 --> 00:02:02,196
Speaker 2: like how can you design the moti well, how can

25
00:02:02,236 --> 00:02:03,636
Speaker 2: you design the gearbox? Well?

26
00:02:03,756 --> 00:02:04,116
Speaker 1: Yeah, right?

27
00:02:04,236 --> 00:02:07,636
Speaker 2: Can you design like the control algorithm so that you

28
00:02:07,676 --> 00:02:11,476
Speaker 2: can get the robot to a exact xyz location in

29
00:02:11,556 --> 00:02:15,036
Speaker 2: the three D physical world like without oscillating around and

30
00:02:15,076 --> 00:02:15,516
Speaker 2: you can.

31
00:02:15,436 --> 00:02:17,756
Speaker 1: Making the thing move? How do you build the parts

32
00:02:17,756 --> 00:02:19,396
Speaker 1: that make the thing move the way we.

33
00:02:19,396 --> 00:02:22,956
Speaker 2: Wanted to move exactly? Like it's all about like telling

34
00:02:23,196 --> 00:02:26,156
Speaker 2: this piece of machinery that we call robot to do

35
00:02:26,196 --> 00:02:28,596
Speaker 2: the thing that's exactly what we tell it to do,

36
00:02:29,036 --> 00:02:32,236
Speaker 2: which turned out to be like obviously a fairly difficult

37
00:02:32,236 --> 00:02:34,916
Speaker 2: engineering problems, and that's why people have woke on it

38
00:02:34,996 --> 00:02:38,396
Speaker 2: for many decades. But it has gotten really good.

39
00:02:38,476 --> 00:02:40,876
Speaker 1: And so this is like this is like the classic

40
00:02:40,996 --> 00:02:43,996
Speaker 1: kind of image you see from a car assembly line

41
00:02:44,036 --> 00:02:48,676
Speaker 1: of like a robot arm you know whatever, welding a

42
00:02:48,796 --> 00:02:52,356
Speaker 1: part onto the body of a cart again and again

43
00:02:52,516 --> 00:02:53,396
Speaker 1: again all day long.

44
00:02:53,476 --> 00:02:54,796
Speaker 2: Yeah, exactly right.

45
00:02:54,636 --> 00:02:57,516
Speaker 1: So they're good at robots are clearly good at welding

46
00:02:57,516 --> 00:02:59,996
Speaker 1: the same part onto the same car a million times?

47
00:03:00,356 --> 00:03:02,716
Speaker 1: What are the limits of that approach? What were the

48
00:03:02,836 --> 00:03:05,236
Speaker 1: problems people were bumping up against.

49
00:03:05,556 --> 00:03:08,516
Speaker 2: Yeah, so the problem is that in order to use

50
00:03:08,596 --> 00:03:13,316
Speaker 2: that kind of robotics, it has a really big limitations

51
00:03:13,356 --> 00:03:15,916
Speaker 2: on your environment. Right. You basically need to be able

52
00:03:15,916 --> 00:03:22,076
Speaker 2: to reduce your task to be solvable by repeated motion. Right,

53
00:03:22,236 --> 00:03:24,876
Speaker 2: And so if you look at like, how like these

54
00:03:24,956 --> 00:03:28,356
Speaker 2: kind of assembly lines that use classical robots, they always

55
00:03:28,516 --> 00:03:32,436
Speaker 2: feed the material into exactly the same place. So no

56
00:03:32,476 --> 00:03:34,476
Speaker 2: matter how how the way that they came in from

57
00:03:34,516 --> 00:03:37,236
Speaker 2: their suppliers and whatnot, you always need to load them

58
00:03:37,316 --> 00:03:39,956
Speaker 2: up in exactly the same way because like there's really

59
00:03:39,996 --> 00:03:43,476
Speaker 2: no adaptivity at all that these robots have because they're

60
00:03:43,516 --> 00:03:45,236
Speaker 2: just executing the same thing again, and it.

61
00:03:45,436 --> 00:03:47,876
Speaker 1: Just it has to be they're very precise, but their

62
00:03:47,916 --> 00:03:51,916
Speaker 1: whole environment has to be super homogeneous, the same every

63
00:03:51,956 --> 00:03:52,996
Speaker 1: time exactly.

64
00:03:53,076 --> 00:03:56,836
Speaker 2: Like, So, like that's the problem one, but that's very difficult, Like,

65
00:03:56,876 --> 00:03:59,956
Speaker 2: not everything can be reduced that way. The second problem

66
00:03:59,956 --> 00:04:02,356
Speaker 2: with it is even in the case that you can

67
00:04:02,396 --> 00:04:07,156
Speaker 2: reduce the problem to that kind of pure mechanical repeated motion,

68
00:04:07,716 --> 00:04:10,796
Speaker 2: it's still very expensive because you still need to program

69
00:04:10,836 --> 00:04:14,116
Speaker 2: a robot to do that one specific task, and if

70
00:04:14,156 --> 00:04:17,356
Speaker 2: you change your task slightly, you need to reprogram everything,

71
00:04:17,796 --> 00:04:21,676
Speaker 2: typically from scratch, and that means like robot are not

72
00:04:21,756 --> 00:04:26,196
Speaker 2: just extremely rigid, which limits like the range of capabilities

73
00:04:26,236 --> 00:04:30,396
Speaker 2: that they can rich and do. And it's also very

74
00:04:30,476 --> 00:04:34,196
Speaker 2: expensive even on the very fixed rigid capability will.

75
00:04:34,116 --> 00:04:37,156
Speaker 1: Right, And so you need something that's the same every time,

76
00:04:37,196 --> 00:04:38,996
Speaker 1: and you need to be doing a lot of it

77
00:04:39,116 --> 00:04:42,996
Speaker 1: because otherwise the economy of scale just doesn't work out.

78
00:04:42,996 --> 00:04:45,036
Speaker 1: It's too expensive to try and get the robot to

79
00:04:45,076 --> 00:04:45,716
Speaker 1: do something else.

80
00:04:46,076 --> 00:04:46,516
Speaker 2: Exactly.

81
00:04:47,036 --> 00:04:50,796
Speaker 1: So I know that you as a as a student,

82
00:04:50,876 --> 00:04:53,396
Speaker 1: as an undergrad and a grad student, if I have

83
00:04:53,436 --> 00:04:56,436
Speaker 1: it right, you worked in the lab of this professor

84
00:04:56,436 --> 00:04:59,196
Speaker 1: at Berkeley who for a long time had been trying

85
00:04:59,236 --> 00:05:02,996
Speaker 1: to teach robots to fold towels. Yes, which is an

86
00:05:03,036 --> 00:05:06,116
Speaker 1: amazing problem because it's one of those ones that seems

87
00:05:06,156 --> 00:05:09,836
Speaker 1: so simple, right, it seems like way easier than riveting

88
00:05:09,916 --> 00:05:12,636
Speaker 1: parts onto a car or whatever, but turned out to

89
00:05:12,636 --> 00:05:16,876
Speaker 1: be in fact much harder for robots, right. And I

90
00:05:16,876 --> 00:05:19,596
Speaker 1: feel like that's telling like why was that so hard

91
00:05:19,636 --> 00:05:20,156
Speaker 1: for robots?

92
00:05:20,196 --> 00:05:22,276
Speaker 2: And what do we learn from that when you do

93
00:05:22,836 --> 00:05:25,476
Speaker 2: welding on a car body, like as we have discussed,

94
00:05:25,476 --> 00:05:28,716
Speaker 2: like you can reduce the problem to just simple mechanical

95
00:05:28,956 --> 00:05:33,276
Speaker 2: repeated motion. But because like these a piece of fabric

96
00:05:33,396 --> 00:05:37,596
Speaker 2: is flexible, is deformable, like it can come in many

97
00:05:37,596 --> 00:05:40,836
Speaker 2: many different kinds of shapes. It has many different possibilities, right, and.

98
00:05:40,756 --> 00:05:43,676
Speaker 1: It's much more complex than a than a car body.

99
00:05:43,716 --> 00:05:46,596
Speaker 1: Weirdly not into exactly, but when you think about it's like,

100
00:05:46,676 --> 00:05:49,516
Speaker 1: oh it could have a little fast possibilities.

101
00:05:50,196 --> 00:05:52,436
Speaker 2: It has a lot more possibility and how it can

102
00:05:52,476 --> 00:05:55,916
Speaker 2: present itself to you. And because exactly of that, like

103
00:05:56,036 --> 00:06:00,916
Speaker 2: recall like the first limitation of traditional robotics, which is

104
00:06:00,956 --> 00:06:03,716
Speaker 2: it can only work with problems that can be reduced

105
00:06:03,716 --> 00:06:09,916
Speaker 2: to repeated motion like and towel folding and folding apparel

106
00:06:09,956 --> 00:06:12,876
Speaker 2: items is exactly one of those things that cannot be

107
00:06:12,956 --> 00:06:14,076
Speaker 2: reduced because.

108
00:06:13,836 --> 00:06:16,676
Speaker 1: It's just a little bit different every time it tells

109
00:06:16,676 --> 00:06:18,116
Speaker 1: a little bit of a different shape. It might be

110
00:06:18,156 --> 00:06:21,116
Speaker 1: sitting on the table and it'll folded over some.

111
00:06:20,996 --> 00:06:24,076
Speaker 2: Weird exactly, like if it's folded onto itself, like how

112
00:06:24,156 --> 00:06:26,436
Speaker 2: much is it folded onto yourself? Like how much wrinkle

113
00:06:26,556 --> 00:06:28,756
Speaker 2: does it have? Like all of those make a big

114
00:06:28,796 --> 00:06:32,316
Speaker 2: difference in terms of what the robots should do with it,

115
00:06:32,356 --> 00:06:35,036
Speaker 2: and so that like it's a really good example of

116
00:06:35,316 --> 00:06:40,116
Speaker 2: something that's traditional robotics cannot solve and you really need

117
00:06:40,156 --> 00:06:41,956
Speaker 2: AI to solve it.

118
00:06:42,076 --> 00:06:45,596
Speaker 1: And when your co founder started working on the problem,

119
00:06:45,676 --> 00:06:49,556
Speaker 1: it was sort of before the kind of modern era

120
00:06:49,596 --> 00:06:52,116
Speaker 1: of AI that we're living in now, right, And I

121
00:06:53,116 --> 00:06:56,716
Speaker 1: read that there was this big moment in kind of

122
00:06:56,756 --> 00:07:03,636
Speaker 1: the origin of the company was when this database essentially

123
00:07:03,716 --> 00:07:08,156
Speaker 1: of labeled images called image net was released, Which was

124
00:07:08,196 --> 00:07:10,316
Speaker 1: interesting to me because I just talk to Faye Lee

125
00:07:10,436 --> 00:07:10,716
Speaker 1: for this.

126
00:07:10,756 --> 00:07:16,396
Speaker 2: So we have is actually one of the advisor investors.

127
00:07:16,436 --> 00:07:18,556
Speaker 1: Just the coincidence. For the record, she didn't put me

128
00:07:18,596 --> 00:07:21,796
Speaker 1: in touch with you, but tell me about sort of

129
00:07:21,956 --> 00:07:26,596
Speaker 1: why why image net was meaningful in the in the

130
00:07:26,636 --> 00:07:28,076
Speaker 1: birth of your company.

131
00:07:28,476 --> 00:07:33,196
Speaker 2: Yeah, there was actually a bigger lesson here than just robotics.

132
00:07:33,556 --> 00:07:38,676
Speaker 2: The bigger lesson is that artificial intelligence is actually becoming

133
00:07:38,796 --> 00:07:42,396
Speaker 2: simpler and simpler, Like when you look at the field

134
00:07:42,396 --> 00:07:46,516
Speaker 2: of artificial intelligence fifteen twenty years ago, like it used

135
00:07:46,516 --> 00:07:48,676
Speaker 2: to be many different sub fields. Like the people that

136
00:07:48,716 --> 00:07:51,676
Speaker 2: work on robotics have a completely different tool sets than

137
00:07:51,716 --> 00:07:54,036
Speaker 2: the people that work on computer vision, and the people

138
00:07:54,036 --> 00:07:56,316
Speaker 2: that are working on computer vision had a different two

139
00:07:56,396 --> 00:07:58,756
Speaker 2: sets from people that are working on natural language processing.

140
00:07:59,116 --> 00:08:02,356
Speaker 1: Well, and they feel different, Right, Like, teaching computers to

141
00:08:02,556 --> 00:08:07,316
Speaker 1: understand an image, you know, to sort of deal with

142
00:08:07,436 --> 00:08:09,716
Speaker 1: an image of the world and understand what it means,

143
00:08:09,916 --> 00:08:14,196
Speaker 1: feels quite different than teaching a computer to have a conversation.

144
00:08:14,396 --> 00:08:16,436
Speaker 1: It doesn't. It's not obvious that you would use the

145
00:08:16,436 --> 00:08:17,996
Speaker 1: same tools to do exactly.

146
00:08:17,636 --> 00:08:21,356
Speaker 2: Those and that definitely was the consensus, Like why.

147
00:08:21,236 --> 00:08:22,116
Speaker 1: Would it be the same?

148
00:08:22,396 --> 00:08:23,876
Speaker 2: Yeah, what would it be the same? Like it feels

149
00:08:23,956 --> 00:08:25,556
Speaker 2: very different, Like it feels like you need different kind

150
00:08:25,596 --> 00:08:29,356
Speaker 2: of data, and I would just like say it. Essentially,

151
00:08:29,396 --> 00:08:32,516
Speaker 2: the field of AI is becoming more and more unified,

152
00:08:32,836 --> 00:08:35,756
Speaker 2: Like the methodology, the model that you would use is

153
00:08:35,756 --> 00:08:39,236
Speaker 2: actually becoming similar and similar and sometimes it's even the

154
00:08:39,276 --> 00:08:43,876
Speaker 2: same across these very different fields of robotics, computer vision, language.

155
00:08:43,996 --> 00:08:46,676
Speaker 1: So it's basically, you build a neural network and then

156
00:08:46,716 --> 00:08:49,116
Speaker 1: you just train it on a bunch of images or

157
00:08:49,196 --> 00:08:52,036
Speaker 1: train it on a bunch of documents, and what you

158
00:08:52,156 --> 00:08:55,036
Speaker 1: train it on is what determines sort of what it's

159
00:08:55,076 --> 00:08:55,476
Speaker 1: good for.

160
00:08:55,996 --> 00:08:59,596
Speaker 3: Is it like that that that basically is it like

161
00:08:59,636 --> 00:09:05,436
Speaker 3: it's it's like it instead of each sub feel coming

162
00:09:05,556 --> 00:09:08,956
Speaker 3: up with different and think about it as like hand

163
00:09:09,156 --> 00:09:13,276
Speaker 3: programmed intelligence, right, like let's try to break down what's

164
00:09:13,316 --> 00:09:16,716
Speaker 3: a sentence means, like break into different parts and when

165
00:09:16,756 --> 00:09:18,996
Speaker 3: you approach a computer vision problem, or let's try to

166
00:09:19,036 --> 00:09:21,996
Speaker 3: come up with different features and some features represent an edge,

167
00:09:22,076 --> 00:09:23,836
Speaker 3: some features represent the background.

168
00:09:24,236 --> 00:09:27,956
Speaker 2: Instead of like trying to manually program the kind of

169
00:09:28,036 --> 00:09:32,316
Speaker 2: quite human intelligence into AI, you're basically taking a step

170
00:09:32,356 --> 00:09:34,876
Speaker 2: back and say, I'm just going to create a very

171
00:09:34,916 --> 00:09:39,516
Speaker 2: flexible learning mechanism, which is an artificial new net, and

172
00:09:39,556 --> 00:09:41,436
Speaker 2: then we're just going to feed it a lot of data.

173
00:09:41,916 --> 00:09:44,796
Speaker 2: And if you have different types of problems that you're solving,

174
00:09:45,036 --> 00:09:48,316
Speaker 2: you're just feeding like this new net different types of data,

175
00:09:49,156 --> 00:09:51,476
Speaker 2: but they're still like really the same kind of mechanism.

176
00:09:51,516 --> 00:09:55,036
Speaker 2: Like then that is a drastic departure from how artificial

177
00:09:55,076 --> 00:09:56,836
Speaker 2: intelligence used to be done.

178
00:09:57,196 --> 00:10:00,316
Speaker 1: And so in that world, then the differentiation is just

179
00:10:00,436 --> 00:10:03,916
Speaker 1: in the data set that you are feeding the totally model.

180
00:10:03,916 --> 00:10:07,076
Speaker 2: Totally and in fact, like I mean, this is like

181
00:10:07,196 --> 00:10:10,156
Speaker 2: jumping way forward in time. We were still talking about

182
00:10:10,156 --> 00:10:13,156
Speaker 2: image net. But if you look at really the most

183
00:10:13,236 --> 00:10:17,196
Speaker 2: popular technologies today, these large language models. When you use

184
00:10:17,276 --> 00:10:22,236
Speaker 2: different companies large language models, it's really what you're saying,

185
00:10:22,276 --> 00:10:26,596
Speaker 2: like as if I'm using GPT chat, gipt four versus

186
00:10:26,676 --> 00:10:32,836
Speaker 2: using Google's bought backed by Germini versus Anthpics, Claude or

187
00:10:33,036 --> 00:10:34,836
Speaker 2: coheres command model.

188
00:10:34,636 --> 00:10:37,836
Speaker 1: You're just naming all the different big language large language models. Now.

189
00:10:37,916 --> 00:10:39,836
Speaker 2: Yeah, and like when you think about these different big

190
00:10:39,916 --> 00:10:43,236
Speaker 2: language models, you're just really referring to the different data

191
00:10:43,236 --> 00:10:44,956
Speaker 2: sets that's see behind.

192
00:10:44,636 --> 00:10:47,236
Speaker 1: That they were trained on. So this is interesting kind

193
00:10:47,276 --> 00:10:50,316
Speaker 1: of abstract big picture talk. I want to kind of

194
00:10:50,436 --> 00:10:54,596
Speaker 1: map this now onto the story of Covariate, the story

195
00:10:54,676 --> 00:10:57,756
Speaker 1: of your company. Right, so we're going back in time. Now,

196
00:10:58,956 --> 00:11:01,196
Speaker 1: tell me about the moment when you decided to start

197
00:11:01,196 --> 00:11:02,316
Speaker 1: the company. What's going on?

198
00:11:02,916 --> 00:11:04,876
Speaker 2: Yeah, so the moment that we decide to start a

199
00:11:04,956 --> 00:11:09,636
Speaker 2: company is you'll refer to this image net mode like

200
00:11:09,716 --> 00:11:14,036
Speaker 2: this like large data set that actually first time like

201
00:11:14,156 --> 00:11:18,796
Speaker 2: really taught people that you can train a network to

202
00:11:18,916 --> 00:11:21,116
Speaker 2: solve one specific task really work.

203
00:11:21,636 --> 00:11:27,916
Speaker 1: So this is twenty twelve. There's when a neural net

204
00:11:28,436 --> 00:11:32,516
Speaker 1: trains on image net and does a really good job,

205
00:11:32,676 --> 00:11:35,436
Speaker 1: way better than anybody has ever done. Any Then any

206
00:11:35,476 --> 00:11:39,316
Speaker 1: model has ever done of identifying objects.

207
00:11:38,716 --> 00:11:42,516
Speaker 2: In exactly right, exactly. That was really significant because that

208
00:11:42,676 --> 00:11:45,716
Speaker 2: means if you can collect a lot of data for

209
00:11:45,996 --> 00:11:48,756
Speaker 2: a single task, and if you can get a group

210
00:11:48,796 --> 00:11:51,996
Speaker 2: of PhDs to work on a model for that single task,

211
00:11:52,356 --> 00:11:55,076
Speaker 2: you basically have AI. Like you can solve that task

212
00:11:55,196 --> 00:11:58,756
Speaker 2: really well. I mean like you might need to iterate

213
00:11:58,796 --> 00:12:00,956
Speaker 2: on your data, you might need to iterate on your algorithm,

214
00:12:01,076 --> 00:12:03,276
Speaker 2: but ultimately you can solve that task really well.

215
00:12:03,436 --> 00:12:07,356
Speaker 1: You're basically saying, if you can gather the data, you

216
00:12:07,396 --> 00:12:11,316
Speaker 1: can gather a shitload of data of whatever kind, then

217
00:12:11,396 --> 00:12:14,356
Speaker 1: you can get AI around.

218
00:12:14,156 --> 00:12:16,756
Speaker 2: For that for that kind, right, And which is like

219
00:12:16,876 --> 00:12:20,596
Speaker 2: why you saw artificial intelligence like really started working like

220
00:12:20,676 --> 00:12:24,116
Speaker 2: after twenty twelve, Like like Google, Facebook, all of these

221
00:12:24,116 --> 00:12:27,476
Speaker 2: companies like have a lot of AI based applications, but

222
00:12:27,516 --> 00:12:31,236
Speaker 2: they are largely not democratized because like in order to

223
00:12:31,316 --> 00:12:33,956
Speaker 2: get any single AI working, you still need a lot

224
00:12:33,996 --> 00:12:36,636
Speaker 2: of data and you still need a team of PhDs

225
00:12:36,676 --> 00:12:39,996
Speaker 2: to work on it. So it was a huge breakthrough

226
00:12:40,076 --> 00:12:43,556
Speaker 2: in AI, but it was not sufficient for really widespread

227
00:12:43,636 --> 00:12:46,276
Speaker 2: usage just because the barrier to create one AI is

228
00:12:46,316 --> 00:12:51,596
Speaker 2: so high. Okay, and then comes the second inflection point

229
00:12:51,916 --> 00:12:54,596
Speaker 2: in the history of AI, which is really the start

230
00:12:54,796 --> 00:12:58,876
Speaker 2: of foundation models. So like I'm talking about the most

231
00:12:58,876 --> 00:13:01,796
Speaker 2: initial version of GPT, right, Like I'm talking about like

232
00:13:02,236 --> 00:13:05,716
Speaker 2: these large language models that are trained on multiple tasks

233
00:13:06,036 --> 00:13:09,396
Speaker 2: so that they're incredibly generalizable, Like you can ask to

234
00:13:09,396 --> 00:13:11,556
Speaker 2: do something new and it can do it really well.

235
00:13:12,076 --> 00:13:16,796
Speaker 2: And also it performed better at single task than specialized model.

236
00:13:16,916 --> 00:13:19,116
Speaker 1: And so just to be clear, like until a few

237
00:13:19,236 --> 00:13:23,036
Speaker 1: years ago, people thought reasonably that if you want to

238
00:13:23,076 --> 00:13:28,516
Speaker 1: build AI to whatever translate language right, you would work

239
00:13:28,556 --> 00:13:31,316
Speaker 1: really hard on that. You would try and build an

240
00:13:31,356 --> 00:13:36,916
Speaker 1: AI specifically designed to be really good at translating text

241
00:13:36,996 --> 00:13:41,996
Speaker 1: from one language to another. But this really surprising result,

242
00:13:42,076 --> 00:13:45,236
Speaker 1: this really surprising thing that emerged from just work people

243
00:13:45,236 --> 00:13:47,796
Speaker 1: were doing, was, in fact, that's not the best way

244
00:13:48,116 --> 00:13:52,636
Speaker 1: to get AI to translate language. It's just throw everything

245
00:13:52,716 --> 00:13:55,316
Speaker 1: you can, all the words on the internet at an

246
00:13:55,356 --> 00:13:58,876
Speaker 1: AI model and just say, figure out everything about language,

247
00:13:58,876 --> 00:14:02,316
Speaker 1: figure out how to answer questions about history, and figure

248
00:14:02,356 --> 00:14:05,076
Speaker 1: out how to translate, and figure out how to give

249
00:14:05,116 --> 00:14:08,396
Speaker 1: me a recipe for you know, pasta. And it turns

250
00:14:08,396 --> 00:14:12,316
Speaker 1: out that LAD technique gets you better results at each

251
00:14:12,356 --> 00:14:15,556
Speaker 1: specific thing than trying to build specialized model exactly.

252
00:14:15,596 --> 00:14:18,916
Speaker 2: And that is really the magic of foundation models. And

253
00:14:18,956 --> 00:14:22,756
Speaker 2: that's the thing that we're not obvious to people outside

254
00:14:22,756 --> 00:14:26,196
Speaker 2: of open Ai for a very long time. And because

255
00:14:26,236 --> 00:14:28,236
Speaker 2: we came from open Ai, a lot of the founding

256
00:14:28,276 --> 00:14:30,676
Speaker 2: team at Coverin came from open Ai. We saw that

257
00:14:30,756 --> 00:14:36,796
Speaker 2: inside earlier, and that inside allow us to start Covariant

258
00:14:36,796 --> 00:14:40,916
Speaker 2: to build foundation models for robotics way before other people

259
00:14:40,956 --> 00:14:42,236
Speaker 2: even believed in the.

260
00:14:42,236 --> 00:14:45,796
Speaker 1: Approach, even people in the field, even people in the field.

261
00:14:45,916 --> 00:14:47,916
Speaker 1: So so, yeah, so you were when did you go

262
00:14:47,956 --> 00:14:49,876
Speaker 1: to open Ai? You were you went to work at

263
00:14:49,876 --> 00:14:50,396
Speaker 1: open Ai.

264
00:14:50,756 --> 00:14:53,876
Speaker 2: We went to open Ai when it was about ten

265
00:14:54,036 --> 00:14:57,356
Speaker 2: ish people like like sometime in twenty sixteen.

266
00:14:57,036 --> 00:15:03,116
Speaker 1: Okay, And and when do you sort of personally have

267
00:15:03,276 --> 00:15:06,836
Speaker 1: this realization and you're not the only one to have it,

268
00:15:06,876 --> 00:15:11,236
Speaker 1: but when do you see that the power of foundation models?

269
00:15:12,916 --> 00:15:15,156
Speaker 2: There are two things on it, Like, so the first

270
00:15:15,196 --> 00:15:20,436
Speaker 2: thing is that early on at open Ai, we believed

271
00:15:20,516 --> 00:15:24,476
Speaker 2: in the idea of scaling, like really scaling up the

272
00:15:24,476 --> 00:15:27,236
Speaker 2: model and scaling up the data sets. And you actually

273
00:15:27,276 --> 00:15:31,796
Speaker 2: see models getting like getting increasingly smarter as you actually

274
00:15:31,796 --> 00:15:34,196
Speaker 2: scale them up. So one is that, and then the

275
00:15:34,236 --> 00:15:39,396
Speaker 2: other one is I would say we we had conviction

276
00:15:39,596 --> 00:15:44,156
Speaker 2: in foundation model for robotics probably earlier than foundation model

277
00:15:44,196 --> 00:15:46,556
Speaker 2: for language. And this is like the one key thing

278
00:15:46,636 --> 00:15:51,276
Speaker 2: is that if you think about building a large language

279
00:15:51,316 --> 00:15:54,796
Speaker 2: model that tried to compress the whole internet of knowledge,

280
00:15:55,036 --> 00:15:58,276
Speaker 2: you still need to compress many things that are not

281
00:15:58,396 --> 00:16:01,396
Speaker 2: quite related to each other. Right, Like maybe you're browsing

282
00:16:01,436 --> 00:16:05,716
Speaker 2: on Wikipedia and you have to recite the composition of

283
00:16:05,836 --> 00:16:09,876
Speaker 2: materials of soil on the moon, and you also need

284
00:16:09,916 --> 00:16:12,436
Speaker 2: to learn how to play chess. What there is there's

285
00:16:12,476 --> 00:16:15,476
Speaker 2: really nothing in common with these two things. Yeah, these

286
00:16:15,476 --> 00:16:17,556
Speaker 2: two parts of the knowledge, but you are asking one

287
00:16:17,596 --> 00:16:21,036
Speaker 2: AI model to learn all of these. And the place

288
00:16:21,116 --> 00:16:23,676
Speaker 2: that make a lot of sense to us is that

289
00:16:24,476 --> 00:16:27,356
Speaker 2: there's only one physical world. Like even when you have

290
00:16:27,716 --> 00:16:30,956
Speaker 2: many different robots that need to do different things in

291
00:16:31,036 --> 00:16:34,556
Speaker 2: different factories, different warehouses, they are still interacting in the

292
00:16:34,596 --> 00:16:37,956
Speaker 2: same physical world. And so building a foundation model for

293
00:16:38,076 --> 00:16:42,676
Speaker 2: robotics like has this amazing property of grounding that like,

294
00:16:42,756 --> 00:16:45,516
Speaker 2: no matter what kind of tasks that you're asking this

295
00:16:45,596 --> 00:16:48,396
Speaker 2: foundation model to learn, well, it's just learning with the

296
00:16:48,396 --> 00:16:50,596
Speaker 2: same sets of physics, Like the same surrounding.

297
00:16:50,756 --> 00:16:52,876
Speaker 1: Is the literal ground with the models.

298
00:16:52,636 --> 00:16:54,596
Speaker 2: The literal ground exactly, and.

299
00:16:54,516 --> 00:16:57,916
Speaker 1: The models have to understand just how the physical world works,

300
00:16:57,956 --> 00:17:01,596
Speaker 1: and that you drop a thing, it will fall and etc.

301
00:17:02,116 --> 00:17:05,556
Speaker 2: And it's if it's something that is like deformable, you

302
00:17:05,596 --> 00:17:07,316
Speaker 2: push it, it would move a little bit. If something

303
00:17:07,436 --> 00:17:10,916
Speaker 2: is rigid like you would sly, if something is rollable,

304
00:17:10,956 --> 00:17:11,716
Speaker 2: it would roll away.

305
00:17:11,796 --> 00:17:11,916
Speaker 1: Like.

306
00:17:12,116 --> 00:17:13,996
Speaker 2: These are the type of things that like, no matter

307
00:17:14,436 --> 00:17:17,076
Speaker 2: where you are on Earth and what type of robots

308
00:17:17,116 --> 00:17:19,836
Speaker 2: like body you're using, is the same, right, and so

309
00:17:19,956 --> 00:17:22,196
Speaker 2: like if you can be one single foundation model that

310
00:17:22,276 --> 00:17:25,076
Speaker 2: can learn from all of these different data, it would

311
00:17:25,076 --> 00:17:26,236
Speaker 2: be incredibly powerful.

312
00:17:26,676 --> 00:17:29,756
Speaker 1: So just to just to state it clearly, So you're

313
00:17:29,836 --> 00:17:33,036
Speaker 1: at open AI, you're seeing the power of foundation models,

314
00:17:33,636 --> 00:17:36,476
Speaker 1: you decide to leave and start the company that is

315
00:17:36,556 --> 00:17:39,676
Speaker 1: now Covariant, Like what are you setting out to do

316
00:17:39,876 --> 00:17:41,756
Speaker 1: when you when you start the company?

317
00:17:42,276 --> 00:17:46,596
Speaker 2: Yeah, So when we started Covariant, we had this really

318
00:17:46,636 --> 00:17:51,556
Speaker 2: strong conviction that like, there should be a future that

319
00:17:51,756 --> 00:17:54,996
Speaker 2: has a lot of autonomous robots doing all the things

320
00:17:55,036 --> 00:18:00,716
Speaker 2: that are repetitive, injury prone, dangerous, like and so that

321
00:18:00,876 --> 00:18:04,316
Speaker 2: can really revolutionize the physical world, make it a lot

322
00:18:04,356 --> 00:18:09,676
Speaker 2: more abundant and to really enable that future of autonomous robots,

323
00:18:09,836 --> 00:18:12,716
Speaker 2: you need really smart AI, and like because of the

324
00:18:12,756 --> 00:18:15,316
Speaker 2: inside that we just talk about, like we believe that

325
00:18:15,356 --> 00:18:17,596
Speaker 2: AI had to be a foundation model. We believe like

326
00:18:17,636 --> 00:18:20,356
Speaker 2: you should have single model that learn from all these

327
00:18:20,356 --> 00:18:23,036
Speaker 2: two different robots together and become smarter together.

328
00:18:23,596 --> 00:18:28,116
Speaker 1: So I mean, so the basic idea is, the dream

329
00:18:28,196 --> 00:18:35,476
Speaker 1: is to build one AI foundation model for robots. Basically, yeah,

330
00:18:35,636 --> 00:18:37,756
Speaker 1: in the same way that you can ask chat GBT

331
00:18:37,956 --> 00:18:40,196
Speaker 1: anything in language and it can answer you in language

332
00:18:40,196 --> 00:18:42,796
Speaker 1: about any different thing. You have a model where you

333
00:18:42,996 --> 00:18:45,876
Speaker 1: just sort of make it be the brain of any

334
00:18:45,996 --> 00:18:48,716
Speaker 1: robot and that robot can sort of see the world

335
00:18:48,796 --> 00:18:51,356
Speaker 1: and move and pick things up and behave in the

336
00:18:51,396 --> 00:18:52,436
Speaker 1: world exactly.

337
00:18:53,076 --> 00:18:56,396
Speaker 2: And there's one key problem, Like the key problem is

338
00:18:56,436 --> 00:19:01,636
Speaker 2: that unlike foundation model for language, where you can scrape

339
00:19:01,676 --> 00:19:04,596
Speaker 2: the whole Internet of text as your pre training data,

340
00:19:05,036 --> 00:19:09,036
Speaker 2: there's nothing that is equivalent in the case of robotic

341
00:19:09,436 --> 00:19:11,236
Speaker 2: I mean, there are some images online, there are some

342
00:19:11,316 --> 00:19:14,556
Speaker 2: YouTube videos online, but by and large, like they don't

343
00:19:14,596 --> 00:19:20,036
Speaker 2: really give you the same type of data that are

344
00:19:20,036 --> 00:19:22,316
Speaker 2: in the form of robots interacting with the world. And

345
00:19:22,356 --> 00:19:25,116
Speaker 2: the big problem is because they're just not that many robots,

346
00:19:25,116 --> 00:19:28,556
Speaker 2: they're just doing interesting things in the world. And so

347
00:19:28,636 --> 00:19:31,436
Speaker 2: a big chunk of what we set out to build

348
00:19:31,436 --> 00:19:34,156
Speaker 2: as a company is recognizing that we need to build

349
00:19:34,276 --> 00:19:37,436
Speaker 2: fundation model for robotics. And in order to be a

350
00:19:37,476 --> 00:19:40,916
Speaker 2: foundation model for robotics, you need to have large data sets.

351
00:19:41,196 --> 00:19:43,676
Speaker 2: And in order to create large data sets, you have

352
00:19:43,836 --> 00:19:47,836
Speaker 2: to have robots that are actually creating value for customers

353
00:19:47,876 --> 00:19:51,476
Speaker 2: in production at scale, because if you're only collecting data

354
00:19:51,516 --> 00:19:53,996
Speaker 2: in your own lab, there's only so much data. And

355
00:19:54,436 --> 00:19:59,076
Speaker 2: so I would say the last six years of Covariance

356
00:19:59,076 --> 00:20:04,956
Speaker 2: is largely focused on really building autonomous robot systems that

357
00:20:05,076 --> 00:20:08,476
Speaker 2: work really well for customers, and they're doing interesting things

358
00:20:08,516 --> 00:20:11,996
Speaker 2: to a level of autonomy and reliability that have not

359
00:20:12,076 --> 00:20:13,236
Speaker 2: been hit before.

360
00:20:16,196 --> 00:20:19,036
Speaker 4: In other words, Peter and his colleagues at Covariant built

361
00:20:19,276 --> 00:20:22,756
Speaker 4: robot arms that businesses are paying to use out in

362
00:20:22,796 --> 00:20:26,476
Speaker 4: the world. But to some extent, those robot arms are

363
00:20:26,516 --> 00:20:29,316
Speaker 4: just a means to an end because they're not just

364
00:20:29,396 --> 00:20:32,356
Speaker 4: doing warehouse work. They're collecting more and more.

365
00:20:32,276 --> 00:20:35,436
Speaker 1: Data that Peter and his colleagues are feeding back into

366
00:20:35,476 --> 00:20:38,276
Speaker 1: their AI model to try and make it get better

367
00:20:38,476 --> 00:20:43,076
Speaker 1: and better. In a minute, what those robot arms are

368
00:20:43,156 --> 00:20:46,876
Speaker 1: actually doing out in the world and what they're learning.

369
00:20:56,916 --> 00:21:02,036
Speaker 1: So let's talk about what the robots you have built

370
00:21:02,436 --> 00:21:05,356
Speaker 1: are doing out in the world. Besides generating the data

371
00:21:05,396 --> 00:21:08,116
Speaker 1: that you will use to train the next generation of robots,

372
00:21:08,116 --> 00:21:11,196
Speaker 1: what are they actually doing in the world right now today.

373
00:21:11,356 --> 00:21:14,636
Speaker 2: Yeah, So when we started the company, we soveyed the

374
00:21:14,716 --> 00:21:19,156
Speaker 2: landscape pretty carefully and then we selected warehouse and logics

375
00:21:19,156 --> 00:21:22,836
Speaker 2: as the primary sector that we focus on today. When

376
00:21:22,876 --> 00:21:26,196
Speaker 2: you are stopping online shops, like when you click a

377
00:21:26,196 --> 00:21:30,436
Speaker 2: button and something shows up next day, there's tremendous amount

378
00:21:30,636 --> 00:21:34,676
Speaker 2: of complexities behind that back end logistics of getting things

379
00:21:34,716 --> 00:21:38,196
Speaker 2: to you. And typically it's estimated that each item is

380
00:21:38,276 --> 00:21:41,476
Speaker 2: touched fifteen to twenty times between when you click a

381
00:21:41,516 --> 00:21:43,396
Speaker 2: butttom to buy and when it shows up.

382
00:21:44,716 --> 00:21:48,476
Speaker 1: In contrast, that fifteen to twenty touches of getting a

383
00:21:48,516 --> 00:21:52,596
Speaker 1: thing from the warehouse to your door is your opportunity.

384
00:21:52,316 --> 00:21:56,556
Speaker 2: Exactly right, And like combining with that, people don't want

385
00:21:56,596 --> 00:21:59,596
Speaker 2: to drive in the middle of the night two hours

386
00:21:59,596 --> 00:22:03,116
Speaker 2: into a suburb to work in a warehouse, Like it's

387
00:22:03,196 --> 00:22:06,156
Speaker 2: just a kind of job that has extremely high turnover rate,

388
00:22:06,236 --> 00:22:08,156
Speaker 2: Like not the kind of job that people stay there

389
00:22:08,196 --> 00:22:11,196
Speaker 2: for a very long time, and so there's a tremendous

390
00:22:11,196 --> 00:22:15,756
Speaker 2: amount of desire for more robotics and more automations in

391
00:22:15,796 --> 00:22:20,596
Speaker 2: those environment to do picking up objects, sorting them into

392
00:22:20,636 --> 00:22:24,636
Speaker 2: the right compartment of boxes, and then packing it nicely

393
00:22:24,716 --> 00:22:27,996
Speaker 2: and then shipping it out to you as a customers.

394
00:22:28,196 --> 00:22:30,556
Speaker 1: Tell me a little bit more specifically, I mean, what's

395
00:22:30,636 --> 00:22:34,236
Speaker 1: one thing one of your robots is doing today in

396
00:22:35,036 --> 00:22:36,916
Speaker 1: a warehouse somewhere on the earth.

397
00:22:37,236 --> 00:22:41,756
Speaker 2: Yeah, so we would get like pretty detail and giky here,

398
00:22:41,796 --> 00:22:43,756
Speaker 2: and so we'd actually tell you a little bit of

399
00:22:43,836 --> 00:22:48,996
Speaker 2: like the needy, greitty details of like warehouse. Like, so

400
00:22:49,076 --> 00:22:50,916
Speaker 2: let me describe what the robot is doing. Like the

401
00:22:50,996 --> 00:22:55,316
Speaker 2: robot is doing, is I have a toad full of

402
00:22:55,316 --> 00:22:57,956
Speaker 2: items that come up to the robots, and then I

403
00:22:57,996 --> 00:23:00,716
Speaker 2: would need to grab one thing at a time, like we're.

404
00:23:00,596 --> 00:23:03,156
Speaker 1: Is like a tote bag with a bunch of different stuff.

405
00:23:02,836 --> 00:23:05,316
Speaker 2: And a bunch of different stuff in it, and like

406
00:23:05,396 --> 00:23:08,676
Speaker 2: because like these stuffs are all layout in a chaotic

407
00:23:09,516 --> 00:23:12,156
Speaker 2: way and like they're overlapping with each other. If you're

408
00:23:12,156 --> 00:23:14,356
Speaker 2: not careful, you might drag out multiple items at the

409
00:23:14,356 --> 00:23:17,316
Speaker 2: same time. And these items all have different shapes like

410
00:23:17,356 --> 00:23:19,596
Speaker 2: they might be transparent, they might be reflective, they might

411
00:23:19,636 --> 00:23:20,316
Speaker 2: be hard to see.

412
00:23:20,476 --> 00:23:23,116
Speaker 1: This is hard, like to go back to our old

413
00:23:23,276 --> 00:23:27,196
Speaker 1: Like riveting parts on a car is easy, Folding a

414
00:23:27,236 --> 00:23:30,596
Speaker 1: towel is hard. This is hard because it's heterogeneous. Things

415
00:23:30,636 --> 00:23:33,236
Speaker 1: look different, They come differently every time. This is hard

416
00:23:33,276 --> 00:23:36,436
Speaker 1: for robots. Hard for a sort of classical robot to do.

417
00:23:37,836 --> 00:23:40,556
Speaker 2: Impossible for a classical impossible.

418
00:23:39,996 --> 00:23:44,236
Speaker 1: Not just hard, impossible. Can your robots do it?

419
00:23:44,916 --> 00:23:47,036
Speaker 2: Yeah? They can do it? How extremely well?

420
00:23:47,196 --> 00:23:48,996
Speaker 1: How did you solve How did you solve it?

421
00:23:49,036 --> 00:23:50,836
Speaker 2: How does it work in the end of the day?

422
00:23:50,916 --> 00:23:53,276
Speaker 2: Like it the way that it operates us very similar

423
00:23:53,356 --> 00:23:57,116
Speaker 2: to how like humans vision system work, Like we have

424
00:23:57,196 --> 00:23:59,836
Speaker 2: two eyes and then like by two eyes looking at

425
00:23:59,876 --> 00:24:03,756
Speaker 2: something like we like we can figure out what's the

426
00:24:03,796 --> 00:24:06,516
Speaker 2: depth of a certain items, like because our two eyes

427
00:24:06,556 --> 00:24:10,356
Speaker 2: can triangulate a single point the three D world and

428
00:24:10,396 --> 00:24:12,516
Speaker 2: it's the same kind of mechanisms and so like you

429
00:24:12,556 --> 00:24:15,876
Speaker 2: can just use multiple regular cameras, just like the one

430
00:24:15,916 --> 00:24:18,836
Speaker 2: that you have on your iPhone, and by having multiple

431
00:24:18,876 --> 00:24:22,396
Speaker 2: ones of those like you give the new net the

432
00:24:22,476 --> 00:24:24,836
Speaker 2: ability to triangulate what's happening.

433
00:24:24,956 --> 00:24:27,196
Speaker 1: Just the way our two eyes allow us to see

434
00:24:27,236 --> 00:24:31,756
Speaker 1: depth essentially exactly right, And are there other things like

435
00:24:31,756 --> 00:24:33,676
Speaker 1: like weight, I mean the arm is going to be

436
00:24:33,676 --> 00:24:36,756
Speaker 1: picking things up like weight or whether you know, presumably

437
00:24:36,796 --> 00:24:39,316
Speaker 1: there's like could be a shirt in a plastic bag,

438
00:24:39,396 --> 00:24:41,756
Speaker 1: could be a box, or something's are rigid, some things

439
00:24:41,756 --> 00:24:42,596
Speaker 1: are deformable.

440
00:24:42,836 --> 00:24:45,956
Speaker 2: Yeah, So what we have found is that if you

441
00:24:46,036 --> 00:24:50,396
Speaker 2: just have a visual understanding of the world that is

442
00:24:50,476 --> 00:24:54,476
Speaker 2: as robust as human, you go a really long way. Right,

443
00:24:54,556 --> 00:24:58,356
Speaker 2: Like so when I when I pick up a cup, right, like,

444
00:24:58,476 --> 00:25:02,356
Speaker 2: I'm not doing a lot of calculations on my how

445
00:25:02,436 --> 00:25:05,756
Speaker 2: is my fingers faced? Like exactly, it gets translated to

446
00:25:05,796 --> 00:25:07,476
Speaker 2: the cup and making sure it holds right.

447
00:25:07,516 --> 00:25:09,876
Speaker 1: It's part of the miracle of being a person though, Right,

448
00:25:09,916 --> 00:25:11,356
Speaker 1: it's like a really hard problem to pick.

449
00:25:11,596 --> 00:25:13,676
Speaker 2: It's a very hard problem. But then like your brains

450
00:25:13,676 --> 00:25:16,476
Speaker 2: subconsciously solve it for you, like your your system one

451
00:25:16,596 --> 00:25:18,516
Speaker 2: thinking somewhat solved that.

452
00:25:18,396 --> 00:25:20,996
Speaker 1: For thinking, you don't have to think about yeahm.

453
00:25:20,636 --> 00:25:24,116
Speaker 2: Hm, exactly, And and you can like imagine like when

454
00:25:24,156 --> 00:25:26,356
Speaker 2: you do this, like in fact, like even if my

455
00:25:26,436 --> 00:25:29,236
Speaker 2: fingers are numb, I can still do this perfectly like

456
00:25:29,476 --> 00:25:34,676
Speaker 2: just because like so it acquires this intuitive understanding of

457
00:25:35,156 --> 00:25:37,756
Speaker 2: interaction with physical will so well that you can do it.

458
00:25:38,196 --> 00:25:41,716
Speaker 1: So basically vision, vision gets you most of the way there.

459
00:25:41,836 --> 00:25:45,236
Speaker 2: I would say, vision and then the ability to intuit

460
00:25:45,636 --> 00:25:48,916
Speaker 2: physics from your visual input that you'll get.

461
00:25:48,996 --> 00:25:51,476
Speaker 1: That told me that second. What is wild though, Like

462
00:25:51,676 --> 00:25:54,476
Speaker 1: I mean into it, I mean into it in the

463
00:25:54,516 --> 00:25:57,676
Speaker 1: in the context of it, I mean sort of make inferences.

464
00:25:57,956 --> 00:26:00,796
Speaker 1: I mean intuit is yeah, okay, yeah.

465
00:26:00,876 --> 00:26:03,076
Speaker 2: But like by by into it, I mean like it's

466
00:26:03,116 --> 00:26:07,236
Speaker 2: not doing some kind of detail physical calculation.

467
00:26:07,276 --> 00:26:09,116
Speaker 1: It's not doing math. It's not doing math.

468
00:26:09,516 --> 00:26:11,796
Speaker 2: Yeah, it's doing a kind of high level pattern matching

469
00:26:11,796 --> 00:26:15,556
Speaker 2: of well, like based on how these things looks, this

470
00:26:15,716 --> 00:26:18,516
Speaker 2: is likely going to be a successful way to approach

471
00:26:18,596 --> 00:26:20,716
Speaker 2: the item and interact with it.

472
00:26:21,036 --> 00:26:23,356
Speaker 1: What what's next?

473
00:26:23,916 --> 00:26:28,876
Speaker 2: So what is next immediately is very exciting? Right. We

474
00:26:28,956 --> 00:26:31,836
Speaker 2: are now getting to a place that by we, I

475
00:26:31,876 --> 00:26:36,796
Speaker 2: mean we as in AI community has gotten to a

476
00:26:36,876 --> 00:26:43,116
Speaker 2: place that we have enough computation power and algorithmic and

477
00:26:43,236 --> 00:26:47,796
Speaker 2: modeling understanding that can allow us to extract a lot

478
00:26:48,356 --> 00:26:48,996
Speaker 2: out of data.

479
00:26:49,356 --> 00:26:54,596
Speaker 1: Right, So from any given amount of data, you can get.

480
00:26:54,396 --> 00:26:57,196
Speaker 2: More, you can get more out of it. Right.

481
00:26:57,276 --> 00:27:00,676
Speaker 1: It's exciting for you because data is such a constraint

482
00:27:00,716 --> 00:27:02,516
Speaker 1: on what you're trying to do exactly.

483
00:27:02,556 --> 00:27:05,876
Speaker 2: And then like we're building up this large robotic data

484
00:27:05,876 --> 00:27:09,916
Speaker 2: set by tapping into a lot of these events, is

485
00:27:11,156 --> 00:27:13,876
Speaker 2: it gives us the ability to even get more out

486
00:27:13,916 --> 00:27:16,156
Speaker 2: of the data sets that we're building and allow us

487
00:27:16,156 --> 00:27:20,956
Speaker 2: to build smarter than better robotics foundation models that perform

488
00:27:21,036 --> 00:27:23,556
Speaker 2: better at the current tasks that they're supposed to do

489
00:27:23,996 --> 00:27:26,436
Speaker 2: and also power more robots.

490
00:27:26,476 --> 00:27:29,076
Speaker 1: When people talk about concerns around AI, they often kind

491
00:27:29,076 --> 00:27:32,316
Speaker 1: of have jokingly use the phrase killer robots, which is

492
00:27:32,396 --> 00:27:35,836
Speaker 1: usually like a metaphor or something. But in your instance,

493
00:27:35,916 --> 00:27:38,516
Speaker 1: because you are building robots, and because you are building

494
00:27:38,756 --> 00:27:41,436
Speaker 1: by design a model that is supposed to be used

495
00:27:41,476 --> 00:27:44,516
Speaker 1: for lots of different purposes, I can in fact very

496
00:27:44,516 --> 00:27:47,796
Speaker 1: easily imagine killer robot applications of your work. Like that

497
00:27:47,796 --> 00:27:50,636
Speaker 1: seems like a very plausible thing someone could do with it, Like,

498
00:27:51,196 --> 00:27:52,996
Speaker 1: is that something you think about worry about?

499
00:27:54,436 --> 00:27:59,316
Speaker 2: I would say very Fortunately, in the very near to

500
00:27:59,476 --> 00:28:03,076
Speaker 2: medium term use cases, we are very safe because like

501
00:28:03,196 --> 00:28:06,836
Speaker 2: all of these industrial robots are very much confined by

502
00:28:06,956 --> 00:28:10,796
Speaker 2: the stations that they're designed in. Curse like industrial robots

503
00:28:11,356 --> 00:28:14,956
Speaker 2: heavy machineries that are subject to regulations and are very

504
00:28:14,996 --> 00:28:20,676
Speaker 2: carefully They are like careful design guidelines and compliance requirements

505
00:28:20,716 --> 00:28:23,396
Speaker 2: for them. They are already by design safe.

506
00:28:23,516 --> 00:28:28,756
Speaker 1: You're saying your model is built for robots basically for

507
00:28:28,876 --> 00:28:31,316
Speaker 1: robot arms. I mean, is that essentially what the model

508
00:28:31,316 --> 00:28:34,036
Speaker 1: you're building is really a foundation model for robot arms

509
00:28:34,036 --> 00:28:36,956
Speaker 1: that are built to just be in one place, pick

510
00:28:36,996 --> 00:28:39,076
Speaker 1: things up, put them down, that sort of thing. It's

511
00:28:39,116 --> 00:28:42,276
Speaker 1: not a foundation model that you could have map onto

512
00:28:42,476 --> 00:28:44,556
Speaker 1: a car or something, or even to a robot that

513
00:28:44,676 --> 00:28:46,716
Speaker 1: walks around. It wouldn't work for that.

514
00:28:46,716 --> 00:28:49,196
Speaker 2: That would not be the near term use cases. Like

515
00:28:49,276 --> 00:28:52,196
Speaker 2: so like because the near term use cases are more

516
00:28:52,236 --> 00:28:56,876
Speaker 2: in this safe by construction setting, it allows us to

517
00:28:57,316 --> 00:29:00,076
Speaker 2: not more about that problem and in fact, like have

518
00:29:00,756 --> 00:29:04,196
Speaker 2: basically no way to misuse the technology in the way

519
00:29:04,196 --> 00:29:06,196
Speaker 2: we need to. But I do agree with you, like

520
00:29:06,236 --> 00:29:09,596
Speaker 2: as we actually unleash this model to a set of

521
00:29:09,676 --> 00:29:12,316
Speaker 2: use cases, like when these robots can actually interact with

522
00:29:12,356 --> 00:29:15,756
Speaker 2: the world in a lot more freeform way those other cases,

523
00:29:15,836 --> 00:29:20,596
Speaker 2: that the safety considerations become a lot more important and

524
00:29:20,636 --> 00:29:22,596
Speaker 2: there's definitely a lot more work that need to be

525
00:29:22,636 --> 00:29:26,076
Speaker 2: done for that to be reliable.

526
00:29:29,836 --> 00:29:31,996
Speaker 1: We'll be back in a minute with the lightning round.

527
00:29:33,596 --> 00:29:44,236
Speaker 1: M there is a lightning round that we're gonna do

528
00:29:44,316 --> 00:29:48,516
Speaker 1: now for the end of the interview, what household chore

529
00:29:48,596 --> 00:29:50,356
Speaker 1: do you wish that a robot could do?

530
00:29:52,996 --> 00:29:57,556
Speaker 2: Cleaning up kitchen? Yeah, but I don't like the cleanup

531
00:29:57,596 --> 00:29:57,836
Speaker 2: of it.

532
00:29:59,556 --> 00:30:02,876
Speaker 1: That seems like a really hard one, like putting stuff away.

533
00:30:03,076 --> 00:30:05,716
Speaker 1: Basically I'm wiping the counter maybe less hard, but like

534
00:30:06,156 --> 00:30:08,556
Speaker 1: putting stuff away seems like a really hard job for

535
00:30:08,556 --> 00:30:09,076
Speaker 1: a robot.

536
00:30:09,796 --> 00:30:11,916
Speaker 2: It is. And these are the type of jobs that

537
00:30:12,036 --> 00:30:15,636
Speaker 2: start to get to how we are limitation, Like, these

538
00:30:15,636 --> 00:30:17,556
Speaker 2: are the type of jobs that start to get to

539
00:30:18,276 --> 00:30:20,996
Speaker 2: like you probably do want a humanoid robot, like you

540
00:30:20,996 --> 00:30:26,156
Speaker 2: probably do want something that kind of moves and conform

541
00:30:26,316 --> 00:30:29,196
Speaker 2: to the human standard of interacting with the world.

542
00:30:29,156 --> 00:30:31,796
Speaker 1: Because the kitchen is optimized, right, or you would have

543
00:30:31,836 --> 00:30:34,556
Speaker 1: to redesign the kitchen for a robot, and but then

544
00:30:34,756 --> 00:30:37,396
Speaker 1: that would suck because then you couldn't get your plates

545
00:30:37,396 --> 00:30:40,516
Speaker 1: because they'd be in some random spot or whatever. Okay,

546
00:30:40,556 --> 00:30:44,476
Speaker 1: so you left open ai in twenty seventeen. In the

547
00:30:44,516 --> 00:30:48,436
Speaker 1: past year, open ai became like this household word, GPT

548
00:30:48,596 --> 00:30:52,876
Speaker 1: became a household word. Were you surprised as a you know,

549
00:30:53,076 --> 00:30:56,516
Speaker 1: old school former open AI guy, were you surprised by

550
00:30:57,036 --> 00:31:00,236
Speaker 1: how how wild the world went for GPT or by

551
00:31:00,276 --> 00:31:01,516
Speaker 1: how good it was how soon?

552
00:31:03,516 --> 00:31:07,116
Speaker 2: I was definitely surprised by the speed of it. I

553
00:31:07,196 --> 00:31:10,076
Speaker 2: was surprised by the speed of bull of the technologists

554
00:31:10,116 --> 00:31:14,716
Speaker 2: development and the speed of adoption. But I was not

555
00:31:14,836 --> 00:31:18,156
Speaker 2: surprised by the fact that it could be dispig and

556
00:31:18,356 --> 00:31:19,236
Speaker 2: it could be bigger.

557
00:31:19,676 --> 00:31:22,676
Speaker 1: You know, when you were talking about sort of warehouse

558
00:31:22,796 --> 00:31:27,516
Speaker 1: and getting data from you know, picking and packing basically,

559
00:31:27,836 --> 00:31:31,116
Speaker 1: I thought, of course, as anyone would, of Amazon, and

560
00:31:31,156 --> 00:31:34,676
Speaker 1: I've read that they're working on some kind of robot

561
00:31:34,756 --> 00:31:36,876
Speaker 1: arm I feel like they would just have so much

562
00:31:37,836 --> 00:31:40,236
Speaker 1: data that they could gather if they wanted, just because

563
00:31:40,236 --> 00:31:41,916
Speaker 1: they're so big, they have so many warehouse I mean

564
00:31:41,916 --> 00:31:44,276
Speaker 1: the same way that say Google just gets tons of

565
00:31:44,436 --> 00:31:46,476
Speaker 1: data every day with every Google search and the way

566
00:31:46,516 --> 00:31:48,796
Speaker 1: people like I feel like that would be very hard

567
00:31:48,796 --> 00:31:50,436
Speaker 1: to compete with, But.

568
00:31:50,476 --> 00:31:52,196
Speaker 2: We also don't need to compete with them. They are

569
00:31:52,276 --> 00:31:56,516
Speaker 2: also a very large role that Amazon is not serving,

570
00:31:56,996 --> 00:31:59,676
Speaker 2: and there are a lot of customers that don't have

571
00:31:59,796 --> 00:32:05,276
Speaker 2: the same degree of engineering team data access as Amazon, and.

572
00:32:05,676 --> 00:32:07,596
Speaker 1: You could be the shop by of you could be

573
00:32:07,636 --> 00:32:09,116
Speaker 1: the Shopify of warehouse robots.

574
00:32:09,476 --> 00:32:12,596
Speaker 2: There are all of these people that still need help,

575
00:32:12,636 --> 00:32:14,356
Speaker 2: and we very gladly help them.

576
00:32:14,716 --> 00:32:17,276
Speaker 1: What was the first robot you personally built.

577
00:32:19,756 --> 00:32:21,356
Speaker 2: I think it's probably one of the first pick and

578
00:32:21,436 --> 00:32:23,116
Speaker 2: place robot e Covariant.

579
00:32:23,596 --> 00:32:25,396
Speaker 1: You didn't build them when you were kid there.

580
00:32:25,596 --> 00:32:26,956
Speaker 2: I didn't build them when I was kid.

581
00:32:27,916 --> 00:32:29,356
Speaker 1: What made you get into robots?

582
00:32:31,876 --> 00:32:36,276
Speaker 2: I would say I'm an AI person first and robot

583
00:32:36,316 --> 00:32:40,956
Speaker 2: person set, and a big part of the interest in

584
00:32:41,076 --> 00:32:44,516
Speaker 2: robotics is probably driven by my interests in AI. Like

585
00:32:44,596 --> 00:32:49,676
Speaker 2: it just like we have just not make as much

586
00:32:49,716 --> 00:32:52,796
Speaker 2: progress for AI in the physical world. That's AI in

587
00:32:52,836 --> 00:32:56,196
Speaker 2: the digital world, and to a large degree, like I

588
00:32:56,196 --> 00:32:59,436
Speaker 2: think we have to make progress there because like ultimately

589
00:32:59,436 --> 00:33:01,716
Speaker 2: we live in the physical world. Like you're creating all

590
00:33:01,756 --> 00:33:04,716
Speaker 2: these intelligence and amazing things in the digital world. That

591
00:33:04,836 --> 00:33:07,436
Speaker 2: is all great, but where's AI in the physical world?

592
00:33:07,436 --> 00:33:11,996
Speaker 2: Like this remarkably little progress there despite how much AI

593
00:33:12,036 --> 00:33:14,636
Speaker 2: has moved forward. And so to some degree, like it's

594
00:33:14,636 --> 00:33:20,156
Speaker 2: a it's driven by a conviction that like AI has

595
00:33:20,236 --> 00:33:24,436
Speaker 2: to progress forward and AI will have a large impact

596
00:33:24,436 --> 00:33:25,276
Speaker 2: in the physical world.

597
00:33:25,756 --> 00:33:28,076
Speaker 1: What do you understand about robots that most people don't.

598
00:33:28,156 --> 00:33:34,036
Speaker 2: I think the most interesting thing about robots that I

599
00:33:34,076 --> 00:33:37,596
Speaker 2: would say, like we understand a covariant that maybe outside

600
00:33:37,596 --> 00:33:46,036
Speaker 2: of the company don't is making one robot work is

601
00:33:47,476 --> 00:33:50,636
Speaker 2: obviously hard and fun, but making a lot of robots

602
00:33:50,756 --> 00:33:53,876
Speaker 2: work as scale for a lot of customers take a

603
00:33:53,916 --> 00:33:59,836
Speaker 2: lot of operational discipline. Like it's about like doing many

604
00:33:59,876 --> 00:34:03,076
Speaker 2: many things right, like before robots go into a facility,

605
00:34:03,196 --> 00:34:05,996
Speaker 2: like how should they prep the site so the robots

606
00:34:06,036 --> 00:34:13,716
Speaker 2: actually work. To ship robots at scale, it's a competencies

607
00:34:13,836 --> 00:34:17,516
Speaker 2: that requires a lot of operational excellence, and that is

608
00:34:17,556 --> 00:34:20,316
Speaker 2: something that I would say, like when most people think

609
00:34:20,316 --> 00:34:22,916
Speaker 2: about robots like big thing about is like this sexy,

610
00:34:22,956 --> 00:34:26,436
Speaker 2: interesting technology, they don't think about having to nail a

611
00:34:26,516 --> 00:34:29,996
Speaker 2: thousand steps, a thousand small steps well in order to

612
00:34:30,036 --> 00:34:33,436
Speaker 2: have robots actually have an impact in the world at scale.

613
00:34:33,956 --> 00:34:36,356
Speaker 1: That's the leap from the academic lab to being a

614
00:34:36,396 --> 00:34:41,036
Speaker 1: real company selling real products in the world. Great, anything

615
00:34:41,036 --> 00:34:42,076
Speaker 1: else do you want to talk about?

616
00:34:42,676 --> 00:34:44,316
Speaker 2: No? I enjoyed this conversation.

617
00:34:44,836 --> 00:34:55,436
Speaker 1: Yeah, likewise, thank you. Theater Chen is the co founder

618
00:34:55,476 --> 00:35:00,316
Speaker 1: and CEO of Covariate. Today's show was produced by Edith

619
00:35:00,356 --> 00:35:04,316
Speaker 1: Russolo and Gabriel Hunter Chang. It was edited by Karen

620
00:35:04,396 --> 00:35:08,356
Speaker 1: Chakerji and engineered by Sarah Bugueer. You can email us

621
00:35:08,476 --> 00:35:12,596
Speaker 1: at at Pushkin dot f M. I'm Jacob Goldstein, and

622
00:35:12,636 --> 00:35:14,916
Speaker 1: we'll be back next week with another episode of What's

623
00:35:14,916 --> 00:35:22,396
Speaker 1: Your Problem