1
00:00:15,356 --> 00:00:24,916
Speaker 1: Pushkin. Humans are made of proteins. Proteins are key components

2
00:00:24,956 --> 00:00:28,716
Speaker 1: of our cells and of our muscles. Proteins regulate gene

3
00:00:28,756 --> 00:00:33,676
Speaker 1: expression and the immune system. And yet forever we had

4
00:00:34,036 --> 00:00:37,956
Speaker 1: no idea what most proteins look like, and this was

5
00:00:37,956 --> 00:00:41,756
Speaker 1: a problem. Every protein has a different shape, and to

6
00:00:41,916 --> 00:00:46,156
Speaker 1: understand how any particular protein works, how it interacts with

7
00:00:46,196 --> 00:00:49,636
Speaker 1: other molecules, how it keeps us healthy or causes disease,

8
00:00:50,236 --> 00:00:55,356
Speaker 1: it is very helpful to understand what that particular shape is,

9
00:00:56,276 --> 00:01:01,276
Speaker 1: but determining that complicated three dimensional shape was really hard.

10
00:01:01,436 --> 00:01:05,076
Speaker 1: Scientists sometimes spent years trying to determine the shape of

11
00:01:05,236 --> 00:01:10,316
Speaker 1: a single protein. So people started to dream this. They thought,

12
00:01:10,796 --> 00:01:12,716
Speaker 1: what if we could come up with some kind of

13
00:01:12,756 --> 00:01:16,676
Speaker 1: a system, some way to use a protein sequence of

14
00:01:16,716 --> 00:01:22,676
Speaker 1: amino acids to reliably predict that protein's unique three dimensional shape.

15
00:01:23,356 --> 00:01:25,836
Speaker 1: It would be a huge leap forward that could lead

16
00:01:25,876 --> 00:01:28,916
Speaker 1: to a much deeper understanding of biology and a new

17
00:01:28,916 --> 00:01:32,436
Speaker 1: wave of treatments for disease. This idea was called the

18
00:01:32,436 --> 00:01:36,876
Speaker 1: protein folding problem. But after decades of work, the best

19
00:01:37,036 --> 00:01:40,236
Speaker 1: protein folding models were nowhere near good enough to be

20
00:01:40,356 --> 00:01:45,156
Speaker 1: scientifically useful. And then in twenty twenty a group of

21
00:01:45,196 --> 00:01:48,156
Speaker 1: researchers built an AI model to try to tackle the

22
00:01:48,196 --> 00:01:51,836
Speaker 1: protein folding problem, and their model was so much better

23
00:01:51,916 --> 00:01:54,556
Speaker 1: than what had come before that some people thought they

24
00:01:54,556 --> 00:01:58,116
Speaker 1: were cheating. In fact, they were not cheating. They had

25
00:01:58,196 --> 00:02:02,676
Speaker 1: solved the protein folding problem. I'm Jacob Goldstein, and this

26
00:02:02,756 --> 00:02:04,796
Speaker 1: is What's Your Problem, the show where I talk to

27
00:02:04,876 --> 00:02:08,636
Speaker 1: people who are trying to make technological progress. My guest

28
00:02:08,636 --> 00:02:11,876
Speaker 1: today is push Meat Coli. He's vice president of research

29
00:02:11,916 --> 00:02:15,116
Speaker 1: at deep Mind, an AI research group that's part of Google.

30
00:02:15,836 --> 00:02:17,916
Speaker 1: Push Me was part of the deep Mind team that

31
00:02:18,076 --> 00:02:21,636
Speaker 1: solved the protein folding problem. They built an AI model

32
00:02:21,676 --> 00:02:25,196
Speaker 1: called alpha fold, and alpha fold is one of the

33
00:02:25,236 --> 00:02:28,916
Speaker 1: most impressive real world AI success stories that we have

34
00:02:28,996 --> 00:02:31,876
Speaker 1: seen so far, and as you'll hear in our conversation,

35
00:02:32,396 --> 00:02:37,156
Speaker 1: alpha fold holds lessons for AI that go beyond protein folding.

36
00:02:37,876 --> 00:02:40,476
Speaker 1: One other thing you may hear in our conversation, by

37
00:02:40,516 --> 00:02:43,996
Speaker 1: the way, is the occasional background beep from push meat

38
00:02:44,076 --> 00:02:52,116
Speaker 1: smoke detector. Have you failed to change the battery in

39
00:02:52,196 --> 00:02:55,556
Speaker 1: your smoke detector? Is that perhaps what that little chirp was?

40
00:02:55,716 --> 00:02:57,876
Speaker 1: You really should do that for yourself as.

41
00:02:57,756 --> 00:03:02,156
Speaker 2: Well as I know, I mean, I was trying to

42
00:03:02,156 --> 00:03:05,916
Speaker 2: fix it before the recording, and like it's just so finicky,

43
00:03:05,996 --> 00:03:06,716
Speaker 2: it doesn't come.

44
00:03:06,596 --> 00:03:11,556
Speaker 1: Out okay, fair, fair, well with it. It makes you real.

45
00:03:11,956 --> 00:03:15,476
Speaker 1: And so it is the case with protein folding that

46
00:03:15,556 --> 00:03:17,796
Speaker 1: it's not just like people publishing papers. Right, There was

47
00:03:17,836 --> 00:03:21,836
Speaker 1: actually this contest that was held was it every couple

48
00:03:21,876 --> 00:03:26,356
Speaker 1: of years of exactlys trying to solve the protein folding problem?

49
00:03:26,396 --> 00:03:28,316
Speaker 1: And this had been going on for a long time,

50
00:03:28,836 --> 00:03:31,236
Speaker 1: And so is there a first moment when you compete

51
00:03:31,276 --> 00:03:32,156
Speaker 1: in this contest?

52
00:03:33,076 --> 00:03:35,516
Speaker 2: Yeah? So this is like an amazing thing about protein

53
00:03:35,516 --> 00:03:38,956
Speaker 2: structure prediction and how visionary the community was. Like they

54
00:03:38,956 --> 00:03:41,876
Speaker 2: had set up this amazing sort of fool proof way

55
00:03:41,916 --> 00:03:44,636
Speaker 2: to evaluate progress, because it's very easy to sort of

56
00:03:44,636 --> 00:03:47,636
Speaker 2: for sometimes scientists to fool themselves saying oh that's progress,

57
00:03:47,836 --> 00:03:50,476
Speaker 2: there is no progress, right, So they had set up

58
00:03:50,516 --> 00:03:54,196
Speaker 2: this contest in a very remarkable way where they had said, well,

59
00:03:54,916 --> 00:03:59,876
Speaker 2: for a specific duration of time, any sort of scientists

60
00:03:59,916 --> 00:04:02,756
Speaker 2: all over the world who are discovering new structures would

61
00:04:02,836 --> 00:04:06,276
Speaker 2: not share them with the world. Instead, instead, they will

62
00:04:06,276 --> 00:04:09,756
Speaker 2: basically send it to a secret want script vault.

63
00:04:09,876 --> 00:04:11,036
Speaker 1: I love a secret fault.

64
00:04:11,156 --> 00:04:13,676
Speaker 2: Yeah, yeah, I mean not like secret.

65
00:04:15,876 --> 00:04:17,516
Speaker 1: I wanted to be a big vault with the wheel

66
00:04:17,596 --> 00:04:18,916
Speaker 1: you turn, but yeah, I understand.

67
00:04:19,836 --> 00:04:25,276
Speaker 2: Yeah, And so the idea there was that nobody except

68
00:04:25,396 --> 00:04:28,076
Speaker 2: the scientists who discovered the structure knew what is the

69
00:04:28,076 --> 00:04:29,476
Speaker 2: structure of this protein.

70
00:04:29,756 --> 00:04:33,236
Speaker 1: So it's a perfect way to test these models doing

71
00:04:33,276 --> 00:04:36,916
Speaker 1: the prediction because somebody knows the answer. But the people

72
00:04:36,916 --> 00:04:39,276
Speaker 1: building the models, people like you, people like deep mind,

73
00:04:39,356 --> 00:04:41,996
Speaker 1: don't actually know the answer. So you can't cheat, you

74
00:04:42,036 --> 00:04:43,996
Speaker 1: can't backfit it or anything like.

75
00:04:43,916 --> 00:04:46,476
Speaker 2: That exactly, right, So you don't know how good you

76
00:04:46,516 --> 00:04:49,796
Speaker 2: are because like you've been training on known examples and

77
00:04:49,836 --> 00:04:52,876
Speaker 2: you've been evaluating them on known examples. But when you

78
00:04:52,996 --> 00:04:55,796
Speaker 2: are tested, you are tested on these amazingly new things

79
00:04:55,876 --> 00:04:57,356
Speaker 2: that nobody has seen before.

80
00:04:57,556 --> 00:05:00,836
Speaker 1: Yeah. Yeah, so okay. And they actually have like a

81
00:05:00,956 --> 00:05:05,996
Speaker 1: numeric score that they assigned to everybody's model, right, yeah,

82
00:05:06,036 --> 00:05:09,276
Speaker 1: Like it's very quantitative and it's not just like good

83
00:05:09,396 --> 00:05:12,316
Speaker 1: or pretty good. It's a number, right, And what is it?

84
00:05:12,476 --> 00:05:13,436
Speaker 1: Zero two one hundred?

85
00:05:13,476 --> 00:05:15,956
Speaker 2: Is that the scale? Yeah, it's zero two hundred, Like

86
00:05:15,996 --> 00:05:17,676
Speaker 2: that's the sort of scale. And if you look at

87
00:05:17,996 --> 00:05:20,596
Speaker 2: progress in the last sort of twenty years before alpha

88
00:05:20,596 --> 00:05:23,356
Speaker 2: fold one was launched. I mean it was somewhere sort

89
00:05:23,396 --> 00:05:27,076
Speaker 2: of between the twenty five to forty sort of GDT

90
00:05:27,236 --> 00:05:27,556
Speaker 2: sort of.

91
00:05:27,556 --> 00:05:30,516
Speaker 1: Five to forty. Was it getting better slowly or was

92
00:05:30,556 --> 00:05:33,276
Speaker 1: it just kind of stuck in the thirties more or less?

93
00:05:33,556 --> 00:05:35,996
Speaker 2: Yeah, it was stagnating. It was like sometimes it would

94
00:05:36,036 --> 00:05:38,596
Speaker 2: go to thirty eight, sometimes thirty five, and it was

95
00:05:38,636 --> 00:05:40,676
Speaker 2: like in that so it was going up and down,

96
00:05:40,756 --> 00:05:42,596
Speaker 2: up and down. There was no sort of remarkable sort

97
00:05:42,596 --> 00:05:43,116
Speaker 2: of breakthrough.

98
00:05:43,596 --> 00:05:46,116
Speaker 1: And was it all AI was like the only way

99
00:05:46,156 --> 00:05:48,316
Speaker 1: people were trying to solve it? AI? Were there whole

100
00:05:48,396 --> 00:05:51,356
Speaker 1: other sort of things people were thinking about trying to do.

101
00:05:51,996 --> 00:05:56,036
Speaker 2: Yeah, so this is mostly not AI based solutions, right,

102
00:05:56,076 --> 00:06:01,516
Speaker 2: These were sort of very well designed, hand designed systems

103
00:06:01,596 --> 00:06:05,276
Speaker 2: that were carefully tuned to the problem over many, many

104
00:06:05,356 --> 00:06:08,756
Speaker 2: decades with large teams working together and so on. But

105
00:06:09,116 --> 00:06:10,396
Speaker 2: it was a little machine learning.

106
00:06:11,116 --> 00:06:14,236
Speaker 1: So they'd been scoring in the thirties more or less.

107
00:06:14,796 --> 00:06:17,596
Speaker 1: And then what year is it that deep mind? You

108
00:06:17,676 --> 00:06:19,516
Speaker 1: and deep Mind show up with alpha fourth one?

109
00:06:20,036 --> 00:06:25,276
Speaker 2: So twenty eighteen. So the contest actually it runs in

110
00:06:25,316 --> 00:06:26,356
Speaker 2: the summer and.

111
00:06:26,276 --> 00:06:28,396
Speaker 1: Then at the end of the summer they sort of

112
00:06:28,516 --> 00:06:30,436
Speaker 1: give you the results or what so.

113
00:06:30,436 --> 00:06:31,996
Speaker 2: At the end of the summer. I mean like by

114
00:06:32,076 --> 00:06:33,956
Speaker 2: by I think July or August, they have sent the

115
00:06:34,036 --> 00:06:37,796
Speaker 2: last files and then you have sent them the results

116
00:06:38,116 --> 00:06:40,156
Speaker 2: and then you wait, right and you don't know what

117
00:06:40,636 --> 00:06:44,716
Speaker 2: has happened. Then they invite you to a conference which

118
00:06:45,076 --> 00:06:48,236
Speaker 2: happens in December, so you are like eagerly waiting what's

119
00:06:48,276 --> 00:06:51,316
Speaker 2: what has happened? And they're like, oh, maybe we came last,

120
00:06:51,836 --> 00:06:55,196
Speaker 2: maybe maybe we're in the middle. And then they actually

121
00:06:55,236 --> 00:06:58,756
Speaker 2: revealed the leaderboard or the scores in the conference.

122
00:06:59,196 --> 00:07:00,796
Speaker 1: Where were you at the time.

123
00:07:01,996 --> 00:07:03,716
Speaker 2: I was in London, I was in the office. I

124
00:07:03,756 --> 00:07:06,196
Speaker 2: was like really waiting, like trying to figure out like

125
00:07:06,236 --> 00:07:08,796
Speaker 2: what where where were we right in terms of the career?

126
00:07:09,276 --> 00:07:12,996
Speaker 2: How did we form? And we get an email from

127
00:07:13,396 --> 00:07:16,516
Speaker 2: the organizers one day' hold the results are about two

128
00:07:16,596 --> 00:07:20,556
Speaker 2: thounds and they say, well you are first and by

129
00:07:20,596 --> 00:07:23,276
Speaker 2: a big margin, so like from thirty to forty we

130
00:07:23,316 --> 00:07:25,076
Speaker 2: had gone to more than sixty.

131
00:07:25,636 --> 00:07:28,356
Speaker 1: You did way better at predicting the structure of protein

132
00:07:28,396 --> 00:07:30,236
Speaker 1: than anyone had ever done, including that.

133
00:07:30,236 --> 00:07:32,796
Speaker 2: Yes you won by a lot, Yes we won by

134
00:07:32,836 --> 00:07:33,596
Speaker 2: a lot.

135
00:07:33,796 --> 00:07:36,996
Speaker 1: So that's way better than anyone has ever done. But

136
00:07:37,076 --> 00:07:39,916
Speaker 1: does it mean that you're basically half right, is that

137
00:07:39,956 --> 00:07:41,116
Speaker 1: what that number means?

138
00:07:41,396 --> 00:07:44,316
Speaker 2: Yeah, so I think the way we I mean we

139
00:07:44,396 --> 00:07:46,156
Speaker 2: are you're the best in the world, but still your

140
00:07:46,156 --> 00:07:50,796
Speaker 2: predictions are pretty much sort of not very useful for

141
00:07:51,636 --> 00:07:53,756
Speaker 2: any like if you're trying to figure out what a

142
00:07:53,876 --> 00:07:57,556
Speaker 2: drug buying to this particular protein or like, the error

143
00:07:57,636 --> 00:08:01,396
Speaker 2: is so much right that you wouldn't get a complete

144
00:08:01,436 --> 00:08:02,436
Speaker 2: picture of the protein.

145
00:08:02,836 --> 00:08:07,156
Speaker 1: So this sixty number, it's like good in that it's

146
00:08:07,156 --> 00:08:09,236
Speaker 1: better than anyone has ever done. It's bad in that

147
00:08:09,276 --> 00:08:12,516
Speaker 1: it's not scientifically useful. What number do you have to

148
00:08:12,556 --> 00:08:14,356
Speaker 1: get to to be scientifically useful?

149
00:08:14,836 --> 00:08:18,836
Speaker 2: Like between eighty five to ninety that's what people told

150
00:08:18,916 --> 00:08:21,716
Speaker 2: us that if you get beyond eighty five to ninety,

151
00:08:22,276 --> 00:08:23,636
Speaker 2: then the problem is solved.

152
00:08:23,836 --> 00:08:27,836
Speaker 1: So what do you decide when you when you get

153
00:08:27,836 --> 00:08:28,356
Speaker 1: this result.

154
00:08:29,756 --> 00:08:32,276
Speaker 2: So we get this result and we're like, yeah, this

155
00:08:32,396 --> 00:08:35,276
Speaker 2: is amazing, right, that we are the best in the world,

156
00:08:35,356 --> 00:08:37,556
Speaker 2: right by a big margin. Right, So like the thesis

157
00:08:37,596 --> 00:08:40,716
Speaker 2: that machine learning sort of will advanced science. Oh that's great,

158
00:08:41,036 --> 00:08:43,956
Speaker 2: but the problem is not solved. And let's go back

159
00:08:43,996 --> 00:08:47,996
Speaker 2: to the drawing board. And now with the information that

160
00:08:48,036 --> 00:08:49,636
Speaker 2: we have in the amount of time, we have spent

161
00:08:49,756 --> 00:08:52,356
Speaker 2: on this previous architecture, do we still think that this

162
00:08:52,396 --> 00:08:56,396
Speaker 2: will lead us to where we want to go? And

163
00:08:56,556 --> 00:09:00,396
Speaker 2: the teams thought, no, we need no completely, Yeah, we

164
00:09:00,636 --> 00:09:02,356
Speaker 2: need to completely start from scratch.

165
00:09:02,876 --> 00:09:06,356
Speaker 1: So your your reaction to winning this contest and doing

166
00:09:06,396 --> 00:09:09,596
Speaker 1: better than anyone has ever done it predicting protein folding

167
00:09:09,676 --> 00:09:12,436
Speaker 1: is let's blow up this thing that just won the contest.

168
00:09:13,036 --> 00:09:17,796
Speaker 2: Yeah, throw it away. Yeah, we were like, the basic

169
00:09:17,876 --> 00:09:21,356
Speaker 2: premise was proved that machine learning has a role to play, right,

170
00:09:21,396 --> 00:09:23,596
Speaker 2: So that gave us a lot of confidence. But at

171
00:09:23,596 --> 00:09:25,636
Speaker 2: the same time we saw it, well, this is not

172
00:09:25,676 --> 00:09:28,756
Speaker 2: an elegant This is not an elegant solution, right. This

173
00:09:28,836 --> 00:09:31,916
Speaker 2: is this is like two modules, Like there's machine learning,

174
00:09:32,076 --> 00:09:34,556
Speaker 2: there's a machine learning module. It is making these sort

175
00:09:34,596 --> 00:09:37,036
Speaker 2: of predictions which this other module is sort of trying

176
00:09:37,076 --> 00:09:39,436
Speaker 2: to use. If you believe in the power of machine learning,

177
00:09:39,516 --> 00:09:41,476
Speaker 2: let's do end to end, right, Let's do end to

178
00:09:41,596 --> 00:09:45,996
Speaker 2: end and basically do everything so that the model takes

179
00:09:45,996 --> 00:09:47,276
Speaker 2: care of it, right, rather.

180
00:09:47,116 --> 00:09:50,516
Speaker 1: Than clear what was happening with that initial model that

181
00:09:50,596 --> 00:09:53,076
Speaker 1: you were deciding to abandon, that.

182
00:09:53,316 --> 00:09:56,996
Speaker 2: Was basically using the machine learning model in together with

183
00:09:57,636 --> 00:10:01,556
Speaker 2: sort of a known framework. Right that there is a

184
00:10:01,636 --> 00:10:04,316
Speaker 2: second step that was a conventional sort of step.

185
00:10:04,996 --> 00:10:07,676
Speaker 1: Oh I see. So it's like you weren't all in

186
00:10:07,716 --> 00:10:10,196
Speaker 1: on machine learning. You were like, well, we gonna use

187
00:10:10,236 --> 00:10:12,356
Speaker 1: machine learning, but we're gonna still do this kind of

188
00:10:12,396 --> 00:10:14,276
Speaker 1: the way other people have been doing it. And your

189
00:10:14,276 --> 00:10:18,716
Speaker 1: response to that first result was screw it, Let's not

190
00:10:18,756 --> 00:10:21,156
Speaker 1: do anything like everybody's not before. Let's just go all

191
00:10:21,196 --> 00:10:26,836
Speaker 1: in on machine learning beginning to end exactly interesting. And

192
00:10:26,996 --> 00:10:29,556
Speaker 1: so you do that, and you spend what two years?

193
00:10:29,636 --> 00:10:31,436
Speaker 1: Is it two years between? Do I have that right?

194
00:10:31,516 --> 00:10:31,716
Speaker 2: Yeah?

195
00:10:31,916 --> 00:10:36,436
Speaker 1: Yeah, and then you come back in twenty twenty. You

196
00:10:36,476 --> 00:10:40,396
Speaker 1: come back in twenty twenty, there's another one of these contests. Yeah,

197
00:10:40,516 --> 00:10:42,796
Speaker 1: you got your new end end machine learning model.

198
00:10:42,956 --> 00:10:46,436
Speaker 2: Yeah. So it was the pandemic. So this is twenty twenty, right.

199
00:10:46,276 --> 00:10:47,916
Speaker 1: So nobody's gone anywhere.

200
00:10:48,036 --> 00:10:51,716
Speaker 2: Yeah, nobody's going anywhere. Right. We knew like twenty nineteen

201
00:10:52,476 --> 00:10:55,556
Speaker 2: was basically where we started working with this new model,

202
00:10:55,756 --> 00:10:58,756
Speaker 2: and it was really tough going because we were like

203
00:10:58,836 --> 00:11:02,476
Speaker 2: starting from we're starting from twenty, right, so we went

204
00:11:02,516 --> 00:11:05,836
Speaker 2: at sixty. Now we are starting from twenty and twenty

205
00:11:06,036 --> 00:11:08,956
Speaker 2: thirty forty. And sometimes you would stagnate at forty five

206
00:11:09,236 --> 00:11:12,396
Speaker 2: fifty you were like, really should I should I had

207
00:11:12,436 --> 00:11:13,276
Speaker 2: that figure model?

208
00:11:13,756 --> 00:11:14,076
Speaker 1: Yeah?

209
00:11:14,236 --> 00:11:17,396
Speaker 2: Yeah, So twenty by the end of twenty nineteen thought

210
00:11:17,636 --> 00:11:20,236
Speaker 2: we started getting some really really cool results and we thought, okay,

211
00:11:20,276 --> 00:11:23,236
Speaker 2: now we have surpassed we have definitely surpassed the previous model.

212
00:11:23,756 --> 00:11:27,876
Speaker 2: We're in good territory. And we were very excited. Like

213
00:11:27,916 --> 00:11:31,036
Speaker 2: at the start of twenty twenty, we were like, yeah,

214
00:11:31,436 --> 00:11:34,516
Speaker 2: making progress, and then the pandemic hit.

215
00:11:38,436 --> 00:11:42,036
Speaker 1: In a minute, the model gets an unanticipated test in

216
00:11:42,076 --> 00:11:42,836
Speaker 1: the real world.

217
00:11:47,036 --> 00:11:51,036
Speaker 2: There was this new virus that was reported sarskov two

218
00:11:51,556 --> 00:11:54,276
Speaker 2: and one of the first things so somebody sort of

219
00:11:54,556 --> 00:11:57,436
Speaker 2: figured out the structure of the spike protein. It was

220
00:11:57,476 --> 00:12:00,116
Speaker 2: all over the newspapers, like, here's a spike protein of

221
00:12:00,116 --> 00:12:03,716
Speaker 2: this new virus, but all the other proteins of the virus,

222
00:12:03,876 --> 00:12:09,036
Speaker 2: the accessory proteins, nobody knew the structure. So the first

223
00:12:09,356 --> 00:12:11,836
Speaker 2: we did we thought, I think we think we have

224
00:12:11,916 --> 00:12:14,516
Speaker 2: the best model in the world. We should be making

225
00:12:14,556 --> 00:12:17,316
Speaker 2: these predictions and sharing it with the world, but is

226
00:12:17,316 --> 00:12:20,196
Speaker 2: this the right thing to do. So we spent a

227
00:12:20,196 --> 00:12:24,276
Speaker 2: lot of time reaching out to biologists who looked at

228
00:12:24,316 --> 00:12:27,036
Speaker 2: the prediction and said well, you need to share this,

229
00:12:27,716 --> 00:12:29,556
Speaker 2: you need to share this with the world. So the

230
00:12:29,596 --> 00:12:32,756
Speaker 2: start of twenty twenty was us sort of sharing the

231
00:12:32,796 --> 00:12:36,756
Speaker 2: predictions from this untested model with the world because we

232
00:12:37,036 --> 00:12:41,396
Speaker 2: thought they were quite good. And then throughout twenty twenty

233
00:12:41,476 --> 00:12:44,396
Speaker 2: we took part in the assessment right, which ran in

234
00:12:44,396 --> 00:12:46,116
Speaker 2: the summer of twenty the contest.

235
00:12:46,196 --> 00:12:47,356
Speaker 1: In the contest.

236
00:12:47,476 --> 00:12:50,996
Speaker 2: Exactly normally, right, the organizers don't come back to you.

237
00:12:51,396 --> 00:12:55,436
Speaker 2: They just released the results at the end in December,

238
00:12:56,196 --> 00:12:58,596
Speaker 2: And at the end of the summer we get this

239
00:12:58,636 --> 00:13:01,796
Speaker 2: funny email saying we want to talk to you, and

240
00:13:01,916 --> 00:13:06,076
Speaker 2: so we were like, yeah, like, did we do anything bad?

241
00:13:06,476 --> 00:13:09,956
Speaker 2: What happened? And a few of them really had sort

242
00:13:09,956 --> 00:13:14,276
Speaker 2: of suspicions. They were like, you must have cheated, right,

243
00:13:14,556 --> 00:13:19,116
Speaker 2: like that your predictions are your level of performance is

244
00:13:19,156 --> 00:13:22,316
Speaker 2: nowhere close to anything that we have seen ever. Right,

245
00:13:23,036 --> 00:13:27,276
Speaker 2: But a few scientists in that contest had submitted a

246
00:13:27,316 --> 00:13:31,396
Speaker 2: sequence a protein whose structure was not known. They were

247
00:13:31,676 --> 00:13:34,476
Speaker 2: expecting that the structure would be known by the time

248
00:13:34,516 --> 00:13:37,556
Speaker 2: the contest en so we'll be able to evaluate the predictions.

249
00:13:37,636 --> 00:13:39,676
Speaker 2: But that structure was not known, and in fact, the

250
00:13:39,716 --> 00:13:41,116
Speaker 2: structure they couldn't find. The structure.

251
00:13:41,356 --> 00:13:44,036
Speaker 1: So you're saying it would be impossible to cheat because

252
00:13:44,076 --> 00:13:47,556
Speaker 1: literally no human knew the structure. No way to cheat,

253
00:13:47,836 --> 00:13:48,956
Speaker 1: nobody knows the answer.

254
00:13:49,156 --> 00:13:53,836
Speaker 2: Yeah, yeah, So they used the prediction of alpha fold

255
00:13:54,676 --> 00:13:59,516
Speaker 2: and then tried to explain their experimental data and it matched.

256
00:14:00,556 --> 00:14:03,076
Speaker 2: And they are like, this model has been able to

257
00:14:03,116 --> 00:14:07,516
Speaker 2: discover something that nobody knew, not even no scientists knew.

258
00:14:09,276 --> 00:14:12,996
Speaker 2: Sense the model had already made new biological discoveries even

259
00:14:13,276 --> 00:14:14,196
Speaker 2: before we knew it.

260
00:14:14,716 --> 00:14:18,996
Speaker 1: Yeah, yeah, okay, so that's good. You're not in trouble anymore.

261
00:14:20,076 --> 00:14:22,956
Speaker 1: It's clear you didn't cheat. Do they say the number?

262
00:14:22,996 --> 00:14:25,596
Speaker 1: What's the number? I'm waiting for the number. How'd you do?

263
00:14:26,116 --> 00:14:28,716
Speaker 2: Yeah, so we were beyond eighty five and ninety right,

264
00:14:28,756 --> 00:14:31,636
Speaker 2: and then they basically said, okay, we have to announce

265
00:14:31,676 --> 00:14:35,796
Speaker 2: it to the world. And so come December that was

266
00:14:35,876 --> 00:14:38,796
Speaker 2: the announcement that was made by the organizers that this

267
00:14:39,196 --> 00:14:43,156
Speaker 2: alpha fold too had solved the protein structure prediction problem.

268
00:14:43,556 --> 00:14:46,156
Speaker 1: So is that contest done? Now? Did you just end

269
00:14:46,236 --> 00:14:48,556
Speaker 1: that contest? Is nobody doing that anymore?

270
00:14:48,956 --> 00:14:52,636
Speaker 2: No, the contest sort of is alive, right, it has changed,

271
00:14:52,836 --> 00:14:56,396
Speaker 2: its focus has changed. So what what alpha fold two

272
00:14:56,476 --> 00:15:00,596
Speaker 2: did was find the structure of these single proteins, But

273
00:15:00,636 --> 00:15:03,236
Speaker 2: there are many other problems that remain, right, how do

274
00:15:03,556 --> 00:15:09,116
Speaker 2: multiple proteins interact? Instance, Right, So there are other structure predictions,

275
00:15:09,156 --> 00:15:12,796
Speaker 2: problems that now the contest is sort of has evolved to, right,

276
00:15:12,796 --> 00:15:15,516
Speaker 2: it is sort of focusing on other types of problems

277
00:15:15,556 --> 00:15:18,116
Speaker 2: that Alpha fold two did not address.

278
00:15:18,516 --> 00:15:21,796
Speaker 1: If we zoom out and think about what you have done,

279
00:15:21,836 --> 00:15:24,556
Speaker 1: what the team has done in you know, using machine

280
00:15:24,636 --> 00:15:28,236
Speaker 1: learning to solve this scientific problem that people had been

281
00:15:28,236 --> 00:15:30,836
Speaker 1: working on for a long time, Like what are the

282
00:15:30,916 --> 00:15:34,316
Speaker 1: broader lessons? Like if we think about other domains, what

283
00:15:34,796 --> 00:15:37,396
Speaker 1: can we infer what can we take from this?

284
00:15:38,636 --> 00:15:40,476
Speaker 2: Yeah, so I think the thing that we can take

285
00:15:40,516 --> 00:15:44,596
Speaker 2: from this is basically science is sort of generating a

286
00:15:44,636 --> 00:15:47,356
Speaker 2: lot of data across any domain that you see, right,

287
00:15:47,396 --> 00:15:51,476
Speaker 2: whether it's genomics, hydergy, physics, whether the amount of data

288
00:15:51,516 --> 00:15:54,236
Speaker 2: that we are gathering about the world is much more

289
00:15:54,276 --> 00:15:56,156
Speaker 2: than any single human mind can comprehend.

290
00:15:56,396 --> 00:15:56,516
Speaker 1: Right.

291
00:15:56,556 --> 00:15:58,316
Speaker 2: You can have the best scientists and they will not

292
00:15:58,356 --> 00:16:00,436
Speaker 2: be able to sort of go through on the data

293
00:16:00,476 --> 00:16:03,716
Speaker 2: that we are collecting about our world. So machine learning

294
00:16:03,756 --> 00:16:06,116
Speaker 2: is this remarkable sort of tool which gives us the

295
00:16:06,196 --> 00:16:09,396
Speaker 2: ability to make sense and leverage this data, right and

296
00:16:10,036 --> 00:16:12,596
Speaker 2: really sort of on the path of really accelerating our

297
00:16:12,676 --> 00:16:15,356
Speaker 2: understanding of the problems that we're dealing with.

298
00:16:16,076 --> 00:16:18,476
Speaker 1: In the case of alpha fold, was the sort of

299
00:16:18,556 --> 00:16:24,156
Speaker 1: input data the known protein structures and amino acid sequence

300
00:16:24,156 --> 00:16:27,236
Speaker 1: and was that the basic training data exactly right?

301
00:16:27,276 --> 00:16:30,676
Speaker 2: So it was the PDB, which was the protein database,

302
00:16:31,196 --> 00:16:34,756
Speaker 2: and that had been collected by the community for many,

303
00:16:34,796 --> 00:16:39,356
Speaker 2: many years, right over many decades. They have meticulously carefully

304
00:16:39,396 --> 00:16:43,356
Speaker 2: deposited all the protein sequences and the corresponding structures that

305
00:16:43,396 --> 00:16:46,316
Speaker 2: were discovered, right, And it had one hundred and fifty

306
00:16:46,396 --> 00:16:49,836
Speaker 2: thousand examples at that time, right, sequences as well as structures,

307
00:16:50,196 --> 00:16:53,756
Speaker 2: and everyone had access to the same data. Right, All

308
00:16:53,756 --> 00:16:57,436
Speaker 2: the teams were training on that data.

309
00:16:57,636 --> 00:17:00,636
Speaker 1: Is it right that alpha fold itself is open sourced

310
00:17:00,716 --> 00:17:04,716
Speaker 1: and that there's this open source database of protein structures

311
00:17:04,756 --> 00:17:06,916
Speaker 1: that have been discovered with alpha fauld? Is that right?

312
00:17:07,716 --> 00:17:10,996
Speaker 2: Yeah? So when the sort of developed alpha fold, we

313
00:17:11,116 --> 00:17:14,196
Speaker 2: made it available to the world. But we then said, well,

314
00:17:14,556 --> 00:17:17,276
Speaker 2: it's so accurate, but it's also so fast that we

315
00:17:17,356 --> 00:17:20,876
Speaker 2: will use it to find the structure for every sort

316
00:17:20,916 --> 00:17:24,396
Speaker 2: of known protein. And then we made all those structures

317
00:17:24,436 --> 00:17:25,796
Speaker 2: available to the world.

318
00:17:30,076 --> 00:17:33,036
Speaker 1: Alpha fold has now made the structures of roughly two

319
00:17:33,196 --> 00:17:38,836
Speaker 1: hundred and fifty million different proteins publicly available. We'll be

320
00:17:38,876 --> 00:17:45,076
Speaker 1: back in a minute with the lightning ground. Last thing

321
00:17:45,316 --> 00:17:48,956
Speaker 1: is a lightning round. Just some fast questions, okay, and

322
00:17:48,996 --> 00:17:51,276
Speaker 1: then we'll be done. What's your favorite protein?

323
00:17:51,836 --> 00:17:52,516
Speaker 2: Himoglobin?

324
00:17:53,396 --> 00:17:54,436
Speaker 1: Why?

325
00:17:55,156 --> 00:17:57,036
Speaker 2: It is sort of very pleasant to look at it. It

326
00:17:56,876 --> 00:17:58,916
Speaker 2: is very symmetric, it has there and you can see

327
00:17:58,996 --> 00:18:01,996
Speaker 2: it's purpose right that and the oxygen binds into that

328
00:18:02,036 --> 00:18:03,476
Speaker 2: thing right from very clean protein.

329
00:18:04,116 --> 00:18:07,396
Speaker 1: It's so easy to understand. It's the little thing that

330
00:18:07,516 --> 00:18:10,916
Speaker 1: carries oxygen around your body. If everything goes well, what

331
00:18:10,996 --> 00:18:13,596
Speaker 1: problem will you be trying to solve in say, five years.

332
00:18:14,756 --> 00:18:19,556
Speaker 2: Really sort of thinking about the two big challenges sort

333
00:18:19,596 --> 00:18:21,876
Speaker 2: of that humanity is facing. One is the pandemic, the

334
00:18:21,956 --> 00:18:24,596
Speaker 2: other is climate change. And I think material science and

335
00:18:24,676 --> 00:18:28,956
Speaker 2: quantum chemistry can impact both, but especially climate change. And

336
00:18:28,996 --> 00:18:31,516
Speaker 2: I think this is something that requires a lot of work.

337
00:18:32,676 --> 00:18:37,036
Speaker 1: Is there some particular problem in that domain that is

338
00:18:37,076 --> 00:18:40,556
Speaker 1: analogous to protein folding? Is there some hard thing that

339
00:18:40,596 --> 00:18:41,676
Speaker 1: you want to figure.

340
00:18:41,396 --> 00:18:45,516
Speaker 2: Out rational material design? We are very far from there.

341
00:18:45,676 --> 00:18:49,476
Speaker 2: We are still basically doing experimental stuff when we think

342
00:18:49,516 --> 00:18:51,916
Speaker 2: about discovering new materials.

343
00:18:53,236 --> 00:18:56,316
Speaker 1: What do you understand about AI or machine learning that

344
00:18:56,436 --> 00:18:58,236
Speaker 1: most people don't understand.

345
00:18:59,596 --> 00:19:03,316
Speaker 2: I think sort of AI is not magic, right, it's

346
00:19:03,356 --> 00:19:07,676
Speaker 2: sort of essentially it's a series of techniques which is

347
00:19:07,756 --> 00:19:12,716
Speaker 2: able to extract intelligence. But you extract intelligence from the

348
00:19:12,836 --> 00:19:16,996
Speaker 2: raw material, right, So so garbage and garbage out. So

349
00:19:17,676 --> 00:19:21,356
Speaker 2: what is really important is that experience need needs to

350
00:19:21,396 --> 00:19:25,036
Speaker 2: be rich enough. Right, We can't just we don't become

351
00:19:25,076 --> 00:19:27,676
Speaker 2: intelligent by sitting in the room, right. We become intelligent

352
00:19:27,676 --> 00:19:32,356
Speaker 2: because we have amazing experiences. So it's not big data, right,

353
00:19:32,396 --> 00:19:34,876
Speaker 2: it's not the bigness of the experience, but it's like

354
00:19:35,116 --> 00:19:38,516
Speaker 2: the goodness of the experience, like the wide variety of

355
00:19:38,796 --> 00:19:41,196
Speaker 2: sort of things that you train on and the things

356
00:19:41,196 --> 00:19:44,596
Speaker 2: that you see. So I think that's very really that's

357
00:19:44,636 --> 00:19:45,436
Speaker 2: really important.

358
00:19:46,316 --> 00:19:51,116
Speaker 1: That thought leads you to like the optimal training data.

359
00:19:51,196 --> 00:19:53,316
Speaker 1: So it's the worry that like people are making a

360
00:19:53,356 --> 00:19:55,836
Speaker 1: mistake by just doing a lot of the same kind

361
00:19:55,916 --> 00:19:56,716
Speaker 1: of training data.

362
00:19:57,676 --> 00:20:01,156
Speaker 2: Yeah, exactly, exactly right. So if you just take one example,

363
00:20:01,236 --> 00:20:04,916
Speaker 2: you repeat it multiple times, right, that's not that's not great. Again,

364
00:20:05,156 --> 00:20:07,436
Speaker 2: you don't become Yeah, you don't become wise doing the

365
00:20:07,476 --> 00:20:09,036
Speaker 2: same thing again and again and again.

366
00:20:09,196 --> 00:20:11,996
Speaker 1: Right, what are you actually working on right now? Like

367
00:20:11,996 --> 00:20:14,156
Speaker 1: what are you going to go work on today or

368
00:20:14,276 --> 00:20:14,876
Speaker 1: next week.

369
00:20:16,116 --> 00:20:21,716
Speaker 2: So there is a system that my team developed called Synthide,

370
00:20:21,956 --> 00:20:26,036
Speaker 2: which is a system for watermarking AA generated content. So

371
00:20:26,076 --> 00:20:28,876
Speaker 2: we want to be able to detect it. When you

372
00:20:28,916 --> 00:20:31,556
Speaker 2: have a generated content, users should be able to detect

373
00:20:31,636 --> 00:20:33,156
Speaker 2: that this is educated.

374
00:20:33,316 --> 00:20:39,276
Speaker 1: Generated content, whether it's images or words or whatever, text, video.

375
00:20:39,716 --> 00:20:46,716
Speaker 2: Exactly exactly. You embed this imperceptible thing within the thing

376
00:20:46,756 --> 00:20:49,196
Speaker 2: that is generated that a human might not see.

377
00:20:49,436 --> 00:20:52,716
Speaker 1: So the builder of the AI model Open AI could

378
00:20:52,796 --> 00:20:57,076
Speaker 1: choose to embed a watermark in GPT, so that anybody

379
00:20:57,076 --> 00:21:00,356
Speaker 1: who made a thing with GPT, that document would have

380
00:21:00,436 --> 00:21:03,716
Speaker 1: some hidden sign that it was AI generated. It's sort

381
00:21:03,716 --> 00:21:07,396
Speaker 1: of the choice of the model of builders. Yeah, thank

382
00:21:07,436 --> 00:21:09,236
Speaker 1: you very much for your time. It was great to

383
00:21:09,276 --> 00:21:09,676
Speaker 1: talk with you.

384
00:21:10,516 --> 00:21:12,556
Speaker 2: Yeah, thanks you good. It was a pleasure.

385
00:21:18,916 --> 00:21:22,916
Speaker 1: Pushmikkoli is vice president of research at Google deep Mine.

386
00:21:23,596 --> 00:21:26,916
Speaker 1: Today's show was produced by Edith Russello and edited by

387
00:21:27,036 --> 00:21:31,676
Speaker 1: Karen Chakerje. You can email us at problem at Pushkin

388
00:21:31,916 --> 00:21:34,436
Speaker 1: dot FM. I'm Jacob Goldstein.