1 00:00:15,356 --> 00:00:24,916 Speaker 1: Pushkin. Humans are made of proteins. Proteins are key components 2 00:00:24,956 --> 00:00:28,716 Speaker 1: of our cells and of our muscles. Proteins regulate gene 3 00:00:28,756 --> 00:00:33,676 Speaker 1: expression and the immune system. And yet forever we had 4 00:00:34,036 --> 00:00:37,956 Speaker 1: no idea what most proteins look like, and this was 5 00:00:37,956 --> 00:00:41,756 Speaker 1: a problem. Every protein has a different shape, and to 6 00:00:41,916 --> 00:00:46,156 Speaker 1: understand how any particular protein works, how it interacts with 7 00:00:46,196 --> 00:00:49,636 Speaker 1: other molecules, how it keeps us healthy or causes disease, 8 00:00:50,236 --> 00:00:55,356 Speaker 1: it is very helpful to understand what that particular shape is, 9 00:00:56,276 --> 00:01:01,276 Speaker 1: but determining that complicated three dimensional shape was really hard. 10 00:01:01,436 --> 00:01:05,076 Speaker 1: Scientists sometimes spent years trying to determine the shape of 11 00:01:05,236 --> 00:01:10,316 Speaker 1: a single protein. So people started to dream this. They thought, 12 00:01:10,796 --> 00:01:12,716 Speaker 1: what if we could come up with some kind of 13 00:01:12,756 --> 00:01:16,676 Speaker 1: a system, some way to use a protein sequence of 14 00:01:16,716 --> 00:01:22,676 Speaker 1: amino acids to reliably predict that protein's unique three dimensional shape. 15 00:01:23,356 --> 00:01:25,836 Speaker 1: It would be a huge leap forward that could lead 16 00:01:25,876 --> 00:01:28,916 Speaker 1: to a much deeper understanding of biology and a new 17 00:01:28,916 --> 00:01:32,436 Speaker 1: wave of treatments for disease. This idea was called the 18 00:01:32,436 --> 00:01:36,876 Speaker 1: protein folding problem. But after decades of work, the best 19 00:01:37,036 --> 00:01:40,236 Speaker 1: protein folding models were nowhere near good enough to be 20 00:01:40,356 --> 00:01:45,156 Speaker 1: scientifically useful. And then in twenty twenty a group of 21 00:01:45,196 --> 00:01:48,156 Speaker 1: researchers built an AI model to try to tackle the 22 00:01:48,196 --> 00:01:51,836 Speaker 1: protein folding problem, and their model was so much better 23 00:01:51,916 --> 00:01:54,556 Speaker 1: than what had come before that some people thought they 24 00:01:54,556 --> 00:01:58,116 Speaker 1: were cheating. In fact, they were not cheating. They had 25 00:01:58,196 --> 00:02:02,676 Speaker 1: solved the protein folding problem. I'm Jacob Goldstein, and this 26 00:02:02,756 --> 00:02:04,796 Speaker 1: is What's Your Problem, the show where I talk to 27 00:02:04,876 --> 00:02:08,636 Speaker 1: people who are trying to make technological progress. My guest 28 00:02:08,636 --> 00:02:11,876 Speaker 1: today is push Meat Coli. He's vice president of research 29 00:02:11,916 --> 00:02:15,116 Speaker 1: at deep Mind, an AI research group that's part of Google. 30 00:02:15,836 --> 00:02:17,916 Speaker 1: Push Me was part of the deep Mind team that 31 00:02:18,076 --> 00:02:21,636 Speaker 1: solved the protein folding problem. They built an AI model 32 00:02:21,676 --> 00:02:25,196 Speaker 1: called alpha fold, and alpha fold is one of the 33 00:02:25,236 --> 00:02:28,916 Speaker 1: most impressive real world AI success stories that we have 34 00:02:28,996 --> 00:02:31,876 Speaker 1: seen so far, and as you'll hear in our conversation, 35 00:02:32,396 --> 00:02:37,156 Speaker 1: alpha fold holds lessons for AI that go beyond protein folding. 36 00:02:37,876 --> 00:02:40,476 Speaker 1: One other thing you may hear in our conversation, by 37 00:02:40,516 --> 00:02:43,996 Speaker 1: the way, is the occasional background beep from push meat 38 00:02:44,076 --> 00:02:52,116 Speaker 1: smoke detector. Have you failed to change the battery in 39 00:02:52,196 --> 00:02:55,556 Speaker 1: your smoke detector? Is that perhaps what that little chirp was? 40 00:02:55,716 --> 00:02:57,876 Speaker 1: You really should do that for yourself as. 41 00:02:57,756 --> 00:03:02,156 Speaker 2: Well as I know, I mean, I was trying to 42 00:03:02,156 --> 00:03:05,916 Speaker 2: fix it before the recording, and like it's just so finicky, 43 00:03:05,996 --> 00:03:06,716 Speaker 2: it doesn't come. 44 00:03:06,596 --> 00:03:11,556 Speaker 1: Out okay, fair, fair, well with it. It makes you real. 45 00:03:11,956 --> 00:03:15,476 Speaker 1: And so it is the case with protein folding that 46 00:03:15,556 --> 00:03:17,796 Speaker 1: it's not just like people publishing papers. Right, There was 47 00:03:17,836 --> 00:03:21,836 Speaker 1: actually this contest that was held was it every couple 48 00:03:21,876 --> 00:03:26,356 Speaker 1: of years of exactlys trying to solve the protein folding problem? 49 00:03:26,396 --> 00:03:28,316 Speaker 1: And this had been going on for a long time, 50 00:03:28,836 --> 00:03:31,236 Speaker 1: And so is there a first moment when you compete 51 00:03:31,276 --> 00:03:32,156 Speaker 1: in this contest? 52 00:03:33,076 --> 00:03:35,516 Speaker 2: Yeah? So this is like an amazing thing about protein 53 00:03:35,516 --> 00:03:38,956 Speaker 2: structure prediction and how visionary the community was. Like they 54 00:03:38,956 --> 00:03:41,876 Speaker 2: had set up this amazing sort of fool proof way 55 00:03:41,916 --> 00:03:44,636 Speaker 2: to evaluate progress, because it's very easy to sort of 56 00:03:44,636 --> 00:03:47,636 Speaker 2: for sometimes scientists to fool themselves saying oh that's progress, 57 00:03:47,836 --> 00:03:50,476 Speaker 2: there is no progress, right, So they had set up 58 00:03:50,516 --> 00:03:54,196 Speaker 2: this contest in a very remarkable way where they had said, well, 59 00:03:54,916 --> 00:03:59,876 Speaker 2: for a specific duration of time, any sort of scientists 60 00:03:59,916 --> 00:04:02,756 Speaker 2: all over the world who are discovering new structures would 61 00:04:02,836 --> 00:04:06,276 Speaker 2: not share them with the world. Instead, instead, they will 62 00:04:06,276 --> 00:04:09,756 Speaker 2: basically send it to a secret want script vault. 63 00:04:09,876 --> 00:04:11,036 Speaker 1: I love a secret fault. 64 00:04:11,156 --> 00:04:13,676 Speaker 2: Yeah, yeah, I mean not like secret. 65 00:04:15,876 --> 00:04:17,516 Speaker 1: I wanted to be a big vault with the wheel 66 00:04:17,596 --> 00:04:18,916 Speaker 1: you turn, but yeah, I understand. 67 00:04:19,836 --> 00:04:25,276 Speaker 2: Yeah, And so the idea there was that nobody except 68 00:04:25,396 --> 00:04:28,076 Speaker 2: the scientists who discovered the structure knew what is the 69 00:04:28,076 --> 00:04:29,476 Speaker 2: structure of this protein. 70 00:04:29,756 --> 00:04:33,236 Speaker 1: So it's a perfect way to test these models doing 71 00:04:33,276 --> 00:04:36,916 Speaker 1: the prediction because somebody knows the answer. But the people 72 00:04:36,916 --> 00:04:39,276 Speaker 1: building the models, people like you, people like deep mind, 73 00:04:39,356 --> 00:04:41,996 Speaker 1: don't actually know the answer. So you can't cheat, you 74 00:04:42,036 --> 00:04:43,996 Speaker 1: can't backfit it or anything like. 75 00:04:43,916 --> 00:04:46,476 Speaker 2: That exactly, right, So you don't know how good you 76 00:04:46,516 --> 00:04:49,796 Speaker 2: are because like you've been training on known examples and 77 00:04:49,836 --> 00:04:52,876 Speaker 2: you've been evaluating them on known examples. But when you 78 00:04:52,996 --> 00:04:55,796 Speaker 2: are tested, you are tested on these amazingly new things 79 00:04:55,876 --> 00:04:57,356 Speaker 2: that nobody has seen before. 80 00:04:57,556 --> 00:05:00,836 Speaker 1: Yeah. Yeah, so okay. And they actually have like a 81 00:05:00,956 --> 00:05:05,996 Speaker 1: numeric score that they assigned to everybody's model, right, yeah, 82 00:05:06,036 --> 00:05:09,276 Speaker 1: Like it's very quantitative and it's not just like good 83 00:05:09,396 --> 00:05:12,316 Speaker 1: or pretty good. It's a number, right, And what is it? 84 00:05:12,476 --> 00:05:13,436 Speaker 1: Zero two one hundred? 85 00:05:13,476 --> 00:05:15,956 Speaker 2: Is that the scale? Yeah, it's zero two hundred, Like 86 00:05:15,996 --> 00:05:17,676 Speaker 2: that's the sort of scale. And if you look at 87 00:05:17,996 --> 00:05:20,596 Speaker 2: progress in the last sort of twenty years before alpha 88 00:05:20,596 --> 00:05:23,356 Speaker 2: fold one was launched. I mean it was somewhere sort 89 00:05:23,396 --> 00:05:27,076 Speaker 2: of between the twenty five to forty sort of GDT 90 00:05:27,236 --> 00:05:27,556 Speaker 2: sort of. 91 00:05:27,556 --> 00:05:30,516 Speaker 1: Five to forty. Was it getting better slowly or was 92 00:05:30,556 --> 00:05:33,276 Speaker 1: it just kind of stuck in the thirties more or less? 93 00:05:33,556 --> 00:05:35,996 Speaker 2: Yeah, it was stagnating. It was like sometimes it would 94 00:05:36,036 --> 00:05:38,596 Speaker 2: go to thirty eight, sometimes thirty five, and it was 95 00:05:38,636 --> 00:05:40,676 Speaker 2: like in that so it was going up and down, 96 00:05:40,756 --> 00:05:42,596 Speaker 2: up and down. There was no sort of remarkable sort 97 00:05:42,596 --> 00:05:43,116 Speaker 2: of breakthrough. 98 00:05:43,596 --> 00:05:46,116 Speaker 1: And was it all AI was like the only way 99 00:05:46,156 --> 00:05:48,316 Speaker 1: people were trying to solve it? AI? Were there whole 100 00:05:48,396 --> 00:05:51,356 Speaker 1: other sort of things people were thinking about trying to do. 101 00:05:51,996 --> 00:05:56,036 Speaker 2: Yeah, so this is mostly not AI based solutions, right, 102 00:05:56,076 --> 00:06:01,516 Speaker 2: These were sort of very well designed, hand designed systems 103 00:06:01,596 --> 00:06:05,276 Speaker 2: that were carefully tuned to the problem over many, many 104 00:06:05,356 --> 00:06:08,756 Speaker 2: decades with large teams working together and so on. But 105 00:06:09,116 --> 00:06:10,396 Speaker 2: it was a little machine learning. 106 00:06:11,116 --> 00:06:14,236 Speaker 1: So they'd been scoring in the thirties more or less. 107 00:06:14,796 --> 00:06:17,596 Speaker 1: And then what year is it that deep mind? You 108 00:06:17,676 --> 00:06:19,516 Speaker 1: and deep Mind show up with alpha fourth one? 109 00:06:20,036 --> 00:06:25,276 Speaker 2: So twenty eighteen. So the contest actually it runs in 110 00:06:25,316 --> 00:06:26,356 Speaker 2: the summer and. 111 00:06:26,276 --> 00:06:28,396 Speaker 1: Then at the end of the summer they sort of 112 00:06:28,516 --> 00:06:30,436 Speaker 1: give you the results or what so. 113 00:06:30,436 --> 00:06:31,996 Speaker 2: At the end of the summer. I mean like by 114 00:06:32,076 --> 00:06:33,956 Speaker 2: by I think July or August, they have sent the 115 00:06:34,036 --> 00:06:37,796 Speaker 2: last files and then you have sent them the results 116 00:06:38,116 --> 00:06:40,156 Speaker 2: and then you wait, right and you don't know what 117 00:06:40,636 --> 00:06:44,716 Speaker 2: has happened. Then they invite you to a conference which 118 00:06:45,076 --> 00:06:48,236 Speaker 2: happens in December, so you are like eagerly waiting what's 119 00:06:48,276 --> 00:06:51,316 Speaker 2: what has happened? And they're like, oh, maybe we came last, 120 00:06:51,836 --> 00:06:55,196 Speaker 2: maybe maybe we're in the middle. And then they actually 121 00:06:55,236 --> 00:06:58,756 Speaker 2: revealed the leaderboard or the scores in the conference. 122 00:06:59,196 --> 00:07:00,796 Speaker 1: Where were you at the time. 123 00:07:01,996 --> 00:07:03,716 Speaker 2: I was in London, I was in the office. I 124 00:07:03,756 --> 00:07:06,196 Speaker 2: was like really waiting, like trying to figure out like 125 00:07:06,236 --> 00:07:08,796 Speaker 2: what where where were we right in terms of the career? 126 00:07:09,276 --> 00:07:12,996 Speaker 2: How did we form? And we get an email from 127 00:07:13,396 --> 00:07:16,516 Speaker 2: the organizers one day' hold the results are about two 128 00:07:16,596 --> 00:07:20,556 Speaker 2: thounds and they say, well you are first and by 129 00:07:20,596 --> 00:07:23,276 Speaker 2: a big margin, so like from thirty to forty we 130 00:07:23,316 --> 00:07:25,076 Speaker 2: had gone to more than sixty. 131 00:07:25,636 --> 00:07:28,356 Speaker 1: You did way better at predicting the structure of protein 132 00:07:28,396 --> 00:07:30,236 Speaker 1: than anyone had ever done, including that. 133 00:07:30,236 --> 00:07:32,796 Speaker 2: Yes you won by a lot, Yes we won by 134 00:07:32,836 --> 00:07:33,596 Speaker 2: a lot. 135 00:07:33,796 --> 00:07:36,996 Speaker 1: So that's way better than anyone has ever done. But 136 00:07:37,076 --> 00:07:39,916 Speaker 1: does it mean that you're basically half right, is that 137 00:07:39,956 --> 00:07:41,116 Speaker 1: what that number means? 138 00:07:41,396 --> 00:07:44,316 Speaker 2: Yeah, so I think the way we I mean we 139 00:07:44,396 --> 00:07:46,156 Speaker 2: are you're the best in the world, but still your 140 00:07:46,156 --> 00:07:50,796 Speaker 2: predictions are pretty much sort of not very useful for 141 00:07:51,636 --> 00:07:53,756 Speaker 2: any like if you're trying to figure out what a 142 00:07:53,876 --> 00:07:57,556 Speaker 2: drug buying to this particular protein or like, the error 143 00:07:57,636 --> 00:08:01,396 Speaker 2: is so much right that you wouldn't get a complete 144 00:08:01,436 --> 00:08:02,436 Speaker 2: picture of the protein. 145 00:08:02,836 --> 00:08:07,156 Speaker 1: So this sixty number, it's like good in that it's 146 00:08:07,156 --> 00:08:09,236 Speaker 1: better than anyone has ever done. It's bad in that 147 00:08:09,276 --> 00:08:12,516 Speaker 1: it's not scientifically useful. What number do you have to 148 00:08:12,556 --> 00:08:14,356 Speaker 1: get to to be scientifically useful? 149 00:08:14,836 --> 00:08:18,836 Speaker 2: Like between eighty five to ninety that's what people told 150 00:08:18,916 --> 00:08:21,716 Speaker 2: us that if you get beyond eighty five to ninety, 151 00:08:22,276 --> 00:08:23,636 Speaker 2: then the problem is solved. 152 00:08:23,836 --> 00:08:27,836 Speaker 1: So what do you decide when you when you get 153 00:08:27,836 --> 00:08:28,356 Speaker 1: this result. 154 00:08:29,756 --> 00:08:32,276 Speaker 2: So we get this result and we're like, yeah, this 155 00:08:32,396 --> 00:08:35,276 Speaker 2: is amazing, right, that we are the best in the world, 156 00:08:35,356 --> 00:08:37,556 Speaker 2: right by a big margin. Right, So like the thesis 157 00:08:37,596 --> 00:08:40,716 Speaker 2: that machine learning sort of will advanced science. Oh that's great, 158 00:08:41,036 --> 00:08:43,956 Speaker 2: but the problem is not solved. And let's go back 159 00:08:43,996 --> 00:08:47,996 Speaker 2: to the drawing board. And now with the information that 160 00:08:48,036 --> 00:08:49,636 Speaker 2: we have in the amount of time, we have spent 161 00:08:49,756 --> 00:08:52,356 Speaker 2: on this previous architecture, do we still think that this 162 00:08:52,396 --> 00:08:56,396 Speaker 2: will lead us to where we want to go? And 163 00:08:56,556 --> 00:09:00,396 Speaker 2: the teams thought, no, we need no completely, Yeah, we 164 00:09:00,636 --> 00:09:02,356 Speaker 2: need to completely start from scratch. 165 00:09:02,876 --> 00:09:06,356 Speaker 1: So your your reaction to winning this contest and doing 166 00:09:06,396 --> 00:09:09,596 Speaker 1: better than anyone has ever done it predicting protein folding 167 00:09:09,676 --> 00:09:12,436 Speaker 1: is let's blow up this thing that just won the contest. 168 00:09:13,036 --> 00:09:17,796 Speaker 2: Yeah, throw it away. Yeah, we were like, the basic 169 00:09:17,876 --> 00:09:21,356 Speaker 2: premise was proved that machine learning has a role to play, right, 170 00:09:21,396 --> 00:09:23,596 Speaker 2: So that gave us a lot of confidence. But at 171 00:09:23,596 --> 00:09:25,636 Speaker 2: the same time we saw it, well, this is not 172 00:09:25,676 --> 00:09:28,756 Speaker 2: an elegant This is not an elegant solution, right. This 173 00:09:28,836 --> 00:09:31,916 Speaker 2: is this is like two modules, Like there's machine learning, 174 00:09:32,076 --> 00:09:34,556 Speaker 2: there's a machine learning module. It is making these sort 175 00:09:34,596 --> 00:09:37,036 Speaker 2: of predictions which this other module is sort of trying 176 00:09:37,076 --> 00:09:39,436 Speaker 2: to use. If you believe in the power of machine learning, 177 00:09:39,516 --> 00:09:41,476 Speaker 2: let's do end to end, right, Let's do end to 178 00:09:41,596 --> 00:09:45,996 Speaker 2: end and basically do everything so that the model takes 179 00:09:45,996 --> 00:09:47,276 Speaker 2: care of it, right, rather. 180 00:09:47,116 --> 00:09:50,516 Speaker 1: Than clear what was happening with that initial model that 181 00:09:50,596 --> 00:09:53,076 Speaker 1: you were deciding to abandon, that. 182 00:09:53,316 --> 00:09:56,996 Speaker 2: Was basically using the machine learning model in together with 183 00:09:57,636 --> 00:10:01,556 Speaker 2: sort of a known framework. Right that there is a 184 00:10:01,636 --> 00:10:04,316 Speaker 2: second step that was a conventional sort of step. 185 00:10:04,996 --> 00:10:07,676 Speaker 1: Oh I see. So it's like you weren't all in 186 00:10:07,716 --> 00:10:10,196 Speaker 1: on machine learning. You were like, well, we gonna use 187 00:10:10,236 --> 00:10:12,356 Speaker 1: machine learning, but we're gonna still do this kind of 188 00:10:12,396 --> 00:10:14,276 Speaker 1: the way other people have been doing it. And your 189 00:10:14,276 --> 00:10:18,716 Speaker 1: response to that first result was screw it, Let's not 190 00:10:18,756 --> 00:10:21,156 Speaker 1: do anything like everybody's not before. Let's just go all 191 00:10:21,196 --> 00:10:26,836 Speaker 1: in on machine learning beginning to end exactly interesting. And 192 00:10:26,996 --> 00:10:29,556 Speaker 1: so you do that, and you spend what two years? 193 00:10:29,636 --> 00:10:31,436 Speaker 1: Is it two years between? Do I have that right? 194 00:10:31,516 --> 00:10:31,716 Speaker 2: Yeah? 195 00:10:31,916 --> 00:10:36,436 Speaker 1: Yeah, and then you come back in twenty twenty. You 196 00:10:36,476 --> 00:10:40,396 Speaker 1: come back in twenty twenty, there's another one of these contests. Yeah, 197 00:10:40,516 --> 00:10:42,796 Speaker 1: you got your new end end machine learning model. 198 00:10:42,956 --> 00:10:46,436 Speaker 2: Yeah. So it was the pandemic. So this is twenty twenty, right. 199 00:10:46,276 --> 00:10:47,916 Speaker 1: So nobody's gone anywhere. 200 00:10:48,036 --> 00:10:51,716 Speaker 2: Yeah, nobody's going anywhere. Right. We knew like twenty nineteen 201 00:10:52,476 --> 00:10:55,556 Speaker 2: was basically where we started working with this new model, 202 00:10:55,756 --> 00:10:58,756 Speaker 2: and it was really tough going because we were like 203 00:10:58,836 --> 00:11:02,476 Speaker 2: starting from we're starting from twenty, right, so we went 204 00:11:02,516 --> 00:11:05,836 Speaker 2: at sixty. Now we are starting from twenty and twenty 205 00:11:06,036 --> 00:11:08,956 Speaker 2: thirty forty. And sometimes you would stagnate at forty five 206 00:11:09,236 --> 00:11:12,396 Speaker 2: fifty you were like, really should I should I had 207 00:11:12,436 --> 00:11:13,276 Speaker 2: that figure model? 208 00:11:13,756 --> 00:11:14,076 Speaker 1: Yeah? 209 00:11:14,236 --> 00:11:17,396 Speaker 2: Yeah, So twenty by the end of twenty nineteen thought 210 00:11:17,636 --> 00:11:20,236 Speaker 2: we started getting some really really cool results and we thought, okay, 211 00:11:20,276 --> 00:11:23,236 Speaker 2: now we have surpassed we have definitely surpassed the previous model. 212 00:11:23,756 --> 00:11:27,876 Speaker 2: We're in good territory. And we were very excited. Like 213 00:11:27,916 --> 00:11:31,036 Speaker 2: at the start of twenty twenty, we were like, yeah, 214 00:11:31,436 --> 00:11:34,516 Speaker 2: making progress, and then the pandemic hit. 215 00:11:38,436 --> 00:11:42,036 Speaker 1: In a minute, the model gets an unanticipated test in 216 00:11:42,076 --> 00:11:42,836 Speaker 1: the real world. 217 00:11:47,036 --> 00:11:51,036 Speaker 2: There was this new virus that was reported sarskov two 218 00:11:51,556 --> 00:11:54,276 Speaker 2: and one of the first things so somebody sort of 219 00:11:54,556 --> 00:11:57,436 Speaker 2: figured out the structure of the spike protein. It was 220 00:11:57,476 --> 00:12:00,116 Speaker 2: all over the newspapers, like, here's a spike protein of 221 00:12:00,116 --> 00:12:03,716 Speaker 2: this new virus, but all the other proteins of the virus, 222 00:12:03,876 --> 00:12:09,036 Speaker 2: the accessory proteins, nobody knew the structure. So the first 223 00:12:09,356 --> 00:12:11,836 Speaker 2: we did we thought, I think we think we have 224 00:12:11,916 --> 00:12:14,516 Speaker 2: the best model in the world. We should be making 225 00:12:14,556 --> 00:12:17,316 Speaker 2: these predictions and sharing it with the world, but is 226 00:12:17,316 --> 00:12:20,196 Speaker 2: this the right thing to do. So we spent a 227 00:12:20,196 --> 00:12:24,276 Speaker 2: lot of time reaching out to biologists who looked at 228 00:12:24,316 --> 00:12:27,036 Speaker 2: the prediction and said well, you need to share this, 229 00:12:27,716 --> 00:12:29,556 Speaker 2: you need to share this with the world. So the 230 00:12:29,596 --> 00:12:32,756 Speaker 2: start of twenty twenty was us sort of sharing the 231 00:12:32,796 --> 00:12:36,756 Speaker 2: predictions from this untested model with the world because we 232 00:12:37,036 --> 00:12:41,396 Speaker 2: thought they were quite good. And then throughout twenty twenty 233 00:12:41,476 --> 00:12:44,396 Speaker 2: we took part in the assessment right, which ran in 234 00:12:44,396 --> 00:12:46,116 Speaker 2: the summer of twenty the contest. 235 00:12:46,196 --> 00:12:47,356 Speaker 1: In the contest. 236 00:12:47,476 --> 00:12:50,996 Speaker 2: Exactly normally, right, the organizers don't come back to you. 237 00:12:51,396 --> 00:12:55,436 Speaker 2: They just released the results at the end in December, 238 00:12:56,196 --> 00:12:58,596 Speaker 2: And at the end of the summer we get this 239 00:12:58,636 --> 00:13:01,796 Speaker 2: funny email saying we want to talk to you, and 240 00:13:01,916 --> 00:13:06,076 Speaker 2: so we were like, yeah, like, did we do anything bad? 241 00:13:06,476 --> 00:13:09,956 Speaker 2: What happened? And a few of them really had sort 242 00:13:09,956 --> 00:13:14,276 Speaker 2: of suspicions. They were like, you must have cheated, right, 243 00:13:14,556 --> 00:13:19,116 Speaker 2: like that your predictions are your level of performance is 244 00:13:19,156 --> 00:13:22,316 Speaker 2: nowhere close to anything that we have seen ever. Right, 245 00:13:23,036 --> 00:13:27,276 Speaker 2: But a few scientists in that contest had submitted a 246 00:13:27,316 --> 00:13:31,396 Speaker 2: sequence a protein whose structure was not known. They were 247 00:13:31,676 --> 00:13:34,476 Speaker 2: expecting that the structure would be known by the time 248 00:13:34,516 --> 00:13:37,556 Speaker 2: the contest en so we'll be able to evaluate the predictions. 249 00:13:37,636 --> 00:13:39,676 Speaker 2: But that structure was not known, and in fact, the 250 00:13:39,716 --> 00:13:41,116 Speaker 2: structure they couldn't find. The structure. 251 00:13:41,356 --> 00:13:44,036 Speaker 1: So you're saying it would be impossible to cheat because 252 00:13:44,076 --> 00:13:47,556 Speaker 1: literally no human knew the structure. No way to cheat, 253 00:13:47,836 --> 00:13:48,956 Speaker 1: nobody knows the answer. 254 00:13:49,156 --> 00:13:53,836 Speaker 2: Yeah, yeah, So they used the prediction of alpha fold 255 00:13:54,676 --> 00:13:59,516 Speaker 2: and then tried to explain their experimental data and it matched. 256 00:14:00,556 --> 00:14:03,076 Speaker 2: And they are like, this model has been able to 257 00:14:03,116 --> 00:14:07,516 Speaker 2: discover something that nobody knew, not even no scientists knew. 258 00:14:09,276 --> 00:14:12,996 Speaker 2: Sense the model had already made new biological discoveries even 259 00:14:13,276 --> 00:14:14,196 Speaker 2: before we knew it. 260 00:14:14,716 --> 00:14:18,996 Speaker 1: Yeah, yeah, okay, so that's good. You're not in trouble anymore. 261 00:14:20,076 --> 00:14:22,956 Speaker 1: It's clear you didn't cheat. Do they say the number? 262 00:14:22,996 --> 00:14:25,596 Speaker 1: What's the number? I'm waiting for the number. How'd you do? 263 00:14:26,116 --> 00:14:28,716 Speaker 2: Yeah, so we were beyond eighty five and ninety right, 264 00:14:28,756 --> 00:14:31,636 Speaker 2: and then they basically said, okay, we have to announce 265 00:14:31,676 --> 00:14:35,796 Speaker 2: it to the world. And so come December that was 266 00:14:35,876 --> 00:14:38,796 Speaker 2: the announcement that was made by the organizers that this 267 00:14:39,196 --> 00:14:43,156 Speaker 2: alpha fold too had solved the protein structure prediction problem. 268 00:14:43,556 --> 00:14:46,156 Speaker 1: So is that contest done? Now? Did you just end 269 00:14:46,236 --> 00:14:48,556 Speaker 1: that contest? Is nobody doing that anymore? 270 00:14:48,956 --> 00:14:52,636 Speaker 2: No, the contest sort of is alive, right, it has changed, 271 00:14:52,836 --> 00:14:56,396 Speaker 2: its focus has changed. So what what alpha fold two 272 00:14:56,476 --> 00:15:00,596 Speaker 2: did was find the structure of these single proteins, But 273 00:15:00,636 --> 00:15:03,236 Speaker 2: there are many other problems that remain, right, how do 274 00:15:03,556 --> 00:15:09,116 Speaker 2: multiple proteins interact? Instance, Right, So there are other structure predictions, 275 00:15:09,156 --> 00:15:12,796 Speaker 2: problems that now the contest is sort of has evolved to, right, 276 00:15:12,796 --> 00:15:15,516 Speaker 2: it is sort of focusing on other types of problems 277 00:15:15,556 --> 00:15:18,116 Speaker 2: that Alpha fold two did not address. 278 00:15:18,516 --> 00:15:21,796 Speaker 1: If we zoom out and think about what you have done, 279 00:15:21,836 --> 00:15:24,556 Speaker 1: what the team has done in you know, using machine 280 00:15:24,636 --> 00:15:28,236 Speaker 1: learning to solve this scientific problem that people had been 281 00:15:28,236 --> 00:15:30,836 Speaker 1: working on for a long time, Like what are the 282 00:15:30,916 --> 00:15:34,316 Speaker 1: broader lessons? Like if we think about other domains, what 283 00:15:34,796 --> 00:15:37,396 Speaker 1: can we infer what can we take from this? 284 00:15:38,636 --> 00:15:40,476 Speaker 2: Yeah, so I think the thing that we can take 285 00:15:40,516 --> 00:15:44,596 Speaker 2: from this is basically science is sort of generating a 286 00:15:44,636 --> 00:15:47,356 Speaker 2: lot of data across any domain that you see, right, 287 00:15:47,396 --> 00:15:51,476 Speaker 2: whether it's genomics, hydergy, physics, whether the amount of data 288 00:15:51,516 --> 00:15:54,236 Speaker 2: that we are gathering about the world is much more 289 00:15:54,276 --> 00:15:56,156 Speaker 2: than any single human mind can comprehend. 290 00:15:56,396 --> 00:15:56,516 Speaker 1: Right. 291 00:15:56,556 --> 00:15:58,316 Speaker 2: You can have the best scientists and they will not 292 00:15:58,356 --> 00:16:00,436 Speaker 2: be able to sort of go through on the data 293 00:16:00,476 --> 00:16:03,716 Speaker 2: that we are collecting about our world. So machine learning 294 00:16:03,756 --> 00:16:06,116 Speaker 2: is this remarkable sort of tool which gives us the 295 00:16:06,196 --> 00:16:09,396 Speaker 2: ability to make sense and leverage this data, right and 296 00:16:10,036 --> 00:16:12,596 Speaker 2: really sort of on the path of really accelerating our 297 00:16:12,676 --> 00:16:15,356 Speaker 2: understanding of the problems that we're dealing with. 298 00:16:16,076 --> 00:16:18,476 Speaker 1: In the case of alpha fold, was the sort of 299 00:16:18,556 --> 00:16:24,156 Speaker 1: input data the known protein structures and amino acid sequence 300 00:16:24,156 --> 00:16:27,236 Speaker 1: and was that the basic training data exactly right? 301 00:16:27,276 --> 00:16:30,676 Speaker 2: So it was the PDB, which was the protein database, 302 00:16:31,196 --> 00:16:34,756 Speaker 2: and that had been collected by the community for many, 303 00:16:34,796 --> 00:16:39,356 Speaker 2: many years, right over many decades. They have meticulously carefully 304 00:16:39,396 --> 00:16:43,356 Speaker 2: deposited all the protein sequences and the corresponding structures that 305 00:16:43,396 --> 00:16:46,316 Speaker 2: were discovered, right, And it had one hundred and fifty 306 00:16:46,396 --> 00:16:49,836 Speaker 2: thousand examples at that time, right, sequences as well as structures, 307 00:16:50,196 --> 00:16:53,756 Speaker 2: and everyone had access to the same data. Right, All 308 00:16:53,756 --> 00:16:57,436 Speaker 2: the teams were training on that data. 309 00:16:57,636 --> 00:17:00,636 Speaker 1: Is it right that alpha fold itself is open sourced 310 00:17:00,716 --> 00:17:04,716 Speaker 1: and that there's this open source database of protein structures 311 00:17:04,756 --> 00:17:06,916 Speaker 1: that have been discovered with alpha fauld? Is that right? 312 00:17:07,716 --> 00:17:10,996 Speaker 2: Yeah? So when the sort of developed alpha fold, we 313 00:17:11,116 --> 00:17:14,196 Speaker 2: made it available to the world. But we then said, well, 314 00:17:14,556 --> 00:17:17,276 Speaker 2: it's so accurate, but it's also so fast that we 315 00:17:17,356 --> 00:17:20,876 Speaker 2: will use it to find the structure for every sort 316 00:17:20,916 --> 00:17:24,396 Speaker 2: of known protein. And then we made all those structures 317 00:17:24,436 --> 00:17:25,796 Speaker 2: available to the world. 318 00:17:30,076 --> 00:17:33,036 Speaker 1: Alpha fold has now made the structures of roughly two 319 00:17:33,196 --> 00:17:38,836 Speaker 1: hundred and fifty million different proteins publicly available. We'll be 320 00:17:38,876 --> 00:17:45,076 Speaker 1: back in a minute with the lightning ground. Last thing 321 00:17:45,316 --> 00:17:48,956 Speaker 1: is a lightning round. Just some fast questions, okay, and 322 00:17:48,996 --> 00:17:51,276 Speaker 1: then we'll be done. What's your favorite protein? 323 00:17:51,836 --> 00:17:52,516 Speaker 2: Himoglobin? 324 00:17:53,396 --> 00:17:54,436 Speaker 1: Why? 325 00:17:55,156 --> 00:17:57,036 Speaker 2: It is sort of very pleasant to look at it. It 326 00:17:56,876 --> 00:17:58,916 Speaker 2: is very symmetric, it has there and you can see 327 00:17:58,996 --> 00:18:01,996 Speaker 2: it's purpose right that and the oxygen binds into that 328 00:18:02,036 --> 00:18:03,476 Speaker 2: thing right from very clean protein. 329 00:18:04,116 --> 00:18:07,396 Speaker 1: It's so easy to understand. It's the little thing that 330 00:18:07,516 --> 00:18:10,916 Speaker 1: carries oxygen around your body. If everything goes well, what 331 00:18:10,996 --> 00:18:13,596 Speaker 1: problem will you be trying to solve in say, five years. 332 00:18:14,756 --> 00:18:19,556 Speaker 2: Really sort of thinking about the two big challenges sort 333 00:18:19,596 --> 00:18:21,876 Speaker 2: of that humanity is facing. One is the pandemic, the 334 00:18:21,956 --> 00:18:24,596 Speaker 2: other is climate change. And I think material science and 335 00:18:24,676 --> 00:18:28,956 Speaker 2: quantum chemistry can impact both, but especially climate change. And 336 00:18:28,996 --> 00:18:31,516 Speaker 2: I think this is something that requires a lot of work. 337 00:18:32,676 --> 00:18:37,036 Speaker 1: Is there some particular problem in that domain that is 338 00:18:37,076 --> 00:18:40,556 Speaker 1: analogous to protein folding? Is there some hard thing that 339 00:18:40,596 --> 00:18:41,676 Speaker 1: you want to figure. 340 00:18:41,396 --> 00:18:45,516 Speaker 2: Out rational material design? We are very far from there. 341 00:18:45,676 --> 00:18:49,476 Speaker 2: We are still basically doing experimental stuff when we think 342 00:18:49,516 --> 00:18:51,916 Speaker 2: about discovering new materials. 343 00:18:53,236 --> 00:18:56,316 Speaker 1: What do you understand about AI or machine learning that 344 00:18:56,436 --> 00:18:58,236 Speaker 1: most people don't understand. 345 00:18:59,596 --> 00:19:03,316 Speaker 2: I think sort of AI is not magic, right, it's 346 00:19:03,356 --> 00:19:07,676 Speaker 2: sort of essentially it's a series of techniques which is 347 00:19:07,756 --> 00:19:12,716 Speaker 2: able to extract intelligence. But you extract intelligence from the 348 00:19:12,836 --> 00:19:16,996 Speaker 2: raw material, right, So so garbage and garbage out. So 349 00:19:17,676 --> 00:19:21,356 Speaker 2: what is really important is that experience need needs to 350 00:19:21,396 --> 00:19:25,036 Speaker 2: be rich enough. Right, We can't just we don't become 351 00:19:25,076 --> 00:19:27,676 Speaker 2: intelligent by sitting in the room, right. We become intelligent 352 00:19:27,676 --> 00:19:32,356 Speaker 2: because we have amazing experiences. So it's not big data, right, 353 00:19:32,396 --> 00:19:34,876 Speaker 2: it's not the bigness of the experience, but it's like 354 00:19:35,116 --> 00:19:38,516 Speaker 2: the goodness of the experience, like the wide variety of 355 00:19:38,796 --> 00:19:41,196 Speaker 2: sort of things that you train on and the things 356 00:19:41,196 --> 00:19:44,596 Speaker 2: that you see. So I think that's very really that's 357 00:19:44,636 --> 00:19:45,436 Speaker 2: really important. 358 00:19:46,316 --> 00:19:51,116 Speaker 1: That thought leads you to like the optimal training data. 359 00:19:51,196 --> 00:19:53,316 Speaker 1: So it's the worry that like people are making a 360 00:19:53,356 --> 00:19:55,836 Speaker 1: mistake by just doing a lot of the same kind 361 00:19:55,916 --> 00:19:56,716 Speaker 1: of training data. 362 00:19:57,676 --> 00:20:01,156 Speaker 2: Yeah, exactly, exactly right. So if you just take one example, 363 00:20:01,236 --> 00:20:04,916 Speaker 2: you repeat it multiple times, right, that's not that's not great. Again, 364 00:20:05,156 --> 00:20:07,436 Speaker 2: you don't become Yeah, you don't become wise doing the 365 00:20:07,476 --> 00:20:09,036 Speaker 2: same thing again and again and again. 366 00:20:09,196 --> 00:20:11,996 Speaker 1: Right, what are you actually working on right now? Like 367 00:20:11,996 --> 00:20:14,156 Speaker 1: what are you going to go work on today or 368 00:20:14,276 --> 00:20:14,876 Speaker 1: next week. 369 00:20:16,116 --> 00:20:21,716 Speaker 2: So there is a system that my team developed called Synthide, 370 00:20:21,956 --> 00:20:26,036 Speaker 2: which is a system for watermarking AA generated content. So 371 00:20:26,076 --> 00:20:28,876 Speaker 2: we want to be able to detect it. When you 372 00:20:28,916 --> 00:20:31,556 Speaker 2: have a generated content, users should be able to detect 373 00:20:31,636 --> 00:20:33,156 Speaker 2: that this is educated. 374 00:20:33,316 --> 00:20:39,276 Speaker 1: Generated content, whether it's images or words or whatever, text, video. 375 00:20:39,716 --> 00:20:46,716 Speaker 2: Exactly exactly. You embed this imperceptible thing within the thing 376 00:20:46,756 --> 00:20:49,196 Speaker 2: that is generated that a human might not see. 377 00:20:49,436 --> 00:20:52,716 Speaker 1: So the builder of the AI model Open AI could 378 00:20:52,796 --> 00:20:57,076 Speaker 1: choose to embed a watermark in GPT, so that anybody 379 00:20:57,076 --> 00:21:00,356 Speaker 1: who made a thing with GPT, that document would have 380 00:21:00,436 --> 00:21:03,716 Speaker 1: some hidden sign that it was AI generated. It's sort 381 00:21:03,716 --> 00:21:07,396 Speaker 1: of the choice of the model of builders. Yeah, thank 382 00:21:07,436 --> 00:21:09,236 Speaker 1: you very much for your time. It was great to 383 00:21:09,276 --> 00:21:09,676 Speaker 1: talk with you. 384 00:21:10,516 --> 00:21:12,556 Speaker 2: Yeah, thanks you good. It was a pleasure. 385 00:21:18,916 --> 00:21:22,916 Speaker 1: Pushmikkoli is vice president of research at Google deep Mine. 386 00:21:23,596 --> 00:21:26,916 Speaker 1: Today's show was produced by Edith Russello and edited by 387 00:21:27,036 --> 00:21:31,676 Speaker 1: Karen Chakerje. You can email us at problem at Pushkin 388 00:21:31,916 --> 00:21:34,436 Speaker 1: dot FM. I'm Jacob Goldstein.