WEBVTT - How AI Solved a Biological Mystery 0:00:15.356 --> 0:00:24.916 Pushkin. Humans are made of proteins. Proteins are key components 0:00:24.956 --> 0:00:28.716 of our cells and of our muscles. Proteins regulate gene 0:00:28.756 --> 0:00:33.676 expression and the immune system. And yet forever we had 0:00:34.036 --> 0:00:37.956 no idea what most proteins look like, and this was 0:00:37.956 --> 0:00:41.756 a problem. Every protein has a different shape, and to 0:00:41.916 --> 0:00:46.156 understand how any particular protein works, how it interacts with 0:00:46.196 --> 0:00:49.636 other molecules, how it keeps us healthy or causes disease, 0:00:50.236 --> 0:00:55.356 it is very helpful to understand what that particular shape is, 0:00:56.276 --> 0:01:01.276 but determining that complicated three dimensional shape was really hard. 0:01:01.436 --> 0:01:05.076 Scientists sometimes spent years trying to determine the shape of 0:01:05.236 --> 0:01:10.316 a single protein. So people started to dream this. They thought, 0:01:10.796 --> 0:01:12.716 what if we could come up with some kind of 0:01:12.756 --> 0:01:16.676 a system, some way to use a protein sequence of 0:01:16.716 --> 0:01:22.676 amino acids to reliably predict that protein's unique three dimensional shape. 0:01:23.356 --> 0:01:25.836 It would be a huge leap forward that could lead 0:01:25.876 --> 0:01:28.916 to a much deeper understanding of biology and a new 0:01:28.916 --> 0:01:32.436 wave of treatments for disease. This idea was called the 0:01:32.436 --> 0:01:36.876 protein folding problem. But after decades of work, the best 0:01:37.036 --> 0:01:40.236 protein folding models were nowhere near good enough to be 0:01:40.356 --> 0:01:45.156 scientifically useful. And then in twenty twenty a group of 0:01:45.196 --> 0:01:48.156 researchers built an AI model to try to tackle the 0:01:48.196 --> 0:01:51.836 protein folding problem, and their model was so much better 0:01:51.916 --> 0:01:54.556 than what had come before that some people thought they 0:01:54.556 --> 0:01:58.116 were cheating. In fact, they were not cheating. They had 0:01:58.196 --> 0:02:02.676 solved the protein folding problem. I'm Jacob Goldstein, and this 0:02:02.756 --> 0:02:04.796 is What's Your Problem, the show where I talk to 0:02:04.876 --> 0:02:08.636 people who are trying to make technological progress. My guest 0:02:08.636 --> 0:02:11.876 today is push Meat Coli. He's vice president of research 0:02:11.916 --> 0:02:15.116 at deep Mind, an AI research group that's part of Google. 0:02:15.836 --> 0:02:17.916 Push Me was part of the deep Mind team that 0:02:18.076 --> 0:02:21.636 solved the protein folding problem. They built an AI model 0:02:21.676 --> 0:02:25.196 called alpha fold, and alpha fold is one of the 0:02:25.236 --> 0:02:28.916 most impressive real world AI success stories that we have 0:02:28.996 --> 0:02:31.876 seen so far, and as you'll hear in our conversation, 0:02:32.396 --> 0:02:37.156 alpha fold holds lessons for AI that go beyond protein folding. 0:02:37.876 --> 0:02:40.476 One other thing you may hear in our conversation, by 0:02:40.516 --> 0:02:43.996 the way, is the occasional background beep from push meat 0:02:44.076 --> 0:02:52.116 smoke detector. Have you failed to change the battery in 0:02:52.196 --> 0:02:55.556 your smoke detector? Is that perhaps what that little chirp was? 0:02:55.716 --> 0:02:57.876 You really should do that for yourself as. 0:02:57.756 --> 0:03:02.156 Well as I know, I mean, I was trying to 0:03:02.156 --> 0:03:05.916 fix it before the recording, and like it's just so finicky, 0:03:05.996 --> 0:03:06.716 it doesn't come. 0:03:06.596 --> 0:03:11.556 Out okay, fair, fair, well with it. It makes you real. 0:03:11.956 --> 0:03:15.476 And so it is the case with protein folding that 0:03:15.556 --> 0:03:17.796 it's not just like people publishing papers. Right, There was 0:03:17.836 --> 0:03:21.836 actually this contest that was held was it every couple 0:03:21.876 --> 0:03:26.356 of years of exactlys trying to solve the protein folding problem? 0:03:26.396 --> 0:03:28.316 And this had been going on for a long time, 0:03:28.836 --> 0:03:31.236 And so is there a first moment when you compete 0:03:31.276 --> 0:03:32.156 in this contest? 0:03:33.076 --> 0:03:35.516 Yeah? So this is like an amazing thing about protein 0:03:35.516 --> 0:03:38.956 structure prediction and how visionary the community was. Like they 0:03:38.956 --> 0:03:41.876 had set up this amazing sort of fool proof way 0:03:41.916 --> 0:03:44.636 to evaluate progress, because it's very easy to sort of 0:03:44.636 --> 0:03:47.636 for sometimes scientists to fool themselves saying oh that's progress, 0:03:47.836 --> 0:03:50.476 there is no progress, right, So they had set up 0:03:50.516 --> 0:03:54.196 this contest in a very remarkable way where they had said, well, 0:03:54.916 --> 0:03:59.876 for a specific duration of time, any sort of scientists 0:03:59.916 --> 0:04:02.756 all over the world who are discovering new structures would 0:04:02.836 --> 0:04:06.276 not share them with the world. Instead, instead, they will 0:04:06.276 --> 0:04:09.756 basically send it to a secret want script vault. 0:04:09.876 --> 0:04:11.036 I love a secret fault. 0:04:11.156 --> 0:04:13.676 Yeah, yeah, I mean not like secret. 0:04:15.876 --> 0:04:17.516 I wanted to be a big vault with the wheel 0:04:17.596 --> 0:04:18.916 you turn, but yeah, I understand. 0:04:19.836 --> 0:04:25.276 Yeah, And so the idea there was that nobody except 0:04:25.396 --> 0:04:28.076 the scientists who discovered the structure knew what is the 0:04:28.076 --> 0:04:29.476 structure of this protein. 0:04:29.756 --> 0:04:33.236 So it's a perfect way to test these models doing 0:04:33.276 --> 0:04:36.916 the prediction because somebody knows the answer. But the people 0:04:36.916 --> 0:04:39.276 building the models, people like you, people like deep mind, 0:04:39.356 --> 0:04:41.996 don't actually know the answer. So you can't cheat, you 0:04:42.036 --> 0:04:43.996 can't backfit it or anything like. 0:04:43.916 --> 0:04:46.476 That exactly, right, So you don't know how good you 0:04:46.516 --> 0:04:49.796 are because like you've been training on known examples and 0:04:49.836 --> 0:04:52.876 you've been evaluating them on known examples. But when you 0:04:52.996 --> 0:04:55.796 are tested, you are tested on these amazingly new things 0:04:55.876 --> 0:04:57.356 that nobody has seen before. 0:04:57.556 --> 0:05:00.836 Yeah. Yeah, so okay. And they actually have like a 0:05:00.956 --> 0:05:05.996 numeric score that they assigned to everybody's model, right, yeah, 0:05:06.036 --> 0:05:09.276 Like it's very quantitative and it's not just like good 0:05:09.396 --> 0:05:12.316 or pretty good. It's a number, right, And what is it? 0:05:12.476 --> 0:05:13.436 Zero two one hundred? 0:05:13.476 --> 0:05:15.956 Is that the scale? Yeah, it's zero two hundred, Like 0:05:15.996 --> 0:05:17.676 that's the sort of scale. And if you look at 0:05:17.996 --> 0:05:20.596 progress in the last sort of twenty years before alpha 0:05:20.596 --> 0:05:23.356 fold one was launched. I mean it was somewhere sort 0:05:23.396 --> 0:05:27.076 of between the twenty five to forty sort of GDT 0:05:27.236 --> 0:05:27.556 sort of. 0:05:27.556 --> 0:05:30.516 Five to forty. Was it getting better slowly or was 0:05:30.556 --> 0:05:33.276 it just kind of stuck in the thirties more or less? 0:05:33.556 --> 0:05:35.996 Yeah, it was stagnating. It was like sometimes it would 0:05:36.036 --> 0:05:38.596 go to thirty eight, sometimes thirty five, and it was 0:05:38.636 --> 0:05:40.676 like in that so it was going up and down, 0:05:40.756 --> 0:05:42.596 up and down. There was no sort of remarkable sort 0:05:42.596 --> 0:05:43.116 of breakthrough. 0:05:43.596 --> 0:05:46.116 And was it all AI was like the only way 0:05:46.156 --> 0:05:48.316 people were trying to solve it? AI? Were there whole 0:05:48.396 --> 0:05:51.356 other sort of things people were thinking about trying to do. 0:05:51.996 --> 0:05:56.036 Yeah, so this is mostly not AI based solutions, right, 0:05:56.076 --> 0:06:01.516 These were sort of very well designed, hand designed systems 0:06:01.596 --> 0:06:05.276 that were carefully tuned to the problem over many, many 0:06:05.356 --> 0:06:08.756 decades with large teams working together and so on. But 0:06:09.116 --> 0:06:10.396 it was a little machine learning. 0:06:11.116 --> 0:06:14.236 So they'd been scoring in the thirties more or less. 0:06:14.796 --> 0:06:17.596 And then what year is it that deep mind? You 0:06:17.676 --> 0:06:19.516 and deep Mind show up with alpha fourth one? 0:06:20.036 --> 0:06:25.276 So twenty eighteen. So the contest actually it runs in 0:06:25.316 --> 0:06:26.356 the summer and. 0:06:26.276 --> 0:06:28.396 Then at the end of the summer they sort of 0:06:28.516 --> 0:06:30.436 give you the results or what so. 0:06:30.436 --> 0:06:31.996 At the end of the summer. I mean like by 0:06:32.076 --> 0:06:33.956 by I think July or August, they have sent the 0:06:34.036 --> 0:06:37.796 last files and then you have sent them the results 0:06:38.116 --> 0:06:40.156 and then you wait, right and you don't know what 0:06:40.636 --> 0:06:44.716 has happened. Then they invite you to a conference which 0:06:45.076 --> 0:06:48.236 happens in December, so you are like eagerly waiting what's 0:06:48.276 --> 0:06:51.316 what has happened? And they're like, oh, maybe we came last, 0:06:51.836 --> 0:06:55.196 maybe maybe we're in the middle. And then they actually 0:06:55.236 --> 0:06:58.756 revealed the leaderboard or the scores in the conference. 0:06:59.196 --> 0:07:00.796 Where were you at the time. 0:07:01.996 --> 0:07:03.716 I was in London, I was in the office. I 0:07:03.756 --> 0:07:06.196 was like really waiting, like trying to figure out like 0:07:06.236 --> 0:07:08.796 what where where were we right in terms of the career? 0:07:09.276 --> 0:07:12.996 How did we form? And we get an email from 0:07:13.396 --> 0:07:16.516 the organizers one day' hold the results are about two 0:07:16.596 --> 0:07:20.556 thounds and they say, well you are first and by 0:07:20.596 --> 0:07:23.276 a big margin, so like from thirty to forty we 0:07:23.316 --> 0:07:25.076 had gone to more than sixty. 0:07:25.636 --> 0:07:28.356 You did way better at predicting the structure of protein 0:07:28.396 --> 0:07:30.236 than anyone had ever done, including that. 0:07:30.236 --> 0:07:32.796 Yes you won by a lot, Yes we won by 0:07:32.836 --> 0:07:33.596 a lot. 0:07:33.796 --> 0:07:36.996 So that's way better than anyone has ever done. But 0:07:37.076 --> 0:07:39.916 does it mean that you're basically half right, is that 0:07:39.956 --> 0:07:41.116 what that number means? 0:07:41.396 --> 0:07:44.316 Yeah, so I think the way we I mean we 0:07:44.396 --> 0:07:46.156 are you're the best in the world, but still your 0:07:46.156 --> 0:07:50.796 predictions are pretty much sort of not very useful for 0:07:51.636 --> 0:07:53.756 any like if you're trying to figure out what a 0:07:53.876 --> 0:07:57.556 drug buying to this particular protein or like, the error 0:07:57.636 --> 0:08:01.396 is so much right that you wouldn't get a complete 0:08:01.436 --> 0:08:02.436 picture of the protein. 0:08:02.836 --> 0:08:07.156 So this sixty number, it's like good in that it's 0:08:07.156 --> 0:08:09.236 better than anyone has ever done. It's bad in that 0:08:09.276 --> 0:08:12.516 it's not scientifically useful. What number do you have to 0:08:12.556 --> 0:08:14.356 get to to be scientifically useful? 0:08:14.836 --> 0:08:18.836 Like between eighty five to ninety that's what people told 0:08:18.916 --> 0:08:21.716 us that if you get beyond eighty five to ninety, 0:08:22.276 --> 0:08:23.636 then the problem is solved. 0:08:23.836 --> 0:08:27.836 So what do you decide when you when you get 0:08:27.836 --> 0:08:28.356 this result. 0:08:29.756 --> 0:08:32.276 So we get this result and we're like, yeah, this 0:08:32.396 --> 0:08:35.276 is amazing, right, that we are the best in the world, 0:08:35.356 --> 0:08:37.556 right by a big margin. Right, So like the thesis 0:08:37.596 --> 0:08:40.716 that machine learning sort of will advanced science. Oh that's great, 0:08:41.036 --> 0:08:43.956 but the problem is not solved. And let's go back 0:08:43.996 --> 0:08:47.996 to the drawing board. And now with the information that 0:08:48.036 --> 0:08:49.636 we have in the amount of time, we have spent 0:08:49.756 --> 0:08:52.356 on this previous architecture, do we still think that this 0:08:52.396 --> 0:08:56.396 will lead us to where we want to go? And 0:08:56.556 --> 0:09:00.396 the teams thought, no, we need no completely, Yeah, we 0:09:00.636 --> 0:09:02.356 need to completely start from scratch. 0:09:02.876 --> 0:09:06.356 So your your reaction to winning this contest and doing 0:09:06.396 --> 0:09:09.596 better than anyone has ever done it predicting protein folding 0:09:09.676 --> 0:09:12.436 is let's blow up this thing that just won the contest. 0:09:13.036 --> 0:09:17.796 Yeah, throw it away. Yeah, we were like, the basic 0:09:17.876 --> 0:09:21.356 premise was proved that machine learning has a role to play, right, 0:09:21.396 --> 0:09:23.596 So that gave us a lot of confidence. But at 0:09:23.596 --> 0:09:25.636 the same time we saw it, well, this is not 0:09:25.676 --> 0:09:28.756 an elegant This is not an elegant solution, right. This 0:09:28.836 --> 0:09:31.916 is this is like two modules, Like there's machine learning, 0:09:32.076 --> 0:09:34.556 there's a machine learning module. It is making these sort 0:09:34.596 --> 0:09:37.036 of predictions which this other module is sort of trying 0:09:37.076 --> 0:09:39.436 to use. If you believe in the power of machine learning, 0:09:39.516 --> 0:09:41.476 let's do end to end, right, Let's do end to 0:09:41.596 --> 0:09:45.996 end and basically do everything so that the model takes 0:09:45.996 --> 0:09:47.276 care of it, right, rather. 0:09:47.116 --> 0:09:50.516 Than clear what was happening with that initial model that 0:09:50.596 --> 0:09:53.076 you were deciding to abandon, that. 0:09:53.316 --> 0:09:56.996 Was basically using the machine learning model in together with 0:09:57.636 --> 0:10:01.556 sort of a known framework. Right that there is a 0:10:01.636 --> 0:10:04.316 second step that was a conventional sort of step. 0:10:04.996 --> 0:10:07.676 Oh I see. So it's like you weren't all in 0:10:07.716 --> 0:10:10.196 on machine learning. You were like, well, we gonna use 0:10:10.236 --> 0:10:12.356 machine learning, but we're gonna still do this kind of 0:10:12.396 --> 0:10:14.276 the way other people have been doing it. And your 0:10:14.276 --> 0:10:18.716 response to that first result was screw it, Let's not 0:10:18.756 --> 0:10:21.156 do anything like everybody's not before. Let's just go all 0:10:21.196 --> 0:10:26.836 in on machine learning beginning to end exactly interesting. And 0:10:26.996 --> 0:10:29.556 so you do that, and you spend what two years? 0:10:29.636 --> 0:10:31.436 Is it two years between? Do I have that right? 0:10:31.516 --> 0:10:31.716 Yeah? 0:10:31.916 --> 0:10:36.436 Yeah, and then you come back in twenty twenty. You 0:10:36.476 --> 0:10:40.396 come back in twenty twenty, there's another one of these contests. Yeah, 0:10:40.516 --> 0:10:42.796 you got your new end end machine learning model. 0:10:42.956 --> 0:10:46.436 Yeah. So it was the pandemic. So this is twenty twenty, right. 0:10:46.276 --> 0:10:47.916 So nobody's gone anywhere. 0:10:48.036 --> 0:10:51.716 Yeah, nobody's going anywhere. Right. We knew like twenty nineteen 0:10:52.476 --> 0:10:55.556 was basically where we started working with this new model, 0:10:55.756 --> 0:10:58.756 and it was really tough going because we were like 0:10:58.836 --> 0:11:02.476 starting from we're starting from twenty, right, so we went 0:11:02.516 --> 0:11:05.836 at sixty. Now we are starting from twenty and twenty 0:11:06.036 --> 0:11:08.956 thirty forty. And sometimes you would stagnate at forty five 0:11:09.236 --> 0:11:12.396 fifty you were like, really should I should I had 0:11:12.436 --> 0:11:13.276 that figure model? 0:11:13.756 --> 0:11:14.076 Yeah? 0:11:14.236 --> 0:11:17.396 Yeah, So twenty by the end of twenty nineteen thought 0:11:17.636 --> 0:11:20.236 we started getting some really really cool results and we thought, okay, 0:11:20.276 --> 0:11:23.236 now we have surpassed we have definitely surpassed the previous model. 0:11:23.756 --> 0:11:27.876 We're in good territory. And we were very excited. Like 0:11:27.916 --> 0:11:31.036 at the start of twenty twenty, we were like, yeah, 0:11:31.436 --> 0:11:34.516 making progress, and then the pandemic hit. 0:11:38.436 --> 0:11:42.036 In a minute, the model gets an unanticipated test in 0:11:42.076 --> 0:11:42.836 the real world. 0:11:47.036 --> 0:11:51.036 There was this new virus that was reported sarskov two 0:11:51.556 --> 0:11:54.276 and one of the first things so somebody sort of 0:11:54.556 --> 0:11:57.436 figured out the structure of the spike protein. It was 0:11:57.476 --> 0:12:00.116 all over the newspapers, like, here's a spike protein of 0:12:00.116 --> 0:12:03.716 this new virus, but all the other proteins of the virus, 0:12:03.876 --> 0:12:09.036 the accessory proteins, nobody knew the structure. So the first 0:12:09.356 --> 0:12:11.836 we did we thought, I think we think we have 0:12:11.916 --> 0:12:14.516 the best model in the world. We should be making 0:12:14.556 --> 0:12:17.316 these predictions and sharing it with the world, but is 0:12:17.316 --> 0:12:20.196 this the right thing to do. So we spent a 0:12:20.196 --> 0:12:24.276 lot of time reaching out to biologists who looked at 0:12:24.316 --> 0:12:27.036 the prediction and said well, you need to share this, 0:12:27.716 --> 0:12:29.556 you need to share this with the world. So the 0:12:29.596 --> 0:12:32.756 start of twenty twenty was us sort of sharing the 0:12:32.796 --> 0:12:36.756 predictions from this untested model with the world because we 0:12:37.036 --> 0:12:41.396 thought they were quite good. And then throughout twenty twenty 0:12:41.476 --> 0:12:44.396 we took part in the assessment right, which ran in 0:12:44.396 --> 0:12:46.116 the summer of twenty the contest. 0:12:46.196 --> 0:12:47.356 In the contest. 0:12:47.476 --> 0:12:50.996 Exactly normally, right, the organizers don't come back to you. 0:12:51.396 --> 0:12:55.436 They just released the results at the end in December, 0:12:56.196 --> 0:12:58.596 And at the end of the summer we get this 0:12:58.636 --> 0:13:01.796 funny email saying we want to talk to you, and 0:13:01.916 --> 0:13:06.076 so we were like, yeah, like, did we do anything bad? 0:13:06.476 --> 0:13:09.956 What happened? And a few of them really had sort 0:13:09.956 --> 0:13:14.276 of suspicions. They were like, you must have cheated, right, 0:13:14.556 --> 0:13:19.116 like that your predictions are your level of performance is 0:13:19.156 --> 0:13:22.316 nowhere close to anything that we have seen ever. Right, 0:13:23.036 --> 0:13:27.276 But a few scientists in that contest had submitted a 0:13:27.316 --> 0:13:31.396 sequence a protein whose structure was not known. They were 0:13:31.676 --> 0:13:34.476 expecting that the structure would be known by the time 0:13:34.516 --> 0:13:37.556 the contest en so we'll be able to evaluate the predictions. 0:13:37.636 --> 0:13:39.676 But that structure was not known, and in fact, the 0:13:39.716 --> 0:13:41.116 structure they couldn't find. The structure. 0:13:41.356 --> 0:13:44.036 So you're saying it would be impossible to cheat because 0:13:44.076 --> 0:13:47.556 literally no human knew the structure. No way to cheat, 0:13:47.836 --> 0:13:48.956 nobody knows the answer. 0:13:49.156 --> 0:13:53.836 Yeah, yeah, So they used the prediction of alpha fold 0:13:54.676 --> 0:13:59.516 and then tried to explain their experimental data and it matched. 0:14:00.556 --> 0:14:03.076 And they are like, this model has been able to 0:14:03.116 --> 0:14:07.516 discover something that nobody knew, not even no scientists knew. 0:14:09.276 --> 0:14:12.996 Sense the model had already made new biological discoveries even 0:14:13.276 --> 0:14:14.196 before we knew it. 0:14:14.716 --> 0:14:18.996 Yeah, yeah, okay, so that's good. You're not in trouble anymore. 0:14:20.076 --> 0:14:22.956 It's clear you didn't cheat. Do they say the number? 0:14:22.996 --> 0:14:25.596 What's the number? I'm waiting for the number. How'd you do? 0:14:26.116 --> 0:14:28.716 Yeah, so we were beyond eighty five and ninety right, 0:14:28.756 --> 0:14:31.636 and then they basically said, okay, we have to announce 0:14:31.676 --> 0:14:35.796 it to the world. And so come December that was 0:14:35.876 --> 0:14:38.796 the announcement that was made by the organizers that this 0:14:39.196 --> 0:14:43.156 alpha fold too had solved the protein structure prediction problem. 0:14:43.556 --> 0:14:46.156 So is that contest done? Now? Did you just end 0:14:46.236 --> 0:14:48.556 that contest? Is nobody doing that anymore? 0:14:48.956 --> 0:14:52.636 No, the contest sort of is alive, right, it has changed, 0:14:52.836 --> 0:14:56.396 its focus has changed. So what what alpha fold two 0:14:56.476 --> 0:15:00.596 did was find the structure of these single proteins, But 0:15:00.636 --> 0:15:03.236 there are many other problems that remain, right, how do 0:15:03.556 --> 0:15:09.116 multiple proteins interact? Instance, Right, So there are other structure predictions, 0:15:09.156 --> 0:15:12.796 problems that now the contest is sort of has evolved to, right, 0:15:12.796 --> 0:15:15.516 it is sort of focusing on other types of problems 0:15:15.556 --> 0:15:18.116 that Alpha fold two did not address. 0:15:18.516 --> 0:15:21.796 If we zoom out and think about what you have done, 0:15:21.836 --> 0:15:24.556 what the team has done in you know, using machine 0:15:24.636 --> 0:15:28.236 learning to solve this scientific problem that people had been 0:15:28.236 --> 0:15:30.836 working on for a long time, Like what are the 0:15:30.916 --> 0:15:34.316 broader lessons? Like if we think about other domains, what 0:15:34.796 --> 0:15:37.396 can we infer what can we take from this? 0:15:38.636 --> 0:15:40.476 Yeah, so I think the thing that we can take 0:15:40.516 --> 0:15:44.596 from this is basically science is sort of generating a 0:15:44.636 --> 0:15:47.356 lot of data across any domain that you see, right, 0:15:47.396 --> 0:15:51.476 whether it's genomics, hydergy, physics, whether the amount of data 0:15:51.516 --> 0:15:54.236 that we are gathering about the world is much more 0:15:54.276 --> 0:15:56.156 than any single human mind can comprehend. 0:15:56.396 --> 0:15:56.516 Right. 0:15:56.556 --> 0:15:58.316 You can have the best scientists and they will not 0:15:58.356 --> 0:16:00.436 be able to sort of go through on the data 0:16:00.476 --> 0:16:03.716 that we are collecting about our world. So machine learning 0:16:03.756 --> 0:16:06.116 is this remarkable sort of tool which gives us the 0:16:06.196 --> 0:16:09.396 ability to make sense and leverage this data, right and 0:16:10.036 --> 0:16:12.596 really sort of on the path of really accelerating our 0:16:12.676 --> 0:16:15.356 understanding of the problems that we're dealing with. 0:16:16.076 --> 0:16:18.476 In the case of alpha fold, was the sort of 0:16:18.556 --> 0:16:24.156 input data the known protein structures and amino acid sequence 0:16:24.156 --> 0:16:27.236 and was that the basic training data exactly right? 0:16:27.276 --> 0:16:30.676 So it was the PDB, which was the protein database, 0:16:31.196 --> 0:16:34.756 and that had been collected by the community for many, 0:16:34.796 --> 0:16:39.356 many years, right over many decades. They have meticulously carefully 0:16:39.396 --> 0:16:43.356 deposited all the protein sequences and the corresponding structures that 0:16:43.396 --> 0:16:46.316 were discovered, right, And it had one hundred and fifty 0:16:46.396 --> 0:16:49.836 thousand examples at that time, right, sequences as well as structures, 0:16:50.196 --> 0:16:53.756 and everyone had access to the same data. Right, All 0:16:53.756 --> 0:16:57.436 the teams were training on that data. 0:16:57.636 --> 0:17:00.636 Is it right that alpha fold itself is open sourced 0:17:00.716 --> 0:17:04.716 and that there's this open source database of protein structures 0:17:04.756 --> 0:17:06.916 that have been discovered with alpha fauld? Is that right? 0:17:07.716 --> 0:17:10.996 Yeah? So when the sort of developed alpha fold, we 0:17:11.116 --> 0:17:14.196 made it available to the world. But we then said, well, 0:17:14.556 --> 0:17:17.276 it's so accurate, but it's also so fast that we 0:17:17.356 --> 0:17:20.876 will use it to find the structure for every sort 0:17:20.916 --> 0:17:24.396 of known protein. And then we made all those structures 0:17:24.436 --> 0:17:25.796 available to the world. 0:17:30.076 --> 0:17:33.036 Alpha fold has now made the structures of roughly two 0:17:33.196 --> 0:17:38.836 hundred and fifty million different proteins publicly available. We'll be 0:17:38.876 --> 0:17:45.076 back in a minute with the lightning ground. Last thing 0:17:45.316 --> 0:17:48.956 is a lightning round. Just some fast questions, okay, and 0:17:48.996 --> 0:17:51.276 then we'll be done. What's your favorite protein? 0:17:51.836 --> 0:17:52.516 Himoglobin? 0:17:53.396 --> 0:17:54.436 Why? 0:17:55.156 --> 0:17:57.036 It is sort of very pleasant to look at it. It 0:17:56.876 --> 0:17:58.916 is very symmetric, it has there and you can see 0:17:58.996 --> 0:18:01.996 it's purpose right that and the oxygen binds into that 0:18:02.036 --> 0:18:03.476 thing right from very clean protein. 0:18:04.116 --> 0:18:07.396 It's so easy to understand. It's the little thing that 0:18:07.516 --> 0:18:10.916 carries oxygen around your body. If everything goes well, what 0:18:10.996 --> 0:18:13.596 problem will you be trying to solve in say, five years. 0:18:14.756 --> 0:18:19.556 Really sort of thinking about the two big challenges sort 0:18:19.596 --> 0:18:21.876 of that humanity is facing. One is the pandemic, the 0:18:21.956 --> 0:18:24.596 other is climate change. And I think material science and 0:18:24.676 --> 0:18:28.956 quantum chemistry can impact both, but especially climate change. And 0:18:28.996 --> 0:18:31.516 I think this is something that requires a lot of work. 0:18:32.676 --> 0:18:37.036 Is there some particular problem in that domain that is 0:18:37.076 --> 0:18:40.556 analogous to protein folding? Is there some hard thing that 0:18:40.596 --> 0:18:41.676 you want to figure. 0:18:41.396 --> 0:18:45.516 Out rational material design? We are very far from there. 0:18:45.676 --> 0:18:49.476 We are still basically doing experimental stuff when we think 0:18:49.516 --> 0:18:51.916 about discovering new materials. 0:18:53.236 --> 0:18:56.316 What do you understand about AI or machine learning that 0:18:56.436 --> 0:18:58.236 most people don't understand. 0:18:59.596 --> 0:19:03.316 I think sort of AI is not magic, right, it's 0:19:03.356 --> 0:19:07.676 sort of essentially it's a series of techniques which is 0:19:07.756 --> 0:19:12.716 able to extract intelligence. But you extract intelligence from the 0:19:12.836 --> 0:19:16.996 raw material, right, So so garbage and garbage out. So 0:19:17.676 --> 0:19:21.356 what is really important is that experience need needs to 0:19:21.396 --> 0:19:25.036 be rich enough. Right, We can't just we don't become 0:19:25.076 --> 0:19:27.676 intelligent by sitting in the room, right. We become intelligent 0:19:27.676 --> 0:19:32.356 because we have amazing experiences. So it's not big data, right, 0:19:32.396 --> 0:19:34.876 it's not the bigness of the experience, but it's like 0:19:35.116 --> 0:19:38.516 the goodness of the experience, like the wide variety of 0:19:38.796 --> 0:19:41.196 sort of things that you train on and the things 0:19:41.196 --> 0:19:44.596 that you see. So I think that's very really that's 0:19:44.636 --> 0:19:45.436 really important. 0:19:46.316 --> 0:19:51.116 That thought leads you to like the optimal training data. 0:19:51.196 --> 0:19:53.316 So it's the worry that like people are making a 0:19:53.356 --> 0:19:55.836 mistake by just doing a lot of the same kind 0:19:55.916 --> 0:19:56.716 of training data. 0:19:57.676 --> 0:20:01.156 Yeah, exactly, exactly right. So if you just take one example, 0:20:01.236 --> 0:20:04.916 you repeat it multiple times, right, that's not that's not great. Again, 0:20:05.156 --> 0:20:07.436 you don't become Yeah, you don't become wise doing the 0:20:07.476 --> 0:20:09.036 same thing again and again and again. 0:20:09.196 --> 0:20:11.996 Right, what are you actually working on right now? Like 0:20:11.996 --> 0:20:14.156 what are you going to go work on today or 0:20:14.276 --> 0:20:14.876 next week. 0:20:16.116 --> 0:20:21.716 So there is a system that my team developed called Synthide, 0:20:21.956 --> 0:20:26.036 which is a system for watermarking AA generated content. So 0:20:26.076 --> 0:20:28.876 we want to be able to detect it. When you 0:20:28.916 --> 0:20:31.556 have a generated content, users should be able to detect 0:20:31.636 --> 0:20:33.156 that this is educated. 0:20:33.316 --> 0:20:39.276 Generated content, whether it's images or words or whatever, text, video. 0:20:39.716 --> 0:20:46.716 Exactly exactly. You embed this imperceptible thing within the thing 0:20:46.756 --> 0:20:49.196 that is generated that a human might not see. 0:20:49.436 --> 0:20:52.716 So the builder of the AI model Open AI could 0:20:52.796 --> 0:20:57.076 choose to embed a watermark in GPT, so that anybody 0:20:57.076 --> 0:21:00.356 who made a thing with GPT, that document would have 0:21:00.436 --> 0:21:03.716 some hidden sign that it was AI generated. It's sort 0:21:03.716 --> 0:21:07.396 of the choice of the model of builders. Yeah, thank 0:21:07.436 --> 0:21:09.236 you very much for your time. It was great to 0:21:09.276 --> 0:21:09.676 talk with you. 0:21:10.516 --> 0:21:12.556 Yeah, thanks you good. It was a pleasure. 0:21:18.916 --> 0:21:22.916 Pushmikkoli is vice president of research at Google deep Mine. 0:21:23.596 --> 0:21:26.916 Today's show was produced by Edith Russello and edited by 0:21:27.036 --> 0:21:31.676 Karen Chakerje. You can email us at problem at Pushkin 0:21:31.916 --> 0:21:34.436 dot FM. I'm Jacob Goldstein.