WEBVTT - Understanding the Most Viral Chart in Artificial Intelligence 0:00:02.720 --> 0:00:14.000 Bloomberg Audio Studios, Podcasts, Radio News. 0:00:18.560 --> 0:00:21.919 Hello and welcome to another episode of the Odd Lots podcast. 0:00:22.000 --> 0:00:24.959 I'm Jill Wisenthal, and I'm Tracy all the way, Tracy. 0:00:25.200 --> 0:00:28.560 One thing about AI is that lots of lines that 0:00:28.640 --> 0:00:29.000 go up. 0:00:30.640 --> 0:00:34.919 Yes, famously, there is perhaps one line that has captured 0:00:34.920 --> 0:00:37.000 the attention more than others when it comes to lines 0:00:37.080 --> 0:00:37.479 going up. 0:00:37.720 --> 0:00:40.560 Yes, but we're recording as April seven. Did you see 0:00:40.600 --> 0:00:43.640 the anthropic revenue chart, by the way, Oh. 0:00:43.320 --> 0:00:44.440 It's just extreme. 0:00:44.720 --> 0:00:47.519 Okay, it's just the number of lines going up. I 0:00:47.560 --> 0:00:49.440 mean there are some really. 0:00:49.520 --> 0:00:53.240 Let me caveat that. Up until recently, there was one 0:00:53.400 --> 0:00:56.840 chart of a line going up exponentially that became I 0:00:56.840 --> 0:00:59.880 think it's fair to say the most viral chart in AI. 0:01:00.240 --> 0:01:02.680 Right, Yes, I would absolutely agree with that. 0:01:02.800 --> 0:01:05.600 So one of the many lines that go up, or 0:01:05.600 --> 0:01:08.399 there are various lines that sort of capture This is 0:01:08.840 --> 0:01:11.240 essentially just measures of AI progress of what they could do, 0:01:11.280 --> 0:01:14.360 what the models are capable of, and so forth. And 0:01:15.000 --> 0:01:18.800 you know, there's all different benchmarks out there and hobbyist 0:01:18.920 --> 0:01:22.120 benchmark creators, et cetera, all kinds of benchmarks out there. 0:01:22.920 --> 0:01:26.880 Organization called Meter based out in San Francisco, and they 0:01:26.920 --> 0:01:30.280 measure how well AI models are doing at various sort 0:01:30.280 --> 0:01:34.000 of engineering tasks, et cetera. And they have these charts 0:01:34.000 --> 0:01:37.440 showing how long, you know, certain tasks, how long would 0:01:37.440 --> 0:01:40.240 take a human to do them, and then whether AI 0:01:40.319 --> 0:01:43.040 could do them. And yes, the lines just almost vertical. 0:01:43.040 --> 0:01:44.920 I think there was someone one of the ones that 0:01:44.959 --> 0:01:47.720 came out maybe very early this year or late last year, 0:01:47.800 --> 0:01:49.280 showing the latest Claude model. 0:01:49.360 --> 0:01:50.920 It, yes, like this is crazy. 0:01:51.040 --> 0:01:54.360 When I look at these charts, they're called time horizon charts. 0:01:54.720 --> 0:01:57.880 When I look at them, I like, intuitively I kind 0:01:57.880 --> 0:02:00.840 of understand what they're saying, and you can kind of 0:02:00.880 --> 0:02:04.080 see the leap in progress between some of the previous 0:02:04.120 --> 0:02:07.560 models and Claude, right the latest claud model. And that's 0:02:07.560 --> 0:02:10.440 what got everyone excited, was you had this big exponential 0:02:10.520 --> 0:02:14.400 shift up in the capability of that particular AI model. 0:02:14.680 --> 0:02:17.600 But then when I start like diving into what it 0:02:17.600 --> 0:02:21.320 actually says on Meter's website about what these charts represent, 0:02:21.440 --> 0:02:24.040 I start getting really confused. I know everyone wants to 0:02:24.040 --> 0:02:26.520 get excited about AI and charts going up in general, 0:02:26.680 --> 0:02:28.280 but I think there's a lot of nuance here and 0:02:28.320 --> 0:02:30.200 we should probably talk about it, because the other thing 0:02:30.240 --> 0:02:33.000 going on with meter right now is they've become sort 0:02:33.000 --> 0:02:35.920 of the industry standard benchmark, and so a lot of 0:02:36.080 --> 0:02:39.720 investment decisions are being based on these charts. And if 0:02:39.720 --> 0:02:42.400 you oversimplify them as just like okay, lines going up 0:02:42.760 --> 0:02:46.440 and then suddenly it goes up even more, obviously, people 0:02:46.480 --> 0:02:48.280 are going to start to get like maybe a little 0:02:48.280 --> 0:02:48.959 over excited. 0:02:49.040 --> 0:02:51.160 Can I see one other thing too that I'm very 0:02:51.160 --> 0:02:54.040 curious about, Like, I'm really glad that there are people 0:02:54.240 --> 0:02:58.320 designing various benchmarks for measuring a progress. Seems like an 0:02:58.360 --> 0:03:01.119 important thing to get a handle on. But like if 0:03:01.200 --> 0:03:05.480 I were, like, say, like talented or smart enough to 0:03:05.520 --> 0:03:07.440 be like doing these things, I would go work for 0:03:07.480 --> 0:03:09.560 one of the labs and make ten million dollars a 0:03:09.639 --> 0:03:11.520 year or something like that. And so I'm actually curious 0:03:11.560 --> 0:03:13.440 to a lot of the nonprofits, et cetera. It's like, 0:03:13.960 --> 0:03:16.120 do you really want to be like working at the 0:03:16.160 --> 0:03:19.480 cutting edge of AI in a nonprofit? I mean, I 0:03:19.480 --> 0:03:22.040 guess Open Eyes owned by a nonprofit weirdly enough, but 0:03:22.080 --> 0:03:23.960 you know what I'm saying, Like I would want the money. 0:03:24.160 --> 0:03:26.240 We should talk about it with our guests who are 0:03:26.240 --> 0:03:27.480 currently sitting right here. 0:03:28.480 --> 0:03:29.320 That's exactly right. 0:03:29.400 --> 0:03:31.799 I'm very excited to say we have the two perfect 0:03:31.840 --> 0:03:34.280 guests to talk about the best of viral and maybe 0:03:34.320 --> 0:03:36.640 important chart in AI. Right now, we're going to be 0:03:36.640 --> 0:03:38.560 speaking with Joel Becker. He is a member of the 0:03:38.600 --> 0:03:41.160 technical staff at METER, and we're also going to be 0:03:41.160 --> 0:03:44.160 speaking with Chris Painter, the president of METER. So, uh, 0:03:44.600 --> 0:03:46.920 Joel and Chris, thank you so much for coming on oud, 0:03:47.000 --> 0:03:50.960 lots for having us. Yeah, really excited to chat with 0:03:51.080 --> 0:03:54.000 both of you. Chris, and you're the president. I'll start 0:03:54.040 --> 0:03:55.920 with you, like, what is METER? How long has it 0:03:55.920 --> 0:03:58.640 been around? What is this organization? What's its goal? Just 0:03:58.640 --> 0:04:02.000 give us the sort of six sixty second synopsis of Meter. 0:04:02.160 --> 0:04:02.640 Yeah, totally. 0:04:02.720 --> 0:04:04.840 I can try and you know, sometimes I give a 0:04:04.840 --> 0:04:07.040 long version. I can try and do a short version here. 0:04:07.120 --> 0:04:11.080 So Meter is a research nonprofit based in the Bay Area, 0:04:11.160 --> 0:04:15.160 like you said, dedicated to advancing the science of measuring 0:04:15.200 --> 0:04:19.159 whether and when AI systems might pose catastrophic risks to 0:04:19.320 --> 0:04:23.680 humanity as a whole, focused specifically on threats that come 0:04:23.720 --> 0:04:27.400 from AI autonomy or AI systems themselves. So when you 0:04:27.440 --> 0:04:29.559 talk about there's kind of this whole field and AI 0:04:30.240 --> 0:04:34.360 of dangerous capability evaluations. People seeing ken this AI system 0:04:34.360 --> 0:04:37.960 assist with a chemical or biological weapon attack, can it advance? 0:04:38.360 --> 0:04:40.920 Kind of like bad actors' ability to execute cyber attacks 0:04:40.960 --> 0:04:43.880 on a really large scale. METER is sort of specialized 0:04:43.960 --> 0:04:48.880 in specifically assessing how autonomous are AI systems, what is 0:04:48.920 --> 0:04:51.840 the scale and like length and difficulty of tasks that 0:04:51.839 --> 0:04:55.000 they're able to do by themselves, partially because we think 0:04:55.000 --> 0:04:59.080 it sets the stakes for conversations about AI misalignment. So 0:04:59.120 --> 0:05:01.200 we sort of see ourselve being on the hook for 0:05:01.240 --> 0:05:04.559 at any given point in time, giving humanity the bits 0:05:04.600 --> 0:05:08.480 of evidence that are most informative for establishing the stakes 0:05:08.520 --> 0:05:12.400 of are we reliant on AI systems as a society 0:05:12.440 --> 0:05:14.960 in a way that could make it really bad if 0:05:15.000 --> 0:05:16.200 they are misaligned. 0:05:16.480 --> 0:05:18.360 I'm going to let Joe ask the question about why 0:05:18.360 --> 0:05:20.280 you're both working in a nonprofit instead of one of 0:05:20.320 --> 0:05:23.440 the labs later, But one question I do have is 0:05:23.480 --> 0:05:26.120 when I think of METER, you guys always come up 0:05:26.120 --> 0:05:28.760 in the context of these time horizon charts. And I 0:05:29.560 --> 0:05:31.240 don't mean this as an insult or anything, but I 0:05:31.240 --> 0:05:34.600 hardly ever hear anyone talk about the actual safety aspect 0:05:34.760 --> 0:05:36.440 of your mission. Why do you think that is. 0:05:36.720 --> 0:05:39.599 Yeah, So I think there's some distinction between our motive 0:05:39.680 --> 0:05:42.800 for assessing time horizons and the kind of how it 0:05:42.839 --> 0:05:45.040 gets used then by the rest of the world or 0:05:45.120 --> 0:05:47.040 kind of like what the origin of the rest of 0:05:47.080 --> 0:05:49.520 the world's interest in it for meter. I think the 0:05:49.920 --> 0:05:52.719 reason that we work on things like the time horizon 0:05:52.800 --> 0:05:55.680 charts is because if we're trying to establish the stakes 0:05:55.720 --> 0:05:59.120 for talking about could AI systems go rogue or one 0:05:59.200 --> 0:06:01.599 day could they like try to take over and subvert 0:06:01.680 --> 0:06:04.760 human control? Three years ago, if you went back to 0:06:04.760 --> 0:06:07.320 around when it meters started about fourish years ago, and 0:06:07.400 --> 0:06:09.760 if you it was started by Beth Barnes Paul Cristiano 0:06:09.800 --> 0:06:12.720 and this was kind of the initial motive. Is if 0:06:12.720 --> 0:06:14.880 you went back then and you said, why don't I 0:06:14.960 --> 0:06:17.240 think that AI systems are going to go rogue and 0:06:17.320 --> 0:06:20.680 like take over or overthrow humanity today the kind of 0:06:20.720 --> 0:06:22.400 most intuitive you know, you can come up with a 0:06:22.400 --> 0:06:25.520 lot of abstract reasons debates about the goals AI systems 0:06:25.560 --> 0:06:27.680 might or might not eventually have, but the kind of 0:06:27.720 --> 0:06:30.920 most damning in the moment reason is the AI system 0:06:31.040 --> 0:06:33.280 just can't do much right. It doesn't make sense to 0:06:33.320 --> 0:06:36.560 talk about a question answer system that like can't even 0:06:36.600 --> 0:06:40.000 reliably answer programming questions saying like is it going to 0:06:40.200 --> 0:06:42.480 hack my systems or like backdoor me in some way. 0:06:42.520 --> 0:06:44.120 It just doesn't make any sense to talk about. 0:06:44.000 --> 0:06:46.040 It's going to write you a poem that you asked. 0:06:46.240 --> 0:06:48.240 Right, or won't even at the time they couldn't do 0:06:48.279 --> 0:06:50.880 anything for themselves. And so if you're like, kind of 0:06:51.080 --> 0:06:54.159 being able to subvert human control depends on agency, And 0:06:54.200 --> 0:06:55.960 so we wanted to come up with a measure that 0:06:56.040 --> 0:06:58.400 kind of tracks agency over time to kind of say, 0:06:58.400 --> 0:07:01.520 when would this argument no longer? When are AI systems 0:07:01.560 --> 0:07:04.840 now able to kind of do long, complex enough actions 0:07:04.839 --> 0:07:08.320 by themselves that the argument kind of the goalposts almost 0:07:08.360 --> 0:07:10.920 move somewhere else to like, well, we would catch the 0:07:10.960 --> 0:07:13.880 AIS or the AIS don't want to subvert human control. 0:07:14.240 --> 0:07:17.000 And so I agree that there is a distinction between 0:07:17.040 --> 0:07:19.600 like how I think partially the exercise of trying to 0:07:19.600 --> 0:07:22.320 come up with these measures throws off things that are 0:07:22.440 --> 0:07:26.000 very like grounded and intuitive measures of AI progress that 0:07:26.080 --> 0:07:28.960 might be more intuitive than just benchmarks. Right, So if 0:07:29.000 --> 0:07:30.600 you a lot of people are in the game of 0:07:30.600 --> 0:07:33.480 making just benchmarks, where you say, like here's my harm 0:07:33.480 --> 0:07:36.400 bench or something the AI gets seventy percent. That's much 0:07:36.480 --> 0:07:39.400 less of a kind of grounded or long lasting metric. 0:07:39.480 --> 0:07:41.560 Like it's hard to say what that means or how 0:07:41.560 --> 0:07:44.240 that generalizes, but the idea with time horizon is like, 0:07:44.280 --> 0:07:46.880 maybe it's more intuitive, and I think that helps both 0:07:46.880 --> 0:07:49.280 for safety and for like business understanding. 0:07:49.600 --> 0:07:52.240 So let's talk about what this charge. 0:07:52.240 --> 0:07:55.240 I go the main chart here at meter dot org, 0:07:55.360 --> 0:07:58.320 right on the front page, it's this time horizon chart, 0:07:58.360 --> 0:08:01.360 and it shows Claude Opus four point six as a 0:08:01.400 --> 0:08:05.240 February twenty twenty six able to complete a task length 0:08:05.360 --> 0:08:08.520 in eleven hours and fifty nine minutes with a fifty 0:08:08.560 --> 0:08:13.280 percent success rate. I have to admit, the first time 0:08:13.680 --> 0:08:16.160 I saw this chart or version of this chart, what 0:08:16.280 --> 0:08:19.200 I assume, and I suspect others assume, is that it 0:08:19.320 --> 0:08:21.000 was able to go off and work on a task 0:08:21.040 --> 0:08:24.200 for eleven hours and fifty nine minutes then come back 0:08:24.240 --> 0:08:27.200 with an answers. But apparently it's not that. What do 0:08:27.240 --> 0:08:30.160 you walk us through what's really being measured here? By 0:08:30.160 --> 0:08:34.240 the way, the previous high was all was GPT five 0:08:34.280 --> 0:08:36.559 point three codex. That was five hours of fifty minutes. 0:08:36.679 --> 0:08:38.280 So I guess part of the reason this charge is 0:08:38.320 --> 0:08:40.720 blew people in mind because literally that's basically a double 0:08:40.760 --> 0:08:42.640 But why don't you talk to us about what's really 0:08:42.679 --> 0:08:43.400 being measured here? 0:08:43.520 --> 0:08:46.520 Yeah, so fundamentally, you know, in simpler terms, we are 0:08:46.559 --> 0:08:49.880 plotting the difficulty of tasks the AIS are able to 0:08:49.920 --> 0:08:52.439 complete over time. And you know, the particular way that 0:08:52.480 --> 0:08:54.959 we measure the difficulty of tasks is in how long 0:08:55.000 --> 0:08:57.679 it takes humans to complete those same tasks that we're 0:08:57.720 --> 0:08:59.880 asking the AIS to do. So in this case, you know, 0:09:00.080 --> 0:09:02.400 talking about for a PUS four point six, something like, 0:09:02.520 --> 0:09:05.520 tasks that take humans twelve hours to do, we predict 0:09:05.559 --> 0:09:08.199 that it will succeed at those tasks around fifty percent 0:09:08.240 --> 0:09:11.120 of the time. And yeah, you know, it turns out 0:09:11.200 --> 0:09:14.280 that when you plot using this particular difficulty measure how 0:09:14.280 --> 0:09:17.440 performance AIS are relative to how long it takes humans 0:09:17.440 --> 0:09:21.280 to complete these tasks, we see an exponential increase in 0:09:21.440 --> 0:09:24.640 capabilities for AIS. And what that ends up meaning is 0:09:24.640 --> 0:09:27.640 that you keep on having these doublings of capabilities every 0:09:27.760 --> 0:09:28.880 let's say four months. 0:09:28.920 --> 0:09:30.640 It seems on recent trends. 0:09:30.480 --> 0:09:32.600 Where you know, the next model is not merely going 0:09:32.640 --> 0:09:35.280 to have necessarily, you know, an hour longer time horizon, 0:09:35.320 --> 0:09:38.240 but perhaps be having some multiple of the. 0:09:38.160 --> 0:09:40.319 Time horizon of the previous model that's come out. 0:09:40.520 --> 0:09:44.239 So then explain how the number of their twelve hours 0:09:44.360 --> 0:09:49.000 is established. So there is some engineering task and you say, okay, 0:09:49.040 --> 0:09:51.760 this is a test that would require twelve hours, but 0:09:52.080 --> 0:09:55.959 humans of all different types of talent capabilities, how do 0:09:56.000 --> 0:09:58.920 you establish that? Okay, this was a twelve hour TESK, 0:09:58.960 --> 0:10:00.679 this was a six hour tel whatever. 0:10:00.960 --> 0:10:03.600 Yeah, So the simple answer is, literally, we get humans 0:10:03.840 --> 0:10:05.800 to sit down and complete the tasks that we give 0:10:05.800 --> 0:10:09.080 to AIS and as close to identical conditions as possible. 0:10:09.400 --> 0:10:11.400 So first we come up with the tasks, and that's 0:10:11.480 --> 0:10:13.240 you know, that's whole god a killer fish. We can 0:10:13.240 --> 0:10:16.240 talk about exactly how we do that. And then, using 0:10:16.440 --> 0:10:18.960 essentially the same tools that we're about to give the AIS, 0:10:19.240 --> 0:10:21.880 we take talented humans, you know, not people who have 0:10:21.880 --> 0:10:24.080 seen this particular type of task before, but people who 0:10:24.120 --> 0:10:26.959 have relevant expertise. So if it's a software engineering task, 0:10:27.120 --> 0:10:29.760 you know, they have software engineering expertise. Machine learning task, 0:10:29.840 --> 0:10:33.280 they have machine learning expertise, and then we time them, 0:10:33.360 --> 0:10:35.360 we see how long it takes for them to complete 0:10:35.400 --> 0:10:39.439 those tasks cescfully and then roughly we call the difficulty 0:10:39.440 --> 0:10:41.720 of the task as measured in human time to complete, 0:10:41.800 --> 0:10:44.200 as the average time it took these humans to complete 0:10:44.200 --> 0:10:47.080 the task. Then we'll run the AIS on this same 0:10:47.120 --> 0:10:49.960 set of tasks. Typically today for the very easiest tasks 0:10:49.960 --> 0:10:52.240 that they're more or less always going to succeed, there's 0:10:52.280 --> 0:10:54.800 some mid range of tasks where you know, perhaps they 0:10:54.800 --> 0:10:57.600 succeed fifty percent of the time, or perhaps for some 0:10:57.720 --> 0:11:00.480 tasks in that range they succeed zero percent of the 0:11:00.480 --> 0:11:02.079 time and for others one hundred percent of the time. 0:11:02.080 --> 0:11:04.120 And so they're getting fifty percent on average, let's say, 0:11:04.400 --> 0:11:06.920 and then for the much harder tasks, perhaps they're getting 0:11:06.920 --> 0:11:10.720 closer to zero percent. And then the point of which 0:11:10.760 --> 0:11:12.760 we predict, you know, in the middle of all these 0:11:12.880 --> 0:11:15.000 zero percents and one hundred percents by task, the point 0:11:15.040 --> 0:11:17.160 at which we predict that they'd have a fifty percent 0:11:17.240 --> 0:11:20.560 chance of succeeding. That is, either a fifty percent chance 0:11:20.600 --> 0:11:22.920 of succeeding on some task or fifty percent of the 0:11:22.960 --> 0:11:25.920 tasks or of that's difficulty that we think they would 0:11:25.960 --> 0:11:27.800 succeed on. That's what we're going to call the time 0:11:27.840 --> 0:11:28.880 horizon of these models. 0:11:29.120 --> 0:11:31.320 I think one thing also that could be good to explain. 0:11:31.360 --> 0:11:33.400 Here is the task distribution. I mean, this is not 0:11:33.640 --> 0:11:36.320 this is not all activities that humans do. We are 0:11:36.360 --> 0:11:39.400 specifically here interested in or the like. There's some question 0:11:39.559 --> 0:11:42.440 in what tasks are you know, like Joel mentioned, we're 0:11:42.480 --> 0:11:44.240 having people come into our office do the task to 0:11:44.240 --> 0:11:46.120 get a sense of how long it takes. We're not 0:11:46.200 --> 0:11:48.800 having them come in and like, you know, paint paintings 0:11:48.920 --> 0:11:51.800 or write novels or you know. We're focused here specifically 0:11:51.800 --> 0:11:54.320 on things that are in the distribution of work that 0:11:54.440 --> 0:11:57.559 a engineer at a like. We like to think of 0:11:57.559 --> 0:11:59.920 it as like a frontier AI lab, the tasks that 0:12:00.000 --> 0:12:02.520 they might be doing. So this is things like software engineering. 0:12:02.559 --> 0:12:05.800 It's fine tuning at AI models, it is like software 0:12:05.920 --> 0:12:07.680 machine learning, that kind of task. 0:12:07.720 --> 0:12:09.360 Wait, can I just ask why did you decide to 0:12:09.400 --> 0:12:11.960 focus on engineering? Because you could have winded out to 0:12:12.080 --> 0:12:14.600 you know, if we're talking about AI being capable of, 0:12:14.800 --> 0:12:17.120 you know, taking over the world, there are all sorts 0:12:17.120 --> 0:12:20.640 of substantive tasks that would fall under that category. So 0:12:20.679 --> 0:12:21.840 why just do engineering? 0:12:22.080 --> 0:12:24.400 Yeah, I think that for one thing, maybe other people 0:12:24.440 --> 0:12:26.200 in the team or maybe Jolis thoughts about this, but 0:12:26.600 --> 0:12:29.120 I think my particular motive and being interested in the 0:12:29.120 --> 0:12:31.959 time horizon on software tasks is that first of all, 0:12:31.960 --> 0:12:34.960 it's the thing that the industry is very like already 0:12:34.960 --> 0:12:37.240 even before we started working on this, is very focused on. 0:12:37.320 --> 0:12:39.120 So it's one of the capabilities that you should expect 0:12:39.160 --> 0:12:41.439 to come along for the ride earliest. It's the thing 0:12:41.480 --> 0:12:43.760 that like a lot of optimization pressure is being exerted on. 0:12:44.040 --> 0:12:46.040 And then I think that it is kind of the 0:12:46.240 --> 0:12:48.280 like thing that you would expect as an early warning 0:12:48.400 --> 0:12:51.480 kind of sign of this AR and D automation. So 0:12:51.600 --> 0:12:54.680 to some extent, METER thinks of itself as trying to 0:12:54.679 --> 0:12:57.800 build you know, science that are advanced science that can 0:12:58.040 --> 0:13:00.240 say when are we getting to the point that aisism 0:13:00.520 --> 0:13:03.559 could improve themselves or speed up the pace of AI development. 0:13:03.559 --> 0:13:06.080 When will AI research kind of feed on itself? And 0:13:06.120 --> 0:13:09.319 the kind of core capability for that might be software 0:13:09.360 --> 0:13:12.920 engineering and machine le learning research ability. There are other 0:13:13.000 --> 0:13:15.240 skills that could be relevant to taking over the world. 0:13:15.800 --> 0:13:17.440 I think other people have done time rising some like 0:13:17.480 --> 0:13:18.640 cybersecurity sense. 0:13:19.200 --> 0:13:21.280 But I suppose it is true like the basilisk isn't 0:13:21.280 --> 0:13:23.720 going to paint its way into like power or something 0:13:23.760 --> 0:13:25.400 like that. Okay, it might. 0:13:25.400 --> 0:13:28.960 Deceive you it might be very convincing or cunning in 0:13:29.000 --> 0:13:31.400 some way, and handover the cues. 0:13:32.240 --> 0:13:33.920 I always say for your mental models. 0:13:33.920 --> 0:13:36.360 You know, we don't have perfect evidence of this whatsoever, 0:13:36.720 --> 0:13:39.840 but my rough sense, sort of colloquially or you know, 0:13:39.880 --> 0:13:42.800 my prior before evidence comes in, is that if we 0:13:42.880 --> 0:13:45.520 did study tasks on these very different distributions, you know, 0:13:45.520 --> 0:13:47.920 not machine learning, not software engineering. I'm not sure about 0:13:47.920 --> 0:13:50.400 painting exactly, but you know, perhaps or other kinds of 0:13:50.440 --> 0:13:53.320 task distributions that we could enumerate that basically we would 0:13:53.400 --> 0:13:57.880 see this similarly shaped exponential progress over time where every 0:13:57.960 --> 0:14:00.199 I'm not sure exactly, but let's say, you know, for month, 0:14:00.240 --> 0:14:03.680 six months, something like that, the level of capabilities as 0:14:03.720 --> 0:14:06.320 measured in time horizon would be doubling at something like 0:14:06.320 --> 0:14:08.920 that pace, maybe from a much lower level. So you know, 0:14:09.000 --> 0:14:11.400 one example that we do have better evidence of is 0:14:11.440 --> 0:14:14.360 that the ais today are much less performance at you know, 0:14:14.400 --> 0:14:17.199 anything that requires vision capabilities, seeing what's on a screen, 0:14:17.320 --> 0:14:20.040 clicking around at a computer, but they're getting you know, 0:14:20.120 --> 0:14:22.520 tremendously better that sort of thing over time. 0:14:22.840 --> 0:14:23.920 I just do mention quickly. 0:14:23.920 --> 0:14:26.479 We did actually do a very kind of brief investigation 0:14:26.600 --> 0:14:29.640 of this another task distributions that's on our website somewhere 0:14:29.880 --> 0:14:31.920 like cross domain time horizons. I think we looked at 0:14:32.000 --> 0:14:35.000 data from the Tesla's shared on self driving and forgetting 0:14:35.000 --> 0:14:37.200 the other there's like os world. Maybe some of these 0:14:37.240 --> 0:14:39.920 are like somewhat similar, still kind of in the distribution 0:14:39.920 --> 0:14:42.680 of software tasks, but trying to get further afield into 0:14:42.720 --> 0:14:43.440 things like vision. 0:14:59.240 --> 0:15:02.080 How big is the sample size on the humans who 0:15:02.120 --> 0:15:05.480 are actually doing work? And also is it getting harder 0:15:05.680 --> 0:15:08.520 getting like human engineers into the room to compete with 0:15:08.600 --> 0:15:11.240 like Claude Opus four point six versus say, if I 0:15:11.280 --> 0:15:14.320 was a mediocre engineer, and I'm not, I'm a non 0:15:14.360 --> 0:15:16.920 existent engineer, but if I was a mediocre one, I 0:15:16.920 --> 0:15:19.000 would like maybe I would feel good about going up 0:15:19.040 --> 0:15:22.000 against like GPT three or something, and maybe I would 0:15:22.040 --> 0:15:25.520 feel a lot worse about myself going up against like Claude. 0:15:25.640 --> 0:15:28.000 Yeah, you know, on these tasks, I'm in a pretty 0:15:28.040 --> 0:15:31.760 similar position myself to you. So we have approximately three, 0:15:31.880 --> 0:15:35.239 although it varies quite a lot across tasks. Human baselines 0:15:35.440 --> 0:15:37.840 per tasks, so you know, typically we're ever going over 0:15:37.880 --> 0:15:40.640 something like three. I think the final numbers, it's my 0:15:40.680 --> 0:15:42.800 impression that they're not going to be so sensitive to 0:15:42.840 --> 0:15:44.280 the particular baselines that we use. 0:15:44.680 --> 0:15:47.320 Aren't the longer tests week more weekly baselined. 0:15:47.720 --> 0:15:50.360 Yeah, So indeed, I think it will get a lot 0:15:50.360 --> 0:15:52.720 harder to baseline these tasks as the length of task 0:15:52.800 --> 0:15:55.640 AIS are able to successfully complete gets longer and longer. 0:15:55.760 --> 0:15:57.440 You know, you might think at some points the length 0:15:57.440 --> 0:15:59.840 of task that they can complete is longer than the 0:16:00.320 --> 0:16:02.440 time in four months time, they're going to be able 0:16:02.440 --> 0:16:04.720 to complete tasks of more than four months, and then 0:16:04.760 --> 0:16:06.920 it's you know, kind of becomes paths close to impossible 0:16:06.960 --> 0:16:09.320 to get these four months long baselines. Of course we're 0:16:09.360 --> 0:16:11.560 not at that point yet, but you know, definitely has 0:16:11.600 --> 0:16:14.280 become more difficult to get these baselines as time has 0:16:14.320 --> 0:16:14.640 gone on. 0:16:15.000 --> 0:16:17.640 At the moments, not impossible, but very challenging. Joe. 0:16:17.640 --> 0:16:21.200 These are the future jobs for displaced engineers, right. It's 0:16:21.320 --> 0:16:25.360 competing against the codes for benchmark first benchmark evaluation, we 0:16:25.400 --> 0:16:26.120 found the jobs. 0:16:26.480 --> 0:16:29.600 So we mentioned at the beginning the most viral CHARLINEI 0:16:30.320 --> 0:16:32.360 is this chart that you have on the front of 0:16:32.360 --> 0:16:35.760 your website. Your website defaults to this and it shows 0:16:36.000 --> 0:16:39.040 you know, this doubling. So if we actually go back 0:16:39.080 --> 0:16:43.280 to November let's say November twenty twenty five, Gemini three 0:16:43.480 --> 0:16:46.320 pro three hours and forty four minutes, claud Op was 0:16:46.320 --> 0:16:48.920 four point six twelve hours. Those are the fifty percent 0:16:49.280 --> 0:16:53.040 success benchmark. If we go to the eighty percent benchmark, 0:16:53.120 --> 0:16:57.880 which the website doesn't default to improve the price of 0:16:57.920 --> 0:17:02.240 improvement looks a little less impressive to me. So okay, 0:17:02.800 --> 0:17:06.320 now it's like it does not have the same gap 0:17:06.480 --> 0:17:09.959 pretty clearly. Now eighty percent is still not one hundred percent. 0:17:10.400 --> 0:17:14.160 And I know that this is your meter's goal is about, 0:17:14.200 --> 0:17:17.000 like you know, human safety and all this stuff. But 0:17:17.040 --> 0:17:19.199 when we think about people look at this and they 0:17:19.560 --> 0:17:22.600 use it as a stand in for how performance are 0:17:22.640 --> 0:17:27.600 these models? Even eighty percent, you know, certainly for like 0:17:27.640 --> 0:17:31.080 any business application. I understand you're not like serving business 0:17:31.119 --> 0:17:34.920 here per se, but probably businesses care about this. Even 0:17:34.960 --> 0:17:37.560 eighty percent may not be very good enough. And it 0:17:37.600 --> 0:17:40.080 does not look as crazy when you look at the 0:17:40.119 --> 0:17:43.240 eighty percent chart as it does at the fifty percent chart. 0:17:43.800 --> 0:17:47.080 Why the focus on the fifty percent chart? And given like, 0:17:47.720 --> 0:17:50.280 why not look at the chart that just does not 0:17:50.480 --> 0:17:51.840 look as impressive. 0:17:51.960 --> 0:17:53.840 Yeah, maybe two central things to say. 0:17:54.119 --> 0:17:56.240 One to my eyes, the eight percent shot it's basically 0:17:56.359 --> 0:17:59.480 does look as impressive. Well, the doubling time is about 0:17:59.480 --> 0:18:00.680 the safe cope. 0:18:00.480 --> 0:18:02.520 On it's the same. 0:18:02.640 --> 0:18:05.879 It's the same, it's say an afseet of it's the 0:18:05.880 --> 0:18:06.960 same pace of progress. 0:18:07.040 --> 0:18:09.040 You know, it's something like five times smaller than the 0:18:09.480 --> 0:18:11.360 fifty percent than the fifty percent number. 0:18:11.440 --> 0:18:13.120 But you know that only takes you too doublings. 0:18:13.160 --> 0:18:15.240 And if each doubling takes around four months, that means 0:18:15.280 --> 0:18:16.919 that in eight months time you're going to have the 0:18:16.920 --> 0:18:19.560 same eighty percent success rate roughly as you do fifty 0:18:19.560 --> 0:18:20.679 percent success ray today. 0:18:20.880 --> 0:18:21.840 That's one thing to say. 0:18:21.960 --> 0:18:24.080 Maybe a second thing to say is, you know, remember 0:18:24.119 --> 0:18:26.520 at the beginning I said, essentially what we're doing is 0:18:26.560 --> 0:18:29.800 plotting the difficulty of tasks that these ais can complete 0:18:29.800 --> 0:18:32.000 over time, just with this particular measure that ends up 0:18:32.000 --> 0:18:35.600 showing this clean exponential trend. And we've picked a particular 0:18:35.640 --> 0:18:37.760 number as our difficulty number, and you know that is 0:18:37.800 --> 0:18:40.520 this fifty percent reliability threshold. We could have picked a 0:18:40.520 --> 0:18:42.840 different one. I think there are reasons for picking the 0:18:42.960 --> 0:18:46.120 fifty percent one. In particular, it's the one that statistically 0:18:46.160 --> 0:18:49.239 we're better able to measure. For some technical reasons, it's 0:18:49.280 --> 0:18:51.719 the one that shows up in previously. It's show that 0:18:51.840 --> 0:18:53.439