1 00:00:00,520 --> 00:00:04,080 Speaker 1: Already and this is the Daily This is the Daily 2 00:00:04,120 --> 00:00:06,840 Speaker 1: ohs oh, now it makes sense. 3 00:00:14,720 --> 00:00:17,000 Speaker 2: Good morning, and welcome to the Daily OS. It's Thursday, 4 00:00:17,040 --> 00:00:18,960 Speaker 2: the fourteenth of August. I'm Sam Kazlowski. 5 00:00:19,160 --> 00:00:20,360 Speaker 1: I'm Emma Gillespie. 6 00:00:20,640 --> 00:00:24,040 Speaker 2: This month, the tech company behind chat GBT, released what 7 00:00:24,120 --> 00:00:28,280 Speaker 2: they claim is their smartest AI model yet now. According 8 00:00:28,280 --> 00:00:31,400 Speaker 2: to Open Ai, GPT five operates at the level of 9 00:00:31,440 --> 00:00:34,840 Speaker 2: a PhD student. But experts are warning that the AI 10 00:00:34,960 --> 00:00:37,720 Speaker 2: race has become a bit of a marketing battle, as 11 00:00:37,760 --> 00:00:41,960 Speaker 2: companies manipulate test results to claim their product is the best. 12 00:00:42,440 --> 00:00:45,559 Speaker 2: On today's podcast, we're going to unpack how AI companies 13 00:00:45,760 --> 00:00:52,159 Speaker 2: measure intelligence and why that's become a problem. 14 00:00:52,280 --> 00:00:52,680 Speaker 3: Sam. 15 00:00:52,800 --> 00:00:56,400 Speaker 1: I was originally skeptical about having this conversation with you 16 00:00:56,520 --> 00:01:01,400 Speaker 1: because I, like maybe some listeners here AI and I 17 00:01:01,520 --> 00:01:03,400 Speaker 1: kind of roll my eyes a little. 18 00:01:03,080 --> 00:01:04,160 Speaker 3: Bit and switch off. 19 00:01:04,240 --> 00:01:06,920 Speaker 1: But if you are that person hearing this right now, 20 00:01:07,040 --> 00:01:10,679 Speaker 1: hang in there, because this is actually a fascinating conversation, 21 00:01:11,360 --> 00:01:14,600 Speaker 1: this idea that we're sort of being marketed to about 22 00:01:14,959 --> 00:01:18,600 Speaker 1: this arms race of who is the smartest, which AI model. 23 00:01:18,400 --> 00:01:19,040 Speaker 3: Is the best. 24 00:01:19,880 --> 00:01:23,160 Speaker 1: Let's start with the basics here, though, when we're talking 25 00:01:23,240 --> 00:01:24,640 Speaker 1: about AI models. 26 00:01:24,800 --> 00:01:26,160 Speaker 3: What exactly does that mean? 27 00:01:26,640 --> 00:01:30,120 Speaker 2: There is a certain brand of satisfaction that is reserved 28 00:01:30,200 --> 00:01:31,959 Speaker 2: for when I can change your mind and whether a 29 00:01:32,000 --> 00:01:33,280 Speaker 2: story is going to be interesting or. 30 00:01:33,280 --> 00:01:35,240 Speaker 3: Not, especially if it's a tech story. 31 00:01:35,400 --> 00:01:38,399 Speaker 2: This is this is going to be awesome. So AI 32 00:01:38,480 --> 00:01:42,760 Speaker 2: models are computer programs that can understand and generate language 33 00:01:42,880 --> 00:01:46,880 Speaker 2: human language. Just think of them as very advanced AU 34 00:01:46,920 --> 00:01:49,800 Speaker 2: though complete systems like the ones that could fill in 35 00:01:49,800 --> 00:01:53,720 Speaker 2: a form for you or you know, password, remembering little 36 00:01:53,880 --> 00:01:56,880 Speaker 2: widgets in your browser, anything that presumes that what you're 37 00:01:56,920 --> 00:01:59,160 Speaker 2: about to do or want it can kind of fill 38 00:01:59,200 --> 00:02:00,000 Speaker 2: in those gaps for you. 39 00:02:00,280 --> 00:02:01,800 Speaker 3: That's actually a really good way to think of it. 40 00:02:02,000 --> 00:02:04,680 Speaker 2: See we're off to a flyer. You type in a 41 00:02:04,800 --> 00:02:08,000 Speaker 2: question or a quest, those responses are generated. The most 42 00:02:08,000 --> 00:02:10,400 Speaker 2: famous ones you might have heard of include chat GBT 43 00:02:10,520 --> 00:02:14,800 Speaker 2: from open Ai, You've Got Clawed from Anthropic, and Gemini 44 00:02:14,880 --> 00:02:15,440 Speaker 2: from Google. 45 00:02:15,760 --> 00:02:20,400 Speaker 1: Okay, now, it does seem like every AI company out 46 00:02:20,400 --> 00:02:24,000 Speaker 1: there claims that its model is the smartest or the 47 00:02:24,040 --> 00:02:26,840 Speaker 1: most capable or better than the best one that we've 48 00:02:26,880 --> 00:02:30,160 Speaker 1: ever seen. And one of the biggest players in this space, 49 00:02:30,280 --> 00:02:35,240 Speaker 1: open ai has just released GPT five this week. What 50 00:02:35,280 --> 00:02:37,440 Speaker 1: are their claims about this new model. 51 00:02:37,720 --> 00:02:40,400 Speaker 2: So they're making some big statements here. They're saying that 52 00:02:40,520 --> 00:02:44,560 Speaker 2: GBT five scored ninety four point six percent on a 53 00:02:44,639 --> 00:02:48,480 Speaker 2: test that measures its ability to solve advanced maths problems, 54 00:02:48,960 --> 00:02:52,119 Speaker 2: seventy four point nine percent on real world coding tasks, 55 00:02:52,240 --> 00:02:55,880 Speaker 2: and produces forty five percent fewer factual errors than their 56 00:02:55,919 --> 00:02:59,160 Speaker 2: previous models. To the CEO of the company, Sam Oltman, 57 00:02:59,240 --> 00:03:01,720 Speaker 2: he called it the model in the world, which kind 58 00:03:01,720 --> 00:03:04,080 Speaker 2: of sounds like those places you were saying before, and 59 00:03:04,120 --> 00:03:08,000 Speaker 2: said it represents a significant step towards what's called artificial 60 00:03:08,120 --> 00:03:12,040 Speaker 2: general intelligence AGI, which is basically the idea that AI 61 00:03:12,120 --> 00:03:15,919 Speaker 2: can actually perform an intellectual task better than humans can. 62 00:03:16,080 --> 00:03:18,400 Speaker 1: Okay, so that's when we start to imagine like the 63 00:03:18,440 --> 00:03:19,919 Speaker 1: I robot future. 64 00:03:20,240 --> 00:03:22,080 Speaker 2: Yeah, and it's when we get into those examples of 65 00:03:22,120 --> 00:03:25,320 Speaker 2: things like AI blackmailing you if you decide to stop 66 00:03:25,400 --> 00:03:27,440 Speaker 2: using it and kind of taking on a life of 67 00:03:27,480 --> 00:03:28,120 Speaker 2: its own. 68 00:03:28,560 --> 00:03:32,160 Speaker 1: So those numbers from Open AI about this new model 69 00:03:32,240 --> 00:03:35,720 Speaker 1: sound pretty impressive, like ninety five percent on advanced maths. 70 00:03:35,760 --> 00:03:40,600 Speaker 1: Particularly interesting this kind of idea of producing fewer factual errors, 71 00:03:40,640 --> 00:03:43,560 Speaker 1: because that's always kind of in the spotlight around the 72 00:03:43,600 --> 00:03:47,240 Speaker 1: skepticism towards AI, But I'm interested in how these companies 73 00:03:47,360 --> 00:03:51,800 Speaker 1: are actually measuring the intelligence of these products. You mentioned 74 00:03:51,840 --> 00:03:54,640 Speaker 1: in the intro SAM that this is becoming a bit 75 00:03:54,640 --> 00:03:57,240 Speaker 1: of an issue. Yeah, So what exactly is the concern? 76 00:03:57,880 --> 00:04:00,720 Speaker 2: Well, ultimately it's the idea that AI come companies are 77 00:04:00,760 --> 00:04:04,240 Speaker 2: all using different tests to prove that their model is 78 00:04:04,280 --> 00:04:08,480 Speaker 2: the best. It's like if all car companies all claims 79 00:04:08,560 --> 00:04:11,839 Speaker 2: to make the fastest car ever or the safest car ever, 80 00:04:12,400 --> 00:04:14,840 Speaker 2: but one tested on a highway, the other tested on 81 00:04:14,880 --> 00:04:17,400 Speaker 2: a racetrack, and the other one went downhill on a 82 00:04:17,440 --> 00:04:21,200 Speaker 2: windy day. A major study published earlier this year into 83 00:04:21,320 --> 00:04:25,920 Speaker 2: AI models actually compared the situation to Volkswagen, who were 84 00:04:25,960 --> 00:04:29,640 Speaker 2: found guilty of lying about the emissions or the lack 85 00:04:29,680 --> 00:04:32,840 Speaker 2: of emissions that their cars were producing when it basically 86 00:04:32,920 --> 00:04:37,320 Speaker 2: cheated on pollution tests. The researchers noted that when companies 87 00:04:37,360 --> 00:04:41,560 Speaker 2: manipulated car testing, people were going to jail, but similar 88 00:04:41,680 --> 00:04:45,719 Speaker 2: manipulation in AI isn't really coming into our attention. 89 00:04:46,440 --> 00:04:48,000 Speaker 3: Wow, it's fascinating. 90 00:04:48,000 --> 00:04:51,279 Speaker 1: I remember that Volkswagen emission scandal, So a good comparison, 91 00:04:51,440 --> 00:04:54,080 Speaker 1: and how the tick for SAM? So, how can these 92 00:04:54,240 --> 00:04:58,559 Speaker 1: AI models then be tested in a fair way. 93 00:04:58,680 --> 00:05:00,080 Speaker 3: What does testing. 94 00:04:59,760 --> 00:05:03,200 Speaker 1: Out official intelligence kind of transparently and consistently look like. 95 00:05:03,440 --> 00:05:05,120 Speaker 2: Well, naturally, the first thing to do would be the 96 00:05:05,200 --> 00:05:08,640 Speaker 2: standardize the same test across every model, and that would 97 00:05:08,640 --> 00:05:11,560 Speaker 2: be described as a benchmark, and you global benchmark for 98 00:05:11,600 --> 00:05:14,320 Speaker 2: how these models are performing. And that could be to 99 00:05:14,400 --> 00:05:17,840 Speaker 2: measure a specific ability, say in maths, you could give 100 00:05:17,960 --> 00:05:20,760 Speaker 2: all of them the same advanced maths problem and then 101 00:05:20,800 --> 00:05:23,440 Speaker 2: measure not only the output, but how long it takes 102 00:05:23,480 --> 00:05:26,880 Speaker 2: for them to get there, what processes it undertook to 103 00:05:27,000 --> 00:05:29,760 Speaker 2: reach that final destination of the answer. You could give 104 00:05:29,800 --> 00:05:32,240 Speaker 2: that a score and then actually compare like for like 105 00:05:32,320 --> 00:05:32,960 Speaker 2: these models. 106 00:05:33,360 --> 00:05:35,800 Speaker 3: It kind of sounds pretty straightforward. 107 00:05:35,880 --> 00:05:39,960 Speaker 1: That to me seems like the obvious path towards getting 108 00:05:40,000 --> 00:05:44,440 Speaker 1: consistent testing. So where does the manipulation come from? 109 00:05:44,560 --> 00:05:46,279 Speaker 2: Well, I think the first thing to acknowledge is that 110 00:05:46,320 --> 00:05:50,560 Speaker 2: there is no centralized global body that has the respect 111 00:05:50,680 --> 00:05:54,320 Speaker 2: or the ability to actually execute that sort of standardized testing. 112 00:05:54,400 --> 00:05:58,760 Speaker 2: There is no say, TGA for drugs, there's no government 113 00:05:59,040 --> 00:06:02,080 Speaker 2: sponsored hub that can execute that kind of stuff. So 114 00:06:02,480 --> 00:06:04,919 Speaker 2: reason A, there's nobody to do it. But reason B 115 00:06:05,120 --> 00:06:09,240 Speaker 2: would be that these models are still in this accelerating 116 00:06:09,640 --> 00:06:12,960 Speaker 2: period of marketing where they're cherry picking tests that would 117 00:06:13,040 --> 00:06:17,599 Speaker 2: favor their models' strengths while hiding poor performance in other areas. 118 00:06:18,240 --> 00:06:21,560 Speaker 2: And one other problem that has come up is that 119 00:06:21,640 --> 00:06:24,599 Speaker 2: if AI knows the problem is coming, because it's AI 120 00:06:24,760 --> 00:06:27,720 Speaker 2: and it knows how tests are done, then it can 121 00:06:27,760 --> 00:06:30,960 Speaker 2: actually almost train itself for the test, and so there's 122 00:06:30,960 --> 00:06:33,120 Speaker 2: a bit of a data contamination problem. You'd have to 123 00:06:33,200 --> 00:06:36,760 Speaker 2: keep these tests almost offline entirely for the models to 124 00:06:36,760 --> 00:06:39,719 Speaker 2: see them for the first time. One study found, for example, 125 00:06:39,760 --> 00:06:43,200 Speaker 2: that GPT four, which is the one older model from 126 00:06:43,560 --> 00:06:47,680 Speaker 2: open AI, it could solve coding problems from before twenty 127 00:06:47,760 --> 00:06:50,559 Speaker 2: twenty one that were published online, but it couldn't solve 128 00:06:50,720 --> 00:06:53,440 Speaker 2: new problems. And so then you get a sense of 129 00:06:53,520 --> 00:06:55,919 Speaker 2: kind of in the great big world of its brain, 130 00:06:56,000 --> 00:06:58,520 Speaker 2: which is the Internet, if those answers are somewhere out there, 131 00:06:58,520 --> 00:06:59,800 Speaker 2: it could just regurgitate them. 132 00:07:00,120 --> 00:07:02,599 Speaker 1: So it's like if you've got an advanced copy of 133 00:07:02,640 --> 00:07:05,880 Speaker 1: an exam or a test at unior in school, you 134 00:07:06,000 --> 00:07:09,960 Speaker 1: can train for the test. That doesn't necessarily mean that 135 00:07:10,240 --> 00:07:14,040 Speaker 1: you have the comprehension levels to speak to a certain 136 00:07:14,120 --> 00:07:17,880 Speaker 1: topic or question. In the same subject outside of the 137 00:07:17,880 --> 00:07:19,160 Speaker 1: confines of that context. 138 00:07:19,160 --> 00:07:21,040 Speaker 2: And if we think about what all of this is for, 139 00:07:21,160 --> 00:07:23,400 Speaker 2: it's about trying to work out if these models are 140 00:07:23,400 --> 00:07:25,600 Speaker 2: going to be good in practice for us to spend 141 00:07:25,680 --> 00:07:27,440 Speaker 2: twenty bucks a month on them. I mean, let's get 142 00:07:27,480 --> 00:07:29,600 Speaker 2: back to the real core problem here. We're trying to 143 00:07:29,600 --> 00:07:32,560 Speaker 2: work out if it's worth our money. And there was 144 00:07:32,560 --> 00:07:35,400 Speaker 2: a great quote from the British Prime Minister, former British 145 00:07:35,440 --> 00:07:39,200 Speaker 2: Prime Minister Richie Sunak. He said AI models shouldn't be 146 00:07:39,240 --> 00:07:41,640 Speaker 2: trusted to mark their own homework. And I think that 147 00:07:41,680 --> 00:07:43,480 Speaker 2: we can all relate to that. Yeah, and it kind 148 00:07:43,520 --> 00:07:47,880 Speaker 2: of encapsulates what's the problem with this independent benchmarking framework. 149 00:07:48,160 --> 00:07:52,960 Speaker 1: You also mentioned that companies are testing multiple versions, or 150 00:07:53,000 --> 00:07:56,720 Speaker 1: that they're cherry picking their data and choosing the kind 151 00:07:56,760 --> 00:07:59,440 Speaker 1: of findings that favor their models the most. 152 00:08:00,040 --> 00:08:00,840 Speaker 3: What's happening there. 153 00:08:00,720 --> 00:08:03,200 Speaker 2: Tell us a bit more well. Some research found that 154 00:08:03,240 --> 00:08:07,040 Speaker 2: major companies were talking mesha, Open Ai and Google have 155 00:08:07,120 --> 00:08:11,440 Speaker 2: been privately testing dozens of different model versions on popular tests. 156 00:08:12,240 --> 00:08:15,840 Speaker 2: They're only revealing the scores from their best performing versions. 157 00:08:16,200 --> 00:08:17,920 Speaker 2: So and it's like, you know, you're on a night out, 158 00:08:17,920 --> 00:08:19,960 Speaker 2: you take twenty selfies, you put up the best one. Yeah, 159 00:08:20,000 --> 00:08:22,680 Speaker 2: of course, and I think at some stage you have 160 00:08:22,760 --> 00:08:25,800 Speaker 2: to admit that all businesses would do that. Yeah, you know, TDA, 161 00:08:25,920 --> 00:08:28,840 Speaker 2: if we had to report results to the stock market, 162 00:08:28,960 --> 00:08:31,240 Speaker 2: you know, we would probably highlight more the pieces that 163 00:08:31,240 --> 00:08:34,319 Speaker 2: did really, really well. Not that there's ever any pieces 164 00:08:34,440 --> 00:08:35,960 Speaker 2: that don't, but you know. 165 00:08:36,040 --> 00:08:38,520 Speaker 3: A flawless company that never makes mistakes. 166 00:08:38,559 --> 00:08:40,800 Speaker 2: Obviously, but we have to. I think it's good to 167 00:08:40,840 --> 00:08:43,800 Speaker 2: acknowledge this bit of kind of business reality there. But 168 00:08:44,040 --> 00:08:47,439 Speaker 2: I do think that in this case it's different because 169 00:08:48,040 --> 00:08:51,120 Speaker 2: there's no transparency at all in terms of the testing process. 170 00:08:51,120 --> 00:08:55,800 Speaker 2: It's to continue with our university kind of example. It's 171 00:08:55,840 --> 00:08:58,600 Speaker 2: like a student taking the same exam twenty seven times 172 00:08:59,000 --> 00:09:00,920 Speaker 2: and then only reporting the best score. Yep. 173 00:09:01,480 --> 00:09:06,000 Speaker 1: So without that transparency, there's that issue around trust, and 174 00:09:06,080 --> 00:09:09,040 Speaker 1: I think we see that really playing out in real 175 00:09:09,120 --> 00:09:12,200 Speaker 1: time right now, that there is a lack of trust 176 00:09:12,440 --> 00:09:16,839 Speaker 1: in the broader community about AI models because we don't 177 00:09:16,880 --> 00:09:19,080 Speaker 1: know how they come to these answers. What are some 178 00:09:19,160 --> 00:09:22,120 Speaker 1: of the other consequences of this manipulation. How does this 179 00:09:22,240 --> 00:09:24,560 Speaker 1: play out in the real world every day? 180 00:09:24,800 --> 00:09:28,439 Speaker 2: Well, there's definitely that marketing angle of misleading consumers and 181 00:09:28,760 --> 00:09:31,360 Speaker 2: you and I signing up to an AI platform because 182 00:09:31,360 --> 00:09:33,880 Speaker 2: we think it's ninety six percent going to be great, 183 00:09:33,920 --> 00:09:36,320 Speaker 2: and in fact it might be eighty one percent great, 184 00:09:36,360 --> 00:09:39,640 Speaker 2: which is still an incredible feat of technology. But then 185 00:09:39,640 --> 00:09:43,359 Speaker 2: from a government perspective, governments are looking at these benchmarks 186 00:09:43,600 --> 00:09:47,640 Speaker 2: for the way that they're thinking about regulation or policy decisions. 187 00:09:48,200 --> 00:09:52,120 Speaker 2: So the European Union's AI Act it uses benchmarks to 188 00:09:52,320 --> 00:09:56,920 Speaker 2: determine whether new AI models pose systemic risk. Can they 189 00:09:56,920 --> 00:09:59,640 Speaker 2: be used by extremists? Can they be used to spread 190 00:09:59,760 --> 00:10:03,679 Speaker 2: race online? Can they be used to mislead and deliberately 191 00:10:03,720 --> 00:10:08,319 Speaker 2: spread misinformation? And if companies are manipulating those scores, it 192 00:10:08,360 --> 00:10:12,079 Speaker 2: could affect how these powerful technologies are indeed regulated. 193 00:10:12,160 --> 00:10:15,960 Speaker 1: Okay, because if these scores say that eighty percent of 194 00:10:16,000 --> 00:10:18,880 Speaker 1: the content is factual, or that there are these really 195 00:10:18,920 --> 00:10:22,480 Speaker 1: great systems in place to catch miss and disinformation or 196 00:10:22,520 --> 00:10:25,600 Speaker 1: hate speech, then that might not concern leaders to the 197 00:10:25,600 --> 00:10:28,120 Speaker 1: point where they think there needs to be certain levels 198 00:10:28,120 --> 00:10:29,080 Speaker 1: of regulation. 199 00:10:28,840 --> 00:10:29,559 Speaker 2: One hundred percent. 200 00:10:30,000 --> 00:10:34,480 Speaker 1: You mentioned this idea of artificial general intelligence earlier. We 201 00:10:34,640 --> 00:10:37,760 Speaker 1: used the I robot example. One of Will Smith's best 202 00:10:38,520 --> 00:10:41,680 Speaker 1: open AI is claiming that GPT five is a step 203 00:10:41,720 --> 00:10:46,080 Speaker 1: forward in AGI, But what does that actually mean in 204 00:10:46,120 --> 00:10:48,560 Speaker 1: a not Hollywood kind of fantasy world. 205 00:10:48,960 --> 00:10:53,319 Speaker 2: Well, I gave the example before of outperforming humans. That's 206 00:10:53,400 --> 00:10:56,080 Speaker 2: a very broad definition, and the problem is that I 207 00:10:56,080 --> 00:10:58,640 Speaker 2: can't really give you a more specific definition because even 208 00:10:58,720 --> 00:11:01,560 Speaker 2: open ai can't really do that right. One open Ai 209 00:11:01,679 --> 00:11:04,960 Speaker 2: statement said, AGI is still a weekly defined term and 210 00:11:05,040 --> 00:11:07,800 Speaker 2: means different things to different people. We don't really know 211 00:11:07,840 --> 00:11:08,600 Speaker 2: what we don't know. 212 00:11:08,800 --> 00:11:11,480 Speaker 3: So how can GPT five verse step forward? 213 00:11:11,559 --> 00:11:11,719 Speaker 2: Then? 214 00:11:11,760 --> 00:11:13,959 Speaker 3: If the company itself isn't sure? 215 00:11:14,559 --> 00:11:19,080 Speaker 2: Interesting question very much raises some questions about how do 216 00:11:19,120 --> 00:11:21,319 Speaker 2: we know when we got there? Even? Yeah, I mean 217 00:11:21,720 --> 00:11:25,439 Speaker 2: this is the exciting and terrifying part of living through 218 00:11:26,240 --> 00:11:29,800 Speaker 2: rapidly emerging technology is that we're learning as we go 219 00:11:30,040 --> 00:11:33,000 Speaker 2: as a society, and that is not always pretty. 220 00:11:33,520 --> 00:11:37,199 Speaker 1: So for people listening who might be using AI kind 221 00:11:37,240 --> 00:11:41,520 Speaker 1: of casually or infrequently in their maybe work or UNI life, 222 00:11:42,000 --> 00:11:45,120 Speaker 1: maybe they're building up their understanding of the different platforms 223 00:11:45,160 --> 00:11:49,079 Speaker 1: out there. What should we make of all these competing claims? 224 00:11:49,160 --> 00:11:51,839 Speaker 1: You know, how do we make better decisions about which 225 00:11:51,920 --> 00:11:55,319 Speaker 1: AI model is actually the good one, or the right one, 226 00:11:55,400 --> 00:11:56,480 Speaker 1: or or the best one for us? 227 00:11:56,559 --> 00:11:59,720 Speaker 2: I'm constantly asked as somebody who is known now in 228 00:11:59,720 --> 00:12:02,160 Speaker 2: my friend group and in the workplace as somebody who's 229 00:12:02,400 --> 00:12:05,320 Speaker 2: really interested in AI. I'm constantly asked which one should 230 00:12:05,360 --> 00:12:07,720 Speaker 2: I use, what's the best one, And the answer is, 231 00:12:07,880 --> 00:12:10,400 Speaker 2: it's about what you're trying to do, essentially, So one 232 00:12:10,440 --> 00:12:13,079 Speaker 2: model might be better for creative writing, but another might 233 00:12:13,120 --> 00:12:16,840 Speaker 2: excel more a data analysis and crunching some numbers. Studies 234 00:12:16,880 --> 00:12:19,920 Speaker 2: are showing though, that AI models often fail when you 235 00:12:20,000 --> 00:12:23,480 Speaker 2: move from those controlled test conditions or those use cases 236 00:12:23,600 --> 00:12:26,840 Speaker 2: or features that are rolled out by these platforms as 237 00:12:26,840 --> 00:12:30,200 Speaker 2: part of marketing campaigns to the messy real world use 238 00:12:30,320 --> 00:12:32,640 Speaker 2: that humans actually use these tools for. 239 00:12:32,840 --> 00:12:35,360 Speaker 1: It actually reminds me of and I'm not even sure 240 00:12:35,400 --> 00:12:38,720 Speaker 1: if this is the same thing, but when Siri was 241 00:12:38,720 --> 00:12:41,840 Speaker 1: first rolled out and Apple kind of in their big 242 00:12:41,880 --> 00:12:44,000 Speaker 1: announcements it's like, you can ask her this, or you 243 00:12:44,000 --> 00:12:46,079 Speaker 1: can ask her that, or if you want to know what. 244 00:12:46,040 --> 00:12:47,840 Speaker 3: The weather's like, should you take an umbrella? 245 00:12:48,280 --> 00:12:51,080 Speaker 1: And I found when I first started using Siri, like, yeah, 246 00:12:51,120 --> 00:12:53,760 Speaker 1: you could definitely answer those sorts of questions, but not 247 00:12:53,840 --> 00:12:56,120 Speaker 1: a whole lot else outside of the almost like a 248 00:12:56,160 --> 00:12:59,480 Speaker 1: prescribed text from Apple about how to use Siri. 249 00:13:00,120 --> 00:13:03,360 Speaker 2: When you get into the world of trying to engage 250 00:13:03,360 --> 00:13:05,520 Speaker 2: with the user no matter what they're about to say. 251 00:13:06,000 --> 00:13:07,480 Speaker 2: It can take a little bit of time for the 252 00:13:07,559 --> 00:13:11,440 Speaker 2: technology to be refined and to keep learning from what 253 00:13:11,520 --> 00:13:12,560 Speaker 2: users actually want. 254 00:13:12,760 --> 00:13:15,600 Speaker 3: So, Sam, what is the way forward in all of this? 255 00:13:16,280 --> 00:13:20,360 Speaker 1: Is there a conversation happening at a more global scale 256 00:13:20,520 --> 00:13:21,880 Speaker 1: about this regulation? 257 00:13:22,360 --> 00:13:25,600 Speaker 2: Definitely, and there's no clear leader here. I mentioned the 258 00:13:25,880 --> 00:13:28,760 Speaker 2: work being done by the European Union before. There's a 259 00:13:28,800 --> 00:13:32,600 Speaker 2: coalition of countries including Australia that signed on to kind 260 00:13:32,640 --> 00:13:35,600 Speaker 2: of key principles of how to keep AI safe. That 261 00:13:35,720 --> 00:13:38,280 Speaker 2: was in mid twenty twenty three, so there's a bit 262 00:13:38,280 --> 00:13:41,200 Speaker 2: of a global movement there. From a government perspective, there's 263 00:13:41,200 --> 00:13:44,480 Speaker 2: some really interesting work being done out of universities, particularly 264 00:13:44,640 --> 00:13:48,920 Speaker 2: Stanford University. They developed an AI Index report which does 265 00:13:49,000 --> 00:13:52,480 Speaker 2: try to compare the models like for like. But I 266 00:13:52,520 --> 00:13:54,920 Speaker 2: think we first need to determine who the authority is 267 00:13:54,960 --> 00:13:57,280 Speaker 2: going to be in this space before we can kind 268 00:13:57,280 --> 00:13:59,959 Speaker 2: of put the burden on them to roll out this 269 00:14:00,000 --> 00:14:03,280 Speaker 2: standardized testing. And I do think in a few decades 270 00:14:03,320 --> 00:14:05,160 Speaker 2: it will take a while. I do think we'll get there. 271 00:14:05,360 --> 00:14:08,320 Speaker 2: I mean, we have the TGA to regulate medicine. We 272 00:14:08,480 --> 00:14:11,840 Speaker 2: have a central aviation authority to regulate what a plane 273 00:14:11,880 --> 00:14:15,079 Speaker 2: that's airworthy looks like. Yep. I do think that we're 274 00:14:15,120 --> 00:14:17,640 Speaker 2: going to see a central AI authority in Australia and 275 00:14:17,679 --> 00:14:20,800 Speaker 2: maybe around the world someday. But we are very early 276 00:14:20,840 --> 00:14:23,840 Speaker 2: in this story. We're like one percent through in the 277 00:14:23,960 --> 00:14:27,840 Speaker 2: AI story if that, and that's really exciting. But it's 278 00:14:27,920 --> 00:14:31,920 Speaker 2: also really important to continuously discuss the potential flaws and 279 00:14:32,240 --> 00:14:35,040 Speaker 2: the gaps that exist in this big, new scary Well. 280 00:14:35,360 --> 00:14:38,800 Speaker 1: Yeah, I think that healthy dose of skepticism is what 281 00:14:38,840 --> 00:14:41,040 Speaker 1: we will be carrying forward. But I look forward to 282 00:14:41,120 --> 00:14:43,360 Speaker 1: many more conversations like this with you, Sam. 283 00:14:43,440 --> 00:14:44,440 Speaker 2: Well, we don't have a choice. 284 00:14:44,520 --> 00:14:46,000 Speaker 3: Help me understand it all. 285 00:14:46,200 --> 00:14:49,000 Speaker 1: Thank you so much for breaking that down for us, Sam, 286 00:14:49,480 --> 00:14:52,200 Speaker 1: and thank you for listening to today's deep Dive. We'll 287 00:14:52,200 --> 00:14:54,840 Speaker 1: be back a little later on with your news headlines, 288 00:14:54,880 --> 00:15:00,160 Speaker 1: but until then, have a great day. 289 00:15:00,880 --> 00:15:03,240 Speaker 2: My name is Lily Maddon and I'm a proud Arunda 290 00:15:03,440 --> 00:15:08,240 Speaker 2: Bunjelung Calkatin woman from Gadighl Country. The Daily oz acknowledges 291 00:15:08,320 --> 00:15:10,440 Speaker 2: that this podcast is recorded on the lands of the 292 00:15:10,480 --> 00:15:14,080 Speaker 2: Gadighl people and pays respect to all Aboriginal and torrest 293 00:15:14,160 --> 00:15:17,000 Speaker 2: rate island and nations. We pay our respects to the 294 00:15:17,000 --> 00:15:19,800 Speaker 2: first peoples of these countries, both past and present.