WEBVTT - Are we being lied to about how smart AI is?

0:00:00.520 --> 0:00:04.080
<v Speaker 1>Already and this is the Daily This is the Daily

0:00:04.120 --> 0:00:06.840
<v Speaker 1>ohs oh, now it makes sense.

0:00:14.720 --> 0:00:17.000
<v Speaker 2>Good morning, and welcome to the Daily OS. It's Thursday,

0:00:17.040 --> 0:00:18.960
<v Speaker 2>the fourteenth of August. I'm Sam Kazlowski.

0:00:19.160 --> 0:00:20.360
<v Speaker 1>I'm Emma Gillespie.

0:00:20.640 --> 0:00:24.040
<v Speaker 2>This month, the tech company behind chat GBT, released what

0:00:24.120 --> 0:00:28.280
<v Speaker 2>they claim is their smartest AI model yet now. According

0:00:28.280 --> 0:00:31.400
<v Speaker 2>to Open Ai, GPT five operates at the level of

0:00:31.440 --> 0:00:34.840
<v Speaker 2>a PhD student. But experts are warning that the AI

0:00:34.960 --> 0:00:37.720
<v Speaker 2>race has become a bit of a marketing battle, as

0:00:37.760 --> 0:00:41.960
<v Speaker 2>companies manipulate test results to claim their product is the best.

0:00:42.440 --> 0:00:45.559
<v Speaker 2>On today's podcast, we're going to unpack how AI companies

0:00:45.760 --> 0:00:52.159
<v Speaker 2>measure intelligence and why that's become a problem.

0:00:52.280 --> 0:00:52.680
<v Speaker 3>Sam.

0:00:52.800 --> 0:00:56.400
<v Speaker 1>I was originally skeptical about having this conversation with you

0:00:56.520 --> 0:01:01.400
<v Speaker 1>because I, like maybe some listeners here AI and I

0:01:01.520 --> 0:01:03.400
<v Speaker 1>kind of roll my eyes a little.

0:01:03.080 --> 0:01:04.160
<v Speaker 3>Bit and switch off.

0:01:04.240 --> 0:01:06.920
<v Speaker 1>But if you are that person hearing this right now,

0:01:07.040 --> 0:01:10.679
<v Speaker 1>hang in there, because this is actually a fascinating conversation,

0:01:11.360 --> 0:01:14.600
<v Speaker 1>this idea that we're sort of being marketed to about

0:01:14.959 --> 0:01:18.600
<v Speaker 1>this arms race of who is the smartest, which AI model.

0:01:18.400 --> 0:01:19.040
<v Speaker 3>Is the best.

0:01:19.880 --> 0:01:23.160
<v Speaker 1>Let's start with the basics here, though, when we're talking

0:01:23.240 --> 0:01:24.640
<v Speaker 1>about AI models.

0:01:24.800 --> 0:01:26.160
<v Speaker 3>What exactly does that mean?

0:01:26.640 --> 0:01:30.120
<v Speaker 2>There is a certain brand of satisfaction that is reserved

0:01:30.200 --> 0:01:31.959
<v Speaker 2>for when I can change your mind and whether a

0:01:32.000 --> 0:01:33.280
<v Speaker 2>story is going to be interesting or.

0:01:33.280 --> 0:01:35.240
<v Speaker 3>Not, especially if it's a tech story.

0:01:35.400 --> 0:01:38.399
<v Speaker 2>This is this is going to be awesome. So AI

0:01:38.480 --> 0:01:42.760
<v Speaker 2>models are computer programs that can understand and generate language

0:01:42.880 --> 0:01:46.880
<v Speaker 2>human language. Just think of them as very advanced AU

0:01:46.920 --> 0:01:49.800
<v Speaker 2>though complete systems like the ones that could fill in

0:01:49.800 --> 0:01:53.720
<v Speaker 2>a form for you or you know, password, remembering little

0:01:53.880 --> 0:01:56.880
<v Speaker 2>widgets in your browser, anything that presumes that what you're

0:01:56.920 --> 0:01:59.160
<v Speaker 2>about to do or want it can kind of fill

0:01:59.200 --> 0:02:00.000
<v Speaker 2>in those gaps for you.

0:02:00.280 --> 0:02:01.800
<v Speaker 3>That's actually a really good way to think of it.

0:02:02.000 --> 0:02:04.680
<v Speaker 2>See we're off to a flyer. You type in a

0:02:04.800 --> 0:02:08.000
<v Speaker 2>question or a quest, those responses are generated. The most

0:02:08.000 --> 0:02:10.400
<v Speaker 2>famous ones you might have heard of include chat GBT

0:02:10.520 --> 0:02:14.800
<v Speaker 2>from open Ai, You've Got Clawed from Anthropic, and Gemini

0:02:14.880 --> 0:02:15.440
<v Speaker 2>from Google.

0:02:15.760 --> 0:02:20.400
<v Speaker 1>Okay, now, it does seem like every AI company out

0:02:20.400 --> 0:02:24.000
<v Speaker 1>there claims that its model is the smartest or the

0:02:24.040 --> 0:02:26.840
<v Speaker 1>most capable or better than the best one that we've

0:02:26.880 --> 0:02:30.160
<v Speaker 1>ever seen. And one of the biggest players in this space,

0:02:30.280 --> 0:02:35.240
<v Speaker 1>open ai has just released GPT five this week. What

0:02:35.280 --> 0:02:37.440
<v Speaker 1>are their claims about this new model.

0:02:37.720 --> 0:02:40.400
<v Speaker 2>So they're making some big statements here. They're saying that

0:02:40.520 --> 0:02:44.560
<v Speaker 2>GBT five scored ninety four point six percent on a

0:02:44.639 --> 0:02:48.480
<v Speaker 2>test that measures its ability to solve advanced maths problems,

0:02:48.960 --> 0:02:52.119
<v Speaker 2>seventy four point nine percent on real world coding tasks,

0:02:52.240 --> 0:02:55.880
<v Speaker 2>and produces forty five percent fewer factual errors than their

0:02:55.919 --> 0:02:59.160
<v Speaker 2>previous models. To the CEO of the company, Sam Oltman,

0:02:59.240 --> 0:03:01.720
<v Speaker 2>he called it the model in the world, which kind

0:03:01.720 --> 0:03:04.080
<v Speaker 2>of sounds like those places you were saying before, and

0:03:04.120 --> 0:03:08.000
<v Speaker 2>said it represents a significant step towards what's called artificial

0:03:08.120 --> 0:03:12.040
<v Speaker 2>general intelligence AGI, which is basically the idea that AI

0:03:12.120 --> 0:03:15.919
<v Speaker 2>can actually perform an intellectual task better than humans can.

0:03:16.080 --> 0:03:18.400
<v Speaker 1>Okay, so that's when we start to imagine like the

0:03:18.440 --> 0:03:19.919
<v Speaker 1>I robot future.

0:03:20.240 --> 0:03:22.080
<v Speaker 2>Yeah, and it's when we get into those examples of

0:03:22.120 --> 0:03:25.320
<v Speaker 2>things like AI blackmailing you if you decide to stop

0:03:25.400 --> 0:03:27.440
<v Speaker 2>using it and kind of taking on a life of

0:03:27.480 --> 0:03:28.120
<v Speaker 2>its own.

0:03:28.560 --> 0:03:32.160
<v Speaker 1>So those numbers from Open AI about this new model

0:03:32.240 --> 0:03:35.720
<v Speaker 1>sound pretty impressive, like ninety five percent on advanced maths.

0:03:35.760 --> 0:03:40.600
<v Speaker 1>Particularly interesting this kind of idea of producing fewer factual errors,

0:03:40.640 --> 0:03:43.560
<v Speaker 1>because that's always kind of in the spotlight around the

0:03:43.600 --> 0:03:47.240
<v Speaker 1>skepticism towards AI, But I'm interested in how these companies

0:03:47.360 --> 0:03:51.800
<v Speaker 1>are actually measuring the intelligence of these products. You mentioned

0:03:51.840 --> 0:03:54.640
<v Speaker 1>in the intro SAM that this is becoming a bit

0:03:54.640 --> 0:03:57.240
<v Speaker 1>of an issue. Yeah, So what exactly is the concern?

0:03:57.880 --> 0:04:00.720
<v Speaker 2>Well, ultimately it's the idea that AI come companies are

0:04:00.760 --> 0:04:04.240
<v Speaker 2>all using different tests to prove that their model is

0:04:04.280 --> 0:04:08.480
<v Speaker 2>the best. It's like if all car companies all claims

0:04:08.560 --> 0:04:11.839
<v Speaker 2>to make the fastest car ever or the safest car ever,

0:04:12.400 --> 0:04:14.840
<v Speaker 2>but one tested on a highway, the other tested on

0:04:14.880 --> 0:04:17.400
<v Speaker 2>a racetrack, and the other one went downhill on a

0:04:17.440 --> 0:04:21.200
<v Speaker 2>windy day. A major study published earlier this year into

0:04:21.320 --> 0:04:25.920
<v Speaker 2>AI models actually compared the situation to Volkswagen, who were

0:04:25.960 --> 0:04:29.640
<v Speaker 2>found guilty of lying about the emissions or the lack

0:04:29.680 --> 0:04:32.840
<v Speaker 2>of emissions that their cars were producing when it basically

0:04:32.920 --> 0:04:37.320
<v Speaker 2>cheated on pollution tests. The researchers noted that when companies

0:04:37.360 --> 0:04:41.560
<v Speaker 2>manipulated car testing, people were going to jail, but similar

0:04:41.680 --> 0:04:45.719
<v Speaker 2>manipulation in AI isn't really coming into our attention.

0:04:46.440 --> 0:04:48.000
<v Speaker 3>Wow, it's fascinating.

0:04:48.000 --> 0:04:51.279
<v Speaker 1>I remember that Volkswagen emission scandal, So a good comparison,

0:04:51.440 --> 0:04:54.080
<v Speaker 1>and how the tick for SAM? So, how can these

0:04:54.240 --> 0:04:58.559
<v Speaker 1>AI models then be tested in a fair way.

0:04:58.680 --> 0:05:00.080
<v Speaker 3>What does testing.

0:04:59.760 --> 0:05:03.200
<v Speaker 1>Out official intelligence kind of transparently and consistently look like.

0:05:03.440 --> 0:05:05.120
<v Speaker 2>Well, naturally, the first thing to do would be the

0:05:05.200 --> 0:05:08.640
<v Speaker 2>standardize the same test across every model, and that would

0:05:08.640 --> 0:05:11.560
<v Speaker 2>be described as a benchmark, and you global benchmark for

0:05:11.600 --> 0:05:14.320
<v Speaker 2>how these models are performing. And that could be to

0:05:14.400 --> 0:05:17.840
<v Speaker 2>measure a specific ability, say in maths, you could give

0:05:17.960 --> 0:05:20.760
<v Speaker 2>all of them the same advanced maths problem and then

0:05:20.800 --> 0:05:23.440
<v Speaker 2>measure not only the output, but how long it takes

0:05:23.480 --> 0:05:26.880
<v Speaker 2>for them to get there, what processes it undertook to

0:05:27.000 --> 0:05:29.760
<v Speaker 2>reach that final destination of the answer. You could give

0:05:29.800 --> 0:05:32.240
<v Speaker 2>that a score and then actually compare like for like

0:05:32.320 --> 0:05:32.960
<v Speaker 2>these models.

0:05:33.360 --> 0:05:35.800
<v Speaker 3>It kind of sounds pretty straightforward.

0:05:35.880 --> 0:05:39.960
<v Speaker 1>That to me seems like the obvious path towards getting

0:05:40.000 --> 0:05:44.440
<v Speaker 1>consistent testing. So where does the manipulation come from?

0:05:44.560 --> 0:05:46.279
<v Speaker 2>Well, I think the first thing to acknowledge is that

0:05:46.320 --> 0:05:50.560
<v Speaker 2>there is no centralized global body that has the respect

0:05:50.680 --> 0:05:54.320
<v Speaker 2>or the ability to actually execute that sort of standardized testing.

0:05:54.400 --> 0:05:58.760
<v Speaker 2>There is no say, TGA for drugs, there's no government

0:05:59.040 --> 0:06:02.080
<v Speaker 2>sponsored hub that can execute that kind of stuff. So

0:06:02.480 --> 0:06:04.919
<v Speaker 2>reason A, there's nobody to do it. But reason B

0:06:05.120 --> 0:06:09.240
<v Speaker 2>would be that these models are still in this accelerating

0:06:09.640 --> 0:06:12.960
<v Speaker 2>period of marketing where they're cherry picking tests that would

0:06:13.040 --> 0:06:17.599
<v Speaker 2>favor their models' strengths while hiding poor performance in other areas.

0:06:18.240 --> 0:06:21.560
<v Speaker 2>And one other problem that has come up is that

0:06:21.640 --> 0:06:24.599
<v Speaker 2>if AI knows the problem is coming, because it's AI

0:06:24.760 --> 0:06:27.720
<v Speaker 2>and it knows how tests are done, then it can

0:06:27.760 --> 0:06:30.960
<v Speaker 2>actually almost train itself for the test, and so there's

0:06:30.960 --> 0:06:33.120
<v Speaker 2>a bit of a data contamination problem. You'd have to

0:06:33.200 --> 0:06:36.760
<v Speaker 2>keep these tests almost offline entirely for the models to

0:06:36.760 --> 0:06:39.719
<v Speaker 2>see them for the first time. One study found, for example,

0:06:39.760 --> 0:06:43.200
<v Speaker 2>that GPT four, which is the one older model from

0:06:43.560 --> 0:06:47.680
<v Speaker 2>open AI, it could solve coding problems from before twenty

0:06:47.760 --> 0:06:50.559
<v Speaker 2>twenty one that were published online, but it couldn't solve

0:06:50.720 --> 0:06:53.440
<v Speaker 2>new problems. And so then you get a sense of

0:06:53.520 --> 0:06:55.919
<v Speaker 2>kind of in the great big world of its brain,

0:06:56.000 --> 0:06:58.520
<v Speaker 2>which is the Internet, if those answers are somewhere out there,

0:06:58.520 --> 0:06:59.800
<v Speaker 2>it could just regurgitate them.

0:07:00.120 --> 0:07:02.599
<v Speaker 1>So it's like if you've got an advanced copy of

0:07:02.640 --> 0:07:05.880
<v Speaker 1>an exam or a test at unior in school, you

0:07:06.000 --> 0:07:09.960
<v Speaker 1>can train for the test. That doesn't necessarily mean that

0:07:10.240 --> 0:07:14.040
<v Speaker 1>you have the comprehension levels to speak to a certain

0:07:14.120 --> 0:07:17.880
<v Speaker 1>topic or question. In the same subject outside of the

0:07:17.880 --> 0:07:19.160
<v Speaker 1>confines of that context.

0:07:19.160 --> 0:07:21.040
<v Speaker 2>And if we think about what all of this is for,

0:07:21.160 --> 0:07:23.400
<v Speaker 2>it's about trying to work out if these models are

0:07:23.400 --> 0:07:25.600
<v Speaker 2>going to be good in practice for us to spend

0:07:25.680 --> 0:07:27.440
<v Speaker 2>twenty bucks a month on them. I mean, let's get

0:07:27.480 --> 0:07:29.600
<v Speaker 2>back to the real core problem here. We're trying to

0:07:29.600 --> 0:07:32.560
<v Speaker 2>work out if it's worth our money. And there was

0:07:32.560 --> 0:07:35.400
<v Speaker 2>a great quote from the British Prime Minister, former British

0:07:35.440 --> 0:07:39.200
<v Speaker 2>Prime Minister Richie Sunak. He said AI models shouldn't be

0:07:39.240 --> 0:07:41.640
<v Speaker 2>trusted to mark their own homework. And I think that

0:07:41.680 --> 0:07:43.480
<v Speaker 2>we can all relate to that. Yeah, and it kind

0:07:43.520 --> 0:07:47.880
<v Speaker 2>of encapsulates what's the problem with this independent benchmarking framework.

0:07:48.160 --> 0:07:52.960
<v Speaker 1>You also mentioned that companies are testing multiple versions, or

0:07:53.000 --> 0:07:56.720
<v Speaker 1>that they're cherry picking their data and choosing the kind

0:07:56.760 --> 0:07:59.440
<v Speaker 1>of findings that favor their models the most.

0:08:00.040 --> 0:08:00.840
<v Speaker 3>What's happening there.

0:08:00.720 --> 0:08:03.200
<v Speaker 2>Tell us a bit more well. Some research found that

0:08:03.240 --> 0:08:07.040
<v Speaker 2>major companies were talking mesha, Open Ai and Google have

0:08:07.120 --> 0:08:11.440
<v Speaker 2>been privately testing dozens of different model versions on popular tests.

0:08:12.240 --> 0:08:15.840
<v Speaker 2>They're only revealing the scores from their best performing versions.

0:08:16.200 --> 0:08:17.920
<v Speaker 2>So and it's like, you know, you're on a night out,

0:08:17.920 --> 0:08:19.960
<v Speaker 2>you take twenty selfies, you put up the best one. Yeah,

0:08:20.000 --> 0:08:22.680
<v Speaker 2>of course, and I think at some stage you have

0:08:22.760 --> 0:08:25.800
<v Speaker 2>to admit that all businesses would do that. Yeah, you know, TDA,

0:08:25.920 --> 0:08:28.840
<v Speaker 2>if we had to report results to the stock market,

0:08:28.960 --> 0:08:31.240
<v Speaker 2>you know, we would probably highlight more the pieces that

0:08:31.240 --> 0:08:34.319
<v Speaker 2>did really, really well. Not that there's ever any pieces

0:08:34.440 --> 0:08:35.960
<v Speaker 2>that don't, but you know.

0:08:36.040 --> 0:08:38.520
<v Speaker 3>A flawless company that never makes mistakes.

0:08:38.559 --> 0:08:40.800
<v Speaker 2>Obviously, but we have to. I think it's good to

0:08:40.840 --> 0:08:43.800
<v Speaker 2>acknowledge this bit of kind of business reality there. But

0:08:44.040 --> 0:08:47.439
<v Speaker 2>I do think that in this case it's different because

0:08:48.040 --> 0:08:51.120
<v Speaker 2>there's no transparency at all in terms of the testing process.

0:08:51.120 --> 0:08:55.800
<v Speaker 2>It's to continue with our university kind of example. It's

0:08:55.840 --> 0:08:58.600
<v Speaker 2>like a student taking the same exam twenty seven times

0:08:59.000 --> 0:09:00.920
<v Speaker 2>and then only reporting the best score. Yep.

0:09:01.480 --> 0:09:06.000
<v Speaker 1>So without that transparency, there's that issue around trust, and

0:09:06.080 --> 0:09:09.040
<v Speaker 1>I think we see that really playing out in real

0:09:09.120 --> 0:09:12.200
<v Speaker 1>time right now, that there is a lack of trust

0:09:12.440 --> 0:09:16.839
<v Speaker 1>in the broader community about AI models because we don't

0:09:16.880 --> 0:09:19.080
<v Speaker 1>know how they come to these answers. What are some

0:09:19.160 --> 0:09:22.120
<v Speaker 1>of the other consequences of this manipulation. How does this

0:09:22.240 --> 0:09:24.560
<v Speaker 1>play out in the real world every day?

0:09:24.800 --> 0:09:28.439
<v Speaker 2>Well, there's definitely that marketing angle of misleading consumers and

0:09:28.760 --> 0:09:31.360
<v Speaker 2>you and I signing up to an AI platform because

0:09:31.360 --> 0:09:33.880
<v Speaker 2>we think it's ninety six percent going to be great,

0:09:33.920 --> 0:09:36.320
<v Speaker 2>and in fact it might be eighty one percent great,

0:09:36.360 --> 0:09:39.640
<v Speaker 2>which is still an incredible feat of technology. But then

0:09:39.640 --> 0:09:43.359
<v Speaker 2>from a government perspective, governments are looking at these benchmarks

0:09:43.600 --> 0:09:47.640
<v Speaker 2>for the way that they're thinking about regulation or policy decisions.

0:09:48.200 --> 0:09:52.120
<v Speaker 2>So the European Union's AI Act it uses benchmarks to

0:09:52.320 --> 0:09:56.920
<v Speaker 2>determine whether new AI models pose systemic risk. Can they

0:09:56.920 --> 0:09:59.640
<v Speaker 2>be used by extremists? Can they be used to spread

0:09:59.760 --> 0:10:03.679
<v Speaker 2>race online? Can they be used to mislead and deliberately

0:10:03.720 --> 0:10:08.319
<v Speaker 2>spread misinformation? And if companies are manipulating those scores, it

0:10:08.360 --> 0:10:12.079
<v Speaker 2>could affect how these powerful technologies are indeed regulated.

0:10:12.160 --> 0:10:15.960
<v Speaker 1>Okay, because if these scores say that eighty percent of

0:10:16.000 --> 0:10:18.880
<v Speaker 1>the content is factual, or that there are these really

0:10:18.920 --> 0:10:22.480
<v Speaker 1>great systems in place to catch miss and disinformation or

0:10:22.520 --> 0:10:25.600
<v Speaker 1>hate speech, then that might not concern leaders to the

0:10:25.600 --> 0:10:28.120
<v Speaker 1>point where they think there needs to be certain levels

0:10:28.120 --> 0:10:29.080
<v Speaker 1>of regulation.

0:10:28.840 --> 0:10:29.559
<v Speaker 2>One hundred percent.

0:10:30.000 --> 0:10:34.480
<v Speaker 1>You mentioned this idea of artificial general intelligence earlier. We

0:10:34.640 --> 0:10:37.760
<v Speaker 1>used the I robot example. One of Will Smith's best

0:10:38.520 --> 0:10:41.680
<v Speaker 1>open AI is claiming that GPT five is a step

0:10:41.720 --> 0:10:46.080
<v Speaker 1>forward in AGI, But what does that actually mean in

0:10:46.120 --> 0:10:48.560
<v Speaker 1>a not Hollywood kind of fantasy world.

0:10:48.960 --> 0:10:53.319
<v Speaker 2>Well, I gave the example before of outperforming humans. That's

0:10:53.400 --> 0:10:56.080
<v Speaker 2>a very broad definition, and the problem is that I

0:10:56.080 --> 0:10:58.640
<v Speaker 2>can't really give you a more specific definition because even

0:10:58.720 --> 0:11:01.560
<v Speaker 2>open ai can't really do that right. One open Ai

0:11:01.679 --> 0:11:04.960
<v Speaker 2>statement said, AGI is still a weekly defined term and

0:11:05.040 --> 0:11:07.800
<v Speaker 2>means different things to different people. We don't really know

0:11:07.840 --> 0:11:08.600
<v Speaker 2>what we don't know.

0:11:08.800 --> 0:11:11.480
<v Speaker 3>So how can GPT five verse step forward?

0:11:11.559 --> 0:11:11.719
<v Speaker 2>Then?

0:11:11.760 --> 0:11:13.959
<v Speaker 3>If the company itself isn't sure?

0:11:14.559 --> 0:11:19.080
<v Speaker 2>Interesting question very much raises some questions about how do

0:11:19.120 --> 0:11:21.319
<v Speaker 2>we know when we got there? Even? Yeah, I mean

0:11:21.720 --> 0:11:25.439
<v Speaker 2>this is the exciting and terrifying part of living through

0:11:26.240 --> 0:11:29.800
<v Speaker 2>rapidly emerging technology is that we're learning as we go

0:11:30.040 --> 0:11:33.000
<v Speaker 2>as a society, and that is not always pretty.

0:11:33.520 --> 0:11:37.199
<v Speaker 1>So for people listening who might be using AI kind

0:11:37.240 --> 0:11:41.520
<v Speaker 1>of casually or infrequently in their maybe work or UNI life,

0:11:42.000 --> 0:11:45.120
<v Speaker 1>maybe they're building up their understanding of the different platforms

0:11:45.160 --> 0:11:49.079
<v Speaker 1>out there. What should we make of all these competing claims?

0:11:49.160 --> 0:11:51.839
<v Speaker 1>You know, how do we make better decisions about which

0:11:51.920 --> 0:11:55.319
<v Speaker 1>AI model is actually the good one, or the right one,

0:11:55.400 --> 0:11:56.480
<v Speaker 1>or or the best one for us?

0:11:56.559 --> 0:11:59.720
<v Speaker 2>I'm constantly asked as somebody who is known now in

0:11:59.720 --> 0:12:02.160
<v Speaker 2>my friend group and in the workplace as somebody who's

0:12:02.400 --> 0:12:05.320
<v Speaker 2>really interested in AI. I'm constantly asked which one should

0:12:05.360 --> 0:12:07.720
<v Speaker 2>I use, what's the best one, And the answer is,

0:12:07.880 --> 0:12:10.400
<v Speaker 2>it's about what you're trying to do, essentially, So one

0:12:10.440 --> 0:12:13.079
<v Speaker 2>model might be better for creative writing, but another might

0:12:13.120 --> 0:12:16.840
<v Speaker 2>excel more a data analysis and crunching some numbers. Studies

0:12:16.880 --> 0:12:19.920
<v Speaker 2>are showing though, that AI models often fail when you

0:12:20.000 --> 0:12:23.480
<v Speaker 2>move from those controlled test conditions or those use cases

0:12:23.600 --> 0:12:26.840
<v Speaker 2>or features that are rolled out by these platforms as

0:12:26.840 --> 0:12:30.200
<v Speaker 2>part of marketing campaigns to the messy real world use

0:12:30.320 --> 0:12:32.640
<v Speaker 2>that humans actually use these tools for.

0:12:32.840 --> 0:12:35.360
<v Speaker 1>It actually reminds me of and I'm not even sure

0:12:35.400 --> 0:12:38.720
<v Speaker 1>if this is the same thing, but when Siri was

0:12:38.720 --> 0:12:41.840
<v Speaker 1>first rolled out and Apple kind of in their big

0:12:41.880 --> 0:12:44.000
<v Speaker 1>announcements it's like, you can ask her this, or you

0:12:44.000 --> 0:12:46.079
<v Speaker 1>can ask her that, or if you want to know what.

0:12:46.040 --> 0:12:47.840
<v Speaker 3>The weather's like, should you take an umbrella?

0:12:48.280 --> 0:12:51.080
<v Speaker 1>And I found when I first started using Siri, like, yeah,

0:12:51.120 --> 0:12:53.760
<v Speaker 1>you could definitely answer those sorts of questions, but not

0:12:53.840 --> 0:12:56.120
<v Speaker 1>a whole lot else outside of the almost like a

0:12:56.160 --> 0:12:59.480
<v Speaker 1>prescribed text from Apple about how to use Siri.

0:13:00.120 --> 0:13:03.360
<v Speaker 2>When you get into the world of trying to engage

0:13:03.360 --> 0:13:05.520
<v Speaker 2>with the user no matter what they're about to say.

0:13:06.000 --> 0:13:07.480
<v Speaker 2>It can take a little bit of time for the

0:13:07.559 --> 0:13:11.440
<v Speaker 2>technology to be refined and to keep learning from what

0:13:11.520 --> 0:13:12.560
<v Speaker 2>users actually want.

0:13:12.760 --> 0:13:15.600
<v Speaker 3>So, Sam, what is the way forward in all of this?

0:13:16.280 --> 0:13:20.360
<v Speaker 1>Is there a conversation happening at a more global scale

0:13:20.520 --> 0:13:21.880
<v Speaker 1>about this regulation?

0:13:22.360 --> 0:13:25.600
<v Speaker 2>Definitely, and there's no clear leader here. I mentioned the

0:13:25.880 --> 0:13:28.760
<v Speaker 2>work being done by the European Union before. There's a

0:13:28.800 --> 0:13:32.600
<v Speaker 2>coalition of countries including Australia that signed on to kind

0:13:32.640 --> 0:13:35.600
<v Speaker 2>of key principles of how to keep AI safe. That

0:13:35.720 --> 0:13:38.280
<v Speaker 2>was in mid twenty twenty three, so there's a bit

0:13:38.280 --> 0:13:41.200
<v Speaker 2>of a global movement there. From a government perspective, there's

0:13:41.200 --> 0:13:44.480
<v Speaker 2>some really interesting work being done out of universities, particularly

0:13:44.640 --> 0:13:48.920
<v Speaker 2>Stanford University. They developed an AI Index report which does

0:13:49.000 --> 0:13:52.480
<v Speaker 2>try to compare the models like for like. But I

0:13:52.520 --> 0:13:54.920
<v Speaker 2>think we first need to determine who the authority is

0:13:54.960 --> 0:13:57.280
<v Speaker 2>going to be in this space before we can kind

0:13:57.280 --> 0:13:59.959
<v Speaker 2>of put the burden on them to roll out this

0:14:00.000 --> 0:14:03.280
<v Speaker 2>standardized testing. And I do think in a few decades

0:14:03.320 --> 0:14:05.160
<v Speaker 2>it will take a while. I do think we'll get there.

0:14:05.360 --> 0:14:08.320
<v Speaker 2>I mean, we have the TGA to regulate medicine. We

0:14:08.480 --> 0:14:11.840
<v Speaker 2>have a central aviation authority to regulate what a plane

0:14:11.880 --> 0:14:15.079
<v Speaker 2>that's airworthy looks like. Yep. I do think that we're

0:14:15.120 --> 0:14:17.640
<v Speaker 2>going to see a central AI authority in Australia and

0:14:17.679 --> 0:14:20.800
<v Speaker 2>maybe around the world someday. But we are very early

0:14:20.840 --> 0:14:23.840
<v Speaker 2>in this story. We're like one percent through in the

0:14:23.960 --> 0:14:27.840
<v Speaker 2>AI story if that, and that's really exciting. But it's

0:14:27.920 --> 0:14:31.920
<v Speaker 2>also really important to continuously discuss the potential flaws and

0:14:32.240 --> 0:14:35.040
<v Speaker 2>the gaps that exist in this big, new scary Well.

0:14:35.360 --> 0:14:38.800
<v Speaker 1>Yeah, I think that healthy dose of skepticism is what

0:14:38.840 --> 0:14:41.040
<v Speaker 1>we will be carrying forward. But I look forward to

0:14:41.120 --> 0:14:43.360
<v Speaker 1>many more conversations like this with you, Sam.

0:14:43.440 --> 0:14:44.440
<v Speaker 2>Well, we don't have a choice.

0:14:44.520 --> 0:14:46.000
<v Speaker 3>Help me understand it all.

0:14:46.200 --> 0:14:49.000
<v Speaker 1>Thank you so much for breaking that down for us, Sam,

0:14:49.480 --> 0:14:52.200
<v Speaker 1>and thank you for listening to today's deep Dive. We'll

0:14:52.200 --> 0:14:54.840
<v Speaker 1>be back a little later on with your news headlines,

0:14:54.880 --> 0:15:00.160
<v Speaker 1>but until then, have a great day.

0:15:00.880 --> 0:15:03.240
<v Speaker 2>My name is Lily Maddon and I'm a proud Arunda

0:15:03.440 --> 0:15:08.240
<v Speaker 2>Bunjelung Calkatin woman from Gadighl Country. The Daily oz acknowledges

0:15:08.320 --> 0:15:10.440
<v Speaker 2>that this podcast is recorded on the lands of the

0:15:10.480 --> 0:15:14.080
<v Speaker 2>Gadighl people and pays respect to all Aboriginal and torrest

0:15:14.160 --> 0:15:17.000
<v Speaker 2>rate island and nations. We pay our respects to the

0:15:17.000 --> 0:15:19.800
<v Speaker 2>first peoples of these countries, both past and present.