1
00:00:00,520 --> 00:00:04,080
Speaker 1: Already and this is the Daily This is the Daily

2
00:00:04,120 --> 00:00:06,840
Speaker 1: ohs oh, now it makes sense.

3
00:00:14,720 --> 00:00:17,000
Speaker 2: Good morning, and welcome to the Daily OS. It's Thursday,

4
00:00:17,040 --> 00:00:18,960
Speaker 2: the fourteenth of August. I'm Sam Kazlowski.

5
00:00:19,160 --> 00:00:20,360
Speaker 1: I'm Emma Gillespie.

6
00:00:20,640 --> 00:00:24,040
Speaker 2: This month, the tech company behind chat GBT, released what

7
00:00:24,120 --> 00:00:28,280
Speaker 2: they claim is their smartest AI model yet now. According

8
00:00:28,280 --> 00:00:31,400
Speaker 2: to Open Ai, GPT five operates at the level of

9
00:00:31,440 --> 00:00:34,840
Speaker 2: a PhD student. But experts are warning that the AI

10
00:00:34,960 --> 00:00:37,720
Speaker 2: race has become a bit of a marketing battle, as

11
00:00:37,760 --> 00:00:41,960
Speaker 2: companies manipulate test results to claim their product is the best.

12
00:00:42,440 --> 00:00:45,559
Speaker 2: On today's podcast, we're going to unpack how AI companies

13
00:00:45,760 --> 00:00:52,159
Speaker 2: measure intelligence and why that's become a problem.

14
00:00:52,280 --> 00:00:52,680
Speaker 3: Sam.

15
00:00:52,800 --> 00:00:56,400
Speaker 1: I was originally skeptical about having this conversation with you

16
00:00:56,520 --> 00:01:01,400
Speaker 1: because I, like maybe some listeners here AI and I

17
00:01:01,520 --> 00:01:03,400
Speaker 1: kind of roll my eyes a little.

18
00:01:03,080 --> 00:01:04,160
Speaker 3: Bit and switch off.

19
00:01:04,240 --> 00:01:06,920
Speaker 1: But if you are that person hearing this right now,

20
00:01:07,040 --> 00:01:10,679
Speaker 1: hang in there, because this is actually a fascinating conversation,

21
00:01:11,360 --> 00:01:14,600
Speaker 1: this idea that we're sort of being marketed to about

22
00:01:14,959 --> 00:01:18,600
Speaker 1: this arms race of who is the smartest, which AI model.

23
00:01:18,400 --> 00:01:19,040
Speaker 3: Is the best.

24
00:01:19,880 --> 00:01:23,160
Speaker 1: Let's start with the basics here, though, when we're talking

25
00:01:23,240 --> 00:01:24,640
Speaker 1: about AI models.

26
00:01:24,800 --> 00:01:26,160
Speaker 3: What exactly does that mean?

27
00:01:26,640 --> 00:01:30,120
Speaker 2: There is a certain brand of satisfaction that is reserved

28
00:01:30,200 --> 00:01:31,959
Speaker 2: for when I can change your mind and whether a

29
00:01:32,000 --> 00:01:33,280
Speaker 2: story is going to be interesting or.

30
00:01:33,280 --> 00:01:35,240
Speaker 3: Not, especially if it's a tech story.

31
00:01:35,400 --> 00:01:38,399
Speaker 2: This is this is going to be awesome. So AI

32
00:01:38,480 --> 00:01:42,760
Speaker 2: models are computer programs that can understand and generate language

33
00:01:42,880 --> 00:01:46,880
Speaker 2: human language. Just think of them as very advanced AU

34
00:01:46,920 --> 00:01:49,800
Speaker 2: though complete systems like the ones that could fill in

35
00:01:49,800 --> 00:01:53,720
Speaker 2: a form for you or you know, password, remembering little

36
00:01:53,880 --> 00:01:56,880
Speaker 2: widgets in your browser, anything that presumes that what you're

37
00:01:56,920 --> 00:01:59,160
Speaker 2: about to do or want it can kind of fill

38
00:01:59,200 --> 00:02:00,000
Speaker 2: in those gaps for you.

39
00:02:00,280 --> 00:02:01,800
Speaker 3: That's actually a really good way to think of it.

40
00:02:02,000 --> 00:02:04,680
Speaker 2: See we're off to a flyer. You type in a

41
00:02:04,800 --> 00:02:08,000
Speaker 2: question or a quest, those responses are generated. The most

42
00:02:08,000 --> 00:02:10,400
Speaker 2: famous ones you might have heard of include chat GBT

43
00:02:10,520 --> 00:02:14,800
Speaker 2: from open Ai, You've Got Clawed from Anthropic, and Gemini

44
00:02:14,880 --> 00:02:15,440
Speaker 2: from Google.

45
00:02:15,760 --> 00:02:20,400
Speaker 1: Okay, now, it does seem like every AI company out

46
00:02:20,400 --> 00:02:24,000
Speaker 1: there claims that its model is the smartest or the

47
00:02:24,040 --> 00:02:26,840
Speaker 1: most capable or better than the best one that we've

48
00:02:26,880 --> 00:02:30,160
Speaker 1: ever seen. And one of the biggest players in this space,

49
00:02:30,280 --> 00:02:35,240
Speaker 1: open ai has just released GPT five this week. What

50
00:02:35,280 --> 00:02:37,440
Speaker 1: are their claims about this new model.

51
00:02:37,720 --> 00:02:40,400
Speaker 2: So they're making some big statements here. They're saying that

52
00:02:40,520 --> 00:02:44,560
Speaker 2: GBT five scored ninety four point six percent on a

53
00:02:44,639 --> 00:02:48,480
Speaker 2: test that measures its ability to solve advanced maths problems,

54
00:02:48,960 --> 00:02:52,119
Speaker 2: seventy four point nine percent on real world coding tasks,

55
00:02:52,240 --> 00:02:55,880
Speaker 2: and produces forty five percent fewer factual errors than their

56
00:02:55,919 --> 00:02:59,160
Speaker 2: previous models. To the CEO of the company, Sam Oltman,

57
00:02:59,240 --> 00:03:01,720
Speaker 2: he called it the model in the world, which kind

58
00:03:01,720 --> 00:03:04,080
Speaker 2: of sounds like those places you were saying before, and

59
00:03:04,120 --> 00:03:08,000
Speaker 2: said it represents a significant step towards what's called artificial

60
00:03:08,120 --> 00:03:12,040
Speaker 2: general intelligence AGI, which is basically the idea that AI

61
00:03:12,120 --> 00:03:15,919
Speaker 2: can actually perform an intellectual task better than humans can.

62
00:03:16,080 --> 00:03:18,400
Speaker 1: Okay, so that's when we start to imagine like the

63
00:03:18,440 --> 00:03:19,919
Speaker 1: I robot future.

64
00:03:20,240 --> 00:03:22,080
Speaker 2: Yeah, and it's when we get into those examples of

65
00:03:22,120 --> 00:03:25,320
Speaker 2: things like AI blackmailing you if you decide to stop

66
00:03:25,400 --> 00:03:27,440
Speaker 2: using it and kind of taking on a life of

67
00:03:27,480 --> 00:03:28,120
Speaker 2: its own.

68
00:03:28,560 --> 00:03:32,160
Speaker 1: So those numbers from Open AI about this new model

69
00:03:32,240 --> 00:03:35,720
Speaker 1: sound pretty impressive, like ninety five percent on advanced maths.

70
00:03:35,760 --> 00:03:40,600
Speaker 1: Particularly interesting this kind of idea of producing fewer factual errors,

71
00:03:40,640 --> 00:03:43,560
Speaker 1: because that's always kind of in the spotlight around the

72
00:03:43,600 --> 00:03:47,240
Speaker 1: skepticism towards AI, But I'm interested in how these companies

73
00:03:47,360 --> 00:03:51,800
Speaker 1: are actually measuring the intelligence of these products. You mentioned

74
00:03:51,840 --> 00:03:54,640
Speaker 1: in the intro SAM that this is becoming a bit

75
00:03:54,640 --> 00:03:57,240
Speaker 1: of an issue. Yeah, So what exactly is the concern?

76
00:03:57,880 --> 00:04:00,720
Speaker 2: Well, ultimately it's the idea that AI come companies are

77
00:04:00,760 --> 00:04:04,240
Speaker 2: all using different tests to prove that their model is

78
00:04:04,280 --> 00:04:08,480
Speaker 2: the best. It's like if all car companies all claims

79
00:04:08,560 --> 00:04:11,839
Speaker 2: to make the fastest car ever or the safest car ever,

80
00:04:12,400 --> 00:04:14,840
Speaker 2: but one tested on a highway, the other tested on

81
00:04:14,880 --> 00:04:17,400
Speaker 2: a racetrack, and the other one went downhill on a

82
00:04:17,440 --> 00:04:21,200
Speaker 2: windy day. A major study published earlier this year into

83
00:04:21,320 --> 00:04:25,920
Speaker 2: AI models actually compared the situation to Volkswagen, who were

84
00:04:25,960 --> 00:04:29,640
Speaker 2: found guilty of lying about the emissions or the lack

85
00:04:29,680 --> 00:04:32,840
Speaker 2: of emissions that their cars were producing when it basically

86
00:04:32,920 --> 00:04:37,320
Speaker 2: cheated on pollution tests. The researchers noted that when companies

87
00:04:37,360 --> 00:04:41,560
Speaker 2: manipulated car testing, people were going to jail, but similar

88
00:04:41,680 --> 00:04:45,719
Speaker 2: manipulation in AI isn't really coming into our attention.

89
00:04:46,440 --> 00:04:48,000
Speaker 3: Wow, it's fascinating.

90
00:04:48,000 --> 00:04:51,279
Speaker 1: I remember that Volkswagen emission scandal, So a good comparison,

91
00:04:51,440 --> 00:04:54,080
Speaker 1: and how the tick for SAM? So, how can these

92
00:04:54,240 --> 00:04:58,559
Speaker 1: AI models then be tested in a fair way.

93
00:04:58,680 --> 00:05:00,080
Speaker 3: What does testing.

94
00:04:59,760 --> 00:05:03,200
Speaker 1: Out official intelligence kind of transparently and consistently look like.

95
00:05:03,440 --> 00:05:05,120
Speaker 2: Well, naturally, the first thing to do would be the

96
00:05:05,200 --> 00:05:08,640
Speaker 2: standardize the same test across every model, and that would

97
00:05:08,640 --> 00:05:11,560
Speaker 2: be described as a benchmark, and you global benchmark for

98
00:05:11,600 --> 00:05:14,320
Speaker 2: how these models are performing. And that could be to

99
00:05:14,400 --> 00:05:17,840
Speaker 2: measure a specific ability, say in maths, you could give

100
00:05:17,960 --> 00:05:20,760
Speaker 2: all of them the same advanced maths problem and then

101
00:05:20,800 --> 00:05:23,440
Speaker 2: measure not only the output, but how long it takes

102
00:05:23,480 --> 00:05:26,880
Speaker 2: for them to get there, what processes it undertook to

103
00:05:27,000 --> 00:05:29,760
Speaker 2: reach that final destination of the answer. You could give

104
00:05:29,800 --> 00:05:32,240
Speaker 2: that a score and then actually compare like for like

105
00:05:32,320 --> 00:05:32,960
Speaker 2: these models.

106
00:05:33,360 --> 00:05:35,800
Speaker 3: It kind of sounds pretty straightforward.

107
00:05:35,880 --> 00:05:39,960
Speaker 1: That to me seems like the obvious path towards getting

108
00:05:40,000 --> 00:05:44,440
Speaker 1: consistent testing. So where does the manipulation come from?

109
00:05:44,560 --> 00:05:46,279
Speaker 2: Well, I think the first thing to acknowledge is that

110
00:05:46,320 --> 00:05:50,560
Speaker 2: there is no centralized global body that has the respect

111
00:05:50,680 --> 00:05:54,320
Speaker 2: or the ability to actually execute that sort of standardized testing.

112
00:05:54,400 --> 00:05:58,760
Speaker 2: There is no say, TGA for drugs, there's no government

113
00:05:59,040 --> 00:06:02,080
Speaker 2: sponsored hub that can execute that kind of stuff. So

114
00:06:02,480 --> 00:06:04,919
Speaker 2: reason A, there's nobody to do it. But reason B

115
00:06:05,120 --> 00:06:09,240
Speaker 2: would be that these models are still in this accelerating

116
00:06:09,640 --> 00:06:12,960
Speaker 2: period of marketing where they're cherry picking tests that would

117
00:06:13,040 --> 00:06:17,599
Speaker 2: favor their models' strengths while hiding poor performance in other areas.

118
00:06:18,240 --> 00:06:21,560
Speaker 2: And one other problem that has come up is that

119
00:06:21,640 --> 00:06:24,599
Speaker 2: if AI knows the problem is coming, because it's AI

120
00:06:24,760 --> 00:06:27,720
Speaker 2: and it knows how tests are done, then it can

121
00:06:27,760 --> 00:06:30,960
Speaker 2: actually almost train itself for the test, and so there's

122
00:06:30,960 --> 00:06:33,120
Speaker 2: a bit of a data contamination problem. You'd have to

123
00:06:33,200 --> 00:06:36,760
Speaker 2: keep these tests almost offline entirely for the models to

124
00:06:36,760 --> 00:06:39,719
Speaker 2: see them for the first time. One study found, for example,

125
00:06:39,760 --> 00:06:43,200
Speaker 2: that GPT four, which is the one older model from

126
00:06:43,560 --> 00:06:47,680
Speaker 2: open AI, it could solve coding problems from before twenty

127
00:06:47,760 --> 00:06:50,559
Speaker 2: twenty one that were published online, but it couldn't solve

128
00:06:50,720 --> 00:06:53,440
Speaker 2: new problems. And so then you get a sense of

129
00:06:53,520 --> 00:06:55,919
Speaker 2: kind of in the great big world of its brain,

130
00:06:56,000 --> 00:06:58,520
Speaker 2: which is the Internet, if those answers are somewhere out there,

131
00:06:58,520 --> 00:06:59,800
Speaker 2: it could just regurgitate them.

132
00:07:00,120 --> 00:07:02,599
Speaker 1: So it's like if you've got an advanced copy of

133
00:07:02,640 --> 00:07:05,880
Speaker 1: an exam or a test at unior in school, you

134
00:07:06,000 --> 00:07:09,960
Speaker 1: can train for the test. That doesn't necessarily mean that

135
00:07:10,240 --> 00:07:14,040
Speaker 1: you have the comprehension levels to speak to a certain

136
00:07:14,120 --> 00:07:17,880
Speaker 1: topic or question. In the same subject outside of the

137
00:07:17,880 --> 00:07:19,160
Speaker 1: confines of that context.

138
00:07:19,160 --> 00:07:21,040
Speaker 2: And if we think about what all of this is for,

139
00:07:21,160 --> 00:07:23,400
Speaker 2: it's about trying to work out if these models are

140
00:07:23,400 --> 00:07:25,600
Speaker 2: going to be good in practice for us to spend

141
00:07:25,680 --> 00:07:27,440
Speaker 2: twenty bucks a month on them. I mean, let's get

142
00:07:27,480 --> 00:07:29,600
Speaker 2: back to the real core problem here. We're trying to

143
00:07:29,600 --> 00:07:32,560
Speaker 2: work out if it's worth our money. And there was

144
00:07:32,560 --> 00:07:35,400
Speaker 2: a great quote from the British Prime Minister, former British

145
00:07:35,440 --> 00:07:39,200
Speaker 2: Prime Minister Richie Sunak. He said AI models shouldn't be

146
00:07:39,240 --> 00:07:41,640
Speaker 2: trusted to mark their own homework. And I think that

147
00:07:41,680 --> 00:07:43,480
Speaker 2: we can all relate to that. Yeah, and it kind

148
00:07:43,520 --> 00:07:47,880
Speaker 2: of encapsulates what's the problem with this independent benchmarking framework.

149
00:07:48,160 --> 00:07:52,960
Speaker 1: You also mentioned that companies are testing multiple versions, or

150
00:07:53,000 --> 00:07:56,720
Speaker 1: that they're cherry picking their data and choosing the kind

151
00:07:56,760 --> 00:07:59,440
Speaker 1: of findings that favor their models the most.

152
00:08:00,040 --> 00:08:00,840
Speaker 3: What's happening there.

153
00:08:00,720 --> 00:08:03,200
Speaker 2: Tell us a bit more well. Some research found that

154
00:08:03,240 --> 00:08:07,040
Speaker 2: major companies were talking mesha, Open Ai and Google have

155
00:08:07,120 --> 00:08:11,440
Speaker 2: been privately testing dozens of different model versions on popular tests.

156
00:08:12,240 --> 00:08:15,840
Speaker 2: They're only revealing the scores from their best performing versions.

157
00:08:16,200 --> 00:08:17,920
Speaker 2: So and it's like, you know, you're on a night out,

158
00:08:17,920 --> 00:08:19,960
Speaker 2: you take twenty selfies, you put up the best one. Yeah,

159
00:08:20,000 --> 00:08:22,680
Speaker 2: of course, and I think at some stage you have

160
00:08:22,760 --> 00:08:25,800
Speaker 2: to admit that all businesses would do that. Yeah, you know, TDA,

161
00:08:25,920 --> 00:08:28,840
Speaker 2: if we had to report results to the stock market,

162
00:08:28,960 --> 00:08:31,240
Speaker 2: you know, we would probably highlight more the pieces that

163
00:08:31,240 --> 00:08:34,319
Speaker 2: did really, really well. Not that there's ever any pieces

164
00:08:34,440 --> 00:08:35,960
Speaker 2: that don't, but you know.

165
00:08:36,040 --> 00:08:38,520
Speaker 3: A flawless company that never makes mistakes.

166
00:08:38,559 --> 00:08:40,800
Speaker 2: Obviously, but we have to. I think it's good to

167
00:08:40,840 --> 00:08:43,800
Speaker 2: acknowledge this bit of kind of business reality there. But

168
00:08:44,040 --> 00:08:47,439
Speaker 2: I do think that in this case it's different because

169
00:08:48,040 --> 00:08:51,120
Speaker 2: there's no transparency at all in terms of the testing process.

170
00:08:51,120 --> 00:08:55,800
Speaker 2: It's to continue with our university kind of example. It's

171
00:08:55,840 --> 00:08:58,600
Speaker 2: like a student taking the same exam twenty seven times

172
00:08:59,000 --> 00:09:00,920
Speaker 2: and then only reporting the best score. Yep.

173
00:09:01,480 --> 00:09:06,000
Speaker 1: So without that transparency, there's that issue around trust, and

174
00:09:06,080 --> 00:09:09,040
Speaker 1: I think we see that really playing out in real

175
00:09:09,120 --> 00:09:12,200
Speaker 1: time right now, that there is a lack of trust

176
00:09:12,440 --> 00:09:16,839
Speaker 1: in the broader community about AI models because we don't

177
00:09:16,880 --> 00:09:19,080
Speaker 1: know how they come to these answers. What are some

178
00:09:19,160 --> 00:09:22,120
Speaker 1: of the other consequences of this manipulation. How does this

179
00:09:22,240 --> 00:09:24,560
Speaker 1: play out in the real world every day?

180
00:09:24,800 --> 00:09:28,439
Speaker 2: Well, there's definitely that marketing angle of misleading consumers and

181
00:09:28,760 --> 00:09:31,360
Speaker 2: you and I signing up to an AI platform because

182
00:09:31,360 --> 00:09:33,880
Speaker 2: we think it's ninety six percent going to be great,

183
00:09:33,920 --> 00:09:36,320
Speaker 2: and in fact it might be eighty one percent great,

184
00:09:36,360 --> 00:09:39,640
Speaker 2: which is still an incredible feat of technology. But then

185
00:09:39,640 --> 00:09:43,359
Speaker 2: from a government perspective, governments are looking at these benchmarks

186
00:09:43,600 --> 00:09:47,640
Speaker 2: for the way that they're thinking about regulation or policy decisions.

187
00:09:48,200 --> 00:09:52,120
Speaker 2: So the European Union's AI Act it uses benchmarks to

188
00:09:52,320 --> 00:09:56,920
Speaker 2: determine whether new AI models pose systemic risk. Can they

189
00:09:56,920 --> 00:09:59,640
Speaker 2: be used by extremists? Can they be used to spread

190
00:09:59,760 --> 00:10:03,679
Speaker 2: race online? Can they be used to mislead and deliberately

191
00:10:03,720 --> 00:10:08,319
Speaker 2: spread misinformation? And if companies are manipulating those scores, it

192
00:10:08,360 --> 00:10:12,079
Speaker 2: could affect how these powerful technologies are indeed regulated.

193
00:10:12,160 --> 00:10:15,960
Speaker 1: Okay, because if these scores say that eighty percent of

194
00:10:16,000 --> 00:10:18,880
Speaker 1: the content is factual, or that there are these really

195
00:10:18,920 --> 00:10:22,480
Speaker 1: great systems in place to catch miss and disinformation or

196
00:10:22,520 --> 00:10:25,600
Speaker 1: hate speech, then that might not concern leaders to the

197
00:10:25,600 --> 00:10:28,120
Speaker 1: point where they think there needs to be certain levels

198
00:10:28,120 --> 00:10:29,080
Speaker 1: of regulation.

199
00:10:28,840 --> 00:10:29,559
Speaker 2: One hundred percent.

200
00:10:30,000 --> 00:10:34,480
Speaker 1: You mentioned this idea of artificial general intelligence earlier. We

201
00:10:34,640 --> 00:10:37,760
Speaker 1: used the I robot example. One of Will Smith's best

202
00:10:38,520 --> 00:10:41,680
Speaker 1: open AI is claiming that GPT five is a step

203
00:10:41,720 --> 00:10:46,080
Speaker 1: forward in AGI, But what does that actually mean in

204
00:10:46,120 --> 00:10:48,560
Speaker 1: a not Hollywood kind of fantasy world.

205
00:10:48,960 --> 00:10:53,319
Speaker 2: Well, I gave the example before of outperforming humans. That's

206
00:10:53,400 --> 00:10:56,080
Speaker 2: a very broad definition, and the problem is that I

207
00:10:56,080 --> 00:10:58,640
Speaker 2: can't really give you a more specific definition because even

208
00:10:58,720 --> 00:11:01,560
Speaker 2: open ai can't really do that right. One open Ai

209
00:11:01,679 --> 00:11:04,960
Speaker 2: statement said, AGI is still a weekly defined term and

210
00:11:05,040 --> 00:11:07,800
Speaker 2: means different things to different people. We don't really know

211
00:11:07,840 --> 00:11:08,600
Speaker 2: what we don't know.

212
00:11:08,800 --> 00:11:11,480
Speaker 3: So how can GPT five verse step forward?

213
00:11:11,559 --> 00:11:11,719
Speaker 2: Then?

214
00:11:11,760 --> 00:11:13,959
Speaker 3: If the company itself isn't sure?

215
00:11:14,559 --> 00:11:19,080
Speaker 2: Interesting question very much raises some questions about how do

216
00:11:19,120 --> 00:11:21,319
Speaker 2: we know when we got there? Even? Yeah, I mean

217
00:11:21,720 --> 00:11:25,439
Speaker 2: this is the exciting and terrifying part of living through

218
00:11:26,240 --> 00:11:29,800
Speaker 2: rapidly emerging technology is that we're learning as we go

219
00:11:30,040 --> 00:11:33,000
Speaker 2: as a society, and that is not always pretty.

220
00:11:33,520 --> 00:11:37,199
Speaker 1: So for people listening who might be using AI kind

221
00:11:37,240 --> 00:11:41,520
Speaker 1: of casually or infrequently in their maybe work or UNI life,

222
00:11:42,000 --> 00:11:45,120
Speaker 1: maybe they're building up their understanding of the different platforms

223
00:11:45,160 --> 00:11:49,079
Speaker 1: out there. What should we make of all these competing claims?

224
00:11:49,160 --> 00:11:51,839
Speaker 1: You know, how do we make better decisions about which

225
00:11:51,920 --> 00:11:55,319
Speaker 1: AI model is actually the good one, or the right one,

226
00:11:55,400 --> 00:11:56,480
Speaker 1: or or the best one for us?

227
00:11:56,559 --> 00:11:59,720
Speaker 2: I'm constantly asked as somebody who is known now in

228
00:11:59,720 --> 00:12:02,160
Speaker 2: my friend group and in the workplace as somebody who's

229
00:12:02,400 --> 00:12:05,320
Speaker 2: really interested in AI. I'm constantly asked which one should

230
00:12:05,360 --> 00:12:07,720
Speaker 2: I use, what's the best one, And the answer is,

231
00:12:07,880 --> 00:12:10,400
Speaker 2: it's about what you're trying to do, essentially, So one

232
00:12:10,440 --> 00:12:13,079
Speaker 2: model might be better for creative writing, but another might

233
00:12:13,120 --> 00:12:16,840
Speaker 2: excel more a data analysis and crunching some numbers. Studies

234
00:12:16,880 --> 00:12:19,920
Speaker 2: are showing though, that AI models often fail when you

235
00:12:20,000 --> 00:12:23,480
Speaker 2: move from those controlled test conditions or those use cases

236
00:12:23,600 --> 00:12:26,840
Speaker 2: or features that are rolled out by these platforms as

237
00:12:26,840 --> 00:12:30,200
Speaker 2: part of marketing campaigns to the messy real world use

238
00:12:30,320 --> 00:12:32,640
Speaker 2: that humans actually use these tools for.

239
00:12:32,840 --> 00:12:35,360
Speaker 1: It actually reminds me of and I'm not even sure

240
00:12:35,400 --> 00:12:38,720
Speaker 1: if this is the same thing, but when Siri was

241
00:12:38,720 --> 00:12:41,840
Speaker 1: first rolled out and Apple kind of in their big

242
00:12:41,880 --> 00:12:44,000
Speaker 1: announcements it's like, you can ask her this, or you

243
00:12:44,000 --> 00:12:46,079
Speaker 1: can ask her that, or if you want to know what.

244
00:12:46,040 --> 00:12:47,840
Speaker 3: The weather's like, should you take an umbrella?

245
00:12:48,280 --> 00:12:51,080
Speaker 1: And I found when I first started using Siri, like, yeah,

246
00:12:51,120 --> 00:12:53,760
Speaker 1: you could definitely answer those sorts of questions, but not

247
00:12:53,840 --> 00:12:56,120
Speaker 1: a whole lot else outside of the almost like a

248
00:12:56,160 --> 00:12:59,480
Speaker 1: prescribed text from Apple about how to use Siri.

249
00:13:00,120 --> 00:13:03,360
Speaker 2: When you get into the world of trying to engage

250
00:13:03,360 --> 00:13:05,520
Speaker 2: with the user no matter what they're about to say.

251
00:13:06,000 --> 00:13:07,480
Speaker 2: It can take a little bit of time for the

252
00:13:07,559 --> 00:13:11,440
Speaker 2: technology to be refined and to keep learning from what

253
00:13:11,520 --> 00:13:12,560
Speaker 2: users actually want.

254
00:13:12,760 --> 00:13:15,600
Speaker 3: So, Sam, what is the way forward in all of this?

255
00:13:16,280 --> 00:13:20,360
Speaker 1: Is there a conversation happening at a more global scale

256
00:13:20,520 --> 00:13:21,880
Speaker 1: about this regulation?

257
00:13:22,360 --> 00:13:25,600
Speaker 2: Definitely, and there's no clear leader here. I mentioned the

258
00:13:25,880 --> 00:13:28,760
Speaker 2: work being done by the European Union before. There's a

259
00:13:28,800 --> 00:13:32,600
Speaker 2: coalition of countries including Australia that signed on to kind

260
00:13:32,640 --> 00:13:35,600
Speaker 2: of key principles of how to keep AI safe. That

261
00:13:35,720 --> 00:13:38,280
Speaker 2: was in mid twenty twenty three, so there's a bit

262
00:13:38,280 --> 00:13:41,200
Speaker 2: of a global movement there. From a government perspective, there's

263
00:13:41,200 --> 00:13:44,480
Speaker 2: some really interesting work being done out of universities, particularly

264
00:13:44,640 --> 00:13:48,920
Speaker 2: Stanford University. They developed an AI Index report which does

265
00:13:49,000 --> 00:13:52,480
Speaker 2: try to compare the models like for like. But I

266
00:13:52,520 --> 00:13:54,920
Speaker 2: think we first need to determine who the authority is

267
00:13:54,960 --> 00:13:57,280
Speaker 2: going to be in this space before we can kind

268
00:13:57,280 --> 00:13:59,959
Speaker 2: of put the burden on them to roll out this

269
00:14:00,000 --> 00:14:03,280
Speaker 2: standardized testing. And I do think in a few decades

270
00:14:03,320 --> 00:14:05,160
Speaker 2: it will take a while. I do think we'll get there.

271
00:14:05,360 --> 00:14:08,320
Speaker 2: I mean, we have the TGA to regulate medicine. We

272
00:14:08,480 --> 00:14:11,840
Speaker 2: have a central aviation authority to regulate what a plane

273
00:14:11,880 --> 00:14:15,079
Speaker 2: that's airworthy looks like. Yep. I do think that we're

274
00:14:15,120 --> 00:14:17,640
Speaker 2: going to see a central AI authority in Australia and

275
00:14:17,679 --> 00:14:20,800
Speaker 2: maybe around the world someday. But we are very early

276
00:14:20,840 --> 00:14:23,840
Speaker 2: in this story. We're like one percent through in the

277
00:14:23,960 --> 00:14:27,840
Speaker 2: AI story if that, and that's really exciting. But it's

278
00:14:27,920 --> 00:14:31,920
Speaker 2: also really important to continuously discuss the potential flaws and

279
00:14:32,240 --> 00:14:35,040
Speaker 2: the gaps that exist in this big, new scary Well.

280
00:14:35,360 --> 00:14:38,800
Speaker 1: Yeah, I think that healthy dose of skepticism is what

281
00:14:38,840 --> 00:14:41,040
Speaker 1: we will be carrying forward. But I look forward to

282
00:14:41,120 --> 00:14:43,360
Speaker 1: many more conversations like this with you, Sam.

283
00:14:43,440 --> 00:14:44,440
Speaker 2: Well, we don't have a choice.

284
00:14:44,520 --> 00:14:46,000
Speaker 3: Help me understand it all.

285
00:14:46,200 --> 00:14:49,000
Speaker 1: Thank you so much for breaking that down for us, Sam,

286
00:14:49,480 --> 00:14:52,200
Speaker 1: and thank you for listening to today's deep Dive. We'll

287
00:14:52,200 --> 00:14:54,840
Speaker 1: be back a little later on with your news headlines,

288
00:14:54,880 --> 00:15:00,160
Speaker 1: but until then, have a great day.

289
00:15:00,880 --> 00:15:03,240
Speaker 2: My name is Lily Maddon and I'm a proud Arunda

290
00:15:03,440 --> 00:15:08,240
Speaker 2: Bunjelung Calkatin woman from Gadighl Country. The Daily oz acknowledges

291
00:15:08,320 --> 00:15:10,440
Speaker 2: that this podcast is recorded on the lands of the

292
00:15:10,480 --> 00:15:14,080
Speaker 2: Gadighl people and pays respect to all Aboriginal and torrest

293
00:15:14,160 --> 00:15:17,000
Speaker 2: rate island and nations. We pay our respects to the

294
00:15:17,000 --> 00:15:19,800
Speaker 2: first peoples of these countries, both past and present.