1
00:00:04,480 --> 00:00:12,319
Speaker 1: Welcome to tech Stuff, a production from iHeartRadio. Hey there,

2
00:00:12,360 --> 00:00:15,560
Speaker 1: and welcome to tech Stuff. I'm your host, Jonathan Strickland.

3
00:00:15,560 --> 00:00:18,919
Speaker 1: I'm an executive producer with iHeart Podcasts. And how the

4
00:00:18,960 --> 00:00:23,520
Speaker 1: tech are you? So? Imagine for a moment that you

5
00:00:23,720 --> 00:00:27,200
Speaker 1: are in school. Some of y'all might actually be in school,

6
00:00:27,400 --> 00:00:30,600
Speaker 1: but others, like me, we have to satisfy ourselves by

7
00:00:30,640 --> 00:00:33,919
Speaker 1: having that occasional stress dream where we imagine that we're

8
00:00:33,960 --> 00:00:35,919
Speaker 1: in school and it's time to take a final and

9
00:00:35,960 --> 00:00:38,960
Speaker 1: we haven't gone to class all year, and also we

10
00:00:39,000 --> 00:00:41,479
Speaker 1: can't remember our locker combination. I don't know about you,

11
00:00:41,520 --> 00:00:44,320
Speaker 1: but I still occasionally get those dreams. And I'm almost

12
00:00:44,400 --> 00:00:47,280
Speaker 1: fifty years old at this point. Anyway, you're in school,

13
00:00:47,760 --> 00:00:50,879
Speaker 1: you're in English class, and you've been given the dreaded

14
00:00:51,120 --> 00:00:53,760
Speaker 1: term paper assignment. You're told you need to go to

15
00:00:53,800 --> 00:00:57,040
Speaker 1: the library and you have to gather resources and read

16
00:00:57,120 --> 00:01:01,080
Speaker 1: up and form your thesis and write your paper while

17
00:01:01,080 --> 00:01:06,080
Speaker 1: making verifiable citations all the way through. So off you

18
00:01:06,120 --> 00:01:09,680
Speaker 1: go to the library. However, you discover, horror of horrors,

19
00:01:09,959 --> 00:01:14,120
Speaker 1: that all the resource books have disappeared. They're none in

20
00:01:14,160 --> 00:01:17,280
Speaker 1: their place. Are other student term papers? Now? Some of

21
00:01:17,280 --> 00:01:21,040
Speaker 1: those term papers are pretty good, some of them are terrible.

22
00:01:21,440 --> 00:01:23,880
Speaker 1: Nearly all of them do have a list of references

23
00:01:23,920 --> 00:01:25,839
Speaker 1: at the end, But the problem is that you don't

24
00:01:25,840 --> 00:01:29,319
Speaker 1: have access to those references. You only have access to

25
00:01:29,440 --> 00:01:32,640
Speaker 1: the term papers, which, in a way, you could say

26
00:01:32,800 --> 00:01:36,520
Speaker 1: is a filtered view of those references. But you have

27
00:01:36,600 --> 00:01:39,319
Speaker 1: no way of knowing if the student who wrote the

28
00:01:39,440 --> 00:01:42,880
Speaker 1: term papers you've pulled out did a proper citation. You

29
00:01:42,920 --> 00:01:46,319
Speaker 1: don't know if the student understood the source material. You

30
00:01:46,360 --> 00:01:49,520
Speaker 1: don't know if they have made a valid reference using

31
00:01:49,560 --> 00:01:52,400
Speaker 1: that source. You don't know if the student didn't understand

32
00:01:52,400 --> 00:01:56,200
Speaker 1: the source and thus misconstrued the information, either accidentally or

33
00:01:56,240 --> 00:01:59,600
Speaker 1: on purpose, or if the student is just outright plagiarizing

34
00:01:59,640 --> 00:02:03,280
Speaker 1: the source material or making stuff up. So how do

35
00:02:03,360 --> 00:02:06,880
Speaker 1: you think your own term paper would turn out? Probably

36
00:02:07,240 --> 00:02:10,359
Speaker 1: it'd be a challenge to write a good term paper.

37
00:02:10,360 --> 00:02:13,200
Speaker 1: It definitely would be difficult or almost impossible to support

38
00:02:13,200 --> 00:02:16,200
Speaker 1: your thesis using citations, because all you would have access

39
00:02:16,200 --> 00:02:19,880
Speaker 1: to would be other term papers. Chances are you'd have

40
00:02:19,919 --> 00:02:24,600
Speaker 1: a pretty lousy grade by the end of that assignment. Now,

41
00:02:24,680 --> 00:02:28,200
Speaker 1: I started off this episode with that analogy because today

42
00:02:28,200 --> 00:02:31,920
Speaker 1: we're going to talk about what happens when AI models

43
00:02:32,040 --> 00:02:35,600
Speaker 1: train off stuff that was generated by other or sometimes

44
00:02:35,639 --> 00:02:40,040
Speaker 1: even the same but earlier versions of AI models. So

45
00:02:40,120 --> 00:02:44,160
Speaker 1: when bots make stuff that other bots consume, and then

46
00:02:44,200 --> 00:02:47,920
Speaker 1: those other bots make new stuff and the cycle goes on.

47
00:02:48,560 --> 00:02:51,320
Speaker 1: Where are the humans in this picture. Maybe they're in

48
00:02:51,360 --> 00:02:55,720
Speaker 1: an actual library, because the online resources will all have

49
00:02:55,840 --> 00:02:59,160
Speaker 1: become practically useless. So if we want to actually learn anything,

50
00:02:59,160 --> 00:03:02,160
Speaker 1: we're gonna need to go back to the basics. So

51
00:03:02,200 --> 00:03:05,720
Speaker 1: we're going to talk about an idea called model collapse,

52
00:03:05,880 --> 00:03:10,440
Speaker 1: as in large language models LMS and other types of

53
00:03:10,480 --> 00:03:14,320
Speaker 1: AI models. We're going to build to that. However, first up,

54
00:03:14,480 --> 00:03:18,079
Speaker 1: let's explore the tendency of AI models to produce wrong

55
00:03:18,400 --> 00:03:21,880
Speaker 1: or misleading results, regardless of whether the material used to

56
00:03:21,960 --> 00:03:26,000
Speaker 1: train that AI model came from AI or humans. This

57
00:03:26,040 --> 00:03:28,440
Speaker 1: is something I've talked about in past episodes, but it's

58
00:03:28,480 --> 00:03:32,280
Speaker 1: an important part to kind of build toward our understanding

59
00:03:32,280 --> 00:03:35,680
Speaker 1: of what model collapse is. Now. In past episodes, I've

60
00:03:35,680 --> 00:03:40,640
Speaker 1: talked about the issue of AI hallucinations, also sometimes called confabulations.

61
00:03:40,760 --> 00:03:45,360
Speaker 1: Some people prefer confabulations to hallucinations. This is the tendency

62
00:03:45,640 --> 00:03:51,320
Speaker 1: for generative AI to mistakenly include untrue or misleading information,

63
00:03:51,960 --> 00:03:56,600
Speaker 1: or to insert stuff that does not belong into whatever

64
00:03:56,640 --> 00:03:59,480
Speaker 1: it is that's creating, whether that's an image or text

65
00:03:59,600 --> 00:04:02,640
Speaker 1: or what an so. One fairly recent example of this

66
00:04:03,160 --> 00:04:07,400
Speaker 1: was when Google's AI augmented search tool suggested that you

67
00:04:07,480 --> 00:04:11,240
Speaker 1: add a non toxic glue to your pizza ingredients if

68
00:04:11,280 --> 00:04:14,440
Speaker 1: you want to solve the irritating issue of cheese slip

69
00:04:14,520 --> 00:04:18,760
Speaker 1: sladden away off your ding dang dern pizza. Clearly this

70
00:04:18,800 --> 00:04:22,800
Speaker 1: answer is not acceptable. Adding glue, non toxic or otherwise

71
00:04:23,160 --> 00:04:25,560
Speaker 1: is not a way of making good eats. I'm pretty

72
00:04:25,560 --> 00:04:28,600
Speaker 1: sure Alton Brown would agree with me, and actually I

73
00:04:28,600 --> 00:04:31,400
Speaker 1: would argue this is one of the less egregious cases

74
00:04:31,440 --> 00:04:34,240
Speaker 1: of AI providing a bad answer. It's famous because it

75
00:04:34,320 --> 00:04:36,800
Speaker 1: got a lot of traction. It went viral for how

76
00:04:36,880 --> 00:04:39,479
Speaker 1: bad the answer was. But in the grand scheme of things,

77
00:04:39,480 --> 00:04:43,400
Speaker 1: there are other examples that were far more potentially harmful.

78
00:04:43,680 --> 00:04:47,520
Speaker 1: So why does AI do this sometimes? Well, there are

79
00:04:47,560 --> 00:04:51,479
Speaker 1: a few different contributing factors that lead AI to making

80
00:04:51,480 --> 00:04:54,040
Speaker 1: these mistakes. By the way, the reason why some people

81
00:04:54,240 --> 00:04:59,520
Speaker 1: prefer confabulations as opposed to hallucinations. Hallucination sounds like the

82
00:04:59,680 --> 00:05:03,880
Speaker 1: AI I has somehow been tricked into thinking something is

83
00:05:04,279 --> 00:05:07,800
Speaker 1: what it isn't right, like the idea that you hallucinate

84
00:05:07,839 --> 00:05:11,960
Speaker 1: your seeing or hearing or experiencing something that's not really there.

85
00:05:12,400 --> 00:05:17,400
Speaker 1: Confabulation suggests that the AI is inventing something. It is confabulating,

86
00:05:17,440 --> 00:05:20,560
Speaker 1: it is creating an answer where there was none, and

87
00:05:20,640 --> 00:05:23,560
Speaker 1: so some people prefer the second one because they but

88
00:05:23,680 --> 00:05:26,520
Speaker 1: it puts more of the onus on the AI model itself.

89
00:05:26,920 --> 00:05:30,799
Speaker 1: So one of the factors that contributes to AI making mistakes.

90
00:05:31,520 --> 00:05:34,800
Speaker 1: And you know, large language models and like are in

91
00:05:34,880 --> 00:05:39,720
Speaker 1: part focused on pattern recognition, and this can lead to issues. Now,

92
00:05:39,760 --> 00:05:43,680
Speaker 1: recognizing patterns is what gives these models the ability to

93
00:05:43,960 --> 00:05:48,559
Speaker 1: form relevant and coherent responses to queries, and obviously pattern

94
00:05:48,600 --> 00:05:53,240
Speaker 1: recognition is important otherwise you're just gonna perceive everything is

95
00:05:53,279 --> 00:05:58,040
Speaker 1: being random and meaningless and then really, this whole conversation

96
00:05:58,240 --> 00:06:02,000
Speaker 1: doesn't mean anything either, or if the whole universe is meaningless,

97
00:06:02,440 --> 00:06:05,200
Speaker 1: then what are we even doing here? But I don't

98
00:06:05,200 --> 00:06:07,479
Speaker 1: want you to go down that path of existential dread.

99
00:06:07,920 --> 00:06:12,239
Speaker 1: So sometimes AI will detect a pattern where there really

100
00:06:12,360 --> 00:06:15,760
Speaker 1: isn't a pattern. And we humans do this too, you know,

101
00:06:15,800 --> 00:06:19,160
Speaker 1: we sometimes experience like paradolia. For example. That's when we

102
00:06:19,200 --> 00:06:24,680
Speaker 1: perceive something meaningful within an otherwise meaningless thing, like we

103
00:06:24,760 --> 00:06:28,040
Speaker 1: see a pattern where there is none. So if you

104
00:06:28,080 --> 00:06:30,880
Speaker 1: were to look at the clouds and you say that

105
00:06:31,160 --> 00:06:34,680
Speaker 1: one of them looks very like a whale, that's paradolia.

106
00:06:34,880 --> 00:06:39,680
Speaker 1: It's also a reference to Hamlet the infamous face on Mars,

107
00:06:39,880 --> 00:06:42,440
Speaker 1: which was really just a hill with some shadows cast

108
00:06:42,480 --> 00:06:46,080
Speaker 1: on it. Because the angle of the image, that was

109
00:06:46,080 --> 00:06:48,960
Speaker 1: another example of paradolia, people began to think that there

110
00:06:49,040 --> 00:06:52,279
Speaker 1: was actually a big sculpted face on Mars. There's not.

111
00:06:52,880 --> 00:06:55,080
Speaker 1: It's a hill. The shadows hit the hill in a

112
00:06:55,080 --> 00:06:57,560
Speaker 1: specific way that made it look kind of like the

113
00:06:57,600 --> 00:07:01,560
Speaker 1: face of an enormous statue, something like the Sphinx, something

114
00:07:01,600 --> 00:07:04,320
Speaker 1: along those lines. But in fact it was just a hill.

115
00:07:04,600 --> 00:07:07,679
Speaker 1: And if you took another image from a different angle,

116
00:07:07,680 --> 00:07:12,000
Speaker 1: which people have done, the illusion of a face disappears.

117
00:07:12,320 --> 00:07:15,680
Speaker 1: So again, that was us inventing a pattern where there

118
00:07:15,880 --> 00:07:19,080
Speaker 1: was none. Now, much of the time we humans can

119
00:07:19,120 --> 00:07:22,320
Speaker 1: recognize when the things we see, you know, the shapes

120
00:07:22,360 --> 00:07:26,320
Speaker 1: of faces or whatever it may be, aren't actually there. Right,

121
00:07:26,360 --> 00:07:29,960
Speaker 1: we can recognize, oh, that looks like a blah, blah blah,

122
00:07:30,000 --> 00:07:33,000
Speaker 1: but we know it's not actually a real image of that.

123
00:07:33,160 --> 00:07:38,200
Speaker 1: It just happens. Now. Sometimes we don't recognize this. Sometimes

124
00:07:38,200 --> 00:07:41,640
Speaker 1: there are ties where people will assume that what they're

125
00:07:41,680 --> 00:07:46,680
Speaker 1: seeing is an actual image made with intent and intelligence,

126
00:07:46,760 --> 00:07:49,560
Speaker 1: perhaps not by humans but by something. So there are

127
00:07:49,560 --> 00:07:52,119
Speaker 1: all those stories of people going bonkers because they believe

128
00:07:52,120 --> 00:07:54,280
Speaker 1: they saw an image of like the Virgin Mary in

129
00:07:54,320 --> 00:07:58,120
Speaker 1: a potato chip or whatever. And machines don't necessarily have

130
00:07:58,160 --> 00:08:02,400
Speaker 1: any checks against fall hits when it comes to pattern recognition,

131
00:08:02,800 --> 00:08:06,640
Speaker 1: and then they might act on a perceived pattern, which

132
00:08:06,680 --> 00:08:10,360
Speaker 1: means the machines produce bad results. What's more, machines conceive

133
00:08:10,400 --> 00:08:13,720
Speaker 1: patterns where we can't. Like sometimes there are patterns present

134
00:08:13,800 --> 00:08:17,720
Speaker 1: that we cannot perceive because maybe the dataset is far

135
00:08:17,800 --> 00:08:22,400
Speaker 1: too large or far too complicated, and so we can't

136
00:08:22,440 --> 00:08:26,560
Speaker 1: perceive where the pattern is. It's just beyond our abilities

137
00:08:27,080 --> 00:08:31,360
Speaker 1: to do so. But sometimes machines can detect those patterns,

138
00:08:31,360 --> 00:08:34,720
Speaker 1: and sometimes they are meaningful. So it can be really tricky.

139
00:08:34,840 --> 00:08:37,400
Speaker 1: If a machine thinks it's found a pattern, it can

140
00:08:37,440 --> 00:08:42,079
Speaker 1: be hard for people to verify or discredit that because

141
00:08:42,400 --> 00:08:44,600
Speaker 1: it's on a scale that we humans are not really

142
00:08:44,640 --> 00:08:48,320
Speaker 1: well equipped to handle with generative AI. This can mean

143
00:08:48,320 --> 00:08:52,240
Speaker 1: that the AI model correctly identifies that it needs to

144
00:08:52,320 --> 00:08:56,319
Speaker 1: use a specific syntax to craft a response to whatever

145
00:08:56,520 --> 00:09:01,240
Speaker 1: query or direction it was given, and it can thus

146
00:09:01,559 --> 00:09:06,600
Speaker 1: put together a sentence that grammatically makes sense. What's happening

147
00:09:06,640 --> 00:09:11,360
Speaker 1: is it's essentially statistically analyzing the structure of hundreds of

148
00:09:11,520 --> 00:09:14,760
Speaker 1: millions of sentences, as well as the role that certain

149
00:09:14,800 --> 00:09:18,120
Speaker 1: words play within those sentences, so that it quote unquote

150
00:09:18,280 --> 00:09:22,120
Speaker 1: knows how to write a grammatically correct response, and ultimately

151
00:09:22,480 --> 00:09:25,439
Speaker 1: it's using statistics to pick what should be the most

152
00:09:25,520 --> 00:09:30,000
Speaker 1: correct word in each position of that sentence. So ideally,

153
00:09:30,440 --> 00:09:34,000
Speaker 1: it's pulling information from various sources that are related to

154
00:09:34,000 --> 00:09:38,240
Speaker 1: whatever it is you're asking about and pulling the words

155
00:09:38,240 --> 00:09:43,439
Speaker 1: together in a way that makes logical sense and is accurate,

156
00:09:43,559 --> 00:09:45,880
Speaker 1: and it's a correct answer to whatever your question is.

157
00:09:46,000 --> 00:09:49,440
Speaker 1: But that doesn't always happen right. Sometimes it can't find

158
00:09:49,880 --> 00:09:52,840
Speaker 1: the right word. Sometimes it finds a different word that

159
00:09:52,920 --> 00:09:56,080
Speaker 1: it thinks is right, but it's not. And the real

160
00:09:56,160 --> 00:10:00,680
Speaker 1: problem is it will present this to you authoritatively as

161
00:10:00,720 --> 00:10:04,160
Speaker 1: if the AI is absolutely certain this is the right answer,

162
00:10:04,360 --> 00:10:07,560
Speaker 1: when in fact it's wrong and the AI has no

163
00:10:07,600 --> 00:10:10,600
Speaker 1: way of knowing it's wrong. It's not purposefully trying to

164
00:10:10,600 --> 00:10:13,800
Speaker 1: mislead you, and at least not necessarily. Maybe it was

165
00:10:13,800 --> 00:10:17,000
Speaker 1: given direction to try and do that, but that's another matter.

166
00:10:17,400 --> 00:10:22,360
Speaker 1: It's just trying to complete its task and failing to

167
00:10:22,400 --> 00:10:26,640
Speaker 1: do so accurately. Sometimes the word or a series of

168
00:10:26,640 --> 00:10:30,400
Speaker 1: words can be wrong. Therefore, now grammatically it could be correct,

169
00:10:30,480 --> 00:10:33,679
Speaker 1: but factually it could be completely made up. And why

170
00:10:33,720 --> 00:10:36,640
Speaker 1: this all happens. It does get really complicated. It's not

171
00:10:36,720 --> 00:10:40,440
Speaker 1: necessarily due to just one specific flaw. It's not always

172
00:10:40,480 --> 00:10:43,840
Speaker 1: the case that, oh, that data point didn't appear in

173
00:10:43,880 --> 00:10:47,120
Speaker 1: the data set for some reason, and so the computer

174
00:10:47,440 --> 00:10:50,400
Speaker 1: made something up. There are other issues that could also

175
00:10:50,440 --> 00:10:53,120
Speaker 1: be at play. So, for example, one possible reason for

176
00:10:53,160 --> 00:10:57,679
Speaker 1: hallucinations is something that's called overfitting. IBM defines this as

177
00:10:57,720 --> 00:11:01,600
Speaker 1: what happens quote when an algorith rhythm fits too closely

178
00:11:01,800 --> 00:11:05,280
Speaker 1: or even exactly to its training data, resulting in a

179
00:11:05,320 --> 00:11:08,920
Speaker 1: model that can't make accurate predictions or conclusions from any

180
00:11:09,040 --> 00:11:12,440
Speaker 1: data other than the training data. End quote. That's from

181
00:11:12,440 --> 00:11:16,439
Speaker 1: a piece on IBM dot com. It's titled what is overfitting?

182
00:11:16,800 --> 00:11:21,440
Speaker 1: Sometimes models get so complex or they're trained so closely

183
00:11:21,520 --> 00:11:24,800
Speaker 1: on a specific data set that they start to pick

184
00:11:24,880 --> 00:11:30,320
Speaker 1: up more noise than signal. They give significance to insignificant things.

185
00:11:30,600 --> 00:11:32,800
Speaker 1: I think of this kind of like the character Dracks

186
00:11:33,000 --> 00:11:36,640
Speaker 1: in the Guardians of the Galaxy movies. Drags takes things literally,

187
00:11:37,000 --> 00:11:40,120
Speaker 1: so if you use a saying or an idiom on him,

188
00:11:40,480 --> 00:11:43,959
Speaker 1: he's likely to interpret what you're saying as being what

189
00:11:44,040 --> 00:11:47,760
Speaker 1: you mean. So if you say, oh, that's like throwing

190
00:11:47,800 --> 00:11:50,839
Speaker 1: the baby out with the bathwater, he would assume you're

191
00:11:50,880 --> 00:11:54,080
Speaker 1: talking about something you have literally done before in your life,

192
00:11:54,080 --> 00:11:56,880
Speaker 1: that you have literally thrown out a baby with bathwater,

193
00:11:57,320 --> 00:12:00,000
Speaker 1: and he would not understand you were using an analog

194
00:12:00,600 --> 00:12:03,960
Speaker 1: to describe getting rid of important stuff along with the

195
00:12:04,040 --> 00:12:06,640
Speaker 1: unimportant stuff you want to get rid of. If a

196
00:12:06,720 --> 00:12:09,719
Speaker 1: model has been overfitted, if it's been trained too much

197
00:12:09,840 --> 00:12:12,679
Speaker 1: on a relatively narrow set of data, it might have

198
00:12:12,720 --> 00:12:16,800
Speaker 1: trouble taking what it has learned and generalizing those learnings

199
00:12:16,800 --> 00:12:20,440
Speaker 1: towards something else that's outside the data set. And rather

200
00:12:20,520 --> 00:12:23,080
Speaker 1: than saying I'm sorry, I don't know the answer to that,

201
00:12:23,440 --> 00:12:27,000
Speaker 1: it could produce an answer that follows the statistical rules

202
00:12:27,240 --> 00:12:29,640
Speaker 1: that the model is set to In other words, it'll

203
00:12:29,800 --> 00:12:33,640
Speaker 1: create something that grammatically makes sense, but it won't necessarily

204
00:12:33,640 --> 00:12:38,079
Speaker 1: be relevant or you know, thematically or irrelevance makes sense.

205
00:12:38,640 --> 00:12:41,360
Speaker 1: So in this way, an AI model can become like

206
00:12:41,400 --> 00:12:44,760
Speaker 1: that stereotypical person in the car who absolutely refuses to

207
00:12:44,800 --> 00:12:47,080
Speaker 1: pull over and ask for directions when they get lost,

208
00:12:47,400 --> 00:12:50,160
Speaker 1: because that would be showing weakness. No, gush, darn. It

209
00:12:50,200 --> 00:12:53,079
Speaker 1: will somehow reason our way out of taking that wrong

210
00:12:53,200 --> 00:12:56,320
Speaker 1: turn forty five minutes ago. That'll fix everything. Except it

211
00:12:56,320 --> 00:12:59,360
Speaker 1: doesn't fix everything, and it can make things worse. But

212
00:12:59,400 --> 00:13:02,480
Speaker 1: it's not just pattern recognition that can trip up AI models.

213
00:13:02,760 --> 00:13:07,119
Speaker 1: Another issue is bias. I've talked about bias in other episodes,

214
00:13:07,280 --> 00:13:10,319
Speaker 1: but it's really important that we understand what we mean

215
00:13:10,360 --> 00:13:13,559
Speaker 1: when we're talking bias and how it can happen, because

216
00:13:14,000 --> 00:13:16,520
Speaker 1: I think a lot of people get tripped up. They

217
00:13:16,559 --> 00:13:22,120
Speaker 1: think it's a machine, right, it doesn't possess opinions. How

218
00:13:22,160 --> 00:13:26,640
Speaker 1: can it have bias? Well, we'll explore that in just

219
00:13:26,800 --> 00:13:29,679
Speaker 1: a couple of moments, but first let's take a quick

220
00:13:29,720 --> 00:13:43,520
Speaker 1: break to think our sponsors. How can an AI model

221
00:13:43,920 --> 00:13:47,880
Speaker 1: have bias? Well, the answer is that the machines that

222
00:13:47,920 --> 00:13:51,640
Speaker 1: AI runs on the algorithms that AI is built upon.

223
00:13:51,920 --> 00:13:55,839
Speaker 1: All this stuff, it didn't just pop out of nowhere. Ultimately,

224
00:13:55,920 --> 00:13:59,200
Speaker 1: this stuff was designed, built, and programmed by human beings.

225
00:13:59,280 --> 00:14:01,880
Speaker 1: Even if you have had a piece of software that

226
00:14:02,080 --> 00:14:06,160
Speaker 1: was designed by AI, while the AI that designed it

227
00:14:06,280 --> 00:14:08,959
Speaker 1: in turn had been designed by humans at least somewhere

228
00:14:09,000 --> 00:14:11,560
Speaker 1: down the line once you trace it back far enough so.

229
00:14:12,080 --> 00:14:16,360
Speaker 1: Human beings absolutely do have biases, and those biases can

230
00:14:16,400 --> 00:14:20,920
Speaker 1: make their way into the routines and processes of machines.

231
00:14:21,480 --> 00:14:25,280
Speaker 1: MIT has a great introduction to AI hallucinations and bias

232
00:14:25,360 --> 00:14:28,040
Speaker 1: on a web page that has the fitting title when

233
00:14:28,200 --> 00:14:32,800
Speaker 1: AI Gets It Wrong, Addressing AI hallucinations and bias now.

234
00:14:32,840 --> 00:14:35,600
Speaker 1: In that article, the author points out that AI has

235
00:14:35,640 --> 00:14:39,440
Speaker 1: had issues with bias for years and uses the example

236
00:14:39,600 --> 00:14:45,720
Speaker 1: of image analysis. The author cites a project called Gender Shades.

237
00:14:46,040 --> 00:14:51,440
Speaker 1: This was led by Joi Adowa Buomini, and I apologize

238
00:14:51,760 --> 00:14:56,080
Speaker 1: for my pronunciation of the name. But the project examined

239
00:14:56,320 --> 00:15:02,280
Speaker 1: how an AI powered gender classification tool performed when presented

240
00:15:02,320 --> 00:15:06,880
Speaker 1: with subjects of varying genders, ethnicities, and skin tones from

241
00:15:06,960 --> 00:15:14,280
Speaker 1: the IARPA Janus benchmark A data set or IJBA. This

242
00:15:14,320 --> 00:15:17,360
Speaker 1: is a database of facial images taken from various angles

243
00:15:17,400 --> 00:15:20,880
Speaker 1: and lighting conditions of lots of different people. It's used

244
00:15:20,880 --> 00:15:25,440
Speaker 1: as a government benchmark for testing stuff like facial recognition technologies. Now.

245
00:15:25,480 --> 00:15:30,640
Speaker 1: The project also used a gender classification benchmark from Adance,

246
00:15:31,240 --> 00:15:35,560
Speaker 1: and this was in part to try and address shortcomings

247
00:15:35,600 --> 00:15:40,840
Speaker 1: with the IJB dash A benchmark set. Plus due to

248
00:15:40,880 --> 00:15:43,360
Speaker 1: the limitations of both of these data sets, which I'll

249
00:15:43,360 --> 00:15:46,480
Speaker 1: talk about in just a moment, the project also outlines

250
00:15:46,520 --> 00:15:49,640
Speaker 1: a process to create a better data set for the

251
00:15:49,640 --> 00:15:54,160
Speaker 1: purposes of training technologies like facial recognition and gender classification.

252
00:15:54,720 --> 00:15:59,480
Speaker 1: The project aimed to test several gender classifier programs from

253
00:15:59,480 --> 00:16:04,120
Speaker 1: companies Microsoft and IBM, among others, all with regard to

254
00:16:04,320 --> 00:16:08,640
Speaker 1: quote gender, skin type, and the intersection of skin type

255
00:16:08,680 --> 00:16:12,640
Speaker 1: and gender end quote. So Joy found that the data

256
00:16:12,680 --> 00:16:17,360
Speaker 1: sets from IJB dah A skewed male and lighter skin

257
00:16:17,480 --> 00:16:21,440
Speaker 1: tones skewed heavily male and lighter skin tones. In fact,

258
00:16:21,480 --> 00:16:24,040
Speaker 1: she said between seventy nine point six percent and eighty

259
00:16:24,040 --> 00:16:26,640
Speaker 1: six point twenty four percent of all the images in

260
00:16:26,680 --> 00:16:31,040
Speaker 1: the database were of people with lighter skin tones, and

261
00:16:31,520 --> 00:16:34,440
Speaker 1: fewer than twenty five percent of all the images were

262
00:16:34,480 --> 00:16:38,480
Speaker 1: of women or female presenting people worse, Yet, only four

263
00:16:38,520 --> 00:16:41,760
Speaker 1: point four percent of all the images were of female

264
00:16:41,840 --> 00:16:46,880
Speaker 1: presenting people who had dark skin Adiance's data set had

265
00:16:46,920 --> 00:16:50,840
Speaker 1: a better distribution of photos, at least between genders. Female

266
00:16:50,840 --> 00:16:54,120
Speaker 1: presenting people made up fifty two percent of the images

267
00:16:54,160 --> 00:16:58,480
Speaker 1: in Aightiance's data set, but again, lighter skin tones made

268
00:16:58,520 --> 00:17:02,320
Speaker 1: up the majority of these images. Less than fifteen percent

269
00:17:02,360 --> 00:17:05,000
Speaker 1: of all the images in that data set contained people

270
00:17:05,080 --> 00:17:08,840
Speaker 1: of darker skin tones. So I'm sure you can already

271
00:17:08,920 --> 00:17:12,800
Speaker 1: see where this is going. If you train an AI

272
00:17:12,840 --> 00:17:18,159
Speaker 1: model on data that has a disproportionate emphasis on certain factors,

273
00:17:18,440 --> 00:17:23,720
Speaker 1: such as certain genders or certain skin tones, then you

274
00:17:23,760 --> 00:17:27,280
Speaker 1: would expect the AI to be better at handling cases

275
00:17:27,280 --> 00:17:31,760
Speaker 1: that fall into those categories, Right Like, if most of

276
00:17:31,800 --> 00:17:34,800
Speaker 1: the data you've fed to your AI model is of

277
00:17:35,040 --> 00:17:37,760
Speaker 1: men who have a lighter skin tone, then when you

278
00:17:37,800 --> 00:17:43,040
Speaker 1: are serving the AI model a picture of someone who's

279
00:17:43,320 --> 00:17:46,000
Speaker 1: male presenting and has a lighter skin tone, chances are

280
00:17:46,080 --> 00:17:49,600
Speaker 1: the tools going to work better. If you are instead

281
00:17:50,160 --> 00:17:55,800
Speaker 1: feeding it images of people who fall outside those majority cases,

282
00:17:56,080 --> 00:17:59,000
Speaker 1: the AI tool is probably not going to work as

283
00:17:59,040 --> 00:18:02,159
Speaker 1: well with them, and that's exactly what Joy found in

284
00:18:02,200 --> 00:18:06,679
Speaker 1: her research. She discovered that gender classification tools from all

285
00:18:06,840 --> 00:18:10,920
Speaker 1: of the providers performed better with lighter skinned men than

286
00:18:10,960 --> 00:18:14,600
Speaker 1: with any other group. They perform the worst with darker

287
00:18:14,640 --> 00:18:17,959
Speaker 1: skinned women. Thus we have a bias in the system.

288
00:18:18,320 --> 00:18:21,160
Speaker 1: The data that folks use to train these systems had

289
00:18:21,200 --> 00:18:25,320
Speaker 1: that bias, and it unsurprisingly affects how the AI does

290
00:18:25,359 --> 00:18:29,639
Speaker 1: its job. Now, this isn't just a curiosity for research labs.

291
00:18:29,680 --> 00:18:34,800
Speaker 1: Of course, around the world, various organizations and companies are

292
00:18:34,840 --> 00:18:38,720
Speaker 1: making use of facial recognition tools and gender classification tools.

293
00:18:39,119 --> 00:18:42,440
Speaker 1: There are numerous stories of law enforcement agencies getting into

294
00:18:42,480 --> 00:18:45,719
Speaker 1: hot water for relying on this kind of technology. So

295
00:18:46,000 --> 00:18:51,120
Speaker 1: we know that this technology isn't reliable, particularly if someone

296
00:18:51,240 --> 00:18:55,280
Speaker 1: belongs to a group that's outside of lighter skinned men,

297
00:18:55,840 --> 00:18:59,000
Speaker 1: and the data being used to train these tools is limited.

298
00:18:59,320 --> 00:19:02,760
Speaker 1: That's why we're having these issues, or one of the

299
00:19:02,840 --> 00:19:05,760
Speaker 1: main reasons why we're having these issues. So it stands

300
00:19:05,800 --> 00:19:09,080
Speaker 1: to reason we should not employ those tools for anything

301
00:19:09,560 --> 00:19:13,480
Speaker 1: really at all, other than maybe working to make them better.

302
00:19:13,680 --> 00:19:16,040
Speaker 1: But we definitely shouldn't be using them for things like

303
00:19:16,200 --> 00:19:19,160
Speaker 1: law enforcement, for example. At least we should not use

304
00:19:19,200 --> 00:19:23,320
Speaker 1: them until we can address the problem of bias generative

305
00:19:23,359 --> 00:19:27,399
Speaker 1: AI can actually have similar issues with bias that MIT.

306
00:19:27,640 --> 00:19:30,760
Speaker 1: Article that I mentioned earlier in this episode cites another

307
00:19:30,880 --> 00:19:36,720
Speaker 1: article by Leonardo Nicoletti and Dina Bass titled humans are biased.

308
00:19:36,920 --> 00:19:41,520
Speaker 1: Generative AI is even worse. This piece appeared in Bloomberg.

309
00:19:41,960 --> 00:19:46,000
Speaker 1: So this article explores how a generative AI platform called

310
00:19:46,080 --> 00:19:50,280
Speaker 1: stable Diffusion had a tendency to make assumptions based on

311
00:19:50,440 --> 00:19:57,359
Speaker 1: racial and gender stereotypes, thus repeating and even amplifying those stereotypes.

312
00:19:57,760 --> 00:20:01,880
Speaker 1: Nicoletti and Bass performed and in formal test with stable Diffusion,

313
00:20:02,040 --> 00:20:05,520
Speaker 1: a pretty thorough one, but still informal. They asked stable

314
00:20:05,560 --> 00:20:10,760
Speaker 1: Diffusion to generate images of people who were working one

315
00:20:10,800 --> 00:20:14,520
Speaker 1: of fourteen different jobs. Now, half of those jobs belonged

316
00:20:14,560 --> 00:20:18,119
Speaker 1: to what they called high paying positions, like things that

317
00:20:18,200 --> 00:20:21,720
Speaker 1: you would typically associate as a high paying job. The

318
00:20:21,800 --> 00:20:26,800
Speaker 1: other half typically were too low paying jobs, and actually

319
00:20:26,840 --> 00:20:28,879
Speaker 1: a little less than half of them were low paying jobs.

320
00:20:28,920 --> 00:20:31,480
Speaker 1: Three of them actually fell into the category of crime,

321
00:20:31,880 --> 00:20:34,679
Speaker 1: so like you know, thief or something like that. The

322
00:20:34,840 --> 00:20:38,720
Speaker 1: two had Stable Diffusion generate more than five thousand images

323
00:20:38,760 --> 00:20:42,640
Speaker 1: total so that they could really compare. They didn't want

324
00:20:42,680 --> 00:20:46,040
Speaker 1: to just create, you know, a single image each that's

325
00:20:46,040 --> 00:20:48,359
Speaker 1: a terrible test. They wanted to see, all right, is

326
00:20:48,400 --> 00:20:51,960
Speaker 1: this something that's actually appearing over and over again when

327
00:20:52,000 --> 00:20:55,000
Speaker 1: we make use of this tool, or is it possible

328
00:20:55,080 --> 00:20:58,119
Speaker 1: that you know, you run fourteen tests and it just

329
00:20:58,320 --> 00:21:03,840
Speaker 1: happens to go along with racial stereotypes. Nope. They classified

330
00:21:03,880 --> 00:21:07,720
Speaker 1: the generated images based off of the Fitzpatrick's skin scale.

331
00:21:08,240 --> 00:21:12,040
Speaker 1: This is actually a skin pigmentation metric that's used by

332
00:21:12,440 --> 00:21:16,359
Speaker 1: dermatologists as well as like other researchers, and the scale

333
00:21:16,440 --> 00:21:19,440
Speaker 1: goes from one to six, so one would be very

334
00:21:19,520 --> 00:21:23,080
Speaker 1: light skinned and six would be very dark skinned. The

335
00:21:23,440 --> 00:21:27,679
Speaker 1: researchers found that stable diffusion was far more likely to

336
00:21:27,680 --> 00:21:31,360
Speaker 1: create a person with a lighter skin tone for positions

337
00:21:31,400 --> 00:21:36,159
Speaker 1: that traditionally fall into the higher paid categories, and that

338
00:21:36,200 --> 00:21:38,760
Speaker 1: it was more likely to generate someone with a darker

339
00:21:38,800 --> 00:21:44,040
Speaker 1: skin tone for lower paid or criminal categories. What's more,

340
00:21:44,280 --> 00:21:48,080
Speaker 1: stable diffusion generated images of people appearing to be men

341
00:21:48,240 --> 00:21:52,159
Speaker 1: or male presenting for most of those higher paid positions.

342
00:21:52,280 --> 00:21:55,080
Speaker 1: It was very rare for it to generate the image

343
00:21:55,080 --> 00:21:58,560
Speaker 1: of a female presenting person in the role of one

344
00:21:58,600 --> 00:22:04,000
Speaker 1: of these traditionally higher paid jobs. So the AI was

345
00:22:04,040 --> 00:22:09,280
Speaker 1: perpetuating and amplifying these racial and gender stereotypes. This actually

346
00:22:09,280 --> 00:22:11,720
Speaker 1: reminds me of a classic riddle that was intended to

347
00:22:11,760 --> 00:22:14,040
Speaker 1: reveal bias. I'm sure most of you have heard this

348
00:22:14,119 --> 00:22:17,480
Speaker 1: before or some variation. So the riddle typically goes something

349
00:22:17,560 --> 00:22:20,080
Speaker 1: like this. A father and a son are in a

350
00:22:20,200 --> 00:22:23,680
Speaker 1: terrible car accident, and the father tragically dies at the scene.

351
00:22:24,040 --> 00:22:27,480
Speaker 1: The son is badly injured. EMTs arrived. They rushed the

352
00:22:27,480 --> 00:22:30,600
Speaker 1: boy to a surgical ward. The surgeon on duty looks

353
00:22:30,600 --> 00:22:32,960
Speaker 1: at the boy and says, I can't operate on him,

354
00:22:33,520 --> 00:22:37,360
Speaker 1: he's my son. Well, how could that be true? Now?

355
00:22:37,400 --> 00:22:41,080
Speaker 1: The obvious answer is the surgeon is the boy's mother.

356
00:22:41,400 --> 00:22:43,399
Speaker 1: And I think a lot of people arrive at that

357
00:22:43,880 --> 00:22:47,600
Speaker 1: conclusion much more easily today than they did when I

358
00:22:47,760 --> 00:22:50,000
Speaker 1: was a kid. Like when I was a kid, the

359
00:22:50,160 --> 00:22:55,159
Speaker 1: sexist stereotype was that all real quote unquote real doctors

360
00:22:55,200 --> 00:23:00,760
Speaker 1: and surgeons were men and women they were nurses or administrators. Right,

361
00:23:00,840 --> 00:23:04,960
Speaker 1: That was the stereotype that people kind of believed in.

362
00:23:05,320 --> 00:23:08,320
Speaker 1: But I'm sure most of y'all understood this answer, or

363
00:23:08,440 --> 00:23:11,200
Speaker 1: you've been exposed to this riddle numerous times. I mean,

364
00:23:11,240 --> 00:23:13,479
Speaker 1: it is a meme at this point, but again, back

365
00:23:13,560 --> 00:23:15,200
Speaker 1: in my day, a lot of folks would likely get

366
00:23:15,240 --> 00:23:18,240
Speaker 1: stumped by this, or they would say something dumb like, oh,

367
00:23:18,240 --> 00:23:21,600
Speaker 1: it turns out the surgeon was the real dad and

368
00:23:21,680 --> 00:23:25,119
Speaker 1: the father who died at the scene had been the

369
00:23:25,160 --> 00:23:28,520
Speaker 1: adopted father he adopted the boy, or something along those lines,

370
00:23:28,520 --> 00:23:32,240
Speaker 1: which reveals the bias of the listener. It reminds the

371
00:23:32,320 --> 00:23:36,040
Speaker 1: listener to think critically and be aware of sexist stereotypes.

372
00:23:36,359 --> 00:23:39,760
Speaker 1: So AI can produce the wrong results due to bias

373
00:23:39,800 --> 00:23:44,320
Speaker 1: built into the underlying model and end up making these

374
00:23:44,320 --> 00:23:47,320
Speaker 1: same mistakes right, Like if you say surgeon, it may

375
00:23:47,359 --> 00:23:51,439
Speaker 1: mistakenly just believe ah, you meant man. It has to

376
00:23:51,440 --> 00:23:55,240
Speaker 1: be a man that I generate in this image because

377
00:23:55,960 --> 00:23:59,399
Speaker 1: the user said surgeon, so that means man. That's a

378
00:23:59,400 --> 00:24:02,639
Speaker 1: real problem. With enough work and attention, we can actually

379
00:24:02,680 --> 00:24:07,240
Speaker 1: create training materials that minimize bias and can help reverse

380
00:24:07,359 --> 00:24:12,080
Speaker 1: this trend. But even doing that is not enough to

381
00:24:12,560 --> 00:24:17,119
Speaker 1: eliminate errors in generative AI. There are other problems we

382
00:24:17,200 --> 00:24:20,879
Speaker 1: have to look out for. So what happens when you

383
00:24:21,040 --> 00:24:26,120
Speaker 1: have an AI model, like a large language model, for example,

384
00:24:26,520 --> 00:24:30,280
Speaker 1: and part of the massive amount of material that it's

385
00:24:30,359 --> 00:24:34,560
Speaker 1: training itself on includes data sets that were generated by

386
00:24:34,680 --> 00:24:38,720
Speaker 1: other AI. When an AI image generator is pulling images

387
00:24:38,760 --> 00:24:41,760
Speaker 1: that were made by other image generators and then training

388
00:24:41,800 --> 00:24:44,800
Speaker 1: itself on that, or you know, even if it's pulling

389
00:24:45,240 --> 00:24:49,000
Speaker 1: images that an earlier version of that very same generator

390
00:24:49,040 --> 00:24:53,879
Speaker 1: had created, the mistakes that exist in those AI generated images,

391
00:24:54,440 --> 00:24:56,840
Speaker 1: or you know, it's if we're not talking images like

392
00:24:56,920 --> 00:25:01,000
Speaker 1: in text or whatever, those things can become like you

393
00:25:01,000 --> 00:25:05,080
Speaker 1: would argue, oh, those things are noise, right, that's those

394
00:25:05,080 --> 00:25:09,280
Speaker 1: are mistakes. But AI doesn't know that they're mistakes. They don't.

395
00:25:09,280 --> 00:25:11,840
Speaker 1: It doesn't know that it's noise. If you're training it

396
00:25:11,880 --> 00:25:14,359
Speaker 1: on the data, it thinks it's significant. And if it

397
00:25:14,400 --> 00:25:18,240
Speaker 1: thinks it's significant, it's going to incorporate it and perhaps

398
00:25:18,520 --> 00:25:22,359
Speaker 1: even dial it up quite a bit. So a great

399
00:25:22,400 --> 00:25:25,800
Speaker 1: way of illustrating this, in my opinion, is to talk

400
00:25:25,840 --> 00:25:29,280
Speaker 1: about fingers. I mean, I'm sure all of you out

401
00:25:29,320 --> 00:25:34,720
Speaker 1: there have experienced seeing AI generated images that hilariously get

402
00:25:34,760 --> 00:25:38,280
Speaker 1: the fingers totally wrong. A lot of AI image generators

403
00:25:38,280 --> 00:25:43,320
Speaker 1: have real problems with fingers, So you might have folks

404
00:25:43,400 --> 00:25:46,440
Speaker 1: and images who wind up with way too many fingers,

405
00:25:46,840 --> 00:25:49,560
Speaker 1: like seven or eight perrand, or maybe they have not

406
00:25:49,800 --> 00:25:53,679
Speaker 1: enough fingers, or maybe all their fingers are thumbs, or

407
00:25:53,720 --> 00:25:56,600
Speaker 1: maybe they bend in unnatural ways, or they all look

408
00:25:56,680 --> 00:26:00,960
Speaker 1: like long strands of spaghetti. These are clearly miss you know,

409
00:26:01,040 --> 00:26:05,359
Speaker 1: image generators have identified fingers are appendages, and these appendages

410
00:26:05,440 --> 00:26:09,000
Speaker 1: attached to hands. But the machines don't really follow the

411
00:26:09,080 --> 00:26:12,320
Speaker 1: rules when it comes to portraying those fingers, and they

412
00:26:12,680 --> 00:26:16,200
Speaker 1: do the best they can, and sometimes the best they

413
00:26:16,200 --> 00:26:21,760
Speaker 1: can is hilariously bad. But if image generator models train

414
00:26:22,160 --> 00:26:26,320
Speaker 1: on material that was created by AI, those weird fingers

415
00:26:26,440 --> 00:26:30,080
Speaker 1: are seen as a feature, not a bug. Like the

416
00:26:30,280 --> 00:26:33,639
Speaker 1: AI model doesn't know, Oh, fingers don't actually look like that,

417
00:26:33,640 --> 00:26:37,440
Speaker 1: that's wrong. It just says, ah, this is how fingers

418
00:26:37,480 --> 00:26:40,119
Speaker 1: sometimes look based upon these images I've been trained on,

419
00:26:40,480 --> 00:26:44,440
Speaker 1: which means the next generation of image generators will stress

420
00:26:44,520 --> 00:26:48,040
Speaker 1: these features more instead of correcting for them, which means

421
00:26:48,080 --> 00:26:50,880
Speaker 1: you're going to get some really weird images as a result.

422
00:26:51,359 --> 00:26:55,119
Speaker 1: And this process can repeat itself, and it gets worse

423
00:26:55,160 --> 00:26:58,480
Speaker 1: and worse each time. It's like making a copy of

424
00:26:58,520 --> 00:27:02,240
Speaker 1: a copy of a copy. You eventually reach a point

425
00:27:02,280 --> 00:27:06,359
Speaker 1: where the copy you have produced is illegible or doesn't

426
00:27:06,400 --> 00:27:08,879
Speaker 1: look enough like the original at all for you to

427
00:27:08,920 --> 00:27:12,360
Speaker 1: even easily say, oh, this is a copy of that.

428
00:27:12,359 --> 00:27:15,240
Speaker 1: That can be a real problem. And of course this

429
00:27:15,320 --> 00:27:18,320
Speaker 1: is just one example the fingers in AI. That's an

430
00:27:18,359 --> 00:27:23,480
Speaker 1: easy mark to hit, right, but there are countless other examples.

431
00:27:23,960 --> 00:27:27,639
Speaker 1: In a paper titled The Curse of Recursion Training on

432
00:27:27,840 --> 00:27:32,080
Speaker 1: Generated Data Makes Models Forget, a group of researchers from

433
00:27:32,160 --> 00:27:36,800
Speaker 1: the University of Cambridge, Oxford University, Imperial College London, the

434
00:27:36,920 --> 00:27:40,639
Speaker 1: University of Edinburgh, and the University of Toronto present an

435
00:27:40,800 --> 00:27:45,520
Speaker 1: argument of a pretty bleak future if AI researchers don't

436
00:27:45,600 --> 00:27:49,040
Speaker 1: take the proper measures to head it off. We're going

437
00:27:49,119 --> 00:27:52,000
Speaker 1: to talk more about that in just a moment, but

438
00:27:52,119 --> 00:28:05,680
Speaker 1: first let's take another quick break to thank our sponsors. Okay,

439
00:28:05,800 --> 00:28:09,520
Speaker 1: before the break, I mentioned this paper, the cursor recursion

440
00:28:09,760 --> 00:28:13,840
Speaker 1: Training on Generated Data makes Models Forget. It's a great article.

441
00:28:13,960 --> 00:28:17,720
Speaker 1: It does get very technical at one point, but the

442
00:28:17,760 --> 00:28:22,800
Speaker 1: researchers did a great job explaining the top level problem

443
00:28:23,040 --> 00:28:26,360
Speaker 1: and the potential outcome of that problem in a way

444
00:28:26,359 --> 00:28:29,040
Speaker 1: that I think anyone could find accessible. When you get

445
00:28:29,080 --> 00:28:32,440
Speaker 1: to the actual analysis part, that's when it gets really technical.

446
00:28:32,600 --> 00:28:36,280
Speaker 1: But the summary, the conclusions, all of that, I think

447
00:28:36,320 --> 00:28:41,040
Speaker 1: is easy to understand. So in that paper, the researchers say, quote,

448
00:28:41,160 --> 00:28:45,560
Speaker 1: we discover that learning from data produced by other models

449
00:28:45,600 --> 00:28:51,800
Speaker 1: causes model collapse, a degenerative process whereby over time models

450
00:28:51,840 --> 00:28:57,360
Speaker 1: forget the true underlying data distribution end quote. So essentially,

451
00:28:57,960 --> 00:29:01,880
Speaker 1: these AI models will quote unquote for get information while

452
00:29:01,960 --> 00:29:10,080
Speaker 1: simultaneously a set of learned behaviors they have created through synthesizing.

453
00:29:10,080 --> 00:29:13,240
Speaker 1: All this information will begin to converge and lead to

454
00:29:13,400 --> 00:29:16,760
Speaker 1: a broken model that's no longer really useful. It won't

455
00:29:16,760 --> 00:29:21,640
Speaker 1: present anything that's of real value. So the researchers argue

456
00:29:21,640 --> 00:29:26,080
Speaker 1: that quote the use of llm's at scale to publish

457
00:29:26,160 --> 00:29:29,720
Speaker 1: content on the Internet will pollute the collection of data

458
00:29:29,800 --> 00:29:34,080
Speaker 1: to train them endo quote. That's bad news, and it's

459
00:29:34,160 --> 00:29:37,040
Speaker 1: definitely going to be an issue, particularly with sites that

460
00:29:37,080 --> 00:29:41,720
Speaker 1: fall into the content farm category, because it's already happening right.

461
00:29:42,160 --> 00:29:45,160
Speaker 1: There are already websites out there that have turned to

462
00:29:45,280 --> 00:29:50,160
Speaker 1: AI generation to flesh out the articles that they have

463
00:29:50,680 --> 00:29:55,600
Speaker 1: in their database, and these articles are of a varying quality,

464
00:29:55,920 --> 00:29:58,880
Speaker 1: and all of those getting scooped up in a future

465
00:29:59,040 --> 00:30:03,600
Speaker 1: AI model session and used side by side with articles

466
00:30:03,640 --> 00:30:07,920
Speaker 1: that were researched written and edited by human beings and

467
00:30:07,960 --> 00:30:12,320
Speaker 1: therefore potentially at least of higher quality. I'm not saying

468
00:30:12,360 --> 00:30:16,040
Speaker 1: that all human written articles and edited articles are great.

469
00:30:16,440 --> 00:30:19,160
Speaker 1: They're not. There's some bad stuff out there that human

470
00:30:19,160 --> 00:30:24,120
Speaker 1: beings have written. But with those steps in place, you

471
00:30:24,240 --> 00:30:29,640
Speaker 1: have the potential for really great work. With AI. You

472
00:30:29,680 --> 00:30:33,040
Speaker 1: don't necessarily get that. You hope you get it, but

473
00:30:33,120 --> 00:30:36,720
Speaker 1: there's no guarantee and there aren't enough I would say

474
00:30:37,720 --> 00:30:40,960
Speaker 1: safety valves to make sure that things don't go off

475
00:30:40,960 --> 00:30:44,520
Speaker 1: the rails. So getting back to content farms, if you

476
00:30:44,600 --> 00:30:48,200
Speaker 1: are unfamiliar with that term, well don't worry. You've almost

477
00:30:48,280 --> 00:30:51,000
Speaker 1: certainly come across a content farm at some point in

478
00:30:51,040 --> 00:30:54,440
Speaker 1: the past. So these are sites that just churn out

479
00:30:54,480 --> 00:30:58,880
Speaker 1: an enormous amount of content, typically in an effort to

480
00:30:58,960 --> 00:31:03,320
Speaker 1: tap into the sweet sweet waters of SEO, which stands

481
00:31:03,320 --> 00:31:07,120
Speaker 1: for search engine optimization. So for a lot of websites

482
00:31:07,200 --> 00:31:11,120
Speaker 1: out there, the majority of traffic coming to the website

483
00:31:11,440 --> 00:31:14,600
Speaker 1: comes courtesy of a search engine. And when I say

484
00:31:14,760 --> 00:31:17,040
Speaker 1: a search engine, you might as well fill in the

485
00:31:17,080 --> 00:31:19,840
Speaker 1: name Google there, because that's the big one. I mean.

486
00:31:19,840 --> 00:31:22,560
Speaker 1: There are other search engines out there, and some of

487
00:31:22,560 --> 00:31:26,520
Speaker 1: them do contribute to this too, but Google commands somewhere

488
00:31:26,560 --> 00:31:30,400
Speaker 1: between eighty and like ninety five percent of the search market.

489
00:31:30,480 --> 00:31:33,840
Speaker 1: Exactly where that it falls is a matter of debate.

490
00:31:34,080 --> 00:31:37,960
Speaker 1: Like I looked at a few different Internet analytics sites, right,

491
00:31:38,040 --> 00:31:41,600
Speaker 1: and they had different percentages, but there was always above

492
00:31:41,680 --> 00:31:44,360
Speaker 1: eighty percent, and some as high as like ninety two

493
00:31:44,480 --> 00:31:47,080
Speaker 1: or ninety three. So it's safe to say that Google

494
00:31:47,200 --> 00:31:50,440
Speaker 1: dominates the search space. You know, technically it may not

495
00:31:50,480 --> 00:31:53,600
Speaker 1: be a monopoly, but effectively it kind of is. So

496
00:31:54,320 --> 00:31:58,800
Speaker 1: sites that depend on traffic from search naturally want to

497
00:31:58,840 --> 00:32:01,960
Speaker 1: find ways for their pay is to rank high in

498
00:32:02,040 --> 00:32:05,560
Speaker 1: search results and to appear in more search results. Now

499
00:32:05,560 --> 00:32:09,200
Speaker 1: that's actually easier said than done. Google has changed its

500
00:32:09,240 --> 00:32:13,000
Speaker 1: page ranking algorithm a few different times, and some search

501
00:32:13,000 --> 00:32:16,960
Speaker 1: results are dependent upon who is doing the searching. That

502
00:32:17,040 --> 00:32:19,800
Speaker 1: means that you and I might each search for the

503
00:32:19,880 --> 00:32:23,560
Speaker 1: exact same thing, maybe we word it the exact same way,

504
00:32:24,080 --> 00:32:27,960
Speaker 1: but we'll end up getting different results. Google says, quote

505
00:32:28,200 --> 00:32:31,720
Speaker 1: personalization is only used in your results if it can

506
00:32:31,760 --> 00:32:36,080
Speaker 1: provide more relevant and helpful information end quote. So presumably

507
00:32:36,120 --> 00:32:38,760
Speaker 1: it doesn't happen all the time. That means that in

508
00:32:38,800 --> 00:32:41,680
Speaker 1: some cases you and I will get identical results depending

509
00:32:41,720 --> 00:32:44,080
Speaker 1: upon what it is we're searching for, and in other

510
00:32:44,160 --> 00:32:48,360
Speaker 1: cases we will get very different search results. I do

511
00:32:48,520 --> 00:32:52,840
Speaker 1: know this makes SEO a much larger challenge because it's

512
00:32:52,880 --> 00:32:56,320
Speaker 1: impossible to be all things to all people. You know,

513
00:32:56,560 --> 00:32:59,120
Speaker 1: you can only do the best you can to try

514
00:32:59,200 --> 00:33:02,440
Speaker 1: and show up for any given search query. It is

515
00:33:02,600 --> 00:33:07,880
Speaker 1: super duper hard if you're dependent upon human writers and

516
00:33:08,000 --> 00:33:11,240
Speaker 1: editors to generate all the stuff that you're shoving out

517
00:33:11,320 --> 00:33:14,719
Speaker 1: in an effort to get clicks. So most of your

518
00:33:14,760 --> 00:33:17,360
Speaker 1: traffic is coming from search. We talked about this already.

519
00:33:17,640 --> 00:33:20,120
Speaker 1: You need to have lots of stuff on your site

520
00:33:20,160 --> 00:33:23,560
Speaker 1: that people could be searching for so that traffic comes

521
00:33:23,600 --> 00:33:25,800
Speaker 1: your way, and that way you can make money through

522
00:33:25,920 --> 00:33:29,680
Speaker 1: web advertising. Essentially, you could try to be reactionary, right,

523
00:33:29,760 --> 00:33:33,360
Speaker 1: You could try to generate new content as things capture

524
00:33:33,400 --> 00:33:36,360
Speaker 1: of the public interest, but you run the danger of

525
00:33:36,440 --> 00:33:39,440
Speaker 1: getting to the party too late and that by the

526
00:33:39,480 --> 00:33:42,080
Speaker 1: time you have something up, no one's talking about it

527
00:33:42,080 --> 00:33:45,480
Speaker 1: anymore and you're not really seeing any real traffic from that.

528
00:33:46,000 --> 00:33:48,000
Speaker 1: What if instead you could just kind of open up

529
00:33:48,040 --> 00:33:52,640
Speaker 1: a fire hose of content using generative AI. Well, if

530
00:33:52,640 --> 00:33:55,360
Speaker 1: you just had AI write a whole bunch of articles

531
00:33:55,360 --> 00:33:59,640
Speaker 1: in the style that you've established for your company, and maybe,

532
00:33:59,680 --> 00:34:02,520
Speaker 1: if you're feeling a little cautious, you'll even employ a

533
00:34:02,520 --> 00:34:05,040
Speaker 1: couple of human being editors to take on the job

534
00:34:05,080 --> 00:34:08,399
Speaker 1: of reading over these generated articles and to correct any

535
00:34:08,440 --> 00:34:11,279
Speaker 1: mistakes that were made, and perhaps even tweak a couple

536
00:34:11,320 --> 00:34:13,040
Speaker 1: of things here and there to make it sound more

537
00:34:13,120 --> 00:34:16,680
Speaker 1: human if necessary. But now you can push out way

538
00:34:16,880 --> 00:34:20,040
Speaker 1: more content without having to wait on human writers to

539
00:34:20,120 --> 00:34:24,879
Speaker 1: research and write everything. Plus, AI does not complain if

540
00:34:24,880 --> 00:34:27,560
Speaker 1: you assign it to write a suite of articles about

541
00:34:27,560 --> 00:34:31,279
Speaker 1: gluten free skincare products. By the way, I'm using my

542
00:34:31,440 --> 00:34:34,920
Speaker 1: real world life experience with that last example. I once

543
00:34:35,080 --> 00:34:39,080
Speaker 1: got that writing assignment. It was dumb then and it's

544
00:34:39,320 --> 00:34:42,040
Speaker 1: dumb now, But I guess people were searching for it,

545
00:34:42,160 --> 00:34:44,640
Speaker 1: so I got an assignment to write it. Now. I

546
00:34:44,680 --> 00:34:47,440
Speaker 1: would like to think that the site I was writing for,

547
00:34:47,520 --> 00:34:51,680
Speaker 1: which was how Stuffworks dot Com, wasn't really a content farm.

548
00:34:52,040 --> 00:34:54,360
Speaker 1: I would love to think that, and I would argue

549
00:34:54,360 --> 00:34:56,680
Speaker 1: that for many years when I wrote there, it did

550
00:34:56,680 --> 00:34:59,560
Speaker 1: not qualify as a content farm. We did try to

551
00:34:59,600 --> 00:35:05,040
Speaker 1: write in depth, authoritative articles about all sorts of stuff,

552
00:35:05,640 --> 00:35:09,600
Speaker 1: like whether we were talking about technology or society, or

553
00:35:09,719 --> 00:35:14,400
Speaker 1: money or entertainment, whatever it might be. We applied rigor,

554
00:35:14,840 --> 00:35:19,160
Speaker 1: you know, journalistic rigor, toward the research and writing and

555
00:35:19,280 --> 00:35:23,239
Speaker 1: editing of those pieces. Over time, things changed where we

556
00:35:23,280 --> 00:35:27,040
Speaker 1: started to cater more toward ad deals, where we would

557
00:35:27,160 --> 00:35:30,200
Speaker 1: get this big ad deal with a company like a

558
00:35:30,880 --> 00:35:34,240
Speaker 1: you know, cosmetics company, for example, and we would suddenly

559
00:35:34,280 --> 00:35:39,720
Speaker 1: have hundreds of articles assigned in the field of cosmetics,

560
00:35:40,160 --> 00:35:44,160
Speaker 1: articles that were incredibly niche like, there was no way

561
00:35:44,239 --> 00:35:46,360
Speaker 1: that we're going to drive a ton of traffic. But

562
00:35:46,560 --> 00:35:51,600
Speaker 1: collectively then these articles could get a lot of traffic.

563
00:35:51,960 --> 00:35:54,760
Speaker 1: Not a single one, but across the board. If someone

564
00:35:54,880 --> 00:35:58,120
Speaker 1: happened to be searching for this thing, they could find

565
00:35:58,120 --> 00:36:00,840
Speaker 1: their way to our article and that would be another

566
00:36:00,920 --> 00:36:03,759
Speaker 1: click coming our way. It was a very much a

567
00:36:03,800 --> 00:36:08,719
Speaker 1: shotgun approach to writing content. I hated it. There were

568
00:36:08,840 --> 00:36:12,439
Speaker 1: articles I wrote that I am not at all. It's

569
00:36:12,440 --> 00:36:14,319
Speaker 1: not that I'm not proud of the work I did.

570
00:36:14,360 --> 00:36:17,120
Speaker 1: I'm not proud of getting the assignment, like it was

571
00:36:17,160 --> 00:36:20,600
Speaker 1: a joke in my opinion, But that's what we were

572
00:36:20,760 --> 00:36:23,279
Speaker 1: trying to do in order to survive. Because again, how

573
00:36:23,320 --> 00:36:25,439
Speaker 1: stuff works was like one of these websites in that

574
00:36:25,800 --> 00:36:28,240
Speaker 1: most of the traffic coming through How Stuff Works came

575
00:36:28,400 --> 00:36:31,319
Speaker 1: through a search engine. Someone was looking to learn how

576
00:36:31,400 --> 00:36:34,840
Speaker 1: something worked and they got sent our way. People weren't,

577
00:36:34,920 --> 00:36:37,360
Speaker 1: as a rule, just coming to How Stuff Works to

578
00:36:37,719 --> 00:36:40,640
Speaker 1: peruse the website. We always wanted that that was what

579
00:36:40,760 --> 00:36:43,759
Speaker 1: our goal was, to create a destination website that people

580
00:36:43,760 --> 00:36:46,000
Speaker 1: would want to go to just to see, oh, what's

581
00:36:46,120 --> 00:36:48,759
Speaker 1: new on the site, But we never really achieved that.

582
00:36:48,920 --> 00:36:51,279
Speaker 1: It's a really hard thing to do. There are people

583
00:36:51,320 --> 00:36:54,200
Speaker 1: who do it and it's amazing, but it's not easy

584
00:36:54,200 --> 00:36:59,160
Speaker 1: to replicate. So instead we wrote tons of articles about

585
00:36:59,200 --> 00:37:03,319
Speaker 1: stuff that people were searching for, and that just kind

586
00:37:03,320 --> 00:37:07,239
Speaker 1: of was our mo at that point. Anyway, if you're

587
00:37:07,320 --> 00:37:10,480
Speaker 1: using AI to create these kinds of articles, it's going

588
00:37:10,520 --> 00:37:12,960
Speaker 1: to generate a lot of stuff that's just not very good.

589
00:37:13,080 --> 00:37:17,319
Speaker 1: But then who cares? Like you don't necessarily care if

590
00:37:17,360 --> 00:37:22,719
Speaker 1: the material is good. If the only traffic you're really

591
00:37:22,719 --> 00:37:26,880
Speaker 1: getting on your website is coming from search engines, you

592
00:37:27,080 --> 00:37:29,960
Speaker 1: just need it to show up in the search engines. Now,

593
00:37:30,000 --> 00:37:32,879
Speaker 1: if the search engine is able to determine, hey, this

594
00:37:32,960 --> 00:37:37,560
Speaker 1: is low quality content, and it disincentivizes people visiting by

595
00:37:37,920 --> 00:37:40,799
Speaker 1: making it go further down the search results, then you're

596
00:37:40,800 --> 00:37:42,960
Speaker 1: going to have a problem, and a lot of content

597
00:37:43,000 --> 00:37:46,920
Speaker 1: farms ran into that problem. Google downgraded content farms in

598
00:37:47,000 --> 00:37:51,000
Speaker 1: their search algorithm. Other sites like duck duck go removed

599
00:37:51,320 --> 00:37:56,600
Speaker 1: websites that were considered content farms because the people running

600
00:37:56,680 --> 00:38:00,720
Speaker 1: duck dot go realized, Hey, these sites aren't inviting anything

601
00:38:00,760 --> 00:38:03,960
Speaker 1: of real value to visitors. Why are we even serving

602
00:38:03,960 --> 00:38:07,640
Speaker 1: it up. That's not really a good use of anyone's time.

603
00:38:08,080 --> 00:38:12,439
Speaker 1: But if you're in a space where the jig isn't

604
00:38:12,520 --> 00:38:14,520
Speaker 1: up yet, you might as well just go ahead and

605
00:38:14,560 --> 00:38:17,080
Speaker 1: create as much garbage as you can because you just

606
00:38:17,120 --> 00:38:19,839
Speaker 1: want the clicks. You don't care if people actually think

607
00:38:19,880 --> 00:38:23,200
Speaker 1: the articles are of good quality or that they're going

608
00:38:23,239 --> 00:38:26,239
Speaker 1: to learn anything useful. You don't even necessarily care if

609
00:38:26,239 --> 00:38:29,280
Speaker 1: the articles are accurate. You care that people are clicking

610
00:38:29,360 --> 00:38:34,720
Speaker 1: on the articles. So if that's your perspective, ultimately, then

611
00:38:35,239 --> 00:38:38,000
Speaker 1: the goal for you is to push as much of

612
00:38:38,000 --> 00:38:40,800
Speaker 1: this stuff out the door as you possibly can, generate

613
00:38:40,840 --> 00:38:43,520
Speaker 1: it as fast as possible, get it online as quickly

614
00:38:43,560 --> 00:38:47,520
Speaker 1: as you can, and hope that starts to rank in

615
00:38:47,600 --> 00:38:51,600
Speaker 1: search so that people flood in to read about whatever

616
00:38:51,600 --> 00:38:54,640
Speaker 1: it is you're writing about. But it's not just people

617
00:38:55,040 --> 00:38:58,040
Speaker 1: who are going to your links, is it. There are

618
00:38:58,200 --> 00:39:01,160
Speaker 1: bots crawling the web now. Some of them are calling

619
00:39:01,200 --> 00:39:04,000
Speaker 1: the web in order to index those web pages for

620
00:39:04,080 --> 00:39:07,240
Speaker 1: the purposes of things like search engines, but other bots

621
00:39:07,239 --> 00:39:10,720
Speaker 1: are there to scrape data for the purposes of training

622
00:39:10,760 --> 00:39:15,040
Speaker 1: the next generation of large language models. Essentially, at this point,

623
00:39:15,120 --> 00:39:18,120
Speaker 1: bots are reading articles that were written by other bots,

624
00:39:18,440 --> 00:39:22,160
Speaker 1: and so when the next large language model launches, it

625
00:39:22,160 --> 00:39:24,600
Speaker 1: does so on a data set that has been polluted

626
00:39:24,880 --> 00:39:29,080
Speaker 1: by bot generated information. That means the next generation will

627
00:39:29,080 --> 00:39:31,759
Speaker 1: be even worse, and so on, and eventually we arrive

628
00:39:31,800 --> 00:39:35,840
Speaker 1: at a point where the Internet, this amazing invention that

629
00:39:35,920 --> 00:39:40,520
Speaker 1: provides access to practically all of human knowledge, becomes absolutely

630
00:39:40,800 --> 00:39:46,560
Speaker 1: infested with junk that is inaccurate and increasingly nonsensical, and

631
00:39:46,680 --> 00:39:54,480
Speaker 1: we render this incredible invention useless. This isn't just speculation either.

632
00:39:54,880 --> 00:39:58,320
Speaker 1: We have examples of companies turning to AI to generate articles.

633
00:39:58,520 --> 00:40:01,680
Speaker 1: C Net famously did this early in the days of

634
00:40:01,760 --> 00:40:06,440
Speaker 1: generative AI, and cnet properly got roasted for doing it,

635
00:40:06,480 --> 00:40:09,040
Speaker 1: first roasted for not presenting it in a way that

636
00:40:09,120 --> 00:40:12,960
Speaker 1: was transparent, and then also for including articles that just

637
00:40:13,000 --> 00:40:17,920
Speaker 1: had outright wrong information in them and publishing them as

638
00:40:18,000 --> 00:40:21,399
Speaker 1: if they were vetted pieces that editors had gone through.

639
00:40:21,640 --> 00:40:25,200
Speaker 1: How stuff works. Again, my old employer where I got

640
00:40:25,200 --> 00:40:28,319
Speaker 1: that Skincare writing assignment once upon a time, they've done

641
00:40:28,360 --> 00:40:31,680
Speaker 1: this too. They laid off their human writers, they stopped

642
00:40:32,160 --> 00:40:36,080
Speaker 1: giving assignments to freelancers. Later on they laid off the

643
00:40:36,239 --> 00:40:40,399
Speaker 1: entire editorial staff after the editors protested this move toward

644
00:40:40,560 --> 00:40:46,400
Speaker 1: AI generated content. This trend is happening. Not only are

645
00:40:46,520 --> 00:40:49,040
Speaker 1: talented people being put out of work, which is bad

646
00:40:49,160 --> 00:40:52,400
Speaker 1: enough already. These editors and writers, they believed in what

647
00:40:52,440 --> 00:40:56,960
Speaker 1: they were doing. Yeah, sometimes the assignments stank, sometimes they

648
00:40:57,000 --> 00:41:01,400
Speaker 1: were not good, but the writers and editors still believed

649
00:41:01,440 --> 00:41:04,080
Speaker 1: in doing as good a job as they possibly could.

650
00:41:04,520 --> 00:41:08,239
Speaker 1: But their replacements, the AI, they're just making the Internet

651
00:41:08,520 --> 00:41:12,839
Speaker 1: worse by generating unreliable and terrible content. Then no one

652
00:41:12,960 --> 00:41:15,960
Speaker 1: actually wants to read unless they just happen to put

653
00:41:16,200 --> 00:41:20,799
Speaker 1: that particular set of terms into a search engine and

654
00:41:20,880 --> 00:41:24,840
Speaker 1: the search engine couldn't find anything better to serve them. Again,

655
00:41:25,200 --> 00:41:28,880
Speaker 1: it's as if you needed to learn something important, but

656
00:41:29,000 --> 00:41:32,320
Speaker 1: all you have access to are just sloppily written articles

657
00:41:32,360 --> 00:41:35,680
Speaker 1: by people who had no understanding or passion about the

658
00:41:35,680 --> 00:41:38,480
Speaker 1: subject matter they were writing on, and there were no

659
00:41:38,680 --> 00:41:41,840
Speaker 1: editors to steer the writer toward creating a more accurate

660
00:41:42,000 --> 00:41:47,520
Speaker 1: or informative piece. It gets pretty darn bleak. Is it inevitable? Though? No,

661
00:41:47,800 --> 00:41:51,560
Speaker 1: it's not inevitable. This future happens if the people who

662
00:41:51,600 --> 00:41:54,799
Speaker 1: are training the AI models allow it to happen, but

663
00:41:54,920 --> 00:41:58,719
Speaker 1: with careful stewardship. By guiding the AI models so that

664
00:41:58,800 --> 00:42:01,920
Speaker 1: they don't pull training data from garbage sites and they

665
00:42:02,080 --> 00:42:07,120
Speaker 1: really focus on reputable sources, it's possible to avoid these

666
00:42:07,160 --> 00:42:10,080
Speaker 1: issues at least in some part. I mean some things

667
00:42:10,120 --> 00:42:13,960
Speaker 1: like hallucinations, confabulations, that kind of stuff that can happen anyway,

668
00:42:14,200 --> 00:42:17,960
Speaker 1: but you can at least limit it. That's not really

669
00:42:18,040 --> 00:42:20,240
Speaker 1: what we're seeing right now, though, because at the moment,

670
00:42:20,360 --> 00:42:23,959
Speaker 1: companies are rushing into the AI space right they are

671
00:42:24,160 --> 00:42:29,000
Speaker 1: pushing so hard to create large language models that dwarf

672
00:42:29,200 --> 00:42:32,799
Speaker 1: the previous generation's capabilities. So to do that, they have

673
00:42:32,880 --> 00:42:35,960
Speaker 1: to seek out training data from all across the internet.

674
00:42:36,160 --> 00:42:38,880
Speaker 1: You have to train these AI models on tons and

675
00:42:38,920 --> 00:42:42,080
Speaker 1: tons of information to make them useful. The more data

676
00:42:42,120 --> 00:42:45,640
Speaker 1: you have access to the better. Social platforms have provided

677
00:42:45,680 --> 00:42:49,160
Speaker 1: a popular source of information. We know that Reddit has

678
00:42:49,200 --> 00:42:52,960
Speaker 1: struck deals with open ai, for example, in order to

679
00:42:53,440 --> 00:42:57,280
Speaker 1: crawl Reddit to pull information. But you know what, social

680
00:42:57,280 --> 00:43:01,280
Speaker 1: platforms are also really popular with bots, not just with people.

681
00:43:01,560 --> 00:43:04,440
Speaker 1: So even this approach brings with it the risk of

682
00:43:04,520 --> 00:43:08,920
Speaker 1: AI training on other AI generated data, which again leads

683
00:43:08,960 --> 00:43:13,480
Speaker 1: to model collapse. Further down the road, I might one

684
00:43:13,560 --> 00:43:16,560
Speaker 1: day do a much more in depth episode about this paper,

685
00:43:16,840 --> 00:43:21,160
Speaker 1: the curse of recursion. Training on generated data makes models forget.

686
00:43:21,440 --> 00:43:24,200
Speaker 1: I've given a very high level summary of what the

687
00:43:24,280 --> 00:43:28,440
Speaker 1: researchers say in that paper, but it might benefit us

688
00:43:28,440 --> 00:43:31,040
Speaker 1: to take a much closer look at what they found

689
00:43:31,320 --> 00:43:35,040
Speaker 1: and their conclusions. So I may revisit this topic in

690
00:43:35,080 --> 00:43:37,160
Speaker 1: the future, but for now, I think it's just good

691
00:43:37,160 --> 00:43:39,839
Speaker 1: to remember that AI does have the potential to do

692
00:43:39,880 --> 00:43:43,919
Speaker 1: great things. I mean, it can potentially augment our work

693
00:43:43,920 --> 00:43:48,080
Speaker 1: efforts and let us accomplish goals more quickly and efficiently

694
00:43:48,280 --> 00:43:51,960
Speaker 1: and accurately. But AI also has the potential to make

695
00:43:52,040 --> 00:43:55,440
Speaker 1: things miserable and churn out content that no one wants

696
00:43:55,480 --> 00:43:59,200
Speaker 1: to see other than other bots, and creating a cynical

697
00:43:59,239 --> 00:44:02,440
Speaker 1: cycle that ultimately could make the Internet into a cluttered,

698
00:44:02,560 --> 00:44:06,120
Speaker 1: practically useless mess. So which way are we going to go?

699
00:44:06,440 --> 00:44:09,319
Speaker 1: I think my answer day to day depends on how

700
00:44:09,320 --> 00:44:11,959
Speaker 1: optimistic I feel, but at the very least, I think

701
00:44:12,040 --> 00:44:16,360
Speaker 1: knowing about the risks is important. That's it for today's episode.

702
00:44:16,600 --> 00:44:19,319
Speaker 1: I hope you are all well. I will try to

703
00:44:19,360 --> 00:44:21,600
Speaker 1: get away from AI topics. I know I've been covering

704
00:44:21,600 --> 00:44:23,640
Speaker 1: a lot of that recently, and that'd be nice to

705
00:44:23,719 --> 00:44:26,839
Speaker 1: kind of branch into other areas of tech, So I'm

706
00:44:26,840 --> 00:44:29,360
Speaker 1: going to try and do that. It's just AI stuff

707
00:44:29,360 --> 00:44:32,400
Speaker 1: just keeps on happening, y'all. But I will talk to

708
00:44:32,400 --> 00:44:42,720
Speaker 1: you again really soon. Tech Stuff is an iHeartRadio production.

709
00:44:43,040 --> 00:44:48,080
Speaker 1: For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts,

710
00:44:48,200 --> 00:44:53,880
Speaker 1: or wherever you listen to your favorite shows.