1 00:00:04,480 --> 00:00:12,319 Speaker 1: Welcome to tech Stuff, a production from iHeartRadio. Hey there, 2 00:00:12,360 --> 00:00:15,560 Speaker 1: and welcome to tech Stuff. I'm your host, Jonathan Strickland. 3 00:00:15,560 --> 00:00:18,919 Speaker 1: I'm an executive producer with iHeart Podcasts. And how the 4 00:00:18,960 --> 00:00:23,520 Speaker 1: tech are you? So? Imagine for a moment that you 5 00:00:23,720 --> 00:00:27,200 Speaker 1: are in school. Some of y'all might actually be in school, 6 00:00:27,400 --> 00:00:30,600 Speaker 1: but others, like me, we have to satisfy ourselves by 7 00:00:30,640 --> 00:00:33,919 Speaker 1: having that occasional stress dream where we imagine that we're 8 00:00:33,960 --> 00:00:35,919 Speaker 1: in school and it's time to take a final and 9 00:00:35,960 --> 00:00:38,960 Speaker 1: we haven't gone to class all year, and also we 10 00:00:39,000 --> 00:00:41,479 Speaker 1: can't remember our locker combination. I don't know about you, 11 00:00:41,520 --> 00:00:44,320 Speaker 1: but I still occasionally get those dreams. And I'm almost 12 00:00:44,400 --> 00:00:47,280 Speaker 1: fifty years old at this point. Anyway, you're in school, 13 00:00:47,760 --> 00:00:50,879 Speaker 1: you're in English class, and you've been given the dreaded 14 00:00:51,120 --> 00:00:53,760 Speaker 1: term paper assignment. You're told you need to go to 15 00:00:53,800 --> 00:00:57,040 Speaker 1: the library and you have to gather resources and read 16 00:00:57,120 --> 00:01:01,080 Speaker 1: up and form your thesis and write your paper while 17 00:01:01,080 --> 00:01:06,080 Speaker 1: making verifiable citations all the way through. So off you 18 00:01:06,120 --> 00:01:09,680 Speaker 1: go to the library. However, you discover, horror of horrors, 19 00:01:09,959 --> 00:01:14,120 Speaker 1: that all the resource books have disappeared. They're none in 20 00:01:14,160 --> 00:01:17,280 Speaker 1: their place. Are other student term papers? Now? Some of 21 00:01:17,280 --> 00:01:21,040 Speaker 1: those term papers are pretty good, some of them are terrible. 22 00:01:21,440 --> 00:01:23,880 Speaker 1: Nearly all of them do have a list of references 23 00:01:23,920 --> 00:01:25,839 Speaker 1: at the end, But the problem is that you don't 24 00:01:25,840 --> 00:01:29,319 Speaker 1: have access to those references. You only have access to 25 00:01:29,440 --> 00:01:32,640 Speaker 1: the term papers, which, in a way, you could say 26 00:01:32,800 --> 00:01:36,520 Speaker 1: is a filtered view of those references. But you have 27 00:01:36,600 --> 00:01:39,319 Speaker 1: no way of knowing if the student who wrote the 28 00:01:39,440 --> 00:01:42,880 Speaker 1: term papers you've pulled out did a proper citation. You 29 00:01:42,920 --> 00:01:46,319 Speaker 1: don't know if the student understood the source material. You 30 00:01:46,360 --> 00:01:49,520 Speaker 1: don't know if they have made a valid reference using 31 00:01:49,560 --> 00:01:52,400 Speaker 1: that source. You don't know if the student didn't understand 32 00:01:52,400 --> 00:01:56,200 Speaker 1: the source and thus misconstrued the information, either accidentally or 33 00:01:56,240 --> 00:01:59,600 Speaker 1: on purpose, or if the student is just outright plagiarizing 34 00:01:59,640 --> 00:02:03,280 Speaker 1: the source material or making stuff up. So how do 35 00:02:03,360 --> 00:02:06,880 Speaker 1: you think your own term paper would turn out? Probably 36 00:02:07,240 --> 00:02:10,359 Speaker 1: it'd be a challenge to write a good term paper. 37 00:02:10,360 --> 00:02:13,200 Speaker 1: It definitely would be difficult or almost impossible to support 38 00:02:13,200 --> 00:02:16,200 Speaker 1: your thesis using citations, because all you would have access 39 00:02:16,200 --> 00:02:19,880 Speaker 1: to would be other term papers. Chances are you'd have 40 00:02:19,919 --> 00:02:24,600 Speaker 1: a pretty lousy grade by the end of that assignment. Now, 41 00:02:24,680 --> 00:02:28,200 Speaker 1: I started off this episode with that analogy because today 42 00:02:28,200 --> 00:02:31,920 Speaker 1: we're going to talk about what happens when AI models 43 00:02:32,040 --> 00:02:35,600 Speaker 1: train off stuff that was generated by other or sometimes 44 00:02:35,639 --> 00:02:40,040 Speaker 1: even the same but earlier versions of AI models. So 45 00:02:40,120 --> 00:02:44,160 Speaker 1: when bots make stuff that other bots consume, and then 46 00:02:44,200 --> 00:02:47,920 Speaker 1: those other bots make new stuff and the cycle goes on. 47 00:02:48,560 --> 00:02:51,320 Speaker 1: Where are the humans in this picture. Maybe they're in 48 00:02:51,360 --> 00:02:55,720 Speaker 1: an actual library, because the online resources will all have 49 00:02:55,840 --> 00:02:59,160 Speaker 1: become practically useless. So if we want to actually learn anything, 50 00:02:59,160 --> 00:03:02,160 Speaker 1: we're gonna need to go back to the basics. So 51 00:03:02,200 --> 00:03:05,720 Speaker 1: we're going to talk about an idea called model collapse, 52 00:03:05,880 --> 00:03:10,440 Speaker 1: as in large language models LMS and other types of 53 00:03:10,480 --> 00:03:14,320 Speaker 1: AI models. We're going to build to that. However, first up, 54 00:03:14,480 --> 00:03:18,079 Speaker 1: let's explore the tendency of AI models to produce wrong 55 00:03:18,400 --> 00:03:21,880 Speaker 1: or misleading results, regardless of whether the material used to 56 00:03:21,960 --> 00:03:26,000 Speaker 1: train that AI model came from AI or humans. This 57 00:03:26,040 --> 00:03:28,440 Speaker 1: is something I've talked about in past episodes, but it's 58 00:03:28,480 --> 00:03:32,280 Speaker 1: an important part to kind of build toward our understanding 59 00:03:32,280 --> 00:03:35,680 Speaker 1: of what model collapse is. Now. In past episodes, I've 60 00:03:35,680 --> 00:03:40,640 Speaker 1: talked about the issue of AI hallucinations, also sometimes called confabulations. 61 00:03:40,760 --> 00:03:45,360 Speaker 1: Some people prefer confabulations to hallucinations. This is the tendency 62 00:03:45,640 --> 00:03:51,320 Speaker 1: for generative AI to mistakenly include untrue or misleading information, 63 00:03:51,960 --> 00:03:56,600 Speaker 1: or to insert stuff that does not belong into whatever 64 00:03:56,640 --> 00:03:59,480 Speaker 1: it is that's creating, whether that's an image or text 65 00:03:59,600 --> 00:04:02,640 Speaker 1: or what an so. One fairly recent example of this 66 00:04:03,160 --> 00:04:07,400 Speaker 1: was when Google's AI augmented search tool suggested that you 67 00:04:07,480 --> 00:04:11,240 Speaker 1: add a non toxic glue to your pizza ingredients if 68 00:04:11,280 --> 00:04:14,440 Speaker 1: you want to solve the irritating issue of cheese slip 69 00:04:14,520 --> 00:04:18,760 Speaker 1: sladden away off your ding dang dern pizza. Clearly this 70 00:04:18,800 --> 00:04:22,800 Speaker 1: answer is not acceptable. Adding glue, non toxic or otherwise 71 00:04:23,160 --> 00:04:25,560 Speaker 1: is not a way of making good eats. I'm pretty 72 00:04:25,560 --> 00:04:28,600 Speaker 1: sure Alton Brown would agree with me, and actually I 73 00:04:28,600 --> 00:04:31,400 Speaker 1: would argue this is one of the less egregious cases 74 00:04:31,440 --> 00:04:34,240 Speaker 1: of AI providing a bad answer. It's famous because it 75 00:04:34,320 --> 00:04:36,800 Speaker 1: got a lot of traction. It went viral for how 76 00:04:36,880 --> 00:04:39,479 Speaker 1: bad the answer was. But in the grand scheme of things, 77 00:04:39,480 --> 00:04:43,400 Speaker 1: there are other examples that were far more potentially harmful. 78 00:04:43,680 --> 00:04:47,520 Speaker 1: So why does AI do this sometimes? Well, there are 79 00:04:47,560 --> 00:04:51,479 Speaker 1: a few different contributing factors that lead AI to making 80 00:04:51,480 --> 00:04:54,040 Speaker 1: these mistakes. By the way, the reason why some people 81 00:04:54,240 --> 00:04:59,520 Speaker 1: prefer confabulations as opposed to hallucinations. Hallucination sounds like the 82 00:04:59,680 --> 00:05:03,880 Speaker 1: AI I has somehow been tricked into thinking something is 83 00:05:04,279 --> 00:05:07,800 Speaker 1: what it isn't right, like the idea that you hallucinate 84 00:05:07,839 --> 00:05:11,960 Speaker 1: your seeing or hearing or experiencing something that's not really there. 85 00:05:12,400 --> 00:05:17,400 Speaker 1: Confabulation suggests that the AI is inventing something. It is confabulating, 86 00:05:17,440 --> 00:05:20,560 Speaker 1: it is creating an answer where there was none, and 87 00:05:20,640 --> 00:05:23,560 Speaker 1: so some people prefer the second one because they but 88 00:05:23,680 --> 00:05:26,520 Speaker 1: it puts more of the onus on the AI model itself. 89 00:05:26,920 --> 00:05:30,799 Speaker 1: So one of the factors that contributes to AI making mistakes. 90 00:05:31,520 --> 00:05:34,800 Speaker 1: And you know, large language models and like are in 91 00:05:34,880 --> 00:05:39,720 Speaker 1: part focused on pattern recognition, and this can lead to issues. Now, 92 00:05:39,760 --> 00:05:43,680 Speaker 1: recognizing patterns is what gives these models the ability to 93 00:05:43,960 --> 00:05:48,559 Speaker 1: form relevant and coherent responses to queries, and obviously pattern 94 00:05:48,600 --> 00:05:53,240 Speaker 1: recognition is important otherwise you're just gonna perceive everything is 95 00:05:53,279 --> 00:05:58,040 Speaker 1: being random and meaningless and then really, this whole conversation 96 00:05:58,240 --> 00:06:02,000 Speaker 1: doesn't mean anything either, or if the whole universe is meaningless, 97 00:06:02,440 --> 00:06:05,200 Speaker 1: then what are we even doing here? But I don't 98 00:06:05,200 --> 00:06:07,479 Speaker 1: want you to go down that path of existential dread. 99 00:06:07,920 --> 00:06:12,239 Speaker 1: So sometimes AI will detect a pattern where there really 100 00:06:12,360 --> 00:06:15,760 Speaker 1: isn't a pattern. And we humans do this too, you know, 101 00:06:15,800 --> 00:06:19,160 Speaker 1: we sometimes experience like paradolia. For example. That's when we 102 00:06:19,200 --> 00:06:24,680 Speaker 1: perceive something meaningful within an otherwise meaningless thing, like we 103 00:06:24,760 --> 00:06:28,040 Speaker 1: see a pattern where there is none. So if you 104 00:06:28,080 --> 00:06:30,880 Speaker 1: were to look at the clouds and you say that 105 00:06:31,160 --> 00:06:34,680 Speaker 1: one of them looks very like a whale, that's paradolia. 106 00:06:34,880 --> 00:06:39,680 Speaker 1: It's also a reference to Hamlet the infamous face on Mars, 107 00:06:39,880 --> 00:06:42,440 Speaker 1: which was really just a hill with some shadows cast 108 00:06:42,480 --> 00:06:46,080 Speaker 1: on it. Because the angle of the image, that was 109 00:06:46,080 --> 00:06:48,960 Speaker 1: another example of paradolia, people began to think that there 110 00:06:49,040 --> 00:06:52,279 Speaker 1: was actually a big sculpted face on Mars. There's not. 111 00:06:52,880 --> 00:06:55,080 Speaker 1: It's a hill. The shadows hit the hill in a 112 00:06:55,080 --> 00:06:57,560 Speaker 1: specific way that made it look kind of like the 113 00:06:57,600 --> 00:07:01,560 Speaker 1: face of an enormous statue, something like the Sphinx, something 114 00:07:01,600 --> 00:07:04,320 Speaker 1: along those lines. But in fact it was just a hill. 115 00:07:04,600 --> 00:07:07,679 Speaker 1: And if you took another image from a different angle, 116 00:07:07,680 --> 00:07:12,000 Speaker 1: which people have done, the illusion of a face disappears. 117 00:07:12,320 --> 00:07:15,680 Speaker 1: So again, that was us inventing a pattern where there 118 00:07:15,880 --> 00:07:19,080 Speaker 1: was none. Now, much of the time we humans can 119 00:07:19,120 --> 00:07:22,320 Speaker 1: recognize when the things we see, you know, the shapes 120 00:07:22,360 --> 00:07:26,320 Speaker 1: of faces or whatever it may be, aren't actually there. Right, 121 00:07:26,360 --> 00:07:29,960 Speaker 1: we can recognize, oh, that looks like a blah, blah blah, 122 00:07:30,000 --> 00:07:33,000 Speaker 1: but we know it's not actually a real image of that. 123 00:07:33,160 --> 00:07:38,200 Speaker 1: It just happens. Now. Sometimes we don't recognize this. Sometimes 124 00:07:38,200 --> 00:07:41,640 Speaker 1: there are ties where people will assume that what they're 125 00:07:41,680 --> 00:07:46,680 Speaker 1: seeing is an actual image made with intent and intelligence, 126 00:07:46,760 --> 00:07:49,560 Speaker 1: perhaps not by humans but by something. So there are 127 00:07:49,560 --> 00:07:52,119 Speaker 1: all those stories of people going bonkers because they believe 128 00:07:52,120 --> 00:07:54,280 Speaker 1: they saw an image of like the Virgin Mary in 129 00:07:54,320 --> 00:07:58,120 Speaker 1: a potato chip or whatever. And machines don't necessarily have 130 00:07:58,160 --> 00:08:02,400 Speaker 1: any checks against fall hits when it comes to pattern recognition, 131 00:08:02,800 --> 00:08:06,640 Speaker 1: and then they might act on a perceived pattern, which 132 00:08:06,680 --> 00:08:10,360 Speaker 1: means the machines produce bad results. What's more, machines conceive 133 00:08:10,400 --> 00:08:13,720 Speaker 1: patterns where we can't. Like sometimes there are patterns present 134 00:08:13,800 --> 00:08:17,720 Speaker 1: that we cannot perceive because maybe the dataset is far 135 00:08:17,800 --> 00:08:22,400 Speaker 1: too large or far too complicated, and so we can't 136 00:08:22,440 --> 00:08:26,560 Speaker 1: perceive where the pattern is. It's just beyond our abilities 137 00:08:27,080 --> 00:08:31,360 Speaker 1: to do so. But sometimes machines can detect those patterns, 138 00:08:31,360 --> 00:08:34,720 Speaker 1: and sometimes they are meaningful. So it can be really tricky. 139 00:08:34,840 --> 00:08:37,400 Speaker 1: If a machine thinks it's found a pattern, it can 140 00:08:37,440 --> 00:08:42,079 Speaker 1: be hard for people to verify or discredit that because 141 00:08:42,400 --> 00:08:44,600 Speaker 1: it's on a scale that we humans are not really 142 00:08:44,640 --> 00:08:48,320 Speaker 1: well equipped to handle with generative AI. This can mean 143 00:08:48,320 --> 00:08:52,240 Speaker 1: that the AI model correctly identifies that it needs to 144 00:08:52,320 --> 00:08:56,319 Speaker 1: use a specific syntax to craft a response to whatever 145 00:08:56,520 --> 00:09:01,240 Speaker 1: query or direction it was given, and it can thus 146 00:09:01,559 --> 00:09:06,600 Speaker 1: put together a sentence that grammatically makes sense. What's happening 147 00:09:06,640 --> 00:09:11,360 Speaker 1: is it's essentially statistically analyzing the structure of hundreds of 148 00:09:11,520 --> 00:09:14,760 Speaker 1: millions of sentences, as well as the role that certain 149 00:09:14,800 --> 00:09:18,120 Speaker 1: words play within those sentences, so that it quote unquote 150 00:09:18,280 --> 00:09:22,120 Speaker 1: knows how to write a grammatically correct response, and ultimately 151 00:09:22,480 --> 00:09:25,439 Speaker 1: it's using statistics to pick what should be the most 152 00:09:25,520 --> 00:09:30,000 Speaker 1: correct word in each position of that sentence. So ideally, 153 00:09:30,440 --> 00:09:34,000 Speaker 1: it's pulling information from various sources that are related to 154 00:09:34,000 --> 00:09:38,240 Speaker 1: whatever it is you're asking about and pulling the words 155 00:09:38,240 --> 00:09:43,439 Speaker 1: together in a way that makes logical sense and is accurate, 156 00:09:43,559 --> 00:09:45,880 Speaker 1: and it's a correct answer to whatever your question is. 157 00:09:46,000 --> 00:09:49,440 Speaker 1: But that doesn't always happen right. Sometimes it can't find 158 00:09:49,880 --> 00:09:52,840 Speaker 1: the right word. Sometimes it finds a different word that 159 00:09:52,920 --> 00:09:56,080 Speaker 1: it thinks is right, but it's not. And the real 160 00:09:56,160 --> 00:10:00,680 Speaker 1: problem is it will present this to you authoritatively as 161 00:10:00,720 --> 00:10:04,160 Speaker 1: if the AI is absolutely certain this is the right answer, 162 00:10:04,360 --> 00:10:07,560 Speaker 1: when in fact it's wrong and the AI has no 163 00:10:07,600 --> 00:10:10,600 Speaker 1: way of knowing it's wrong. It's not purposefully trying to 164 00:10:10,600 --> 00:10:13,800 Speaker 1: mislead you, and at least not necessarily. Maybe it was 165 00:10:13,800 --> 00:10:17,000 Speaker 1: given direction to try and do that, but that's another matter. 166 00:10:17,400 --> 00:10:22,360 Speaker 1: It's just trying to complete its task and failing to 167 00:10:22,400 --> 00:10:26,640 Speaker 1: do so accurately. Sometimes the word or a series of 168 00:10:26,640 --> 00:10:30,400 Speaker 1: words can be wrong. Therefore, now grammatically it could be correct, 169 00:10:30,480 --> 00:10:33,679 Speaker 1: but factually it could be completely made up. And why 170 00:10:33,720 --> 00:10:36,640 Speaker 1: this all happens. It does get really complicated. It's not 171 00:10:36,720 --> 00:10:40,440 Speaker 1: necessarily due to just one specific flaw. It's not always 172 00:10:40,480 --> 00:10:43,840 Speaker 1: the case that, oh, that data point didn't appear in 173 00:10:43,880 --> 00:10:47,120 Speaker 1: the data set for some reason, and so the computer 174 00:10:47,440 --> 00:10:50,400 Speaker 1: made something up. There are other issues that could also 175 00:10:50,440 --> 00:10:53,120 Speaker 1: be at play. So, for example, one possible reason for 176 00:10:53,160 --> 00:10:57,679 Speaker 1: hallucinations is something that's called overfitting. IBM defines this as 177 00:10:57,720 --> 00:11:01,600 Speaker 1: what happens quote when an algorith rhythm fits too closely 178 00:11:01,800 --> 00:11:05,280 Speaker 1: or even exactly to its training data, resulting in a 179 00:11:05,320 --> 00:11:08,920 Speaker 1: model that can't make accurate predictions or conclusions from any 180 00:11:09,040 --> 00:11:12,440 Speaker 1: data other than the training data. End quote. That's from 181 00:11:12,440 --> 00:11:16,439 Speaker 1: a piece on IBM dot com. It's titled what is overfitting? 182 00:11:16,800 --> 00:11:21,440 Speaker 1: Sometimes models get so complex or they're trained so closely 183 00:11:21,520 --> 00:11:24,800 Speaker 1: on a specific data set that they start to pick 184 00:11:24,880 --> 00:11:30,320 Speaker 1: up more noise than signal. They give significance to insignificant things. 185 00:11:30,600 --> 00:11:32,800 Speaker 1: I think of this kind of like the character Dracks 186 00:11:33,000 --> 00:11:36,640 Speaker 1: in the Guardians of the Galaxy movies. Drags takes things literally, 187 00:11:37,000 --> 00:11:40,120 Speaker 1: so if you use a saying or an idiom on him, 188 00:11:40,480 --> 00:11:43,959 Speaker 1: he's likely to interpret what you're saying as being what 189 00:11:44,040 --> 00:11:47,760 Speaker 1: you mean. So if you say, oh, that's like throwing 190 00:11:47,800 --> 00:11:50,839 Speaker 1: the baby out with the bathwater, he would assume you're 191 00:11:50,880 --> 00:11:54,080 Speaker 1: talking about something you have literally done before in your life, 192 00:11:54,080 --> 00:11:56,880 Speaker 1: that you have literally thrown out a baby with bathwater, 193 00:11:57,320 --> 00:12:00,000 Speaker 1: and he would not understand you were using an analog 194 00:12:00,600 --> 00:12:03,960 Speaker 1: to describe getting rid of important stuff along with the 195 00:12:04,040 --> 00:12:06,640 Speaker 1: unimportant stuff you want to get rid of. If a 196 00:12:06,720 --> 00:12:09,719 Speaker 1: model has been overfitted, if it's been trained too much 197 00:12:09,840 --> 00:12:12,679 Speaker 1: on a relatively narrow set of data, it might have 198 00:12:12,720 --> 00:12:16,800 Speaker 1: trouble taking what it has learned and generalizing those learnings 199 00:12:16,800 --> 00:12:20,440 Speaker 1: towards something else that's outside the data set. And rather 200 00:12:20,520 --> 00:12:23,080 Speaker 1: than saying I'm sorry, I don't know the answer to that, 201 00:12:23,440 --> 00:12:27,000 Speaker 1: it could produce an answer that follows the statistical rules 202 00:12:27,240 --> 00:12:29,640 Speaker 1: that the model is set to In other words, it'll 203 00:12:29,800 --> 00:12:33,640 Speaker 1: create something that grammatically makes sense, but it won't necessarily 204 00:12:33,640 --> 00:12:38,079 Speaker 1: be relevant or you know, thematically or irrelevance makes sense. 205 00:12:38,640 --> 00:12:41,360 Speaker 1: So in this way, an AI model can become like 206 00:12:41,400 --> 00:12:44,760 Speaker 1: that stereotypical person in the car who absolutely refuses to 207 00:12:44,800 --> 00:12:47,080 Speaker 1: pull over and ask for directions when they get lost, 208 00:12:47,400 --> 00:12:50,160 Speaker 1: because that would be showing weakness. No, gush, darn. It 209 00:12:50,200 --> 00:12:53,079 Speaker 1: will somehow reason our way out of taking that wrong 210 00:12:53,200 --> 00:12:56,320 Speaker 1: turn forty five minutes ago. That'll fix everything. Except it 211 00:12:56,320 --> 00:12:59,360 Speaker 1: doesn't fix everything, and it can make things worse. But 212 00:12:59,400 --> 00:13:02,480 Speaker 1: it's not just pattern recognition that can trip up AI models. 213 00:13:02,760 --> 00:13:07,119 Speaker 1: Another issue is bias. I've talked about bias in other episodes, 214 00:13:07,280 --> 00:13:10,319 Speaker 1: but it's really important that we understand what we mean 215 00:13:10,360 --> 00:13:13,559 Speaker 1: when we're talking bias and how it can happen, because 216 00:13:14,000 --> 00:13:16,520 Speaker 1: I think a lot of people get tripped up. They 217 00:13:16,559 --> 00:13:22,120 Speaker 1: think it's a machine, right, it doesn't possess opinions. How 218 00:13:22,160 --> 00:13:26,640 Speaker 1: can it have bias? Well, we'll explore that in just 219 00:13:26,800 --> 00:13:29,679 Speaker 1: a couple of moments, but first let's take a quick 220 00:13:29,720 --> 00:13:43,520 Speaker 1: break to think our sponsors. How can an AI model 221 00:13:43,920 --> 00:13:47,880 Speaker 1: have bias? Well, the answer is that the machines that 222 00:13:47,920 --> 00:13:51,640 Speaker 1: AI runs on the algorithms that AI is built upon. 223 00:13:51,920 --> 00:13:55,839 Speaker 1: All this stuff, it didn't just pop out of nowhere. Ultimately, 224 00:13:55,920 --> 00:13:59,200 Speaker 1: this stuff was designed, built, and programmed by human beings. 225 00:13:59,280 --> 00:14:01,880 Speaker 1: Even if you have had a piece of software that 226 00:14:02,080 --> 00:14:06,160 Speaker 1: was designed by AI, while the AI that designed it 227 00:14:06,280 --> 00:14:08,959 Speaker 1: in turn had been designed by humans at least somewhere 228 00:14:09,000 --> 00:14:11,560 Speaker 1: down the line once you trace it back far enough so. 229 00:14:12,080 --> 00:14:16,360 Speaker 1: Human beings absolutely do have biases, and those biases can 230 00:14:16,400 --> 00:14:20,920 Speaker 1: make their way into the routines and processes of machines. 231 00:14:21,480 --> 00:14:25,280 Speaker 1: MIT has a great introduction to AI hallucinations and bias 232 00:14:25,360 --> 00:14:28,040 Speaker 1: on a web page that has the fitting title when 233 00:14:28,200 --> 00:14:32,800 Speaker 1: AI Gets It Wrong, Addressing AI hallucinations and bias now. 234 00:14:32,840 --> 00:14:35,600 Speaker 1: In that article, the author points out that AI has 235 00:14:35,640 --> 00:14:39,440 Speaker 1: had issues with bias for years and uses the example 236 00:14:39,600 --> 00:14:45,720 Speaker 1: of image analysis. The author cites a project called Gender Shades. 237 00:14:46,040 --> 00:14:51,440 Speaker 1: This was led by Joi Adowa Buomini, and I apologize 238 00:14:51,760 --> 00:14:56,080 Speaker 1: for my pronunciation of the name. But the project examined 239 00:14:56,320 --> 00:15:02,280 Speaker 1: how an AI powered gender classification tool performed when presented 240 00:15:02,320 --> 00:15:06,880 Speaker 1: with subjects of varying genders, ethnicities, and skin tones from 241 00:15:06,960 --> 00:15:14,280 Speaker 1: the IARPA Janus benchmark A data set or IJBA. This 242 00:15:14,320 --> 00:15:17,360 Speaker 1: is a database of facial images taken from various angles 243 00:15:17,400 --> 00:15:20,880 Speaker 1: and lighting conditions of lots of different people. It's used 244 00:15:20,880 --> 00:15:25,440 Speaker 1: as a government benchmark for testing stuff like facial recognition technologies. Now. 245 00:15:25,480 --> 00:15:30,640 Speaker 1: The project also used a gender classification benchmark from Adance, 246 00:15:31,240 --> 00:15:35,560 Speaker 1: and this was in part to try and address shortcomings 247 00:15:35,600 --> 00:15:40,840 Speaker 1: with the IJB dash A benchmark set. Plus due to 248 00:15:40,880 --> 00:15:43,360 Speaker 1: the limitations of both of these data sets, which I'll 249 00:15:43,360 --> 00:15:46,480 Speaker 1: talk about in just a moment, the project also outlines 250 00:15:46,520 --> 00:15:49,640 Speaker 1: a process to create a better data set for the 251 00:15:49,640 --> 00:15:54,160 Speaker 1: purposes of training technologies like facial recognition and gender classification. 252 00:15:54,720 --> 00:15:59,480 Speaker 1: The project aimed to test several gender classifier programs from 253 00:15:59,480 --> 00:16:04,120 Speaker 1: companies Microsoft and IBM, among others, all with regard to 254 00:16:04,320 --> 00:16:08,640 Speaker 1: quote gender, skin type, and the intersection of skin type 255 00:16:08,680 --> 00:16:12,640 Speaker 1: and gender end quote. So Joy found that the data 256 00:16:12,680 --> 00:16:17,360 Speaker 1: sets from IJB dah A skewed male and lighter skin 257 00:16:17,480 --> 00:16:21,440 Speaker 1: tones skewed heavily male and lighter skin tones. In fact, 258 00:16:21,480 --> 00:16:24,040 Speaker 1: she said between seventy nine point six percent and eighty 259 00:16:24,040 --> 00:16:26,640 Speaker 1: six point twenty four percent of all the images in 260 00:16:26,680 --> 00:16:31,040 Speaker 1: the database were of people with lighter skin tones, and 261 00:16:31,520 --> 00:16:34,440 Speaker 1: fewer than twenty five percent of all the images were 262 00:16:34,480 --> 00:16:38,480 Speaker 1: of women or female presenting people worse, Yet, only four 263 00:16:38,520 --> 00:16:41,760 Speaker 1: point four percent of all the images were of female 264 00:16:41,840 --> 00:16:46,880 Speaker 1: presenting people who had dark skin Adiance's data set had 265 00:16:46,920 --> 00:16:50,840 Speaker 1: a better distribution of photos, at least between genders. Female 266 00:16:50,840 --> 00:16:54,120 Speaker 1: presenting people made up fifty two percent of the images 267 00:16:54,160 --> 00:16:58,480 Speaker 1: in Aightiance's data set, but again, lighter skin tones made 268 00:16:58,520 --> 00:17:02,320 Speaker 1: up the majority of these images. Less than fifteen percent 269 00:17:02,360 --> 00:17:05,000 Speaker 1: of all the images in that data set contained people 270 00:17:05,080 --> 00:17:08,840 Speaker 1: of darker skin tones. So I'm sure you can already 271 00:17:08,920 --> 00:17:12,800 Speaker 1: see where this is going. If you train an AI 272 00:17:12,840 --> 00:17:18,159 Speaker 1: model on data that has a disproportionate emphasis on certain factors, 273 00:17:18,440 --> 00:17:23,720 Speaker 1: such as certain genders or certain skin tones, then you 274 00:17:23,760 --> 00:17:27,280 Speaker 1: would expect the AI to be better at handling cases 275 00:17:27,280 --> 00:17:31,760 Speaker 1: that fall into those categories, Right Like, if most of 276 00:17:31,800 --> 00:17:34,800 Speaker 1: the data you've fed to your AI model is of 277 00:17:35,040 --> 00:17:37,760 Speaker 1: men who have a lighter skin tone, then when you 278 00:17:37,800 --> 00:17:43,040 Speaker 1: are serving the AI model a picture of someone who's 279 00:17:43,320 --> 00:17:46,000 Speaker 1: male presenting and has a lighter skin tone, chances are 280 00:17:46,080 --> 00:17:49,600 Speaker 1: the tools going to work better. If you are instead 281 00:17:50,160 --> 00:17:55,800 Speaker 1: feeding it images of people who fall outside those majority cases, 282 00:17:56,080 --> 00:17:59,000 Speaker 1: the AI tool is probably not going to work as 283 00:17:59,040 --> 00:18:02,159 Speaker 1: well with them, and that's exactly what Joy found in 284 00:18:02,200 --> 00:18:06,679 Speaker 1: her research. She discovered that gender classification tools from all 285 00:18:06,840 --> 00:18:10,920 Speaker 1: of the providers performed better with lighter skinned men than 286 00:18:10,960 --> 00:18:14,600 Speaker 1: with any other group. They perform the worst with darker 287 00:18:14,640 --> 00:18:17,959 Speaker 1: skinned women. Thus we have a bias in the system. 288 00:18:18,320 --> 00:18:21,160 Speaker 1: The data that folks use to train these systems had 289 00:18:21,200 --> 00:18:25,320 Speaker 1: that bias, and it unsurprisingly affects how the AI does 290 00:18:25,359 --> 00:18:29,639 Speaker 1: its job. Now, this isn't just a curiosity for research labs. 291 00:18:29,680 --> 00:18:34,800 Speaker 1: Of course, around the world, various organizations and companies are 292 00:18:34,840 --> 00:18:38,720 Speaker 1: making use of facial recognition tools and gender classification tools. 293 00:18:39,119 --> 00:18:42,440 Speaker 1: There are numerous stories of law enforcement agencies getting into 294 00:18:42,480 --> 00:18:45,719 Speaker 1: hot water for relying on this kind of technology. So 295 00:18:46,000 --> 00:18:51,120 Speaker 1: we know that this technology isn't reliable, particularly if someone 296 00:18:51,240 --> 00:18:55,280 Speaker 1: belongs to a group that's outside of lighter skinned men, 297 00:18:55,840 --> 00:18:59,000 Speaker 1: and the data being used to train these tools is limited. 298 00:18:59,320 --> 00:19:02,760 Speaker 1: That's why we're having these issues, or one of the 299 00:19:02,840 --> 00:19:05,760 Speaker 1: main reasons why we're having these issues. So it stands 300 00:19:05,800 --> 00:19:09,080 Speaker 1: to reason we should not employ those tools for anything 301 00:19:09,560 --> 00:19:13,480 Speaker 1: really at all, other than maybe working to make them better. 302 00:19:13,680 --> 00:19:16,040 Speaker 1: But we definitely shouldn't be using them for things like 303 00:19:16,200 --> 00:19:19,160 Speaker 1: law enforcement, for example. At least we should not use 304 00:19:19,200 --> 00:19:23,320 Speaker 1: them until we can address the problem of bias generative 305 00:19:23,359 --> 00:19:27,399 Speaker 1: AI can actually have similar issues with bias that MIT. 306 00:19:27,640 --> 00:19:30,760 Speaker 1: Article that I mentioned earlier in this episode cites another 307 00:19:30,880 --> 00:19:36,720 Speaker 1: article by Leonardo Nicoletti and Dina Bass titled humans are biased. 308 00:19:36,920 --> 00:19:41,520 Speaker 1: Generative AI is even worse. This piece appeared in Bloomberg. 309 00:19:41,960 --> 00:19:46,000 Speaker 1: So this article explores how a generative AI platform called 310 00:19:46,080 --> 00:19:50,280 Speaker 1: stable Diffusion had a tendency to make assumptions based on 311 00:19:50,440 --> 00:19:57,359 Speaker 1: racial and gender stereotypes, thus repeating and even amplifying those stereotypes. 312 00:19:57,760 --> 00:20:01,880 Speaker 1: Nicoletti and Bass performed and in formal test with stable Diffusion, 313 00:20:02,040 --> 00:20:05,520 Speaker 1: a pretty thorough one, but still informal. They asked stable 314 00:20:05,560 --> 00:20:10,760 Speaker 1: Diffusion to generate images of people who were working one 315 00:20:10,800 --> 00:20:14,520 Speaker 1: of fourteen different jobs. Now, half of those jobs belonged 316 00:20:14,560 --> 00:20:18,119 Speaker 1: to what they called high paying positions, like things that 317 00:20:18,200 --> 00:20:21,720 Speaker 1: you would typically associate as a high paying job. The 318 00:20:21,800 --> 00:20:26,800 Speaker 1: other half typically were too low paying jobs, and actually 319 00:20:26,840 --> 00:20:28,879 Speaker 1: a little less than half of them were low paying jobs. 320 00:20:28,920 --> 00:20:31,480 Speaker 1: Three of them actually fell into the category of crime, 321 00:20:31,880 --> 00:20:34,679 Speaker 1: so like you know, thief or something like that. The 322 00:20:34,840 --> 00:20:38,720 Speaker 1: two had Stable Diffusion generate more than five thousand images 323 00:20:38,760 --> 00:20:42,640 Speaker 1: total so that they could really compare. They didn't want 324 00:20:42,680 --> 00:20:46,040 Speaker 1: to just create, you know, a single image each that's 325 00:20:46,040 --> 00:20:48,359 Speaker 1: a terrible test. They wanted to see, all right, is 326 00:20:48,400 --> 00:20:51,960 Speaker 1: this something that's actually appearing over and over again when 327 00:20:52,000 --> 00:20:55,000 Speaker 1: we make use of this tool, or is it possible 328 00:20:55,080 --> 00:20:58,119 Speaker 1: that you know, you run fourteen tests and it just 329 00:20:58,320 --> 00:21:03,840 Speaker 1: happens to go along with racial stereotypes. Nope. They classified 330 00:21:03,880 --> 00:21:07,720 Speaker 1: the generated images based off of the Fitzpatrick's skin scale. 331 00:21:08,240 --> 00:21:12,040 Speaker 1: This is actually a skin pigmentation metric that's used by 332 00:21:12,440 --> 00:21:16,359 Speaker 1: dermatologists as well as like other researchers, and the scale 333 00:21:16,440 --> 00:21:19,440 Speaker 1: goes from one to six, so one would be very 334 00:21:19,520 --> 00:21:23,080 Speaker 1: light skinned and six would be very dark skinned. The 335 00:21:23,440 --> 00:21:27,679 Speaker 1: researchers found that stable diffusion was far more likely to 336 00:21:27,680 --> 00:21:31,360 Speaker 1: create a person with a lighter skin tone for positions 337 00:21:31,400 --> 00:21:36,159 Speaker 1: that traditionally fall into the higher paid categories, and that 338 00:21:36,200 --> 00:21:38,760 Speaker 1: it was more likely to generate someone with a darker 339 00:21:38,800 --> 00:21:44,040 Speaker 1: skin tone for lower paid or criminal categories. What's more, 340 00:21:44,280 --> 00:21:48,080 Speaker 1: stable diffusion generated images of people appearing to be men 341 00:21:48,240 --> 00:21:52,159 Speaker 1: or male presenting for most of those higher paid positions. 342 00:21:52,280 --> 00:21:55,080 Speaker 1: It was very rare for it to generate the image 343 00:21:55,080 --> 00:21:58,560 Speaker 1: of a female presenting person in the role of one 344 00:21:58,600 --> 00:22:04,000 Speaker 1: of these traditionally higher paid jobs. So the AI was 345 00:22:04,040 --> 00:22:09,280 Speaker 1: perpetuating and amplifying these racial and gender stereotypes. This actually 346 00:22:09,280 --> 00:22:11,720 Speaker 1: reminds me of a classic riddle that was intended to 347 00:22:11,760 --> 00:22:14,040 Speaker 1: reveal bias. I'm sure most of you have heard this 348 00:22:14,119 --> 00:22:17,480 Speaker 1: before or some variation. So the riddle typically goes something 349 00:22:17,560 --> 00:22:20,080 Speaker 1: like this. A father and a son are in a 350 00:22:20,200 --> 00:22:23,680 Speaker 1: terrible car accident, and the father tragically dies at the scene. 351 00:22:24,040 --> 00:22:27,480 Speaker 1: The son is badly injured. EMTs arrived. They rushed the 352 00:22:27,480 --> 00:22:30,600 Speaker 1: boy to a surgical ward. The surgeon on duty looks 353 00:22:30,600 --> 00:22:32,960 Speaker 1: at the boy and says, I can't operate on him, 354 00:22:33,520 --> 00:22:37,360 Speaker 1: he's my son. Well, how could that be true? Now? 355 00:22:37,400 --> 00:22:41,080 Speaker 1: The obvious answer is the surgeon is the boy's mother. 356 00:22:41,400 --> 00:22:43,399 Speaker 1: And I think a lot of people arrive at that 357 00:22:43,880 --> 00:22:47,600 Speaker 1: conclusion much more easily today than they did when I 358 00:22:47,760 --> 00:22:50,000 Speaker 1: was a kid. Like when I was a kid, the 359 00:22:50,160 --> 00:22:55,159 Speaker 1: sexist stereotype was that all real quote unquote real doctors 360 00:22:55,200 --> 00:23:00,760 Speaker 1: and surgeons were men and women they were nurses or administrators. Right, 361 00:23:00,840 --> 00:23:04,960 Speaker 1: That was the stereotype that people kind of believed in. 362 00:23:05,320 --> 00:23:08,320 Speaker 1: But I'm sure most of y'all understood this answer, or 363 00:23:08,440 --> 00:23:11,200 Speaker 1: you've been exposed to this riddle numerous times. I mean, 364 00:23:11,240 --> 00:23:13,479 Speaker 1: it is a meme at this point, but again, back 365 00:23:13,560 --> 00:23:15,200 Speaker 1: in my day, a lot of folks would likely get 366 00:23:15,240 --> 00:23:18,240 Speaker 1: stumped by this, or they would say something dumb like, oh, 367 00:23:18,240 --> 00:23:21,600 Speaker 1: it turns out the surgeon was the real dad and 368 00:23:21,680 --> 00:23:25,119 Speaker 1: the father who died at the scene had been the 369 00:23:25,160 --> 00:23:28,520 Speaker 1: adopted father he adopted the boy, or something along those lines, 370 00:23:28,520 --> 00:23:32,240 Speaker 1: which reveals the bias of the listener. It reminds the 371 00:23:32,320 --> 00:23:36,040 Speaker 1: listener to think critically and be aware of sexist stereotypes. 372 00:23:36,359 --> 00:23:39,760 Speaker 1: So AI can produce the wrong results due to bias 373 00:23:39,800 --> 00:23:44,320 Speaker 1: built into the underlying model and end up making these 374 00:23:44,320 --> 00:23:47,320 Speaker 1: same mistakes right, Like if you say surgeon, it may 375 00:23:47,359 --> 00:23:51,439 Speaker 1: mistakenly just believe ah, you meant man. It has to 376 00:23:51,440 --> 00:23:55,240 Speaker 1: be a man that I generate in this image because 377 00:23:55,960 --> 00:23:59,399 Speaker 1: the user said surgeon, so that means man. That's a 378 00:23:59,400 --> 00:24:02,639 Speaker 1: real problem. With enough work and attention, we can actually 379 00:24:02,680 --> 00:24:07,240 Speaker 1: create training materials that minimize bias and can help reverse 380 00:24:07,359 --> 00:24:12,080 Speaker 1: this trend. But even doing that is not enough to 381 00:24:12,560 --> 00:24:17,119 Speaker 1: eliminate errors in generative AI. There are other problems we 382 00:24:17,200 --> 00:24:20,879 Speaker 1: have to look out for. So what happens when you 383 00:24:21,040 --> 00:24:26,120 Speaker 1: have an AI model, like a large language model, for example, 384 00:24:26,520 --> 00:24:30,280 Speaker 1: and part of the massive amount of material that it's 385 00:24:30,359 --> 00:24:34,560 Speaker 1: training itself on includes data sets that were generated by 386 00:24:34,680 --> 00:24:38,720 Speaker 1: other AI. When an AI image generator is pulling images 387 00:24:38,760 --> 00:24:41,760 Speaker 1: that were made by other image generators and then training 388 00:24:41,800 --> 00:24:44,800 Speaker 1: itself on that, or you know, even if it's pulling 389 00:24:45,240 --> 00:24:49,000 Speaker 1: images that an earlier version of that very same generator 390 00:24:49,040 --> 00:24:53,879 Speaker 1: had created, the mistakes that exist in those AI generated images, 391 00:24:54,440 --> 00:24:56,840 Speaker 1: or you know, it's if we're not talking images like 392 00:24:56,920 --> 00:25:01,000 Speaker 1: in text or whatever, those things can become like you 393 00:25:01,000 --> 00:25:05,080 Speaker 1: would argue, oh, those things are noise, right, that's those 394 00:25:05,080 --> 00:25:09,280 Speaker 1: are mistakes. But AI doesn't know that they're mistakes. They don't. 395 00:25:09,280 --> 00:25:11,840 Speaker 1: It doesn't know that it's noise. If you're training it 396 00:25:11,880 --> 00:25:14,359 Speaker 1: on the data, it thinks it's significant. And if it 397 00:25:14,400 --> 00:25:18,240 Speaker 1: thinks it's significant, it's going to incorporate it and perhaps 398 00:25:18,520 --> 00:25:22,359 Speaker 1: even dial it up quite a bit. So a great 399 00:25:22,400 --> 00:25:25,800 Speaker 1: way of illustrating this, in my opinion, is to talk 400 00:25:25,840 --> 00:25:29,280 Speaker 1: about fingers. I mean, I'm sure all of you out 401 00:25:29,320 --> 00:25:34,720 Speaker 1: there have experienced seeing AI generated images that hilariously get 402 00:25:34,760 --> 00:25:38,280 Speaker 1: the fingers totally wrong. A lot of AI image generators 403 00:25:38,280 --> 00:25:43,320 Speaker 1: have real problems with fingers, So you might have folks 404 00:25:43,400 --> 00:25:46,440 Speaker 1: and images who wind up with way too many fingers, 405 00:25:46,840 --> 00:25:49,560 Speaker 1: like seven or eight perrand, or maybe they have not 406 00:25:49,800 --> 00:25:53,679 Speaker 1: enough fingers, or maybe all their fingers are thumbs, or 407 00:25:53,720 --> 00:25:56,600 Speaker 1: maybe they bend in unnatural ways, or they all look 408 00:25:56,680 --> 00:26:00,960 Speaker 1: like long strands of spaghetti. These are clearly miss you know, 409 00:26:01,040 --> 00:26:05,359 Speaker 1: image generators have identified fingers are appendages, and these appendages 410 00:26:05,440 --> 00:26:09,000 Speaker 1: attached to hands. But the machines don't really follow the 411 00:26:09,080 --> 00:26:12,320 Speaker 1: rules when it comes to portraying those fingers, and they 412 00:26:12,680 --> 00:26:16,200 Speaker 1: do the best they can, and sometimes the best they 413 00:26:16,200 --> 00:26:21,760 Speaker 1: can is hilariously bad. But if image generator models train 414 00:26:22,160 --> 00:26:26,320 Speaker 1: on material that was created by AI, those weird fingers 415 00:26:26,440 --> 00:26:30,080 Speaker 1: are seen as a feature, not a bug. Like the 416 00:26:30,280 --> 00:26:33,639 Speaker 1: AI model doesn't know, Oh, fingers don't actually look like that, 417 00:26:33,640 --> 00:26:37,440 Speaker 1: that's wrong. It just says, ah, this is how fingers 418 00:26:37,480 --> 00:26:40,119 Speaker 1: sometimes look based upon these images I've been trained on, 419 00:26:40,480 --> 00:26:44,440 Speaker 1: which means the next generation of image generators will stress 420 00:26:44,520 --> 00:26:48,040 Speaker 1: these features more instead of correcting for them, which means 421 00:26:48,080 --> 00:26:50,880 Speaker 1: you're going to get some really weird images as a result. 422 00:26:51,359 --> 00:26:55,119 Speaker 1: And this process can repeat itself, and it gets worse 423 00:26:55,160 --> 00:26:58,480 Speaker 1: and worse each time. It's like making a copy of 424 00:26:58,520 --> 00:27:02,240 Speaker 1: a copy of a copy. You eventually reach a point 425 00:27:02,280 --> 00:27:06,359 Speaker 1: where the copy you have produced is illegible or doesn't 426 00:27:06,400 --> 00:27:08,879 Speaker 1: look enough like the original at all for you to 427 00:27:08,920 --> 00:27:12,360 Speaker 1: even easily say, oh, this is a copy of that. 428 00:27:12,359 --> 00:27:15,240 Speaker 1: That can be a real problem. And of course this 429 00:27:15,320 --> 00:27:18,320 Speaker 1: is just one example the fingers in AI. That's an 430 00:27:18,359 --> 00:27:23,480 Speaker 1: easy mark to hit, right, but there are countless other examples. 431 00:27:23,960 --> 00:27:27,639 Speaker 1: In a paper titled The Curse of Recursion Training on 432 00:27:27,840 --> 00:27:32,080 Speaker 1: Generated Data Makes Models Forget, a group of researchers from 433 00:27:32,160 --> 00:27:36,800 Speaker 1: the University of Cambridge, Oxford University, Imperial College London, the 434 00:27:36,920 --> 00:27:40,639 Speaker 1: University of Edinburgh, and the University of Toronto present an 435 00:27:40,800 --> 00:27:45,520 Speaker 1: argument of a pretty bleak future if AI researchers don't 436 00:27:45,600 --> 00:27:49,040 Speaker 1: take the proper measures to head it off. We're going 437 00:27:49,119 --> 00:27:52,000 Speaker 1: to talk more about that in just a moment, but 438 00:27:52,119 --> 00:28:05,680 Speaker 1: first let's take another quick break to thank our sponsors. Okay, 439 00:28:05,800 --> 00:28:09,520 Speaker 1: before the break, I mentioned this paper, the cursor recursion 440 00:28:09,760 --> 00:28:13,840 Speaker 1: Training on Generated Data makes Models Forget. It's a great article. 441 00:28:13,960 --> 00:28:17,720 Speaker 1: It does get very technical at one point, but the 442 00:28:17,760 --> 00:28:22,800 Speaker 1: researchers did a great job explaining the top level problem 443 00:28:23,040 --> 00:28:26,360 Speaker 1: and the potential outcome of that problem in a way 444 00:28:26,359 --> 00:28:29,040 Speaker 1: that I think anyone could find accessible. When you get 445 00:28:29,080 --> 00:28:32,440 Speaker 1: to the actual analysis part, that's when it gets really technical. 446 00:28:32,600 --> 00:28:36,280 Speaker 1: But the summary, the conclusions, all of that, I think 447 00:28:36,320 --> 00:28:41,040 Speaker 1: is easy to understand. So in that paper, the researchers say, quote, 448 00:28:41,160 --> 00:28:45,560 Speaker 1: we discover that learning from data produced by other models 449 00:28:45,600 --> 00:28:51,800 Speaker 1: causes model collapse, a degenerative process whereby over time models 450 00:28:51,840 --> 00:28:57,360 Speaker 1: forget the true underlying data distribution end quote. So essentially, 451 00:28:57,960 --> 00:29:01,880 Speaker 1: these AI models will quote unquote for get information while 452 00:29:01,960 --> 00:29:10,080 Speaker 1: simultaneously a set of learned behaviors they have created through synthesizing. 453 00:29:10,080 --> 00:29:13,240 Speaker 1: All this information will begin to converge and lead to 454 00:29:13,400 --> 00:29:16,760 Speaker 1: a broken model that's no longer really useful. It won't 455 00:29:16,760 --> 00:29:21,640 Speaker 1: present anything that's of real value. So the researchers argue 456 00:29:21,640 --> 00:29:26,080 Speaker 1: that quote the use of llm's at scale to publish 457 00:29:26,160 --> 00:29:29,720 Speaker 1: content on the Internet will pollute the collection of data 458 00:29:29,800 --> 00:29:34,080 Speaker 1: to train them endo quote. That's bad news, and it's 459 00:29:34,160 --> 00:29:37,040 Speaker 1: definitely going to be an issue, particularly with sites that 460 00:29:37,080 --> 00:29:41,720 Speaker 1: fall into the content farm category, because it's already happening right. 461 00:29:42,160 --> 00:29:45,160 Speaker 1: There are already websites out there that have turned to 462 00:29:45,280 --> 00:29:50,160 Speaker 1: AI generation to flesh out the articles that they have 463 00:29:50,680 --> 00:29:55,600 Speaker 1: in their database, and these articles are of a varying quality, 464 00:29:55,920 --> 00:29:58,880 Speaker 1: and all of those getting scooped up in a future 465 00:29:59,040 --> 00:30:03,600 Speaker 1: AI model session and used side by side with articles 466 00:30:03,640 --> 00:30:07,920 Speaker 1: that were researched written and edited by human beings and 467 00:30:07,960 --> 00:30:12,320 Speaker 1: therefore potentially at least of higher quality. I'm not saying 468 00:30:12,360 --> 00:30:16,040 Speaker 1: that all human written articles and edited articles are great. 469 00:30:16,440 --> 00:30:19,160 Speaker 1: They're not. There's some bad stuff out there that human 470 00:30:19,160 --> 00:30:24,120 Speaker 1: beings have written. But with those steps in place, you 471 00:30:24,240 --> 00:30:29,640 Speaker 1: have the potential for really great work. With AI. You 472 00:30:29,680 --> 00:30:33,040 Speaker 1: don't necessarily get that. You hope you get it, but 473 00:30:33,120 --> 00:30:36,720 Speaker 1: there's no guarantee and there aren't enough I would say 474 00:30:37,720 --> 00:30:40,960 Speaker 1: safety valves to make sure that things don't go off 475 00:30:40,960 --> 00:30:44,520 Speaker 1: the rails. So getting back to content farms, if you 476 00:30:44,600 --> 00:30:48,200 Speaker 1: are unfamiliar with that term, well don't worry. You've almost 477 00:30:48,280 --> 00:30:51,000 Speaker 1: certainly come across a content farm at some point in 478 00:30:51,040 --> 00:30:54,440 Speaker 1: the past. So these are sites that just churn out 479 00:30:54,480 --> 00:30:58,880 Speaker 1: an enormous amount of content, typically in an effort to 480 00:30:58,960 --> 00:31:03,320 Speaker 1: tap into the sweet sweet waters of SEO, which stands 481 00:31:03,320 --> 00:31:07,120 Speaker 1: for search engine optimization. So for a lot of websites 482 00:31:07,200 --> 00:31:11,120 Speaker 1: out there, the majority of traffic coming to the website 483 00:31:11,440 --> 00:31:14,600 Speaker 1: comes courtesy of a search engine. And when I say 484 00:31:14,760 --> 00:31:17,040 Speaker 1: a search engine, you might as well fill in the 485 00:31:17,080 --> 00:31:19,840 Speaker 1: name Google there, because that's the big one. I mean. 486 00:31:19,840 --> 00:31:22,560 Speaker 1: There are other search engines out there, and some of 487 00:31:22,560 --> 00:31:26,520 Speaker 1: them do contribute to this too, but Google commands somewhere 488 00:31:26,560 --> 00:31:30,400 Speaker 1: between eighty and like ninety five percent of the search market. 489 00:31:30,480 --> 00:31:33,840 Speaker 1: Exactly where that it falls is a matter of debate. 490 00:31:34,080 --> 00:31:37,960 Speaker 1: Like I looked at a few different Internet analytics sites, right, 491 00:31:38,040 --> 00:31:41,600 Speaker 1: and they had different percentages, but there was always above 492 00:31:41,680 --> 00:31:44,360 Speaker 1: eighty percent, and some as high as like ninety two 493 00:31:44,480 --> 00:31:47,080 Speaker 1: or ninety three. So it's safe to say that Google 494 00:31:47,200 --> 00:31:50,440 Speaker 1: dominates the search space. You know, technically it may not 495 00:31:50,480 --> 00:31:53,600 Speaker 1: be a monopoly, but effectively it kind of is. So 496 00:31:54,320 --> 00:31:58,800 Speaker 1: sites that depend on traffic from search naturally want to 497 00:31:58,840 --> 00:32:01,960 Speaker 1: find ways for their pay is to rank high in 498 00:32:02,040 --> 00:32:05,560 Speaker 1: search results and to appear in more search results. Now 499 00:32:05,560 --> 00:32:09,200 Speaker 1: that's actually easier said than done. Google has changed its 500 00:32:09,240 --> 00:32:13,000 Speaker 1: page ranking algorithm a few different times, and some search 501 00:32:13,000 --> 00:32:16,960 Speaker 1: results are dependent upon who is doing the searching. That 502 00:32:17,040 --> 00:32:19,800 Speaker 1: means that you and I might each search for the 503 00:32:19,880 --> 00:32:23,560 Speaker 1: exact same thing, maybe we word it the exact same way, 504 00:32:24,080 --> 00:32:27,960 Speaker 1: but we'll end up getting different results. Google says, quote 505 00:32:28,200 --> 00:32:31,720 Speaker 1: personalization is only used in your results if it can 506 00:32:31,760 --> 00:32:36,080 Speaker 1: provide more relevant and helpful information end quote. So presumably 507 00:32:36,120 --> 00:32:38,760 Speaker 1: it doesn't happen all the time. That means that in 508 00:32:38,800 --> 00:32:41,680 Speaker 1: some cases you and I will get identical results depending 509 00:32:41,720 --> 00:32:44,080 Speaker 1: upon what it is we're searching for, and in other 510 00:32:44,160 --> 00:32:48,360 Speaker 1: cases we will get very different search results. I do 511 00:32:48,520 --> 00:32:52,840 Speaker 1: know this makes SEO a much larger challenge because it's 512 00:32:52,880 --> 00:32:56,320 Speaker 1: impossible to be all things to all people. You know, 513 00:32:56,560 --> 00:32:59,120 Speaker 1: you can only do the best you can to try 514 00:32:59,200 --> 00:33:02,440 Speaker 1: and show up for any given search query. It is 515 00:33:02,600 --> 00:33:07,880 Speaker 1: super duper hard if you're dependent upon human writers and 516 00:33:08,000 --> 00:33:11,240 Speaker 1: editors to generate all the stuff that you're shoving out 517 00:33:11,320 --> 00:33:14,719 Speaker 1: in an effort to get clicks. So most of your 518 00:33:14,760 --> 00:33:17,360 Speaker 1: traffic is coming from search. We talked about this already. 519 00:33:17,640 --> 00:33:20,120 Speaker 1: You need to have lots of stuff on your site 520 00:33:20,160 --> 00:33:23,560 Speaker 1: that people could be searching for so that traffic comes 521 00:33:23,600 --> 00:33:25,800 Speaker 1: your way, and that way you can make money through 522 00:33:25,920 --> 00:33:29,680 Speaker 1: web advertising. Essentially, you could try to be reactionary, right, 523 00:33:29,760 --> 00:33:33,360 Speaker 1: You could try to generate new content as things capture 524 00:33:33,400 --> 00:33:36,360 Speaker 1: of the public interest, but you run the danger of 525 00:33:36,440 --> 00:33:39,440 Speaker 1: getting to the party too late and that by the 526 00:33:39,480 --> 00:33:42,080 Speaker 1: time you have something up, no one's talking about it 527 00:33:42,080 --> 00:33:45,480 Speaker 1: anymore and you're not really seeing any real traffic from that. 528 00:33:46,000 --> 00:33:48,000 Speaker 1: What if instead you could just kind of open up 529 00:33:48,040 --> 00:33:52,640 Speaker 1: a fire hose of content using generative AI. Well, if 530 00:33:52,640 --> 00:33:55,360 Speaker 1: you just had AI write a whole bunch of articles 531 00:33:55,360 --> 00:33:59,640 Speaker 1: in the style that you've established for your company, and maybe, 532 00:33:59,680 --> 00:34:02,520 Speaker 1: if you're feeling a little cautious, you'll even employ a 533 00:34:02,520 --> 00:34:05,040 Speaker 1: couple of human being editors to take on the job 534 00:34:05,080 --> 00:34:08,399 Speaker 1: of reading over these generated articles and to correct any 535 00:34:08,440 --> 00:34:11,279 Speaker 1: mistakes that were made, and perhaps even tweak a couple 536 00:34:11,320 --> 00:34:13,040 Speaker 1: of things here and there to make it sound more 537 00:34:13,120 --> 00:34:16,680 Speaker 1: human if necessary. But now you can push out way 538 00:34:16,880 --> 00:34:20,040 Speaker 1: more content without having to wait on human writers to 539 00:34:20,120 --> 00:34:24,879 Speaker 1: research and write everything. Plus, AI does not complain if 540 00:34:24,880 --> 00:34:27,560 Speaker 1: you assign it to write a suite of articles about 541 00:34:27,560 --> 00:34:31,279 Speaker 1: gluten free skincare products. By the way, I'm using my 542 00:34:31,440 --> 00:34:34,920 Speaker 1: real world life experience with that last example. I once 543 00:34:35,080 --> 00:34:39,080 Speaker 1: got that writing assignment. It was dumb then and it's 544 00:34:39,320 --> 00:34:42,040 Speaker 1: dumb now, But I guess people were searching for it, 545 00:34:42,160 --> 00:34:44,640 Speaker 1: so I got an assignment to write it. Now. I 546 00:34:44,680 --> 00:34:47,440 Speaker 1: would like to think that the site I was writing for, 547 00:34:47,520 --> 00:34:51,680 Speaker 1: which was how Stuffworks dot Com, wasn't really a content farm. 548 00:34:52,040 --> 00:34:54,360 Speaker 1: I would love to think that, and I would argue 549 00:34:54,360 --> 00:34:56,680 Speaker 1: that for many years when I wrote there, it did 550 00:34:56,680 --> 00:34:59,560 Speaker 1: not qualify as a content farm. We did try to 551 00:34:59,600 --> 00:35:05,040 Speaker 1: write in depth, authoritative articles about all sorts of stuff, 552 00:35:05,640 --> 00:35:09,600 Speaker 1: like whether we were talking about technology or society, or 553 00:35:09,719 --> 00:35:14,400 Speaker 1: money or entertainment, whatever it might be. We applied rigor, 554 00:35:14,840 --> 00:35:19,160 Speaker 1: you know, journalistic rigor, toward the research and writing and 555 00:35:19,280 --> 00:35:23,239 Speaker 1: editing of those pieces. Over time, things changed where we 556 00:35:23,280 --> 00:35:27,040 Speaker 1: started to cater more toward ad deals, where we would 557 00:35:27,160 --> 00:35:30,200 Speaker 1: get this big ad deal with a company like a 558 00:35:30,880 --> 00:35:34,240 Speaker 1: you know, cosmetics company, for example, and we would suddenly 559 00:35:34,280 --> 00:35:39,720 Speaker 1: have hundreds of articles assigned in the field of cosmetics, 560 00:35:40,160 --> 00:35:44,160 Speaker 1: articles that were incredibly niche like, there was no way 561 00:35:44,239 --> 00:35:46,360 Speaker 1: that we're going to drive a ton of traffic. But 562 00:35:46,560 --> 00:35:51,600 Speaker 1: collectively then these articles could get a lot of traffic. 563 00:35:51,960 --> 00:35:54,760 Speaker 1: Not a single one, but across the board. If someone 564 00:35:54,880 --> 00:35:58,120 Speaker 1: happened to be searching for this thing, they could find 565 00:35:58,120 --> 00:36:00,840 Speaker 1: their way to our article and that would be another 566 00:36:00,920 --> 00:36:03,759 Speaker 1: click coming our way. It was a very much a 567 00:36:03,800 --> 00:36:08,719 Speaker 1: shotgun approach to writing content. I hated it. There were 568 00:36:08,840 --> 00:36:12,439 Speaker 1: articles I wrote that I am not at all. It's 569 00:36:12,440 --> 00:36:14,319 Speaker 1: not that I'm not proud of the work I did. 570 00:36:14,360 --> 00:36:17,120 Speaker 1: I'm not proud of getting the assignment, like it was 571 00:36:17,160 --> 00:36:20,600 Speaker 1: a joke in my opinion, But that's what we were 572 00:36:20,760 --> 00:36:23,279 Speaker 1: trying to do in order to survive. Because again, how 573 00:36:23,320 --> 00:36:25,439 Speaker 1: stuff works was like one of these websites in that 574 00:36:25,800 --> 00:36:28,240 Speaker 1: most of the traffic coming through How Stuff Works came 575 00:36:28,400 --> 00:36:31,319 Speaker 1: through a search engine. Someone was looking to learn how 576 00:36:31,400 --> 00:36:34,840 Speaker 1: something worked and they got sent our way. People weren't, 577 00:36:34,920 --> 00:36:37,360 Speaker 1: as a rule, just coming to How Stuff Works to 578 00:36:37,719 --> 00:36:40,640 Speaker 1: peruse the website. We always wanted that that was what 579 00:36:40,760 --> 00:36:43,759 Speaker 1: our goal was, to create a destination website that people 580 00:36:43,760 --> 00:36:46,000 Speaker 1: would want to go to just to see, oh, what's 581 00:36:46,120 --> 00:36:48,759 Speaker 1: new on the site, But we never really achieved that. 582 00:36:48,920 --> 00:36:51,279 Speaker 1: It's a really hard thing to do. There are people 583 00:36:51,320 --> 00:36:54,200 Speaker 1: who do it and it's amazing, but it's not easy 584 00:36:54,200 --> 00:36:59,160 Speaker 1: to replicate. So instead we wrote tons of articles about 585 00:36:59,200 --> 00:37:03,319 Speaker 1: stuff that people were searching for, and that just kind 586 00:37:03,320 --> 00:37:07,239 Speaker 1: of was our mo at that point. Anyway, if you're 587 00:37:07,320 --> 00:37:10,480 Speaker 1: using AI to create these kinds of articles, it's going 588 00:37:10,520 --> 00:37:12,960 Speaker 1: to generate a lot of stuff that's just not very good. 589 00:37:13,080 --> 00:37:17,319 Speaker 1: But then who cares? Like you don't necessarily care if 590 00:37:17,360 --> 00:37:22,719 Speaker 1: the material is good. If the only traffic you're really 591 00:37:22,719 --> 00:37:26,880 Speaker 1: getting on your website is coming from search engines, you 592 00:37:27,080 --> 00:37:29,960 Speaker 1: just need it to show up in the search engines. Now, 593 00:37:30,000 --> 00:37:32,879 Speaker 1: if the search engine is able to determine, hey, this 594 00:37:32,960 --> 00:37:37,560 Speaker 1: is low quality content, and it disincentivizes people visiting by 595 00:37:37,920 --> 00:37:40,799 Speaker 1: making it go further down the search results, then you're 596 00:37:40,800 --> 00:37:42,960 Speaker 1: going to have a problem, and a lot of content 597 00:37:43,000 --> 00:37:46,920 Speaker 1: farms ran into that problem. Google downgraded content farms in 598 00:37:47,000 --> 00:37:51,000 Speaker 1: their search algorithm. Other sites like duck duck go removed 599 00:37:51,320 --> 00:37:56,600 Speaker 1: websites that were considered content farms because the people running 600 00:37:56,680 --> 00:38:00,720 Speaker 1: duck dot go realized, Hey, these sites aren't inviting anything 601 00:38:00,760 --> 00:38:03,960 Speaker 1: of real value to visitors. Why are we even serving 602 00:38:03,960 --> 00:38:07,640 Speaker 1: it up. That's not really a good use of anyone's time. 603 00:38:08,080 --> 00:38:12,439 Speaker 1: But if you're in a space where the jig isn't 604 00:38:12,520 --> 00:38:14,520 Speaker 1: up yet, you might as well just go ahead and 605 00:38:14,560 --> 00:38:17,080 Speaker 1: create as much garbage as you can because you just 606 00:38:17,120 --> 00:38:19,839 Speaker 1: want the clicks. You don't care if people actually think 607 00:38:19,880 --> 00:38:23,200 Speaker 1: the articles are of good quality or that they're going 608 00:38:23,239 --> 00:38:26,239 Speaker 1: to learn anything useful. You don't even necessarily care if 609 00:38:26,239 --> 00:38:29,280 Speaker 1: the articles are accurate. You care that people are clicking 610 00:38:29,360 --> 00:38:34,720 Speaker 1: on the articles. So if that's your perspective, ultimately, then 611 00:38:35,239 --> 00:38:38,000 Speaker 1: the goal for you is to push as much of 612 00:38:38,000 --> 00:38:40,800 Speaker 1: this stuff out the door as you possibly can, generate 613 00:38:40,840 --> 00:38:43,520 Speaker 1: it as fast as possible, get it online as quickly 614 00:38:43,560 --> 00:38:47,520 Speaker 1: as you can, and hope that starts to rank in 615 00:38:47,600 --> 00:38:51,600 Speaker 1: search so that people flood in to read about whatever 616 00:38:51,600 --> 00:38:54,640 Speaker 1: it is you're writing about. But it's not just people 617 00:38:55,040 --> 00:38:58,040 Speaker 1: who are going to your links, is it. There are 618 00:38:58,200 --> 00:39:01,160 Speaker 1: bots crawling the web now. Some of them are calling 619 00:39:01,200 --> 00:39:04,000 Speaker 1: the web in order to index those web pages for 620 00:39:04,080 --> 00:39:07,240 Speaker 1: the purposes of things like search engines, but other bots 621 00:39:07,239 --> 00:39:10,720 Speaker 1: are there to scrape data for the purposes of training 622 00:39:10,760 --> 00:39:15,040 Speaker 1: the next generation of large language models. Essentially, at this point, 623 00:39:15,120 --> 00:39:18,120 Speaker 1: bots are reading articles that were written by other bots, 624 00:39:18,440 --> 00:39:22,160 Speaker 1: and so when the next large language model launches, it 625 00:39:22,160 --> 00:39:24,600 Speaker 1: does so on a data set that has been polluted 626 00:39:24,880 --> 00:39:29,080 Speaker 1: by bot generated information. That means the next generation will 627 00:39:29,080 --> 00:39:31,759 Speaker 1: be even worse, and so on, and eventually we arrive 628 00:39:31,800 --> 00:39:35,840 Speaker 1: at a point where the Internet, this amazing invention that 629 00:39:35,920 --> 00:39:40,520 Speaker 1: provides access to practically all of human knowledge, becomes absolutely 630 00:39:40,800 --> 00:39:46,560 Speaker 1: infested with junk that is inaccurate and increasingly nonsensical, and 631 00:39:46,680 --> 00:39:54,480 Speaker 1: we render this incredible invention useless. This isn't just speculation either. 632 00:39:54,880 --> 00:39:58,320 Speaker 1: We have examples of companies turning to AI to generate articles. 633 00:39:58,520 --> 00:40:01,680 Speaker 1: C Net famously did this early in the days of 634 00:40:01,760 --> 00:40:06,440 Speaker 1: generative AI, and cnet properly got roasted for doing it, 635 00:40:06,480 --> 00:40:09,040 Speaker 1: first roasted for not presenting it in a way that 636 00:40:09,120 --> 00:40:12,960 Speaker 1: was transparent, and then also for including articles that just 637 00:40:13,000 --> 00:40:17,920 Speaker 1: had outright wrong information in them and publishing them as 638 00:40:18,000 --> 00:40:21,399 Speaker 1: if they were vetted pieces that editors had gone through. 639 00:40:21,640 --> 00:40:25,200 Speaker 1: How stuff works. Again, my old employer where I got 640 00:40:25,200 --> 00:40:28,319 Speaker 1: that Skincare writing assignment once upon a time, they've done 641 00:40:28,360 --> 00:40:31,680 Speaker 1: this too. They laid off their human writers, they stopped 642 00:40:32,160 --> 00:40:36,080 Speaker 1: giving assignments to freelancers. Later on they laid off the 643 00:40:36,239 --> 00:40:40,399 Speaker 1: entire editorial staff after the editors protested this move toward 644 00:40:40,560 --> 00:40:46,400 Speaker 1: AI generated content. This trend is happening. Not only are 645 00:40:46,520 --> 00:40:49,040 Speaker 1: talented people being put out of work, which is bad 646 00:40:49,160 --> 00:40:52,400 Speaker 1: enough already. These editors and writers, they believed in what 647 00:40:52,440 --> 00:40:56,960 Speaker 1: they were doing. Yeah, sometimes the assignments stank, sometimes they 648 00:40:57,000 --> 00:41:01,400 Speaker 1: were not good, but the writers and editors still believed 649 00:41:01,440 --> 00:41:04,080 Speaker 1: in doing as good a job as they possibly could. 650 00:41:04,520 --> 00:41:08,239 Speaker 1: But their replacements, the AI, they're just making the Internet 651 00:41:08,520 --> 00:41:12,839 Speaker 1: worse by generating unreliable and terrible content. Then no one 652 00:41:12,960 --> 00:41:15,960 Speaker 1: actually wants to read unless they just happen to put 653 00:41:16,200 --> 00:41:20,799 Speaker 1: that particular set of terms into a search engine and 654 00:41:20,880 --> 00:41:24,840 Speaker 1: the search engine couldn't find anything better to serve them. Again, 655 00:41:25,200 --> 00:41:28,880 Speaker 1: it's as if you needed to learn something important, but 656 00:41:29,000 --> 00:41:32,320 Speaker 1: all you have access to are just sloppily written articles 657 00:41:32,360 --> 00:41:35,680 Speaker 1: by people who had no understanding or passion about the 658 00:41:35,680 --> 00:41:38,480 Speaker 1: subject matter they were writing on, and there were no 659 00:41:38,680 --> 00:41:41,840 Speaker 1: editors to steer the writer toward creating a more accurate 660 00:41:42,000 --> 00:41:47,520 Speaker 1: or informative piece. It gets pretty darn bleak. Is it inevitable? Though? No, 661 00:41:47,800 --> 00:41:51,560 Speaker 1: it's not inevitable. This future happens if the people who 662 00:41:51,600 --> 00:41:54,799 Speaker 1: are training the AI models allow it to happen, but 663 00:41:54,920 --> 00:41:58,719 Speaker 1: with careful stewardship. By guiding the AI models so that 664 00:41:58,800 --> 00:42:01,920 Speaker 1: they don't pull training data from garbage sites and they 665 00:42:02,080 --> 00:42:07,120 Speaker 1: really focus on reputable sources, it's possible to avoid these 666 00:42:07,160 --> 00:42:10,080 Speaker 1: issues at least in some part. I mean some things 667 00:42:10,120 --> 00:42:13,960 Speaker 1: like hallucinations, confabulations, that kind of stuff that can happen anyway, 668 00:42:14,200 --> 00:42:17,960 Speaker 1: but you can at least limit it. That's not really 669 00:42:18,040 --> 00:42:20,240 Speaker 1: what we're seeing right now, though, because at the moment, 670 00:42:20,360 --> 00:42:23,959 Speaker 1: companies are rushing into the AI space right they are 671 00:42:24,160 --> 00:42:29,000 Speaker 1: pushing so hard to create large language models that dwarf 672 00:42:29,200 --> 00:42:32,799 Speaker 1: the previous generation's capabilities. So to do that, they have 673 00:42:32,880 --> 00:42:35,960 Speaker 1: to seek out training data from all across the internet. 674 00:42:36,160 --> 00:42:38,880 Speaker 1: You have to train these AI models on tons and 675 00:42:38,920 --> 00:42:42,080 Speaker 1: tons of information to make them useful. The more data 676 00:42:42,120 --> 00:42:45,640 Speaker 1: you have access to the better. Social platforms have provided 677 00:42:45,680 --> 00:42:49,160 Speaker 1: a popular source of information. We know that Reddit has 678 00:42:49,200 --> 00:42:52,960 Speaker 1: struck deals with open ai, for example, in order to 679 00:42:53,440 --> 00:42:57,280 Speaker 1: crawl Reddit to pull information. But you know what, social 680 00:42:57,280 --> 00:43:01,280 Speaker 1: platforms are also really popular with bots, not just with people. 681 00:43:01,560 --> 00:43:04,440 Speaker 1: So even this approach brings with it the risk of 682 00:43:04,520 --> 00:43:08,920 Speaker 1: AI training on other AI generated data, which again leads 683 00:43:08,960 --> 00:43:13,480 Speaker 1: to model collapse. Further down the road, I might one 684 00:43:13,560 --> 00:43:16,560 Speaker 1: day do a much more in depth episode about this paper, 685 00:43:16,840 --> 00:43:21,160 Speaker 1: the curse of recursion. Training on generated data makes models forget. 686 00:43:21,440 --> 00:43:24,200 Speaker 1: I've given a very high level summary of what the 687 00:43:24,280 --> 00:43:28,440 Speaker 1: researchers say in that paper, but it might benefit us 688 00:43:28,440 --> 00:43:31,040 Speaker 1: to take a much closer look at what they found 689 00:43:31,320 --> 00:43:35,040 Speaker 1: and their conclusions. So I may revisit this topic in 690 00:43:35,080 --> 00:43:37,160 Speaker 1: the future, but for now, I think it's just good 691 00:43:37,160 --> 00:43:39,839 Speaker 1: to remember that AI does have the potential to do 692 00:43:39,880 --> 00:43:43,919 Speaker 1: great things. I mean, it can potentially augment our work 693 00:43:43,920 --> 00:43:48,080 Speaker 1: efforts and let us accomplish goals more quickly and efficiently 694 00:43:48,280 --> 00:43:51,960 Speaker 1: and accurately. But AI also has the potential to make 695 00:43:52,040 --> 00:43:55,440 Speaker 1: things miserable and churn out content that no one wants 696 00:43:55,480 --> 00:43:59,200 Speaker 1: to see other than other bots, and creating a cynical 697 00:43:59,239 --> 00:44:02,440 Speaker 1: cycle that ultimately could make the Internet into a cluttered, 698 00:44:02,560 --> 00:44:06,120 Speaker 1: practically useless mess. So which way are we going to go? 699 00:44:06,440 --> 00:44:09,319 Speaker 1: I think my answer day to day depends on how 700 00:44:09,320 --> 00:44:11,959 Speaker 1: optimistic I feel, but at the very least, I think 701 00:44:12,040 --> 00:44:16,360 Speaker 1: knowing about the risks is important. That's it for today's episode. 702 00:44:16,600 --> 00:44:19,319 Speaker 1: I hope you are all well. I will try to 703 00:44:19,360 --> 00:44:21,600 Speaker 1: get away from AI topics. I know I've been covering 704 00:44:21,600 --> 00:44:23,640 Speaker 1: a lot of that recently, and that'd be nice to 705 00:44:23,719 --> 00:44:26,839 Speaker 1: kind of branch into other areas of tech, So I'm 706 00:44:26,840 --> 00:44:29,360 Speaker 1: going to try and do that. It's just AI stuff 707 00:44:29,360 --> 00:44:32,400 Speaker 1: just keeps on happening, y'all. But I will talk to 708 00:44:32,400 --> 00:44:42,720 Speaker 1: you again really soon. Tech Stuff is an iHeartRadio production. 709 00:44:43,040 --> 00:44:48,080 Speaker 1: For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts, 710 00:44:48,200 --> 00:44:53,880 Speaker 1: or wherever you listen to your favorite shows.