WEBVTT - AI Model Collapse and the Dangers of AI-Generated Content

0:00:04.480 --> 0:00:12.319
<v Speaker 1>Welcome to tech Stuff, a production from iHeartRadio. Hey there,

0:00:12.360 --> 0:00:15.560
<v Speaker 1>and welcome to tech Stuff. I'm your host, Jonathan Strickland.

0:00:15.560 --> 0:00:18.919
<v Speaker 1>I'm an executive producer with iHeart Podcasts. And how the

0:00:18.960 --> 0:00:23.520
<v Speaker 1>tech are you? So? Imagine for a moment that you

0:00:23.720 --> 0:00:27.200
<v Speaker 1>are in school. Some of y'all might actually be in school,

0:00:27.400 --> 0:00:30.600
<v Speaker 1>but others, like me, we have to satisfy ourselves by

0:00:30.640 --> 0:00:33.919
<v Speaker 1>having that occasional stress dream where we imagine that we're

0:00:33.960 --> 0:00:35.919
<v Speaker 1>in school and it's time to take a final and

0:00:35.960 --> 0:00:38.960
<v Speaker 1>we haven't gone to class all year, and also we

0:00:39.000 --> 0:00:41.479
<v Speaker 1>can't remember our locker combination. I don't know about you,

0:00:41.520 --> 0:00:44.320
<v Speaker 1>but I still occasionally get those dreams. And I'm almost

0:00:44.400 --> 0:00:47.280
<v Speaker 1>fifty years old at this point. Anyway, you're in school,

0:00:47.760 --> 0:00:50.879
<v Speaker 1>you're in English class, and you've been given the dreaded

0:00:51.120 --> 0:00:53.760
<v Speaker 1>term paper assignment. You're told you need to go to

0:00:53.800 --> 0:00:57.040
<v Speaker 1>the library and you have to gather resources and read

0:00:57.120 --> 0:01:01.080
<v Speaker 1>up and form your thesis and write your paper while

0:01:01.080 --> 0:01:06.080
<v Speaker 1>making verifiable citations all the way through. So off you

0:01:06.120 --> 0:01:09.680
<v Speaker 1>go to the library. However, you discover, horror of horrors,

0:01:09.959 --> 0:01:14.120
<v Speaker 1>that all the resource books have disappeared. They're none in

0:01:14.160 --> 0:01:17.280
<v Speaker 1>their place. Are other student term papers? Now? Some of

0:01:17.280 --> 0:01:21.040
<v Speaker 1>those term papers are pretty good, some of them are terrible.

0:01:21.440 --> 0:01:23.880
<v Speaker 1>Nearly all of them do have a list of references

0:01:23.920 --> 0:01:25.839
<v Speaker 1>at the end, But the problem is that you don't

0:01:25.840 --> 0:01:29.319
<v Speaker 1>have access to those references. You only have access to

0:01:29.440 --> 0:01:32.640
<v Speaker 1>the term papers, which, in a way, you could say

0:01:32.800 --> 0:01:36.520
<v Speaker 1>is a filtered view of those references. But you have

0:01:36.600 --> 0:01:39.319
<v Speaker 1>no way of knowing if the student who wrote the

0:01:39.440 --> 0:01:42.880
<v Speaker 1>term papers you've pulled out did a proper citation. You

0:01:42.920 --> 0:01:46.319
<v Speaker 1>don't know if the student understood the source material. You

0:01:46.360 --> 0:01:49.520
<v Speaker 1>don't know if they have made a valid reference using

0:01:49.560 --> 0:01:52.400
<v Speaker 1>that source. You don't know if the student didn't understand

0:01:52.400 --> 0:01:56.200
<v Speaker 1>the source and thus misconstrued the information, either accidentally or

0:01:56.240 --> 0:01:59.600
<v Speaker 1>on purpose, or if the student is just outright plagiarizing

0:01:59.640 --> 0:02:03.280
<v Speaker 1>the source material or making stuff up. So how do

0:02:03.360 --> 0:02:06.880
<v Speaker 1>you think your own term paper would turn out? Probably

0:02:07.240 --> 0:02:10.359
<v Speaker 1>it'd be a challenge to write a good term paper.

0:02:10.360 --> 0:02:13.200
<v Speaker 1>It definitely would be difficult or almost impossible to support

0:02:13.200 --> 0:02:16.200
<v Speaker 1>your thesis using citations, because all you would have access

0:02:16.200 --> 0:02:19.880
<v Speaker 1>to would be other term papers. Chances are you'd have

0:02:19.919 --> 0:02:24.600
<v Speaker 1>a pretty lousy grade by the end of that assignment. Now,

0:02:24.680 --> 0:02:28.200
<v Speaker 1>I started off this episode with that analogy because today

0:02:28.200 --> 0:02:31.920
<v Speaker 1>we're going to talk about what happens when AI models

0:02:32.040 --> 0:02:35.600
<v Speaker 1>train off stuff that was generated by other or sometimes

0:02:35.639 --> 0:02:40.040
<v Speaker 1>even the same but earlier versions of AI models. So

0:02:40.120 --> 0:02:44.160
<v Speaker 1>when bots make stuff that other bots consume, and then

0:02:44.200 --> 0:02:47.920
<v Speaker 1>those other bots make new stuff and the cycle goes on.

0:02:48.560 --> 0:02:51.320
<v Speaker 1>Where are the humans in this picture. Maybe they're in

0:02:51.360 --> 0:02:55.720
<v Speaker 1>an actual library, because the online resources will all have

0:02:55.840 --> 0:02:59.160
<v Speaker 1>become practically useless. So if we want to actually learn anything,

0:02:59.160 --> 0:03:02.160
<v Speaker 1>we're gonna need to go back to the basics. So

0:03:02.200 --> 0:03:05.720
<v Speaker 1>we're going to talk about an idea called model collapse,

0:03:05.880 --> 0:03:10.440
<v Speaker 1>as in large language models LMS and other types of

0:03:10.480 --> 0:03:14.320
<v Speaker 1>AI models. We're going to build to that. However, first up,

0:03:14.480 --> 0:03:18.079
<v Speaker 1>let's explore the tendency of AI models to produce wrong

0:03:18.400 --> 0:03:21.880
<v Speaker 1>or misleading results, regardless of whether the material used to

0:03:21.960 --> 0:03:26.000
<v Speaker 1>train that AI model came from AI or humans. This

0:03:26.040 --> 0:03:28.440
<v Speaker 1>is something I've talked about in past episodes, but it's

0:03:28.480 --> 0:03:32.280
<v Speaker 1>an important part to kind of build toward our understanding

0:03:32.280 --> 0:03:35.680
<v Speaker 1>of what model collapse is. Now. In past episodes, I've

0:03:35.680 --> 0:03:40.640
<v Speaker 1>talked about the issue of AI hallucinations, also sometimes called confabulations.

0:03:40.760 --> 0:03:45.360
<v Speaker 1>Some people prefer confabulations to hallucinations. This is the tendency

0:03:45.640 --> 0:03:51.320
<v Speaker 1>for generative AI to mistakenly include untrue or misleading information,

0:03:51.960 --> 0:03:56.600
<v Speaker 1>or to insert stuff that does not belong into whatever

0:03:56.640 --> 0:03:59.480
<v Speaker 1>it is that's creating, whether that's an image or text

0:03:59.600 --> 0:04:02.640
<v Speaker 1>or what an so. One fairly recent example of this

0:04:03.160 --> 0:04:07.400
<v Speaker 1>was when Google's AI augmented search tool suggested that you

0:04:07.480 --> 0:04:11.240
<v Speaker 1>add a non toxic glue to your pizza ingredients if

0:04:11.280 --> 0:04:14.440
<v Speaker 1>you want to solve the irritating issue of cheese slip

0:04:14.520 --> 0:04:18.760
<v Speaker 1>sladden away off your ding dang dern pizza. Clearly this

0:04:18.800 --> 0:04:22.800
<v Speaker 1>answer is not acceptable. Adding glue, non toxic or otherwise

0:04:23.160 --> 0:04:25.560
<v Speaker 1>is not a way of making good eats. I'm pretty

0:04:25.560 --> 0:04:28.600
<v Speaker 1>sure Alton Brown would agree with me, and actually I

0:04:28.600 --> 0:04:31.400
<v Speaker 1>would argue this is one of the less egregious cases

0:04:31.440 --> 0:04:34.240
<v Speaker 1>of AI providing a bad answer. It's famous because it

0:04:34.320 --> 0:04:36.800
<v Speaker 1>got a lot of traction. It went viral for how

0:04:36.880 --> 0:04:39.479
<v Speaker 1>bad the answer was. But in the grand scheme of things,

0:04:39.480 --> 0:04:43.400
<v Speaker 1>there are other examples that were far more potentially harmful.

0:04:43.680 --> 0:04:47.520
<v Speaker 1>So why does AI do this sometimes? Well, there are

0:04:47.560 --> 0:04:51.479
<v Speaker 1>a few different contributing factors that lead AI to making

0:04:51.480 --> 0:04:54.040
<v Speaker 1>these mistakes. By the way, the reason why some people

0:04:54.240 --> 0:04:59.520
<v Speaker 1>prefer confabulations as opposed to hallucinations. Hallucination sounds like the

0:04:59.680 --> 0:05:03.880
<v Speaker 1>AI I has somehow been tricked into thinking something is

0:05:04.279 --> 0:05:07.800
<v Speaker 1>what it isn't right, like the idea that you hallucinate

0:05:07.839 --> 0:05:11.960
<v Speaker 1>your seeing or hearing or experiencing something that's not really there.

0:05:12.400 --> 0:05:17.400
<v Speaker 1>Confabulation suggests that the AI is inventing something. It is confabulating,

0:05:17.440 --> 0:05:20.560
<v Speaker 1>it is creating an answer where there was none, and

0:05:20.640 --> 0:05:23.560
<v Speaker 1>so some people prefer the second one because they but

0:05:23.680 --> 0:05:26.520
<v Speaker 1>it puts more of the onus on the AI model itself.

0:05:26.920 --> 0:05:30.799
<v Speaker 1>So one of the factors that contributes to AI making mistakes.

0:05:31.520 --> 0:05:34.800
<v Speaker 1>And you know, large language models and like are in

0:05:34.880 --> 0:05:39.720
<v Speaker 1>part focused on pattern recognition, and this can lead to issues. Now,

0:05:39.760 --> 0:05:43.680
<v Speaker 1>recognizing patterns is what gives these models the ability to

0:05:43.960 --> 0:05:48.559
<v Speaker 1>form relevant and coherent responses to queries, and obviously pattern

0:05:48.600 --> 0:05:53.240
<v Speaker 1>recognition is important otherwise you're just gonna perceive everything is

0:05:53.279 --> 0:05:58.040
<v Speaker 1>being random and meaningless and then really, this whole conversation

0:05:58.240 --> 0:06:02.000
<v Speaker 1>doesn't mean anything either, or if the whole universe is meaningless,

0:06:02.440 --> 0:06:05.200
<v Speaker 1>then what are we even doing here? But I don't

0:06:05.200 --> 0:06:07.479
<v Speaker 1>want you to go down that path of existential dread.

0:06:07.920 --> 0:06:12.239
<v Speaker 1>So sometimes AI will detect a pattern where there really

0:06:12.360 --> 0:06:15.760
<v Speaker 1>isn't a pattern. And we humans do this too, you know,

0:06:15.800 --> 0:06:19.160
<v Speaker 1>we sometimes experience like paradolia. For example. That's when we

0:06:19.200 --> 0:06:24.680
<v Speaker 1>perceive something meaningful within an otherwise meaningless thing, like we

0:06:24.760 --> 0:06:28.040
<v Speaker 1>see a pattern where there is none. So if you

0:06:28.080 --> 0:06:30.880
<v Speaker 1>were to look at the clouds and you say that

0:06:31.160 --> 0:06:34.680
<v Speaker 1>one of them looks very like a whale, that's paradolia.

0:06:34.880 --> 0:06:39.680
<v Speaker 1>It's also a reference to Hamlet the infamous face on Mars,

0:06:39.880 --> 0:06:42.440
<v Speaker 1>which was really just a hill with some shadows cast

0:06:42.480 --> 0:06:46.080
<v Speaker 1>on it. Because the angle of the image, that was

0:06:46.080 --> 0:06:48.960
<v Speaker 1>another example of paradolia, people began to think that there

0:06:49.040 --> 0:06:52.279
<v Speaker 1>was actually a big sculpted face on Mars. There's not.

0:06:52.880 --> 0:06:55.080
<v Speaker 1>It's a hill. The shadows hit the hill in a

0:06:55.080 --> 0:06:57.560
<v Speaker 1>specific way that made it look kind of like the

0:06:57.600 --> 0:07:01.560
<v Speaker 1>face of an enormous statue, something like the Sphinx, something

0:07:01.600 --> 0:07:04.320
<v Speaker 1>along those lines. But in fact it was just a hill.

0:07:04.600 --> 0:07:07.679
<v Speaker 1>And if you took another image from a different angle,

0:07:07.680 --> 0:07:12.000
<v Speaker 1>which people have done, the illusion of a face disappears.

0:07:12.320 --> 0:07:15.680
<v Speaker 1>So again, that was us inventing a pattern where there

0:07:15.880 --> 0:07:19.080
<v Speaker 1>was none. Now, much of the time we humans can

0:07:19.120 --> 0:07:22.320
<v Speaker 1>recognize when the things we see, you know, the shapes

0:07:22.360 --> 0:07:26.320
<v Speaker 1>of faces or whatever it may be, aren't actually there. Right,

0:07:26.360 --> 0:07:29.960
<v Speaker 1>we can recognize, oh, that looks like a blah, blah blah,

0:07:30.000 --> 0:07:33.000
<v Speaker 1>but we know it's not actually a real image of that.

0:07:33.160 --> 0:07:38.200
<v Speaker 1>It just happens. Now. Sometimes we don't recognize this. Sometimes

0:07:38.200 --> 0:07:41.640
<v Speaker 1>there are ties where people will assume that what they're

0:07:41.680 --> 0:07:46.680
<v Speaker 1>seeing is an actual image made with intent and intelligence,

0:07:46.760 --> 0:07:49.560
<v Speaker 1>perhaps not by humans but by something. So there are

0:07:49.560 --> 0:07:52.119
<v Speaker 1>all those stories of people going bonkers because they believe

0:07:52.120 --> 0:07:54.280
<v Speaker 1>they saw an image of like the Virgin Mary in

0:07:54.320 --> 0:07:58.120
<v Speaker 1>a potato chip or whatever. And machines don't necessarily have

0:07:58.160 --> 0:08:02.400
<v Speaker 1>any checks against fall hits when it comes to pattern recognition,

0:08:02.800 --> 0:08:06.640
<v Speaker 1>and then they might act on a perceived pattern, which

0:08:06.680 --> 0:08:10.360
<v Speaker 1>means the machines produce bad results. What's more, machines conceive

0:08:10.400 --> 0:08:13.720
<v Speaker 1>patterns where we can't. Like sometimes there are patterns present

0:08:13.800 --> 0:08:17.720
<v Speaker 1>that we cannot perceive because maybe the dataset is far

0:08:17.800 --> 0:08:22.400
<v Speaker 1>too large or far too complicated, and so we can't

0:08:22.440 --> 0:08:26.560
<v Speaker 1>perceive where the pattern is. It's just beyond our abilities

0:08:27.080 --> 0:08:31.360
<v Speaker 1>to do so. But sometimes machines can detect those patterns,

0:08:31.360 --> 0:08:34.720
<v Speaker 1>and sometimes they are meaningful. So it can be really tricky.

0:08:34.840 --> 0:08:37.400
<v Speaker 1>If a machine thinks it's found a pattern, it can

0:08:37.440 --> 0:08:42.079
<v Speaker 1>be hard for people to verify or discredit that because

0:08:42.400 --> 0:08:44.600
<v Speaker 1>it's on a scale that we humans are not really

0:08:44.640 --> 0:08:48.320
<v Speaker 1>well equipped to handle with generative AI. This can mean

0:08:48.320 --> 0:08:52.240
<v Speaker 1>that the AI model correctly identifies that it needs to

0:08:52.320 --> 0:08:56.319
<v Speaker 1>use a specific syntax to craft a response to whatever

0:08:56.520 --> 0:09:01.240
<v Speaker 1>query or direction it was given, and it can thus

0:09:01.559 --> 0:09:06.600
<v Speaker 1>put together a sentence that grammatically makes sense. What's happening

0:09:06.640 --> 0:09:11.360
<v Speaker 1>is it's essentially statistically analyzing the structure of hundreds of

0:09:11.520 --> 0:09:14.760
<v Speaker 1>millions of sentences, as well as the role that certain

0:09:14.800 --> 0:09:18.120
<v Speaker 1>words play within those sentences, so that it quote unquote

0:09:18.280 --> 0:09:22.120
<v Speaker 1>knows how to write a grammatically correct response, and ultimately

0:09:22.480 --> 0:09:25.439
<v Speaker 1>it's using statistics to pick what should be the most

0:09:25.520 --> 0:09:30.000
<v Speaker 1>correct word in each position of that sentence. So ideally,

0:09:30.440 --> 0:09:34.000
<v Speaker 1>it's pulling information from various sources that are related to

0:09:34.000 --> 0:09:38.240
<v Speaker 1>whatever it is you're asking about and pulling the words

0:09:38.240 --> 0:09:43.439
<v Speaker 1>together in a way that makes logical sense and is accurate,

0:09:43.559 --> 0:09:45.880
<v Speaker 1>and it's a correct answer to whatever your question is.

0:09:46.000 --> 0:09:49.440
<v Speaker 1>But that doesn't always happen right. Sometimes it can't find

0:09:49.880 --> 0:09:52.840
<v Speaker 1>the right word. Sometimes it finds a different word that

0:09:52.920 --> 0:09:56.080
<v Speaker 1>it thinks is right, but it's not. And the real

0:09:56.160 --> 0:10:00.680
<v Speaker 1>problem is it will present this to you authoritatively as

0:10:00.720 --> 0:10:04.160
<v Speaker 1>if the AI is absolutely certain this is the right answer,

0:10:04.360 --> 0:10:07.560
<v Speaker 1>when in fact it's wrong and the AI has no

0:10:07.600 --> 0:10:10.600
<v Speaker 1>way of knowing it's wrong. It's not purposefully trying to

0:10:10.600 --> 0:10:13.800
<v Speaker 1>mislead you, and at least not necessarily. Maybe it was

0:10:13.800 --> 0:10:17.000
<v Speaker 1>given direction to try and do that, but that's another matter.

0:10:17.400 --> 0:10:22.360
<v Speaker 1>It's just trying to complete its task and failing to

0:10:22.400 --> 0:10:26.640
<v Speaker 1>do so accurately. Sometimes the word or a series of

0:10:26.640 --> 0:10:30.400
<v Speaker 1>words can be wrong. Therefore, now grammatically it could be correct,

0:10:30.480 --> 0:10:33.679
<v Speaker 1>but factually it could be completely made up. And why

0:10:33.720 --> 0:10:36.640
<v Speaker 1>this all happens. It does get really complicated. It's not

0:10:36.720 --> 0:10:40.440
<v Speaker 1>necessarily due to just one specific flaw. It's not always

0:10:40.480 --> 0:10:43.840
<v Speaker 1>the case that, oh, that data point didn't appear in

0:10:43.880 --> 0:10:47.120
<v Speaker 1>the data set for some reason, and so the computer

0:10:47.440 --> 0:10:50.400
<v Speaker 1>made something up. There are other issues that could also

0:10:50.440 --> 0:10:53.120
<v Speaker 1>be at play. So, for example, one possible reason for

0:10:53.160 --> 0:10:57.679
<v Speaker 1>hallucinations is something that's called overfitting. IBM defines this as

0:10:57.720 --> 0:11:01.600
<v Speaker 1>what happens quote when an algorith rhythm fits too closely

0:11:01.800 --> 0:11:05.280
<v Speaker 1>or even exactly to its training data, resulting in a

0:11:05.320 --> 0:11:08.920
<v Speaker 1>model that can't make accurate predictions or conclusions from any

0:11:09.040 --> 0:11:12.440
<v Speaker 1>data other than the training data. End quote. That's from

0:11:12.440 --> 0:11:16.439
<v Speaker 1>a piece on IBM dot com. It's titled what is overfitting?

0:11:16.800 --> 0:11:21.440
<v Speaker 1>Sometimes models get so complex or they're trained so closely

0:11:21.520 --> 0:11:24.800
<v Speaker 1>on a specific data set that they start to pick

0:11:24.880 --> 0:11:30.320
<v Speaker 1>up more noise than signal. They give significance to insignificant things.

0:11:30.600 --> 0:11:32.800
<v Speaker 1>I think of this kind of like the character Dracks

0:11:33.000 --> 0:11:36.640
<v Speaker 1>in the Guardians of the Galaxy movies. Drags takes things literally,

0:11:37.000 --> 0:11:40.120
<v Speaker 1>so if you use a saying or an idiom on him,

0:11:40.480 --> 0:11:43.959
<v Speaker 1>he's likely to interpret what you're saying as being what

0:11:44.040 --> 0:11:47.760
<v Speaker 1>you mean. So if you say, oh, that's like throwing

0:11:47.800 --> 0:11:50.839
<v Speaker 1>the baby out with the bathwater, he would assume you're

0:11:50.880 --> 0:11:54.080
<v Speaker 1>talking about something you have literally done before in your life,

0:11:54.080 --> 0:11:56.880
<v Speaker 1>that you have literally thrown out a baby with bathwater,

0:11:57.320 --> 0:12:00.000
<v Speaker 1>and he would not understand you were using an analog

0:12:00.600 --> 0:12:03.960
<v Speaker 1>to describe getting rid of important stuff along with the

0:12:04.040 --> 0:12:06.640
<v Speaker 1>unimportant stuff you want to get rid of. If a

0:12:06.720 --> 0:12:09.719
<v Speaker 1>model has been overfitted, if it's been trained too much

0:12:09.840 --> 0:12:12.679
<v Speaker 1>on a relatively narrow set of data, it might have

0:12:12.720 --> 0:12:16.800
<v Speaker 1>trouble taking what it has learned and generalizing those learnings

0:12:16.800 --> 0:12:20.440
<v Speaker 1>towards something else that's outside the data set. And rather

0:12:20.520 --> 0:12:23.080
<v Speaker 1>than saying I'm sorry, I don't know the answer to that,

0:12:23.440 --> 0:12:27.000
<v Speaker 1>it could produce an answer that follows the statistical rules

0:12:27.240 --> 0:12:29.640
<v Speaker 1>that the model is set to In other words, it'll

0:12:29.800 --> 0:12:33.640
<v Speaker 1>create something that grammatically makes sense, but it won't necessarily

0:12:33.640 --> 0:12:38.079
<v Speaker 1>be relevant or you know, thematically or irrelevance makes sense.

0:12:38.640 --> 0:12:41.360
<v Speaker 1>So in this way, an AI model can become like

0:12:41.400 --> 0:12:44.760
<v Speaker 1>that stereotypical person in the car who absolutely refuses to

0:12:44.800 --> 0:12:47.080
<v Speaker 1>pull over and ask for directions when they get lost,

0:12:47.400 --> 0:12:50.160
<v Speaker 1>because that would be showing weakness. No, gush, darn. It

0:12:50.200 --> 0:12:53.079
<v Speaker 1>will somehow reason our way out of taking that wrong

0:12:53.200 --> 0:12:56.320
<v Speaker 1>turn forty five minutes ago. That'll fix everything. Except it

0:12:56.320 --> 0:12:59.360
<v Speaker 1>doesn't fix everything, and it can make things worse. But

0:12:59.400 --> 0:13:02.480
<v Speaker 1>it's not just pattern recognition that can trip up AI models.

0:13:02.760 --> 0:13:07.119
<v Speaker 1>Another issue is bias. I've talked about bias in other episodes,

0:13:07.280 --> 0:13:10.319
<v Speaker 1>but it's really important that we understand what we mean

0:13:10.360 --> 0:13:13.559
<v Speaker 1>when we're talking bias and how it can happen, because

0:13:14.000 --> 0:13:16.520
<v Speaker 1>I think a lot of people get tripped up. They

0:13:16.559 --> 0:13:22.120
<v Speaker 1>think it's a machine, right, it doesn't possess opinions. How

0:13:22.160 --> 0:13:26.640
<v Speaker 1>can it have bias? Well, we'll explore that in just

0:13:26.800 --> 0:13:29.679
<v Speaker 1>a couple of moments, but first let's take a quick

0:13:29.720 --> 0:13:43.520
<v Speaker 1>break to think our sponsors. How can an AI model

0:13:43.920 --> 0:13:47.880
<v Speaker 1>have bias? Well, the answer is that the machines that

0:13:47.920 --> 0:13:51.640
<v Speaker 1>AI runs on the algorithms that AI is built upon.

0:13:51.920 --> 0:13:55.839
<v Speaker 1>All this stuff, it didn't just pop out of nowhere. Ultimately,

0:13:55.920 --> 0:13:59.200
<v Speaker 1>this stuff was designed, built, and programmed by human beings.

0:13:59.280 --> 0:14:01.880
<v Speaker 1>Even if you have had a piece of software that

0:14:02.080 --> 0:14:06.160
<v Speaker 1>was designed by AI, while the AI that designed it

0:14:06.280 --> 0:14:08.959
<v Speaker 1>in turn had been designed by humans at least somewhere

0:14:09.000 --> 0:14:11.560
<v Speaker 1>down the line once you trace it back far enough so.

0:14:12.080 --> 0:14:16.360
<v Speaker 1>Human beings absolutely do have biases, and those biases can

0:14:16.400 --> 0:14:20.920
<v Speaker 1>make their way into the routines and processes of machines.

0:14:21.480 --> 0:14:25.280
<v Speaker 1>MIT has a great introduction to AI hallucinations and bias

0:14:25.360 --> 0:14:28.040
<v Speaker 1>on a web page that has the fitting title when

0:14:28.200 --> 0:14:32.800
<v Speaker 1>AI Gets It Wrong, Addressing AI hallucinations and bias now.

0:14:32.840 --> 0:14:35.600
<v Speaker 1>In that article, the author points out that AI has

0:14:35.640 --> 0:14:39.440
<v Speaker 1>had issues with bias for years and uses the example

0:14:39.600 --> 0:14:45.720
<v Speaker 1>of image analysis. The author cites a project called Gender Shades.

0:14:46.040 --> 0:14:51.440
<v Speaker 1>This was led by Joi Adowa Buomini, and I apologize

0:14:51.760 --> 0:14:56.080
<v Speaker 1>for my pronunciation of the name. But the project examined

0:14:56.320 --> 0:15:02.280
<v Speaker 1>how an AI powered gender classification tool performed when presented

0:15:02.320 --> 0:15:06.880
<v Speaker 1>with subjects of varying genders, ethnicities, and skin tones from

0:15:06.960 --> 0:15:14.280
<v Speaker 1>the IARPA Janus benchmark A data set or IJBA. This

0:15:14.320 --> 0:15:17.360
<v Speaker 1>is a database of facial images taken from various angles

0:15:17.400 --> 0:15:20.880
<v Speaker 1>and lighting conditions of lots of different people. It's used

0:15:20.880 --> 0:15:25.440
<v Speaker 1>as a government benchmark for testing stuff like facial recognition technologies. Now.

0:15:25.480 --> 0:15:30.640
<v Speaker 1>The project also used a gender classification benchmark from Adance,

0:15:31.240 --> 0:15:35.560
<v Speaker 1>and this was in part to try and address shortcomings

0:15:35.600 --> 0:15:40.840
<v Speaker 1>with the IJB dash A benchmark set. Plus due to

0:15:40.880 --> 0:15:43.360
<v Speaker 1>the limitations of both of these data sets, which I'll

0:15:43.360 --> 0:15:46.480
<v Speaker 1>talk about in just a moment, the project also outlines

0:15:46.520 --> 0:15:49.640
<v Speaker 1>a process to create a better data set for the

0:15:49.640 --> 0:15:54.160
<v Speaker 1>purposes of training technologies like facial recognition and gender classification.

0:15:54.720 --> 0:15:59.480
<v Speaker 1>The project aimed to test several gender classifier programs from

0:15:59.480 --> 0:16:04.120
<v Speaker 1>companies Microsoft and IBM, among others, all with regard to

0:16:04.320 --> 0:16:08.640
<v Speaker 1>quote gender, skin type, and the intersection of skin type

0:16:08.680 --> 0:16:12.640
<v Speaker 1>and gender end quote. So Joy found that the data

0:16:12.680 --> 0:16:17.360
<v Speaker 1>sets from IJB dah A skewed male and lighter skin

0:16:17.480 --> 0:16:21.440
<v Speaker 1>tones skewed heavily male and lighter skin tones. In fact,

0:16:21.480 --> 0:16:24.040
<v Speaker 1>she said between seventy nine point six percent and eighty

0:16:24.040 --> 0:16:26.640
<v Speaker 1>six point twenty four percent of all the images in

0:16:26.680 --> 0:16:31.040
<v Speaker 1>the database were of people with lighter skin tones, and

0:16:31.520 --> 0:16:34.440
<v Speaker 1>fewer than twenty five percent of all the images were

0:16:34.480 --> 0:16:38.480
<v Speaker 1>of women or female presenting people worse, Yet, only four

0:16:38.520 --> 0:16:41.760
<v Speaker 1>point four percent of all the images were of female

0:16:41.840 --> 0:16:46.880
<v Speaker 1>presenting people who had dark skin Adiance's data set had

0:16:46.920 --> 0:16:50.840
<v Speaker 1>a better distribution of photos, at least between genders. Female

0:16:50.840 --> 0:16:54.120
<v Speaker 1>presenting people made up fifty two percent of the images

0:16:54.160 --> 0:16:58.480
<v Speaker 1>in Aightiance's data set, but again, lighter skin tones made

0:16:58.520 --> 0:17:02.320
<v Speaker 1>up the majority of these images. Less than fifteen percent

0:17:02.360 --> 0:17:05.000
<v Speaker 1>of all the images in that data set contained people

0:17:05.080 --> 0:17:08.840
<v Speaker 1>of darker skin tones. So I'm sure you can already

0:17:08.920 --> 0:17:12.800
<v Speaker 1>see where this is going. If you train an AI

0:17:12.840 --> 0:17:18.159
<v Speaker 1>model on data that has a disproportionate emphasis on certain factors,

0:17:18.440 --> 0:17:23.720
<v Speaker 1>such as certain genders or certain skin tones, then you

0:17:23.760 --> 0:17:27.280
<v Speaker 1>would expect the AI to be better at handling cases

0:17:27.280 --> 0:17:31.760
<v Speaker 1>that fall into those categories, Right Like, if most of

0:17:31.800 --> 0:17:34.800
<v Speaker 1>the data you've fed to your AI model is of

0:17:35.040 --> 0:17:37.760
<v Speaker 1>men who have a lighter skin tone, then when you

0:17:37.800 --> 0:17:43.040
<v Speaker 1>are serving the AI model a picture of someone who's

0:17:43.320 --> 0:17:46.000
<v Speaker 1>male presenting and has a lighter skin tone, chances are

0:17:46.080 --> 0:17:49.600
<v Speaker 1>the tools going to work better. If you are instead

0:17:50.160 --> 0:17:55.800
<v Speaker 1>feeding it images of people who fall outside those majority cases,

0:17:56.080 --> 0:17:59.000
<v Speaker 1>the AI tool is probably not going to work as

0:17:59.040 --> 0:18:02.159
<v Speaker 1>well with them, and that's exactly what Joy found in

0:18:02.200 --> 0:18:06.679
<v Speaker 1>her research. She discovered that gender classification tools from all

0:18:06.840 --> 0:18:10.920
<v Speaker 1>of the providers performed better with lighter skinned men than

0:18:10.960 --> 0:18:14.600
<v Speaker 1>with any other group. They perform the worst with darker

0:18:14.640 --> 0:18:17.959
<v Speaker 1>skinned women. Thus we have a bias in the system.

0:18:18.320 --> 0:18:21.160
<v Speaker 1>The data that folks use to train these systems had

0:18:21.200 --> 0:18:25.320
<v Speaker 1>that bias, and it unsurprisingly affects how the AI does

0:18:25.359 --> 0:18:29.639
<v Speaker 1>its job. Now, this isn't just a curiosity for research labs.

0:18:29.680 --> 0:18:34.800
<v Speaker 1>Of course, around the world, various organizations and companies are

0:18:34.840 --> 0:18:38.720
<v Speaker 1>making use of facial recognition tools and gender classification tools.

0:18:39.119 --> 0:18:42.440
<v Speaker 1>There are numerous stories of law enforcement agencies getting into

0:18:42.480 --> 0:18:45.719
<v Speaker 1>hot water for relying on this kind of technology. So

0:18:46.000 --> 0:18:51.120
<v Speaker 1>we know that this technology isn't reliable, particularly if someone

0:18:51.240 --> 0:18:55.280
<v Speaker 1>belongs to a group that's outside of lighter skinned men,

0:18:55.840 --> 0:18:59.000
<v Speaker 1>and the data being used to train these tools is limited.

0:18:59.320 --> 0:19:02.760
<v Speaker 1>That's why we're having these issues, or one of the

0:19:02.840 --> 0:19:05.760
<v Speaker 1>main reasons why we're having these issues. So it stands

0:19:05.800 --> 0:19:09.080
<v Speaker 1>to reason we should not employ those tools for anything

0:19:09.560 --> 0:19:13.480
<v Speaker 1>really at all, other than maybe working to make them better.

0:19:13.680 --> 0:19:16.040
<v Speaker 1>But we definitely shouldn't be using them for things like

0:19:16.200 --> 0:19:19.160
<v Speaker 1>law enforcement, for example. At least we should not use

0:19:19.200 --> 0:19:23.320
<v Speaker 1>them until we can address the problem of bias generative

0:19:23.359 --> 0:19:27.399
<v Speaker 1>AI can actually have similar issues with bias that MIT.

0:19:27.640 --> 0:19:30.760
<v Speaker 1>Article that I mentioned earlier in this episode cites another

0:19:30.880 --> 0:19:36.720
<v Speaker 1>article by Leonardo Nicoletti and Dina Bass titled humans are biased.

0:19:36.920 --> 0:19:41.520
<v Speaker 1>Generative AI is even worse. This piece appeared in Bloomberg.

0:19:41.960 --> 0:19:46.000
<v Speaker 1>So this article explores how a generative AI platform called

0:19:46.080 --> 0:19:50.280
<v Speaker 1>stable Diffusion had a tendency to make assumptions based on

0:19:50.440 --> 0:19:57.359
<v Speaker 1>racial and gender stereotypes, thus repeating and even amplifying those stereotypes.

0:19:57.760 --> 0:20:01.880
<v Speaker 1>Nicoletti and Bass performed and in formal test with stable Diffusion,

0:20:02.040 --> 0:20:05.520
<v Speaker 1>a pretty thorough one, but still informal. They asked stable

0:20:05.560 --> 0:20:10.760
<v Speaker 1>Diffusion to generate images of people who were working one

0:20:10.800 --> 0:20:14.520
<v Speaker 1>of fourteen different jobs. Now, half of those jobs belonged

0:20:14.560 --> 0:20:18.119
<v Speaker 1>to what they called high paying positions, like things that

0:20:18.200 --> 0:20:21.720
<v Speaker 1>you would typically associate as a high paying job. The

0:20:21.800 --> 0:20:26.800
<v Speaker 1>other half typically were too low paying jobs, and actually

0:20:26.840 --> 0:20:28.879
<v Speaker 1>a little less than half of them were low paying jobs.

0:20:28.920 --> 0:20:31.480
<v Speaker 1>Three of them actually fell into the category of crime,

0:20:31.880 --> 0:20:34.679
<v Speaker 1>so like you know, thief or something like that. The

0:20:34.840 --> 0:20:38.720
<v Speaker 1>two had Stable Diffusion generate more than five thousand images

0:20:38.760 --> 0:20:42.640
<v Speaker 1>total so that they could really compare. They didn't want

0:20:42.680 --> 0:20:46.040
<v Speaker 1>to just create, you know, a single image each that's

0:20:46.040 --> 0:20:48.359
<v Speaker 1>a terrible test. They wanted to see, all right, is

0:20:48.400 --> 0:20:51.960
<v Speaker 1>this something that's actually appearing over and over again when

0:20:52.000 --> 0:20:55.000
<v Speaker 1>we make use of this tool, or is it possible

0:20:55.080 --> 0:20:58.119
<v Speaker 1>that you know, you run fourteen tests and it just

0:20:58.320 --> 0:21:03.840
<v Speaker 1>happens to go along with racial stereotypes. Nope. They classified

0:21:03.880 --> 0:21:07.720
<v Speaker 1>the generated images based off of the Fitzpatrick's skin scale.

0:21:08.240 --> 0:21:12.040
<v Speaker 1>This is actually a skin pigmentation metric that's used by

0:21:12.440 --> 0:21:16.359
<v Speaker 1>dermatologists as well as like other researchers, and the scale

0:21:16.440 --> 0:21:19.440
<v Speaker 1>goes from one to six, so one would be very

0:21:19.520 --> 0:21:23.080
<v Speaker 1>light skinned and six would be very dark skinned. The

0:21:23.440 --> 0:21:27.679
<v Speaker 1>researchers found that stable diffusion was far more likely to

0:21:27.680 --> 0:21:31.360
<v Speaker 1>create a person with a lighter skin tone for positions

0:21:31.400 --> 0:21:36.159
<v Speaker 1>that traditionally fall into the higher paid categories, and that

0:21:36.200 --> 0:21:38.760
<v Speaker 1>it was more likely to generate someone with a darker

0:21:38.800 --> 0:21:44.040
<v Speaker 1>skin tone for lower paid or criminal categories. What's more,

0:21:44.280 --> 0:21:48.080
<v Speaker 1>stable diffusion generated images of people appearing to be men

0:21:48.240 --> 0:21:52.159
<v Speaker 1>or male presenting for most of those higher paid positions.

0:21:52.280 --> 0:21:55.080
<v Speaker 1>It was very rare for it to generate the image

0:21:55.080 --> 0:21:58.560
<v Speaker 1>of a female presenting person in the role of one

0:21:58.600 --> 0:22:04.000
<v Speaker 1>of these traditionally higher paid jobs. So the AI was

0:22:04.040 --> 0:22:09.280
<v Speaker 1>perpetuating and amplifying these racial and gender stereotypes. This actually

0:22:09.280 --> 0:22:11.720
<v Speaker 1>reminds me of a classic riddle that was intended to

0:22:11.760 --> 0:22:14.040
<v Speaker 1>reveal bias. I'm sure most of you have heard this

0:22:14.119 --> 0:22:17.480
<v Speaker 1>before or some variation. So the riddle typically goes something

0:22:17.560 --> 0:22:20.080
<v Speaker 1>like this. A father and a son are in a

0:22:20.200 --> 0:22:23.680
<v Speaker 1>terrible car accident, and the father tragically dies at the scene.

0:22:24.040 --> 0:22:27.480
<v Speaker 1>The son is badly injured. EMTs arrived. They rushed the

0:22:27.480 --> 0:22:30.600
<v Speaker 1>boy to a surgical ward. The surgeon on duty looks

0:22:30.600 --> 0:22:32.960
<v Speaker 1>at the boy and says, I can't operate on him,

0:22:33.520 --> 0:22:37.360
<v Speaker 1>he's my son. Well, how could that be true? Now?

0:22:37.400 --> 0:22:41.080
<v Speaker 1>The obvious answer is the surgeon is the boy's mother.

0:22:41.400 --> 0:22:43.399
<v Speaker 1>And I think a lot of people arrive at that

0:22:43.880 --> 0:22:47.600
<v Speaker 1>conclusion much more easily today than they did when I

0:22:47.760 --> 0:22:50.000
<v Speaker 1>was a kid. Like when I was a kid, the

0:22:50.160 --> 0:22:55.159
<v Speaker 1>sexist stereotype was that all real quote unquote real doctors

0:22:55.200 --> 0:23:00.760
<v Speaker 1>and surgeons were men and women they were nurses or administrators. Right,

0:23:00.840 --> 0:23:04.960
<v Speaker 1>That was the stereotype that people kind of believed in.

0:23:05.320 --> 0:23:08.320
<v Speaker 1>But I'm sure most of y'all understood this answer, or

0:23:08.440 --> 0:23:11.200
<v Speaker 1>you've been exposed to this riddle numerous times. I mean,

0:23:11.240 --> 0:23:13.479
<v Speaker 1>it is a meme at this point, but again, back

0:23:13.560 --> 0:23:15.200
<v Speaker 1>in my day, a lot of folks would likely get

0:23:15.240 --> 0:23:18.240
<v Speaker 1>stumped by this, or they would say something dumb like, oh,

0:23:18.240 --> 0:23:21.600
<v Speaker 1>it turns out the surgeon was the real dad and

0:23:21.680 --> 0:23:25.119
<v Speaker 1>the father who died at the scene had been the

0:23:25.160 --> 0:23:28.520
<v Speaker 1>adopted father he adopted the boy, or something along those lines,

0:23:28.520 --> 0:23:32.240
<v Speaker 1>which reveals the bias of the listener. It reminds the

0:23:32.320 --> 0:23:36.040
<v Speaker 1>listener to think critically and be aware of sexist stereotypes.

0:23:36.359 --> 0:23:39.760
<v Speaker 1>So AI can produce the wrong results due to bias

0:23:39.800 --> 0:23:44.320
<v Speaker 1>built into the underlying model and end up making these

0:23:44.320 --> 0:23:47.320
<v Speaker 1>same mistakes right, Like if you say surgeon, it may

0:23:47.359 --> 0:23:51.439
<v Speaker 1>mistakenly just believe ah, you meant man. It has to

0:23:51.440 --> 0:23:55.240
<v Speaker 1>be a man that I generate in this image because

0:23:55.960 --> 0:23:59.399
<v Speaker 1>the user said surgeon, so that means man. That's a

0:23:59.400 --> 0:24:02.639
<v Speaker 1>real problem. With enough work and attention, we can actually

0:24:02.680 --> 0:24:07.240
<v Speaker 1>create training materials that minimize bias and can help reverse

0:24:07.359 --> 0:24:12.080
<v Speaker 1>this trend. But even doing that is not enough to

0:24:12.560 --> 0:24:17.119
<v Speaker 1>eliminate errors in generative AI. There are other problems we

0:24:17.200 --> 0:24:20.879
<v Speaker 1>have to look out for. So what happens when you

0:24:21.040 --> 0:24:26.120
<v Speaker 1>have an AI model, like a large language model, for example,

0:24:26.520 --> 0:24:30.280
<v Speaker 1>and part of the massive amount of material that it's

0:24:30.359 --> 0:24:34.560
<v Speaker 1>training itself on includes data sets that were generated by

0:24:34.680 --> 0:24:38.720
<v Speaker 1>other AI. When an AI image generator is pulling images

0:24:38.760 --> 0:24:41.760
<v Speaker 1>that were made by other image generators and then training

0:24:41.800 --> 0:24:44.800
<v Speaker 1>itself on that, or you know, even if it's pulling

0:24:45.240 --> 0:24:49.000
<v Speaker 1>images that an earlier version of that very same generator

0:24:49.040 --> 0:24:53.879
<v Speaker 1>had created, the mistakes that exist in those AI generated images,

0:24:54.440 --> 0:24:56.840
<v Speaker 1>or you know, it's if we're not talking images like

0:24:56.920 --> 0:25:01.000
<v Speaker 1>in text or whatever, those things can become like you

0:25:01.000 --> 0:25:05.080
<v Speaker 1>would argue, oh, those things are noise, right, that's those

0:25:05.080 --> 0:25:09.280
<v Speaker 1>are mistakes. But AI doesn't know that they're mistakes. They don't.

0:25:09.280 --> 0:25:11.840
<v Speaker 1>It doesn't know that it's noise. If you're training it

0:25:11.880 --> 0:25:14.359
<v Speaker 1>on the data, it thinks it's significant. And if it

0:25:14.400 --> 0:25:18.240
<v Speaker 1>thinks it's significant, it's going to incorporate it and perhaps

0:25:18.520 --> 0:25:22.359
<v Speaker 1>even dial it up quite a bit. So a great

0:25:22.400 --> 0:25:25.800
<v Speaker 1>way of illustrating this, in my opinion, is to talk

0:25:25.840 --> 0:25:29.280
<v Speaker 1>about fingers. I mean, I'm sure all of you out

0:25:29.320 --> 0:25:34.720
<v Speaker 1>there have experienced seeing AI generated images that hilariously get

0:25:34.760 --> 0:25:38.280
<v Speaker 1>the fingers totally wrong. A lot of AI image generators

0:25:38.280 --> 0:25:43.320
<v Speaker 1>have real problems with fingers, So you might have folks

0:25:43.400 --> 0:25:46.440
<v Speaker 1>and images who wind up with way too many fingers,

0:25:46.840 --> 0:25:49.560
<v Speaker 1>like seven or eight perrand, or maybe they have not

0:25:49.800 --> 0:25:53.679
<v Speaker 1>enough fingers, or maybe all their fingers are thumbs, or

0:25:53.720 --> 0:25:56.600
<v Speaker 1>maybe they bend in unnatural ways, or they all look

0:25:56.680 --> 0:26:00.960
<v Speaker 1>like long strands of spaghetti. These are clearly miss you know,

0:26:01.040 --> 0:26:05.359
<v Speaker 1>image generators have identified fingers are appendages, and these appendages

0:26:05.440 --> 0:26:09.000
<v Speaker 1>attached to hands. But the machines don't really follow the

0:26:09.080 --> 0:26:12.320
<v Speaker 1>rules when it comes to portraying those fingers, and they

0:26:12.680 --> 0:26:16.200
<v Speaker 1>do the best they can, and sometimes the best they

0:26:16.200 --> 0:26:21.760
<v Speaker 1>can is hilariously bad. But if image generator models train

0:26:22.160 --> 0:26:26.320
<v Speaker 1>on material that was created by AI, those weird fingers

0:26:26.440 --> 0:26:30.080
<v Speaker 1>are seen as a feature, not a bug. Like the

0:26:30.280 --> 0:26:33.639
<v Speaker 1>AI model doesn't know, Oh, fingers don't actually look like that,

0:26:33.640 --> 0:26:37.440
<v Speaker 1>that's wrong. It just says, ah, this is how fingers

0:26:37.480 --> 0:26:40.119
<v Speaker 1>sometimes look based upon these images I've been trained on,

0:26:40.480 --> 0:26:44.440
<v Speaker 1>which means the next generation of image generators will stress

0:26:44.520 --> 0:26:48.040
<v Speaker 1>these features more instead of correcting for them, which means

0:26:48.080 --> 0:26:50.880
<v Speaker 1>you're going to get some really weird images as a result.

0:26:51.359 --> 0:26:55.119
<v Speaker 1>And this process can repeat itself, and it gets worse

0:26:55.160 --> 0:26:58.480
<v Speaker 1>and worse each time. It's like making a copy of

0:26:58.520 --> 0:27:02.240
<v Speaker 1>a copy of a copy. You eventually reach a point

0:27:02.280 --> 0:27:06.359
<v Speaker 1>where the copy you have produced is illegible or doesn't

0:27:06.400 --> 0:27:08.879
<v Speaker 1>look enough like the original at all for you to

0:27:08.920 --> 0:27:12.360
<v Speaker 1>even easily say, oh, this is a copy of that.

0:27:12.359 --> 0:27:15.240
<v Speaker 1>That can be a real problem. And of course this

0:27:15.320 --> 0:27:18.320
<v Speaker 1>is just one example the fingers in AI. That's an

0:27:18.359 --> 0:27:23.480
<v Speaker 1>easy mark to hit, right, but there are countless other examples.

0:27:23.960 --> 0:27:27.639
<v Speaker 1>In a paper titled The Curse of Recursion Training on

0:27:27.840 --> 0:27:32.080
<v Speaker 1>Generated Data Makes Models Forget, a group of researchers from

0:27:32.160 --> 0:27:36.800
<v Speaker 1>the University of Cambridge, Oxford University, Imperial College London, the

0:27:36.920 --> 0:27:40.639
<v Speaker 1>University of Edinburgh, and the University of Toronto present an

0:27:40.800 --> 0:27:45.520
<v Speaker 1>argument of a pretty bleak future if AI researchers don't

0:27:45.600 --> 0:27:49.040
<v Speaker 1>take the proper measures to head it off. We're going

0:27:49.119 --> 0:27:52.000
<v Speaker 1>to talk more about that in just a moment, but

0:27:52.119 --> 0:28:05.680
<v Speaker 1>first let's take another quick break to thank our sponsors. Okay,

0:28:05.800 --> 0:28:09.520
<v Speaker 1>before the break, I mentioned this paper, the cursor recursion

0:28:09.760 --> 0:28:13.840
<v Speaker 1>Training on Generated Data makes Models Forget. It's a great article.

0:28:13.960 --> 0:28:17.720
<v Speaker 1>It does get very technical at one point, but the

0:28:17.760 --> 0:28:22.800
<v Speaker 1>researchers did a great job explaining the top level problem

0:28:23.040 --> 0:28:26.360
<v Speaker 1>and the potential outcome of that problem in a way

0:28:26.359 --> 0:28:29.040
<v Speaker 1>that I think anyone could find accessible. When you get

0:28:29.080 --> 0:28:32.440
<v Speaker 1>to the actual analysis part, that's when it gets really technical.

0:28:32.600 --> 0:28:36.280
<v Speaker 1>But the summary, the conclusions, all of that, I think

0:28:36.320 --> 0:28:41.040
<v Speaker 1>is easy to understand. So in that paper, the researchers say, quote,

0:28:41.160 --> 0:28:45.560
<v Speaker 1>we discover that learning from data produced by other models

0:28:45.600 --> 0:28:51.800
<v Speaker 1>causes model collapse, a degenerative process whereby over time models

0:28:51.840 --> 0:28:57.360
<v Speaker 1>forget the true underlying data distribution end quote. So essentially,

0:28:57.960 --> 0:29:01.880
<v Speaker 1>these AI models will quote unquote for get information while

0:29:01.960 --> 0:29:10.080
<v Speaker 1>simultaneously a set of learned behaviors they have created through synthesizing.

0:29:10.080 --> 0:29:13.240
<v Speaker 1>All this information will begin to converge and lead to

0:29:13.400 --> 0:29:16.760
<v Speaker 1>a broken model that's no longer really useful. It won't

0:29:16.760 --> 0:29:21.640
<v Speaker 1>present anything that's of real value. So the researchers argue

0:29:21.640 --> 0:29:26.080
<v Speaker 1>that quote the use of llm's at scale to publish

0:29:26.160 --> 0:29:29.720
<v Speaker 1>content on the Internet will pollute the collection of data

0:29:29.800 --> 0:29:34.080
<v Speaker 1>to train them endo quote. That's bad news, and it's

0:29:34.160 --> 0:29:37.040
<v Speaker 1>definitely going to be an issue, particularly with sites that

0:29:37.080 --> 0:29:41.720
<v Speaker 1>fall into the content farm category, because it's already happening right.

0:29:42.160 --> 0:29:45.160
<v Speaker 1>There are already websites out there that have turned to

0:29:45.280 --> 0:29:50.160
<v Speaker 1>AI generation to flesh out the articles that they have

0:29:50.680 --> 0:29:55.600
<v Speaker 1>in their database, and these articles are of a varying quality,

0:29:55.920 --> 0:29:58.880
<v Speaker 1>and all of those getting scooped up in a future

0:29:59.040 --> 0:30:03.600
<v Speaker 1>AI model session and used side by side with articles

0:30:03.640 --> 0:30:07.920
<v Speaker 1>that were researched written and edited by human beings and

0:30:07.960 --> 0:30:12.320
<v Speaker 1>therefore potentially at least of higher quality. I'm not saying

0:30:12.360 --> 0:30:16.040
<v Speaker 1>that all human written articles and edited articles are great.

0:30:16.440 --> 0:30:19.160
<v Speaker 1>They're not. There's some bad stuff out there that human

0:30:19.160 --> 0:30:24.120
<v Speaker 1>beings have written. But with those steps in place, you

0:30:24.240 --> 0:30:29.640
<v Speaker 1>have the potential for really great work. With AI. You

0:30:29.680 --> 0:30:33.040
<v Speaker 1>don't necessarily get that. You hope you get it, but

0:30:33.120 --> 0:30:36.720
<v Speaker 1>there's no guarantee and there aren't enough I would say

0:30:37.720 --> 0:30:40.960
<v Speaker 1>safety valves to make sure that things don't go off

0:30:40.960 --> 0:30:44.520
<v Speaker 1>the rails. So getting back to content farms, if you

0:30:44.600 --> 0:30:48.200
<v Speaker 1>are unfamiliar with that term, well don't worry. You've almost

0:30:48.280 --> 0:30:51.000
<v Speaker 1>certainly come across a content farm at some point in

0:30:51.040 --> 0:30:54.440
<v Speaker 1>the past. So these are sites that just churn out

0:30:54.480 --> 0:30:58.880
<v Speaker 1>an enormous amount of content, typically in an effort to

0:30:58.960 --> 0:31:03.320
<v Speaker 1>tap into the sweet sweet waters of SEO, which stands

0:31:03.320 --> 0:31:07.120
<v Speaker 1>for search engine optimization. So for a lot of websites

0:31:07.200 --> 0:31:11.120
<v Speaker 1>out there, the majority of traffic coming to the website

0:31:11.440 --> 0:31:14.600
<v Speaker 1>comes courtesy of a search engine. And when I say

0:31:14.760 --> 0:31:17.040
<v Speaker 1>a search engine, you might as well fill in the

0:31:17.080 --> 0:31:19.840
<v Speaker 1>name Google there, because that's the big one. I mean.

0:31:19.840 --> 0:31:22.560
<v Speaker 1>There are other search engines out there, and some of

0:31:22.560 --> 0:31:26.520
<v Speaker 1>them do contribute to this too, but Google commands somewhere

0:31:26.560 --> 0:31:30.400
<v Speaker 1>between eighty and like ninety five percent of the search market.

0:31:30.480 --> 0:31:33.840
<v Speaker 1>Exactly where that it falls is a matter of debate.

0:31:34.080 --> 0:31:37.960
<v Speaker 1>Like I looked at a few different Internet analytics sites, right,

0:31:38.040 --> 0:31:41.600
<v Speaker 1>and they had different percentages, but there was always above

0:31:41.680 --> 0:31:44.360
<v Speaker 1>eighty percent, and some as high as like ninety two

0:31:44.480 --> 0:31:47.080
<v Speaker 1>or ninety three. So it's safe to say that Google

0:31:47.200 --> 0:31:50.440
<v Speaker 1>dominates the search space. You know, technically it may not

0:31:50.480 --> 0:31:53.600
<v Speaker 1>be a monopoly, but effectively it kind of is. So

0:31:54.320 --> 0:31:58.800
<v Speaker 1>sites that depend on traffic from search naturally want to

0:31:58.840 --> 0:32:01.960
<v Speaker 1>find ways for their pay is to rank high in

0:32:02.040 --> 0:32:05.560
<v Speaker 1>search results and to appear in more search results. Now

0:32:05.560 --> 0:32:09.200
<v Speaker 1>that's actually easier said than done. Google has changed its

0:32:09.240 --> 0:32:13.000
<v Speaker 1>page ranking algorithm a few different times, and some search

0:32:13.000 --> 0:32:16.960
<v Speaker 1>results are dependent upon who is doing the searching. That

0:32:17.040 --> 0:32:19.800
<v Speaker 1>means that you and I might each search for the

0:32:19.880 --> 0:32:23.560
<v Speaker 1>exact same thing, maybe we word it the exact same way,

0:32:24.080 --> 0:32:27.960
<v Speaker 1>but we'll end up getting different results. Google says, quote

0:32:28.200 --> 0:32:31.720
<v Speaker 1>personalization is only used in your results if it can

0:32:31.760 --> 0:32:36.080
<v Speaker 1>provide more relevant and helpful information end quote. So presumably

0:32:36.120 --> 0:32:38.760
<v Speaker 1>it doesn't happen all the time. That means that in

0:32:38.800 --> 0:32:41.680
<v Speaker 1>some cases you and I will get identical results depending

0:32:41.720 --> 0:32:44.080
<v Speaker 1>upon what it is we're searching for, and in other

0:32:44.160 --> 0:32:48.360
<v Speaker 1>cases we will get very different search results. I do

0:32:48.520 --> 0:32:52.840
<v Speaker 1>know this makes SEO a much larger challenge because it's

0:32:52.880 --> 0:32:56.320
<v Speaker 1>impossible to be all things to all people. You know,

0:32:56.560 --> 0:32:59.120
<v Speaker 1>you can only do the best you can to try

0:32:59.200 --> 0:33:02.440
<v Speaker 1>and show up for any given search query. It is

0:33:02.600 --> 0:33:07.880
<v Speaker 1>super duper hard if you're dependent upon human writers and

0:33:08.000 --> 0:33:11.240
<v Speaker 1>editors to generate all the stuff that you're shoving out

0:33:11.320 --> 0:33:14.719
<v Speaker 1>in an effort to get clicks. So most of your

0:33:14.760 --> 0:33:17.360
<v Speaker 1>traffic is coming from search. We talked about this already.

0:33:17.640 --> 0:33:20.120
<v Speaker 1>You need to have lots of stuff on your site

0:33:20.160 --> 0:33:23.560
<v Speaker 1>that people could be searching for so that traffic comes

0:33:23.600 --> 0:33:25.800
<v Speaker 1>your way, and that way you can make money through

0:33:25.920 --> 0:33:29.680
<v Speaker 1>web advertising. Essentially, you could try to be reactionary, right,

0:33:29.760 --> 0:33:33.360
<v Speaker 1>You could try to generate new content as things capture

0:33:33.400 --> 0:33:36.360
<v Speaker 1>of the public interest, but you run the danger of

0:33:36.440 --> 0:33:39.440
<v Speaker 1>getting to the party too late and that by the

0:33:39.480 --> 0:33:42.080
<v Speaker 1>time you have something up, no one's talking about it

0:33:42.080 --> 0:33:45.480
<v Speaker 1>anymore and you're not really seeing any real traffic from that.

0:33:46.000 --> 0:33:48.000
<v Speaker 1>What if instead you could just kind of open up

0:33:48.040 --> 0:33:52.640
<v Speaker 1>a fire hose of content using generative AI. Well, if

0:33:52.640 --> 0:33:55.360
<v Speaker 1>you just had AI write a whole bunch of articles

0:33:55.360 --> 0:33:59.640
<v Speaker 1>in the style that you've established for your company, and maybe,

0:33:59.680 --> 0:34:02.520
<v Speaker 1>if you're feeling a little cautious, you'll even employ a

0:34:02.520 --> 0:34:05.040
<v Speaker 1>couple of human being editors to take on the job

0:34:05.080 --> 0:34:08.399
<v Speaker 1>of reading over these generated articles and to correct any

0:34:08.440 --> 0:34:11.279
<v Speaker 1>mistakes that were made, and perhaps even tweak a couple

0:34:11.320 --> 0:34:13.040
<v Speaker 1>of things here and there to make it sound more

0:34:13.120 --> 0:34:16.680
<v Speaker 1>human if necessary. But now you can push out way

0:34:16.880 --> 0:34:20.040
<v Speaker 1>more content without having to wait on human writers to

0:34:20.120 --> 0:34:24.879
<v Speaker 1>research and write everything. Plus, AI does not complain if

0:34:24.880 --> 0:34:27.560
<v Speaker 1>you assign it to write a suite of articles about

0:34:27.560 --> 0:34:31.279
<v Speaker 1>gluten free skincare products. By the way, I'm using my

0:34:31.440 --> 0:34:34.920
<v Speaker 1>real world life experience with that last example. I once

0:34:35.080 --> 0:34:39.080
<v Speaker 1>got that writing assignment. It was dumb then and it's

0:34:39.320 --> 0:34:42.040
<v Speaker 1>dumb now, But I guess people were searching for it,

0:34:42.160 --> 0:34:44.640
<v Speaker 1>so I got an assignment to write it. Now. I

0:34:44.680 --> 0:34:47.440
<v Speaker 1>would like to think that the site I was writing for,

0:34:47.520 --> 0:34:51.680
<v Speaker 1>which was how Stuffworks dot Com, wasn't really a content farm.

0:34:52.040 --> 0:34:54.360
<v Speaker 1>I would love to think that, and I would argue

0:34:54.360 --> 0:34:56.680
<v Speaker 1>that for many years when I wrote there, it did

0:34:56.680 --> 0:34:59.560
<v Speaker 1>not qualify as a content farm. We did try to

0:34:59.600 --> 0:35:05.040
<v Speaker 1>write in depth, authoritative articles about all sorts of stuff,

0:35:05.640 --> 0:35:09.600
<v Speaker 1>like whether we were talking about technology or society, or

0:35:09.719 --> 0:35:14.400
<v Speaker 1>money or entertainment, whatever it might be. We applied rigor,

0:35:14.840 --> 0:35:19.160
<v Speaker 1>you know, journalistic rigor, toward the research and writing and

0:35:19.280 --> 0:35:23.239
<v Speaker 1>editing of those pieces. Over time, things changed where we

0:35:23.280 --> 0:35:27.040
<v Speaker 1>started to cater more toward ad deals, where we would

0:35:27.160 --> 0:35:30.200
<v Speaker 1>get this big ad deal with a company like a

0:35:30.880 --> 0:35:34.240
<v Speaker 1>you know, cosmetics company, for example, and we would suddenly

0:35:34.280 --> 0:35:39.720
<v Speaker 1>have hundreds of articles assigned in the field of cosmetics,

0:35:40.160 --> 0:35:44.160
<v Speaker 1>articles that were incredibly niche like, there was no way

0:35:44.239 --> 0:35:46.360
<v Speaker 1>that we're going to drive a ton of traffic. But

0:35:46.560 --> 0:35:51.600
<v Speaker 1>collectively then these articles could get a lot of traffic.

0:35:51.960 --> 0:35:54.760
<v Speaker 1>Not a single one, but across the board. If someone

0:35:54.880 --> 0:35:58.120
<v Speaker 1>happened to be searching for this thing, they could find

0:35:58.120 --> 0:36:00.840
<v Speaker 1>their way to our article and that would be another

0:36:00.920 --> 0:36:03.759
<v Speaker 1>click coming our way. It was a very much a

0:36:03.800 --> 0:36:08.719
<v Speaker 1>shotgun approach to writing content. I hated it. There were

0:36:08.840 --> 0:36:12.439
<v Speaker 1>articles I wrote that I am not at all. It's

0:36:12.440 --> 0:36:14.319
<v Speaker 1>not that I'm not proud of the work I did.

0:36:14.360 --> 0:36:17.120
<v Speaker 1>I'm not proud of getting the assignment, like it was

0:36:17.160 --> 0:36:20.600
<v Speaker 1>a joke in my opinion, But that's what we were

0:36:20.760 --> 0:36:23.279
<v Speaker 1>trying to do in order to survive. Because again, how

0:36:23.320 --> 0:36:25.439
<v Speaker 1>stuff works was like one of these websites in that

0:36:25.800 --> 0:36:28.240
<v Speaker 1>most of the traffic coming through How Stuff Works came

0:36:28.400 --> 0:36:31.319
<v Speaker 1>through a search engine. Someone was looking to learn how

0:36:31.400 --> 0:36:34.840
<v Speaker 1>something worked and they got sent our way. People weren't,

0:36:34.920 --> 0:36:37.360
<v Speaker 1>as a rule, just coming to How Stuff Works to

0:36:37.719 --> 0:36:40.640
<v Speaker 1>peruse the website. We always wanted that that was what

0:36:40.760 --> 0:36:43.759
<v Speaker 1>our goal was, to create a destination website that people

0:36:43.760 --> 0:36:46.000
<v Speaker 1>would want to go to just to see, oh, what's

0:36:46.120 --> 0:36:48.759
<v Speaker 1>new on the site, But we never really achieved that.

0:36:48.920 --> 0:36:51.279
<v Speaker 1>It's a really hard thing to do. There are people

0:36:51.320 --> 0:36:54.200
<v Speaker 1>who do it and it's amazing, but it's not easy

0:36:54.200 --> 0:36:59.160
<v Speaker 1>to replicate. So instead we wrote tons of articles about

0:36:59.200 --> 0:37:03.319
<v Speaker 1>stuff that people were searching for, and that just kind

0:37:03.320 --> 0:37:07.239
<v Speaker 1>of was our mo at that point. Anyway, if you're

0:37:07.320 --> 0:37:10.480
<v Speaker 1>using AI to create these kinds of articles, it's going

0:37:10.520 --> 0:37:12.960
<v Speaker 1>to generate a lot of stuff that's just not very good.

0:37:13.080 --> 0:37:17.319
<v Speaker 1>But then who cares? Like you don't necessarily care if

0:37:17.360 --> 0:37:22.719
<v Speaker 1>the material is good. If the only traffic you're really

0:37:22.719 --> 0:37:26.880
<v Speaker 1>getting on your website is coming from search engines, you

0:37:27.080 --> 0:37:29.960
<v Speaker 1>just need it to show up in the search engines. Now,

0:37:30.000 --> 0:37:32.879
<v Speaker 1>if the search engine is able to determine, hey, this

0:37:32.960 --> 0:37:37.560
<v Speaker 1>is low quality content, and it disincentivizes people visiting by

0:37:37.920 --> 0:37:40.799
<v Speaker 1>making it go further down the search results, then you're

0:37:40.800 --> 0:37:42.960
<v Speaker 1>going to have a problem, and a lot of content

0:37:43.000 --> 0:37:46.920
<v Speaker 1>farms ran into that problem. Google downgraded content farms in

0:37:47.000 --> 0:37:51.000
<v Speaker 1>their search algorithm. Other sites like duck duck go removed

0:37:51.320 --> 0:37:56.600
<v Speaker 1>websites that were considered content farms because the people running

0:37:56.680 --> 0:38:00.720
<v Speaker 1>duck dot go realized, Hey, these sites aren't inviting anything

0:38:00.760 --> 0:38:03.960
<v Speaker 1>of real value to visitors. Why are we even serving

0:38:03.960 --> 0:38:07.640
<v Speaker 1>it up. That's not really a good use of anyone's time.

0:38:08.080 --> 0:38:12.439
<v Speaker 1>But if you're in a space where the jig isn't

0:38:12.520 --> 0:38:14.520
<v Speaker 1>up yet, you might as well just go ahead and

0:38:14.560 --> 0:38:17.080
<v Speaker 1>create as much garbage as you can because you just

0:38:17.120 --> 0:38:19.839
<v Speaker 1>want the clicks. You don't care if people actually think

0:38:19.880 --> 0:38:23.200
<v Speaker 1>the articles are of good quality or that they're going

0:38:23.239 --> 0:38:26.239
<v Speaker 1>to learn anything useful. You don't even necessarily care if

0:38:26.239 --> 0:38:29.280
<v Speaker 1>the articles are accurate. You care that people are clicking

0:38:29.360 --> 0:38:34.720
<v Speaker 1>on the articles. So if that's your perspective, ultimately, then

0:38:35.239 --> 0:38:38.000
<v Speaker 1>the goal for you is to push as much of

0:38:38.000 --> 0:38:40.800
<v Speaker 1>this stuff out the door as you possibly can, generate

0:38:40.840 --> 0:38:43.520
<v Speaker 1>it as fast as possible, get it online as quickly

0:38:43.560 --> 0:38:47.520
<v Speaker 1>as you can, and hope that starts to rank in

0:38:47.600 --> 0:38:51.600
<v Speaker 1>search so that people flood in to read about whatever

0:38:51.600 --> 0:38:54.640
<v Speaker 1>it is you're writing about. But it's not just people

0:38:55.040 --> 0:38:58.040
<v Speaker 1>who are going to your links, is it. There are

0:38:58.200 --> 0:39:01.160
<v Speaker 1>bots crawling the web now. Some of them are calling

0:39:01.200 --> 0:39:04.000
<v Speaker 1>the web in order to index those web pages for

0:39:04.080 --> 0:39:07.240
<v Speaker 1>the purposes of things like search engines, but other bots

0:39:07.239 --> 0:39:10.720
<v Speaker 1>are there to scrape data for the purposes of training

0:39:10.760 --> 0:39:15.040
<v Speaker 1>the next generation of large language models. Essentially, at this point,

0:39:15.120 --> 0:39:18.120
<v Speaker 1>bots are reading articles that were written by other bots,

0:39:18.440 --> 0:39:22.160
<v Speaker 1>and so when the next large language model launches, it

0:39:22.160 --> 0:39:24.600
<v Speaker 1>does so on a data set that has been polluted

0:39:24.880 --> 0:39:29.080
<v Speaker 1>by bot generated information. That means the next generation will

0:39:29.080 --> 0:39:31.759
<v Speaker 1>be even worse, and so on, and eventually we arrive

0:39:31.800 --> 0:39:35.840
<v Speaker 1>at a point where the Internet, this amazing invention that

0:39:35.920 --> 0:39:40.520
<v Speaker 1>provides access to practically all of human knowledge, becomes absolutely

0:39:40.800 --> 0:39:46.560
<v Speaker 1>infested with junk that is inaccurate and increasingly nonsensical, and

0:39:46.680 --> 0:39:54.480
<v Speaker 1>we render this incredible invention useless. This isn't just speculation either.

0:39:54.880 --> 0:39:58.320
<v Speaker 1>We have examples of companies turning to AI to generate articles.

0:39:58.520 --> 0:40:01.680
<v Speaker 1>C Net famously did this early in the days of

0:40:01.760 --> 0:40:06.440
<v Speaker 1>generative AI, and cnet properly got roasted for doing it,

0:40:06.480 --> 0:40:09.040
<v Speaker 1>first roasted for not presenting it in a way that

0:40:09.120 --> 0:40:12.960
<v Speaker 1>was transparent, and then also for including articles that just

0:40:13.000 --> 0:40:17.920
<v Speaker 1>had outright wrong information in them and publishing them as

0:40:18.000 --> 0:40:21.399
<v Speaker 1>if they were vetted pieces that editors had gone through.

0:40:21.640 --> 0:40:25.200
<v Speaker 1>How stuff works. Again, my old employer where I got

0:40:25.200 --> 0:40:28.319
<v Speaker 1>that Skincare writing assignment once upon a time, they've done

0:40:28.360 --> 0:40:31.680
<v Speaker 1>this too. They laid off their human writers, they stopped

0:40:32.160 --> 0:40:36.080
<v Speaker 1>giving assignments to freelancers. Later on they laid off the

0:40:36.239 --> 0:40:40.399
<v Speaker 1>entire editorial staff after the editors protested this move toward

0:40:40.560 --> 0:40:46.400
<v Speaker 1>AI generated content. This trend is happening. Not only are

0:40:46.520 --> 0:40:49.040
<v Speaker 1>talented people being put out of work, which is bad

0:40:49.160 --> 0:40:52.400
<v Speaker 1>enough already. These editors and writers, they believed in what

0:40:52.440 --> 0:40:56.960
<v Speaker 1>they were doing. Yeah, sometimes the assignments stank, sometimes they

0:40:57.000 --> 0:41:01.400
<v Speaker 1>were not good, but the writers and editors still believed

0:41:01.440 --> 0:41:04.080
<v Speaker 1>in doing as good a job as they possibly could.

0:41:04.520 --> 0:41:08.239
<v Speaker 1>But their replacements, the AI, they're just making the Internet

0:41:08.520 --> 0:41:12.839
<v Speaker 1>worse by generating unreliable and terrible content. Then no one

0:41:12.960 --> 0:41:15.960
<v Speaker 1>actually wants to read unless they just happen to put

0:41:16.200 --> 0:41:20.799
<v Speaker 1>that particular set of terms into a search engine and

0:41:20.880 --> 0:41:24.840
<v Speaker 1>the search engine couldn't find anything better to serve them. Again,

0:41:25.200 --> 0:41:28.880
<v Speaker 1>it's as if you needed to learn something important, but

0:41:29.000 --> 0:41:32.320
<v Speaker 1>all you have access to are just sloppily written articles

0:41:32.360 --> 0:41:35.680
<v Speaker 1>by people who had no understanding or passion about the

0:41:35.680 --> 0:41:38.480
<v Speaker 1>subject matter they were writing on, and there were no

0:41:38.680 --> 0:41:41.840
<v Speaker 1>editors to steer the writer toward creating a more accurate

0:41:42.000 --> 0:41:47.520
<v Speaker 1>or informative piece. It gets pretty darn bleak. Is it inevitable? Though? No,

0:41:47.800 --> 0:41:51.560
<v Speaker 1>it's not inevitable. This future happens if the people who

0:41:51.600 --> 0:41:54.799
<v Speaker 1>are training the AI models allow it to happen, but

0:41:54.920 --> 0:41:58.719
<v Speaker 1>with careful stewardship. By guiding the AI models so that

0:41:58.800 --> 0:42:01.920
<v Speaker 1>they don't pull training data from garbage sites and they

0:42:02.080 --> 0:42:07.120
<v Speaker 1>really focus on reputable sources, it's possible to avoid these

0:42:07.160 --> 0:42:10.080
<v Speaker 1>issues at least in some part. I mean some things

0:42:10.120 --> 0:42:13.960
<v Speaker 1>like hallucinations, confabulations, that kind of stuff that can happen anyway,

0:42:14.200 --> 0:42:17.960
<v Speaker 1>but you can at least limit it. That's not really

0:42:18.040 --> 0:42:20.240
<v Speaker 1>what we're seeing right now, though, because at the moment,

0:42:20.360 --> 0:42:23.959
<v Speaker 1>companies are rushing into the AI space right they are

0:42:24.160 --> 0:42:29.000
<v Speaker 1>pushing so hard to create large language models that dwarf

0:42:29.200 --> 0:42:32.799
<v Speaker 1>the previous generation's capabilities. So to do that, they have

0:42:32.880 --> 0:42:35.960
<v Speaker 1>to seek out training data from all across the internet.

0:42:36.160 --> 0:42:38.880
<v Speaker 1>You have to train these AI models on tons and

0:42:38.920 --> 0:42:42.080
<v Speaker 1>tons of information to make them useful. The more data

0:42:42.120 --> 0:42:45.640
<v Speaker 1>you have access to the better. Social platforms have provided

0:42:45.680 --> 0:42:49.160
<v Speaker 1>a popular source of information. We know that Reddit has

0:42:49.200 --> 0:42:52.960
<v Speaker 1>struck deals with open ai, for example, in order to

0:42:53.440 --> 0:42:57.280
<v Speaker 1>crawl Reddit to pull information. But you know what, social

0:42:57.280 --> 0:43:01.280
<v Speaker 1>platforms are also really popular with bots, not just with people.

0:43:01.560 --> 0:43:04.440
<v Speaker 1>So even this approach brings with it the risk of

0:43:04.520 --> 0:43:08.920
<v Speaker 1>AI training on other AI generated data, which again leads

0:43:08.960 --> 0:43:13.480
<v Speaker 1>to model collapse. Further down the road, I might one

0:43:13.560 --> 0:43:16.560
<v Speaker 1>day do a much more in depth episode about this paper,

0:43:16.840 --> 0:43:21.160
<v Speaker 1>the curse of recursion. Training on generated data makes models forget.

0:43:21.440 --> 0:43:24.200
<v Speaker 1>I've given a very high level summary of what the

0:43:24.280 --> 0:43:28.440
<v Speaker 1>researchers say in that paper, but it might benefit us

0:43:28.440 --> 0:43:31.040
<v Speaker 1>to take a much closer look at what they found

0:43:31.320 --> 0:43:35.040
<v Speaker 1>and their conclusions. So I may revisit this topic in

0:43:35.080 --> 0:43:37.160
<v Speaker 1>the future, but for now, I think it's just good

0:43:37.160 --> 0:43:39.839
<v Speaker 1>to remember that AI does have the potential to do

0:43:39.880 --> 0:43:43.919
<v Speaker 1>great things. I mean, it can potentially augment our work

0:43:43.920 --> 0:43:48.080
<v Speaker 1>efforts and let us accomplish goals more quickly and efficiently

0:43:48.280 --> 0:43:51.960
<v Speaker 1>and accurately. But AI also has the potential to make

0:43:52.040 --> 0:43:55.440
<v Speaker 1>things miserable and churn out content that no one wants

0:43:55.480 --> 0:43:59.200
<v Speaker 1>to see other than other bots, and creating a cynical

0:43:59.239 --> 0:44:02.440
<v Speaker 1>cycle that ultimately could make the Internet into a cluttered,

0:44:02.560 --> 0:44:06.120
<v Speaker 1>practically useless mess. So which way are we going to go?

0:44:06.440 --> 0:44:09.319
<v Speaker 1>I think my answer day to day depends on how

0:44:09.320 --> 0:44:11.959
<v Speaker 1>optimistic I feel, but at the very least, I think

0:44:12.040 --> 0:44:16.360
<v Speaker 1>knowing about the risks is important. That's it for today's episode.

0:44:16.600 --> 0:44:19.319
<v Speaker 1>I hope you are all well. I will try to

0:44:19.360 --> 0:44:21.600
<v Speaker 1>get away from AI topics. I know I've been covering

0:44:21.600 --> 0:44:23.640
<v Speaker 1>a lot of that recently, and that'd be nice to

0:44:23.719 --> 0:44:26.839
<v Speaker 1>kind of branch into other areas of tech, So I'm

0:44:26.840 --> 0:44:29.360
<v Speaker 1>going to try and do that. It's just AI stuff

0:44:29.360 --> 0:44:32.400
<v Speaker 1>just keeps on happening, y'all. But I will talk to

0:44:32.400 --> 0:44:42.720
<v Speaker 1>you again really soon. Tech Stuff is an iHeartRadio production.

0:44:43.040 --> 0:44:48.080
<v Speaker 1>For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts,

0:44:48.200 --> 0:44:53.880
<v Speaker 1>or wherever you listen to your favorite shows.