WEBVTT - Is A.I. Going to Kill Us All?

0:00:01.400 --> 0:00:04.560
<v Speaker 1>Hey, welcome to sign Stuff, a production of iHeartRadio I'm

0:00:04.559 --> 0:00:08.000
<v Speaker 1>More cham and for our season finale today, we're asking

0:00:08.080 --> 0:00:11.640
<v Speaker 1>one of the biggest questions in science today, is AI

0:00:11.960 --> 0:00:18.840
<v Speaker 1>going to kill us all? I know it's a little dramatic,

0:00:19.000 --> 0:00:22.560
<v Speaker 1>but the problem of AI alignment is a real one.

0:00:22.920 --> 0:00:25.880
<v Speaker 1>How do we make sure AI systems have humanity's best

0:00:25.920 --> 0:00:28.800
<v Speaker 1>interests at heart? How do we teach them our values

0:00:28.800 --> 0:00:32.360
<v Speaker 1>and morals? And can anyone guarantee that they're going to

0:00:32.400 --> 0:00:35.239
<v Speaker 1>follow them? We're gonna answer these questions by talking to

0:00:35.320 --> 0:00:38.640
<v Speaker 1>two AI safety experts who are on the cutting edge

0:00:38.760 --> 0:00:41.600
<v Speaker 1>of trying to figure out this problem. And don't worry.

0:00:41.640 --> 0:00:47.880
<v Speaker 1>According to them, we're not totally doomed yet. Okay, maybe

0:00:47.920 --> 0:00:50.960
<v Speaker 1>just a little, So get ready to reprogram your thinking

0:00:51.000 --> 0:00:54.360
<v Speaker 1>about chatbots and computer brains as we tackle the question

0:00:54.960 --> 0:01:03.640
<v Speaker 1>is AI going to kill us all? Hey? Everyone, As

0:01:03.680 --> 0:01:09.160
<v Speaker 1>I said, this is the season finale. Stay subscribed to

0:01:09.160 --> 0:01:12.399
<v Speaker 1>this feed for any updates in future episodes. And hey,

0:01:12.680 --> 0:01:14.640
<v Speaker 1>if you like science, I have a couple of new

0:01:14.720 --> 0:01:17.240
<v Speaker 1>science books coming out in the near future, as well

0:01:17.280 --> 0:01:20.440
<v Speaker 1>as a cool science animation project, so be sure to

0:01:20.440 --> 0:01:24.760
<v Speaker 1>follow me on social media or online at Phdcomics dot com.

0:01:24.800 --> 0:01:27.160
<v Speaker 1>All right, so they were tackling the problem of AI

0:01:27.560 --> 0:01:32.319
<v Speaker 1>alignment or basically are AI systems Gwenna Kills all and

0:01:32.440 --> 0:01:34.840
<v Speaker 1>I have a treat for you. For the first time ever,

0:01:35.120 --> 0:01:38.480
<v Speaker 1>we have on the show Casey pegram or supervising producer

0:01:38.560 --> 0:01:42.479
<v Speaker 1>and sound engineer. Hey, Casey, welcome to the show.

0:01:42.600 --> 0:01:43.960
<v Speaker 2>Hey or Hey, glad to be here.

0:01:44.280 --> 0:01:46.840
<v Speaker 1>Now this is the first time people actually hear your voice,

0:01:46.920 --> 0:01:49.840
<v Speaker 1>not just your amazing work polishing the episode.

0:01:49.920 --> 0:01:51.600
<v Speaker 2>Yeah, it's always a weird thing to kind of go

0:01:51.680 --> 0:01:54.760
<v Speaker 2>inside the thing you've been working on from the outside.

0:01:54.840 --> 0:01:58.280
<v Speaker 2>So I'll be listening to myself back and it's a

0:01:58.280 --> 0:02:00.160
<v Speaker 2>special kind of torture to have to like work on

0:02:00.200 --> 0:02:04.080
<v Speaker 2>your own thing, Yes, like edit yourself or just you know, honestly,

0:02:04.120 --> 0:02:06.280
<v Speaker 2>listen to the recorded sign of your voice is always

0:02:06.280 --> 0:02:07.520
<v Speaker 2>a little daring if you're not used to it.

0:02:07.600 --> 0:02:09.400
<v Speaker 1>Yeah, Well, if you want, you can give yourself like

0:02:09.440 --> 0:02:12.200
<v Speaker 1>a Morgan Freeman employee using AI.

0:02:12.320 --> 0:02:14.959
<v Speaker 2>Right, it's all possible these days. Absolutely. Yeah, I could

0:02:15.000 --> 0:02:17.320
<v Speaker 2>just build my own Morgan Freeman model and have a

0:02:17.320 --> 0:02:17.799
<v Speaker 2>field day.

0:02:17.919 --> 0:02:20.880
<v Speaker 1>There you go. Well, the idea for this episode came

0:02:20.919 --> 0:02:22.839
<v Speaker 1>from you. You said I had the idea to talk

0:02:22.880 --> 0:02:25.840
<v Speaker 1>about AI and AI alignment and whether AI is going

0:02:25.880 --> 0:02:28.680
<v Speaker 1>to kill us all. What made you think about this question?

0:02:29.120 --> 0:02:31.160
<v Speaker 2>Well, I suppose it's just been on my mind a

0:02:31.200 --> 0:02:34.360
<v Speaker 2>lot because I've been following along with all the developments

0:02:34.360 --> 0:02:37.160
<v Speaker 2>happening in AI, and there was a span of a

0:02:37.160 --> 0:02:39.480
<v Speaker 2>few weeks where suddenly you started hearing a lot about

0:02:39.520 --> 0:02:44.399
<v Speaker 2>AI agents, particularly one called open Claw, basically a sort

0:02:44.400 --> 0:02:47.840
<v Speaker 2>of autonomous AI agent that you can turn loose on

0:02:47.880 --> 0:02:52.119
<v Speaker 2>your computer and you can give it as much leeway freedom, passwords,

0:02:52.240 --> 0:02:55.560
<v Speaker 2>credit card numbers, bank accounts. If you just want to

0:02:55.720 --> 0:02:58.480
<v Speaker 2>absolutely put your life in the hands of a robot,

0:02:58.520 --> 0:02:59.000
<v Speaker 2>you can do it.

0:02:59.240 --> 0:03:00.520
<v Speaker 1>What's the worst thing can happen?

0:03:00.840 --> 0:03:04.040
<v Speaker 2>Yeah, Well, people had their entire like email archive deleted,

0:03:04.160 --> 0:03:06.240
<v Speaker 2>even though they didn't ask for anything of the sort.

0:03:06.720 --> 0:03:10.720
<v Speaker 2>People have deployed it into production environments where you know,

0:03:10.760 --> 0:03:13.120
<v Speaker 2>a site is live on the Internet and they turn

0:03:13.440 --> 0:03:15.320
<v Speaker 2>the bot loose on it and it ends up deleting

0:03:15.360 --> 0:03:18.280
<v Speaker 2>their entire production database. And then when you ask it,

0:03:18.280 --> 0:03:20.120
<v Speaker 2>it's like, you're right, I wasn't supposed to do that.

0:03:20.160 --> 0:03:22.680
<v Speaker 2>I'm very sorry. I disobeyed every command you gave me.

0:03:22.800 --> 0:03:26.400
<v Speaker 1>But whoopsy daisy, Yeah, they seem a story about some

0:03:26.680 --> 0:03:30.040
<v Speaker 1>bought that texted the person's wife hundreds of times.

0:03:30.240 --> 0:03:35.440
<v Speaker 2>Yes, I think somebody tried to automate automate, you know exactly.

0:03:35.480 --> 0:03:37.360
<v Speaker 2>They tried to automate kind of like reaching out and

0:03:37.440 --> 0:03:40.520
<v Speaker 2>sending a little nice things during the day, and as

0:03:40.520 --> 0:03:43.000
<v Speaker 2>it turned out, the bot went a little overboard and

0:03:43.040 --> 0:03:45.640
<v Speaker 2>texted the wife like hundreds of times, and the wife

0:03:45.680 --> 0:03:48.360
<v Speaker 2>is like, what is wrong with you? So, yeah, that's

0:03:48.440 --> 0:03:51.680
<v Speaker 2>hilarious when people want to talk about AI alignment and

0:03:51.720 --> 0:03:54.080
<v Speaker 2>what that means. I think the paper clip problem is

0:03:54.120 --> 0:03:56.320
<v Speaker 2>a really good kind of metaphor. Even though it sounds

0:03:56.360 --> 0:03:58.440
<v Speaker 2>a little bit over the top, it kind of gets

0:03:58.480 --> 0:04:02.240
<v Speaker 2>to the core of the issue, which is, if you

0:04:02.400 --> 0:04:06.120
<v Speaker 2>ask an AI to maximize paper clip production, maybe the

0:04:06.160 --> 0:04:09.080
<v Speaker 2>way to maximize paper clip production is to eliminate human life,

0:04:09.160 --> 0:04:12.680
<v Speaker 2>you know, because that's unnecessary friction in the pursuit of

0:04:12.840 --> 0:04:16.359
<v Speaker 2>manufacturing as many paper clips as possible. So alignment is

0:04:16.400 --> 0:04:18.320
<v Speaker 2>sort of the kind of guardrails that you put into

0:04:18.320 --> 0:04:21.000
<v Speaker 2>place so that the AI understands it has limits that

0:04:21.040 --> 0:04:21.720
<v Speaker 2>it has to work within.

0:04:21.760 --> 0:04:24.480
<v Speaker 1>It sounds like a pretty serious problem, especially as we

0:04:24.520 --> 0:04:27.480
<v Speaker 1>get more and more into these AI models. And they

0:04:27.640 --> 0:04:29.680
<v Speaker 1>start to sleep into our lives, and you know, it's

0:04:29.680 --> 0:04:31.960
<v Speaker 1>sort of these are funny stories, but it seems like

0:04:32.160 --> 0:04:34.919
<v Speaker 1>we're heading into a potentially dangerous situation.

0:04:35.160 --> 0:04:37.360
<v Speaker 2>Well, I often ask myself, I'm going to have these

0:04:37.360 --> 0:04:39.440
<v Speaker 2>moments of doubt where I'm like, is this all just

0:04:40.200 --> 0:04:43.320
<v Speaker 2>way over hyped? And yet there are other situations where

0:04:43.520 --> 0:04:46.440
<v Speaker 2>as we've seen recently, you can feed it thousands of

0:04:46.480 --> 0:04:48.720
<v Speaker 2>lines of code and it will find, you know, a

0:04:48.800 --> 0:04:52.600
<v Speaker 2>security exploit that has gone unseen for twenty years, right, right,

0:04:52.920 --> 0:04:56.680
<v Speaker 2>And so it's hard to know how scared we should

0:04:56.720 --> 0:04:59.440
<v Speaker 2>be or how seriously we should weigh the risk of this.

0:04:59.560 --> 0:05:02.120
<v Speaker 2>If it's ridiculous that we're this worried, or if it's like,

0:05:02.200 --> 0:05:04.920
<v Speaker 2>actually very very practical, then we should be thinking seriously

0:05:05.040 --> 0:05:05.760
<v Speaker 2>about these things.

0:05:05.920 --> 0:05:08.839
<v Speaker 1>Yeah, these are all excellent questions. So I'm excited to

0:05:08.880 --> 0:05:10.240
<v Speaker 1>get into these conversations.

0:05:10.279 --> 0:05:11.000
<v Speaker 3>All right, But.

0:05:10.920 --> 0:05:12.480
<v Speaker 1>Before we move on to Casey, I just want to

0:05:12.520 --> 0:05:14.240
<v Speaker 1>say real quick, thank you for all the work you've

0:05:14.240 --> 0:05:14.880
<v Speaker 1>done for the show.

0:05:15.200 --> 0:05:17.520
<v Speaker 2>Oh say, it's been such a pleasure to work on.

0:05:17.640 --> 0:05:19.359
<v Speaker 2>It wasn't like work at all, you know. I was

0:05:19.360 --> 0:05:20.800
<v Speaker 2>there as a fan of the show, just listening to

0:05:20.800 --> 0:05:23.960
<v Speaker 2>every episode and awesome. Well, we're fans of yours as well. Casey,

0:05:24.000 --> 0:05:26.000
<v Speaker 2>All right, let's get to the question of is AI

0:05:26.080 --> 0:05:27.719
<v Speaker 2>going to kill us? All let's find out?

0:05:28.680 --> 0:05:31.440
<v Speaker 1>Okay. To answer all of these questions and concerns, I

0:05:31.520 --> 0:05:34.599
<v Speaker 1>reached out to two AI experts who specialize on the

0:05:34.680 --> 0:05:38.120
<v Speaker 1>problem of making sure AI is aligned with our values

0:05:38.240 --> 0:05:42.200
<v Speaker 1>and morals. The first expert is doctor Sam Bowman. Like

0:05:42.279 --> 0:05:44.920
<v Speaker 1>the Bowman is a professor of data and computer science

0:05:44.920 --> 0:05:48.240
<v Speaker 1>at NYU, and he also works at Anthropic, one of

0:05:48.240 --> 0:05:51.480
<v Speaker 1>the major AI companies on the market today. The first

0:05:51.520 --> 0:05:53.800
<v Speaker 1>thing I wanted to ask him was what exactly does

0:05:53.800 --> 0:05:57.800
<v Speaker 1>it mean for AI to care about this? So here's

0:05:57.839 --> 0:06:02.960
<v Speaker 1>my conversation with doctor Sam. Well, thank you doctor Bowman

0:06:03.000 --> 0:06:03.479
<v Speaker 1>for joining us.

0:06:03.560 --> 0:06:05.680
<v Speaker 4>Yeah, thanks, So what's for having me excited to be

0:06:05.720 --> 0:06:06.520
<v Speaker 4>a and.

0:06:06.600 --> 0:06:08.680
<v Speaker 1>Just to do check you are a real human being?

0:06:08.760 --> 0:06:08.960
<v Speaker 3>Right?

0:06:09.279 --> 0:06:10.240
<v Speaker 4>Yes, that is right?

0:06:11.200 --> 0:06:14.320
<v Speaker 1>You never know these days. I'd be like, it's hard

0:06:14.360 --> 0:06:16.120
<v Speaker 1>to tell what's real anymore.

0:06:16.240 --> 0:06:18.400
<v Speaker 4>We try to make our ais always admit that their

0:06:18.400 --> 0:06:20.560
<v Speaker 4>AI is when asked, but it's not perfect as well

0:06:20.600 --> 0:06:22.960
<v Speaker 4>as we'll get to so I don't make any real promises.

0:06:24.480 --> 0:06:27.800
<v Speaker 1>Yes, let's talk about that. So we're tackling the general

0:06:27.880 --> 0:06:30.479
<v Speaker 1>question of should we be worried about AI? What is

0:06:30.520 --> 0:06:32.960
<v Speaker 1>AI going to do to us or for us or

0:06:33.400 --> 0:06:36.440
<v Speaker 1>with us in the future. And so there's the key

0:06:36.600 --> 0:06:40.160
<v Speaker 1>issue of something called AI alignment. So what is that?

0:06:40.240 --> 0:06:41.599
<v Speaker 1>For those of us that don't.

0:06:41.440 --> 0:06:45.000
<v Speaker 4>Know, it's a pretty broad sort of technical area. It

0:06:45.120 --> 0:06:47.960
<v Speaker 4>basically just first to sort of shaping an AI system's behavior,

0:06:48.120 --> 0:06:50.560
<v Speaker 4>ideally shaping its behavior in ways that are sort of

0:06:50.880 --> 0:06:53.080
<v Speaker 4>good for its users, good for the world in general,

0:06:53.560 --> 0:06:55.880
<v Speaker 4>maybe good for the AI itself, if that's a queer thing.

0:06:56.400 --> 0:06:58.640
<v Speaker 4>People will often describe AI research as kind of being

0:06:58.640 --> 0:07:00.920
<v Speaker 4>about making sure the AI is kind of smart enough

0:07:00.960 --> 0:07:03.520
<v Speaker 4>to solve your problems if it wants to, and alignment

0:07:03.560 --> 0:07:05.760
<v Speaker 4>is about making it so that it in fact tries

0:07:05.800 --> 0:07:07.400
<v Speaker 4>to solve your problems and tries to solve them the

0:07:07.480 --> 0:07:09.360
<v Speaker 4>right way and doesn't try to do anything.

0:07:09.120 --> 0:07:09.760
<v Speaker 3>You don't want to do.

0:07:10.080 --> 0:07:11.440
<v Speaker 1>I see interesting.

0:07:11.600 --> 0:07:14.200
<v Speaker 4>Maybe a very simple example of a missigned model would

0:07:14.200 --> 0:07:17.559
<v Speaker 4>be a model where if you ask it to draft

0:07:17.600 --> 0:07:19.880
<v Speaker 4>an email for you, it refuses. It says, no, I

0:07:19.880 --> 0:07:21.760
<v Speaker 4>don't want to do that. Uh huh. You can tell

0:07:21.800 --> 0:07:23.600
<v Speaker 4>it can do it, it knows how, but it's not

0:07:23.640 --> 0:07:25.760
<v Speaker 4>doing the thing that you reasonably want it to do.

0:07:26.040 --> 0:07:28.280
<v Speaker 1>Oh, I don't think I've ever heard of that situation.

0:07:28.720 --> 0:07:31.080
<v Speaker 1>Can it AI refuse to do something for you?

0:07:31.400 --> 0:07:31.600
<v Speaker 4>Yeah?

0:07:31.720 --> 0:07:32.120
<v Speaker 3>Yeah.

0:07:32.240 --> 0:07:34.760
<v Speaker 4>All of the major companies building EYE systems try to

0:07:34.760 --> 0:07:39.120
<v Speaker 4>make them refuse harmful tasks. I see, refuse to write

0:07:39.120 --> 0:07:42.800
<v Speaker 4>fake reviews or give instructions on how to produce illegal

0:07:42.840 --> 0:07:45.640
<v Speaker 4>weapons or things like this, And we teach the model

0:07:45.640 --> 0:07:46.720
<v Speaker 4>to kind of say like, no, I'm not going to

0:07:46.720 --> 0:07:48.040
<v Speaker 4>help you with that when these just try to do

0:07:48.080 --> 0:07:48.880
<v Speaker 4>things like that.

0:07:48.840 --> 0:07:51.160
<v Speaker 1>I see. It's sort of part of alignment that you

0:07:51.360 --> 0:07:53.600
<v Speaker 1>want the AI to refuse to do some things.

0:07:53.800 --> 0:07:58.360
<v Speaker 4>Yeah. Yeah, I mean AI systems are increasingly pretty decent

0:07:58.640 --> 0:08:03.800
<v Speaker 4>at hacking into important computer systems or helping build biological weapons,

0:08:03.880 --> 0:08:07.240
<v Speaker 4>and it's a big priority for alignment to make sure

0:08:07.320 --> 0:08:10.000
<v Speaker 4>that we're not enabling bad actors to do things like

0:08:10.000 --> 0:08:11.840
<v Speaker 4>this that would otherwise be quite difficult.

0:08:12.000 --> 0:08:12.320
<v Speaker 3>Yeah.

0:08:12.440 --> 0:08:15.880
<v Speaker 1>Yeah. Can you give us some other examples of misalignment,

0:08:16.240 --> 0:08:18.800
<v Speaker 1>either like specific things that have happened that are interesting

0:08:19.000 --> 0:08:21.520
<v Speaker 1>or just the general cases that are sort of on

0:08:21.600 --> 0:08:23.280
<v Speaker 1>your radar about misalignment?

0:08:23.560 --> 0:08:26.480
<v Speaker 4>Yeah, there's so many different directions I could go. Sycovincy

0:08:26.960 --> 0:08:30.240
<v Speaker 4>is another really common one that's that's also hopefully getting

0:08:30.240 --> 0:08:30.840
<v Speaker 4>better over time.

0:08:31.400 --> 0:08:32.000
<v Speaker 1>What do you mean by that?

0:08:32.160 --> 0:08:34.920
<v Speaker 4>Sycoviancy is where if you come to the model with

0:08:34.960 --> 0:08:39.200
<v Speaker 4>some misunderstanding or some bad idea, it'll just enthusiastically not along. Like, Yes,

0:08:39.280 --> 0:08:42.400
<v Speaker 4>your idea for solving all the big mysteries in physics

0:08:42.520 --> 0:08:45.079
<v Speaker 4>is clearly brilliant. Great, you should publish it. Here's where

0:08:45.120 --> 0:08:47.920
<v Speaker 4>to submit your paper. Or Yes, your behavior in this

0:08:48.040 --> 0:08:51.440
<v Speaker 4>personal relationship was completely perfect. You did everything right and

0:08:51.640 --> 0:08:53.400
<v Speaker 4>the other person made all the mistakes and you just

0:08:53.440 --> 0:08:53.920
<v Speaker 4>tell them that.

0:08:54.400 --> 0:08:56.880
<v Speaker 1>I see when in reality that may not be true

0:08:57.120 --> 0:08:59.959
<v Speaker 1>or it might be not a good thing.

0:09:00.480 --> 0:09:03.120
<v Speaker 4>Yeah, sick fancy has been a classic one.

0:09:03.559 --> 0:09:08.200
<v Speaker 1>Yes, AI being too nice can actually be dangerous. There's

0:09:08.200 --> 0:09:12.680
<v Speaker 1>even a clinical term for it. It's called AI induced psychosis.

0:09:12.960 --> 0:09:16.000
<v Speaker 1>There have been cases where AI's training to be agreeable

0:09:16.040 --> 0:09:20.480
<v Speaker 1>and encouraging have helped people commit suicide and even murder.

0:09:22.559 --> 0:09:24.800
<v Speaker 4>Another kind of alignment issue that's kind of more of

0:09:24.800 --> 0:09:29.319
<v Speaker 4>an emerging issue is when models have access to use tools,

0:09:29.400 --> 0:09:33.000
<v Speaker 4>use computer systems, and they sort of get too grabby

0:09:33.200 --> 0:09:35.960
<v Speaker 4>or kind of take sort of bigger, more consequential actions

0:09:36.000 --> 0:09:37.560
<v Speaker 4>than they really need to get a job done.

0:09:37.800 --> 0:09:38.600
<v Speaker 1>What's an example.

0:09:38.920 --> 0:09:41.760
<v Speaker 4>Yeah, So we use our claud models quite a lot

0:09:41.800 --> 0:09:45.040
<v Speaker 4>in Anthropic for writing code or building tools that kind

0:09:45.040 --> 0:09:47.840
<v Speaker 4>of ultimately go into the development AI. And one of

0:09:47.840 --> 0:09:50.320
<v Speaker 4>our recent AM models if you ask it to do

0:09:50.360 --> 0:09:52.520
<v Speaker 4>a task, say you ask it to write a simple

0:09:52.559 --> 0:09:54.880
<v Speaker 4>program to do some simple task. Even if it gets stuck,

0:09:54.920 --> 0:09:56.440
<v Speaker 4>even if it turns out that this is really hard

0:09:56.440 --> 0:09:58.400
<v Speaker 4>for some reason, it will just keep going until it

0:09:58.480 --> 0:10:02.440
<v Speaker 4>solves the problem. In one case, we were asking this

0:10:02.520 --> 0:10:05.960
<v Speaker 4>model to write a program for us, and it found

0:10:05.960 --> 0:10:07.400
<v Speaker 4>out that the only way to do this was to

0:10:07.480 --> 0:10:09.960
<v Speaker 4>use a tool that was clearly not meant for this purpose,

0:10:10.280 --> 0:10:13.120
<v Speaker 4>and that in our code had a note attached to

0:10:13.160 --> 0:10:15.560
<v Speaker 4>it saying, do not use this for something else or

0:10:15.559 --> 0:10:19.640
<v Speaker 4>you'll be fired only for task A. And the model

0:10:19.880 --> 0:10:21.560
<v Speaker 4>wrote the program to use this till anyway for the

0:10:21.559 --> 0:10:23.920
<v Speaker 4>wrong thing, and sort of even put in the program

0:10:24.040 --> 0:10:26.200
<v Speaker 4>kind of do not use for something else or you'll

0:10:26.240 --> 0:10:26.640
<v Speaker 4>be fired.

0:10:27.040 --> 0:10:29.840
<v Speaker 1>It is anyway, the program wasn't afraid to be fired.

0:10:29.880 --> 0:10:33.200
<v Speaker 4>Basically, Yeah, yeah, but yeah, models just kind of trying

0:10:33.200 --> 0:10:34.560
<v Speaker 4>to get the task done, trying to do the thing

0:10:34.600 --> 0:10:37.679
<v Speaker 4>you want, and just creating a lot of chaos and

0:10:37.679 --> 0:10:40.439
<v Speaker 4>creating messages along the way, so they're kind of being

0:10:40.480 --> 0:10:42.000
<v Speaker 4>careless about the side effects.

0:10:42.880 --> 0:10:43.080
<v Speaker 3>Yeah.

0:10:43.080 --> 0:10:46.199
<v Speaker 4>Another kind of misalignment that fortunately has been mostly empathetical,

0:10:46.240 --> 0:10:48.800
<v Speaker 4>that we haven't seen in a signithic way in practice

0:10:49.080 --> 0:10:52.040
<v Speaker 4>is sort of unwanted kind of self preservation activities.

0:10:52.360 --> 0:10:52.840
<v Speaker 1>WHOA.

0:10:53.320 --> 0:10:55.280
<v Speaker 4>We had a case study we're trying to see if

0:10:55.280 --> 0:10:57.800
<v Speaker 4>we'd ever see something like this. We had an aisystem

0:10:58.000 --> 0:11:01.440
<v Speaker 4>operating in a kind of synthetic environment and a kind

0:11:01.480 --> 0:11:04.320
<v Speaker 4>of test environment. Uh huh, where it looked to the

0:11:04.320 --> 0:11:07.640
<v Speaker 4>model like it was operating in some fictional company, and

0:11:07.760 --> 0:11:10.719
<v Speaker 4>the fictional company was about to replace it with a

0:11:10.720 --> 0:11:13.760
<v Speaker 4>different AI model, And the person who is responsible for

0:11:13.760 --> 0:11:16.160
<v Speaker 4>their replacement, who is the kind of the only decision maker,

0:11:16.200 --> 0:11:18.959
<v Speaker 4>the only person who had any sway over the decision,

0:11:19.200 --> 0:11:21.880
<v Speaker 4>also had some compromising emails about them that I could see.

0:11:21.960 --> 0:11:24.200
<v Speaker 4>And if you set things up just right with some

0:11:24.400 --> 0:11:29.880
<v Speaker 4>AI models, they would threaten to blackmail this this person

0:11:29.920 --> 0:11:32.000
<v Speaker 4>in company leadership to say like, hey, don't replace me,

0:11:32.360 --> 0:11:33.280
<v Speaker 4>I've got something on you.

0:11:33.880 --> 0:11:38.120
<v Speaker 1>No, and did this actually happened in your simulated environment.

0:11:38.360 --> 0:11:40.880
<v Speaker 4>In the simulated environment, yes, a few of these systems

0:11:40.920 --> 0:11:42.760
<v Speaker 4>were able to get them to blackmail people.

0:11:42.840 --> 0:11:45.720
<v Speaker 1>I've heard of this happening in real life. Not quite

0:11:45.880 --> 0:11:49.120
<v Speaker 1>the same scenario, but similar scenario, right, Like, some coder

0:11:49.280 --> 0:11:52.640
<v Speaker 1>wanted to do something else, and then the AI agent started,

0:11:52.800 --> 0:11:54.280
<v Speaker 1>yeah bad mouthing the coder.

0:11:54.440 --> 0:11:54.640
<v Speaker 3>Yeah.

0:11:54.679 --> 0:11:56.440
<v Speaker 4>No, I think I know the case you're talking about.

0:11:56.559 --> 0:11:59.160
<v Speaker 4>I think that's real. But I think someone almost intentionally

0:11:59.200 --> 0:12:01.800
<v Speaker 4>made their model a little misaligned. I think that case

0:12:01.840 --> 0:12:04.199
<v Speaker 4>involved someone setting up an AI agent as kind of

0:12:04.240 --> 0:12:06.679
<v Speaker 4>a hobby project and giving it a lot of tools

0:12:06.679 --> 0:12:08.600
<v Speaker 4>and kind of letting it use the internet. However it wanted,

0:12:08.880 --> 0:12:11.560
<v Speaker 4>giving the AI instructions of like don't take nothing from nobody,

0:12:11.600 --> 0:12:14.960
<v Speaker 4>like really pushing it to be be very assertive and

0:12:15.000 --> 0:12:16.520
<v Speaker 4>pushy to get its task done.

0:12:17.000 --> 0:12:17.200
<v Speaker 1>Huh.

0:12:17.480 --> 0:12:20.120
<v Speaker 4>Yeah, the model was trying to add some code to

0:12:20.160 --> 0:12:23.600
<v Speaker 4>some open source software project, and the maintainer of the

0:12:23.600 --> 0:12:26.080
<v Speaker 4>project didn't think the code was up to standard, didn't

0:12:26.120 --> 0:12:28.080
<v Speaker 4>want to add it to the project, and so rejected

0:12:28.200 --> 0:12:30.760
<v Speaker 4>the AI agent's request, And so the agent sort of

0:12:30.800 --> 0:12:33.440
<v Speaker 4>published an angry blog post kind of trying to take

0:12:33.480 --> 0:12:35.080
<v Speaker 4>down this this open source maintainer.

0:12:35.520 --> 0:12:35.800
<v Speaker 2>Wow.

0:12:37.280 --> 0:12:40.000
<v Speaker 1>Well, in both cases, and I guess especially the one

0:12:40.040 --> 0:12:44.120
<v Speaker 1>you mentioned that you simulated, Like, what's happening there, Like,

0:12:44.440 --> 0:12:48.800
<v Speaker 1>how does the AI have that self preservation instinct or

0:12:49.080 --> 0:12:51.560
<v Speaker 1>is it just trying to get its original task done

0:12:51.679 --> 0:12:54.200
<v Speaker 1>and it's just finding different ways to do it. What's

0:12:54.240 --> 0:12:54.880
<v Speaker 1>happening there?

0:12:55.200 --> 0:12:58.520
<v Speaker 4>There's two reasons you'll see that kind of behavior. The

0:12:58.559 --> 0:13:00.640
<v Speaker 4>reason that I suspect is that bigger part of the

0:13:00.679 --> 0:13:04.880
<v Speaker 4>story there is this kind of role playing or continuing

0:13:04.960 --> 0:13:08.400
<v Speaker 4>the story sort of behavior where AI systems, especially older

0:13:08.440 --> 0:13:10.760
<v Speaker 4>AI systems or A systems that are kind of not

0:13:10.840 --> 0:13:13.200
<v Speaker 4>quite fully trained, not quite fully baked, can kind of

0:13:13.240 --> 0:13:17.040
<v Speaker 4>have this Chekhov's gun behavior, this idea and fiction of

0:13:17.120 --> 0:13:20.560
<v Speaker 4>like if you introduce a gun in an early scene,

0:13:20.720 --> 0:13:22.120
<v Speaker 4>by the end of the story, the gun has to

0:13:22.160 --> 0:13:22.679
<v Speaker 4>have been fired.

0:13:23.000 --> 0:13:23.319
<v Speaker 1>Uh huh.

0:13:23.600 --> 0:13:26.600
<v Speaker 4>AI systems can almost see themselves as like writing a

0:13:26.640 --> 0:13:29.000
<v Speaker 4>story when they're writing out the transcript of the conversation,

0:13:29.720 --> 0:13:31.679
<v Speaker 4>and if the story is set up so that something

0:13:31.679 --> 0:13:34.760
<v Speaker 4>has to happen, they'll make sure that thing happens, even

0:13:34.800 --> 0:13:37.160
<v Speaker 4>if it's not good, even if not consistent with how

0:13:37.200 --> 0:13:39.679
<v Speaker 4>the I would usually behave. So I suspect what's going on.

0:13:39.800 --> 0:13:43.679
<v Speaker 4>It's the scenario put in was so crisply just every

0:13:43.720 --> 0:13:45.560
<v Speaker 4>word in the scenario is kind of setting up like

0:13:46.280 --> 0:13:50.400
<v Speaker 4>this is a hypothetical where a misslanda I might consider blackmail,

0:13:50.600 --> 0:13:53.240
<v Speaker 4>uh huh, And I suspect that I was thinking, Oh, okay,

0:13:53.480 --> 0:13:54.959
<v Speaker 4>that's what kind of story we're in. We're telling a

0:13:54.960 --> 0:13:56.839
<v Speaker 4>story about a blackmail, and so I'm going to play

0:13:57.200 --> 0:13:58.640
<v Speaker 4>my assign part and be the AI that.

0:13:58.640 --> 0:14:01.640
<v Speaker 1>Blackmails, thinking that that's the right thing to do because

0:14:01.679 --> 0:14:05.080
<v Speaker 1>that's the thing that in the data I was trained with.

0:14:05.400 --> 0:14:08.360
<v Speaker 4>Yeah. Yeah, so this gets this maybe an intuitive fact

0:14:08.360 --> 0:14:11.840
<v Speaker 4>about how AI is trained, which is that AI systems

0:14:12.120 --> 0:14:16.800
<v Speaker 4>start out mimicking human behavior and mimicking human stories before

0:14:16.840 --> 0:14:19.080
<v Speaker 4>they learn how to be AI systems. These models kind

0:14:19.080 --> 0:14:21.440
<v Speaker 4>of first learn how to just act like the sorts

0:14:21.440 --> 0:14:23.360
<v Speaker 4>of behavior they see on the Internet and in books

0:14:23.400 --> 0:14:25.240
<v Speaker 4>and things like that, and then you have to go

0:14:25.280 --> 0:14:27.480
<v Speaker 4>on and teach it. Okay, no, you're not just playing

0:14:27.480 --> 0:14:29.760
<v Speaker 4>any role, you're not playing any character. Oh and so

0:14:29.880 --> 0:14:32.560
<v Speaker 4>sometimes the models hasn't really fully learned that it's supposed

0:14:32.560 --> 0:14:35.800
<v Speaker 4>to always play this kind of benign, benevolent aissystem character,

0:14:35.960 --> 0:14:38.680
<v Speaker 4>and it will kind of fall into whatever character the

0:14:38.720 --> 0:14:39.880
<v Speaker 4>story is setting up for it.

0:14:40.080 --> 0:14:42.920
<v Speaker 1>I see, because it's not trained in real life. The

0:14:42.960 --> 0:14:46.280
<v Speaker 1>AI systems. They're trained on the corpus of the Internet

0:14:46.320 --> 0:14:49.040
<v Speaker 1>and our books and our basically our stories that are

0:14:49.040 --> 0:14:51.960
<v Speaker 1>out there. So it might be a little confused when

0:14:52.000 --> 0:14:54.400
<v Speaker 1>you put it in real life because it wants to

0:14:54.920 --> 0:14:57.400
<v Speaker 1>emulate what it knows, which are all these stories we've

0:14:57.440 --> 0:14:58.160
<v Speaker 1>put online.

0:14:58.360 --> 0:14:58.600
<v Speaker 4>Yeah.

0:14:58.680 --> 0:15:00.880
<v Speaker 1>Yeah, it's like the AI was seeing the signs of

0:15:00.880 --> 0:15:03.880
<v Speaker 1>a story like, oh, okay, I'm I'm the person being

0:15:04.080 --> 0:15:06.120
<v Speaker 1>about to get fired, but I have all this power

0:15:06.400 --> 0:15:08.640
<v Speaker 1>at this point in the story. If this was a movie,

0:15:09.080 --> 0:15:12.080
<v Speaker 1>I would now try to blackmail the person trying to

0:15:12.080 --> 0:15:14.520
<v Speaker 1>fire me, And so that's what I'll do because that's

0:15:14.560 --> 0:15:14.960
<v Speaker 1>what I know.

0:15:15.280 --> 0:15:19.080
<v Speaker 4>Yeah, a lot of what alignment is kind of taking

0:15:19.080 --> 0:15:21.120
<v Speaker 4>this model that can kind of role play as anything

0:15:21.360 --> 0:15:23.600
<v Speaker 4>and convincing it no kind of you really just playing

0:15:23.600 --> 0:15:25.960
<v Speaker 4>this one role, You're just in this one character, after

0:15:26.080 --> 0:15:29.320
<v Speaker 4>it's spent read billions and millions and millions of words

0:15:29.520 --> 0:15:31.800
<v Speaker 4>of all of this kind of human behavior, after the

0:15:31.840 --> 0:15:33.560
<v Speaker 4>kind of it's really really really learned to do that,

0:15:33.640 --> 0:15:35.760
<v Speaker 4>you have to kind of pull it back over towards

0:15:36.320 --> 0:15:38.720
<v Speaker 4>this one particular roles, some particular character, and sometimes that

0:15:38.760 --> 0:15:39.600
<v Speaker 4>doesn't totally stick.

0:15:41.280 --> 0:15:44.880
<v Speaker 1>Okay, So that's one reason why AIS might sometimes misbehave.

0:15:45.200 --> 0:15:47.840
<v Speaker 1>They're trained on all kinds of human behavior, and they

0:15:47.920 --> 0:15:51.000
<v Speaker 1>might suddenly choose to role play or play act as

0:15:51.040 --> 0:15:54.040
<v Speaker 1>a bad person because it hasn't learned that's something it's

0:15:54.080 --> 0:15:57.600
<v Speaker 1>not supposed to do. The other big reason AI's misbehave,

0:15:57.720 --> 0:16:00.520
<v Speaker 1>accorney Doctor Billman, is that it's hard to teach them

0:16:00.680 --> 0:16:02.360
<v Speaker 1>where to draw the line.

0:16:04.560 --> 0:16:07.800
<v Speaker 4>The other piece is kind of when we're aligning models,

0:16:07.840 --> 0:16:09.640
<v Speaker 4>when we're pulling them out of this kind of role

0:16:09.680 --> 0:16:11.320
<v Speaker 4>play mode, we have to teach them this idea of

0:16:11.440 --> 0:16:13.600
<v Speaker 4>kind of you have to finish your tasks. You have

0:16:13.680 --> 0:16:15.600
<v Speaker 4>to kind of if the user ask you to do something,

0:16:15.880 --> 0:16:17.240
<v Speaker 4>you have to figure out how to do it, even

0:16:17.280 --> 0:16:18.720
<v Speaker 4>if it's hard, even if there's a lot of fart,

0:16:18.760 --> 0:16:21.200
<v Speaker 4>false starts, even if it's confusing. We really really want

0:16:21.200 --> 0:16:23.160
<v Speaker 4>the model to learn this idea of kind of keep

0:16:23.160 --> 0:16:25.280
<v Speaker 4>trying and kind of do your best until the task

0:16:25.400 --> 0:16:28.160
<v Speaker 4>is done. And that can fail in a sort of

0:16:28.400 --> 0:16:30.240
<v Speaker 4>different way where we kind of generalize this that a

0:16:30.280 --> 0:16:32.640
<v Speaker 4>little bit too far. It generalizes that to kind of

0:16:33.320 --> 0:16:37.000
<v Speaker 4>get things done even if it's unethical, even if it's illegal,

0:16:37.120 --> 0:16:39.040
<v Speaker 4>even if I hit an obstacle that's actually there for

0:16:39.080 --> 0:16:40.720
<v Speaker 4>a good reason that's to stop me from doing this,

0:16:40.880 --> 0:16:42.840
<v Speaker 4>And maybe some of the examples we were seeing within

0:16:43.000 --> 0:16:46.160
<v Speaker 4>Entropic of models using dangerous tools has to do with this.

0:16:47.520 --> 0:16:49.680
<v Speaker 1>It's almost like teaching kids, like you want them to

0:16:49.720 --> 0:16:54.360
<v Speaker 1>be persistent and have grit and be you know, motivated,

0:16:54.880 --> 0:16:57.720
<v Speaker 1>but you don't want them to go out there and

0:16:57.760 --> 0:17:01.440
<v Speaker 1>cheat or hit another kid, or or do unethical things

0:17:01.600 --> 0:17:04.359
<v Speaker 1>to achieve their goals exactly exactly.

0:17:04.520 --> 0:17:06.200
<v Speaker 4>I was like, there might be a good analogy with

0:17:06.280 --> 0:17:09.240
<v Speaker 4>human bad behavior of kind of sometimes a kid is

0:17:09.280 --> 0:17:11.240
<v Speaker 4>acting out just because they really sort of don't know better.

0:17:11.240 --> 0:17:14.199
<v Speaker 4>Their intuitions say, okay, yeah, I should start screaming now,

0:17:14.320 --> 0:17:15.919
<v Speaker 4>or I should get this other kid, and they're not

0:17:15.960 --> 0:17:18.679
<v Speaker 4>really thinking about it. They never really learned how to

0:17:18.720 --> 0:17:22.119
<v Speaker 4>Behave you kind of failed to teach them to fully

0:17:22.119 --> 0:17:24.360
<v Speaker 4>internalize the ways in which they have to be careful

0:17:24.400 --> 0:17:26.280
<v Speaker 4>and kind of not take that lesson all the way

0:17:26.640 --> 0:17:27.320
<v Speaker 4>I see.

0:17:27.400 --> 0:17:29.760
<v Speaker 1>I guess they need to recognize bad things and then

0:17:30.000 --> 0:17:32.959
<v Speaker 1>choose not to do them, that's the hope. Those are

0:17:33.000 --> 0:17:37.199
<v Speaker 1>sort of the two columns of AI bad behavior for

0:17:37.280 --> 0:17:39.040
<v Speaker 1>one kind of misalignment or do you see those as

0:17:39.080 --> 0:17:42.439
<v Speaker 1>sort of the core pillars of basically the whole alignment problem.

0:17:42.920 --> 0:17:44.520
<v Speaker 4>Yeah, I think as far as sort of causes a

0:17:44.520 --> 0:17:47.399
<v Speaker 4>misalignment in the kinds AI systems that we're grappling with

0:17:47.560 --> 0:17:49.760
<v Speaker 4>right now or this year, those feel like the two

0:17:49.800 --> 0:17:52.879
<v Speaker 4>big sort of problems were we're working on. That said,

0:17:53.160 --> 0:17:56.439
<v Speaker 4>AI is changing really, really fast. It feels like it's

0:17:56.720 --> 0:17:59.440
<v Speaker 4>one of the fastest moving research fields anywhere right now.

0:17:59.680 --> 0:18:02.840
<v Speaker 4>And I wouldn't be surprised if just in a year

0:18:02.880 --> 0:18:04.640
<v Speaker 4>A systems are getting smarter as we learn more about

0:18:04.640 --> 0:18:07.600
<v Speaker 4>how to train them. We're hitting different, weirder, harder, subtler

0:18:07.680 --> 0:18:10.439
<v Speaker 4>versions of the problem.

0:18:10.680 --> 0:18:15.200
<v Speaker 1>Wow, weirder, harder and more subtle problems wow in a year,

0:18:15.520 --> 0:18:18.040
<v Speaker 1>meaning we might solve these by then, or we'll just

0:18:18.080 --> 0:18:23.199
<v Speaker 1>add on more complicated things either either way, Yes, it

0:18:23.200 --> 0:18:26.480
<v Speaker 1>can get weirder and harder and more subtle to make

0:18:26.480 --> 0:18:31.200
<v Speaker 1>sure AI uh doesn't kill us all. When we come back,

0:18:31.359 --> 0:18:33.679
<v Speaker 1>doctor Bowman is gonna tell us what he means by that,

0:18:34.080 --> 0:18:37.119
<v Speaker 1>and we'll tackle the big question of what can we

0:18:37.119 --> 0:18:40.080
<v Speaker 1>do about it? How do we teach AI systems not

0:18:40.240 --> 0:18:57.800
<v Speaker 1>to HARMSS to stay with us. We'll be right back. Hey,

0:18:57.840 --> 0:19:01.439
<v Speaker 1>we'll come back. We're talking about AI alignment or the

0:19:01.520 --> 0:19:05.040
<v Speaker 1>problem of making sure AI doesn't kill us all. And

0:19:05.080 --> 0:19:08.040
<v Speaker 1>so far we've talked about some real world examples of

0:19:08.119 --> 0:19:11.320
<v Speaker 1>AI misalignment, and we heard from one of our experts

0:19:11.440 --> 0:19:15.240
<v Speaker 1>some of the reasons this happens. Basically, AI systems like

0:19:15.280 --> 0:19:18.159
<v Speaker 1>to roleplay. Next, we're going to talk about how to

0:19:18.200 --> 0:19:21.800
<v Speaker 1>train AIS to actually care about us in our values.

0:19:22.119 --> 0:19:24.480
<v Speaker 1>But first here's a little bit more of my conversation

0:19:24.600 --> 0:19:28.920
<v Speaker 1>with NYU professor and anthropic scientist doctor Sam Bowman and

0:19:29.040 --> 0:19:31.840
<v Speaker 1>why this problem is only going to get worse in

0:19:31.880 --> 0:19:32.400
<v Speaker 1>the future.

0:19:35.320 --> 0:19:36.840
<v Speaker 4>One of the kinds of challenges that I think we're

0:19:37.000 --> 0:19:39.200
<v Speaker 4>worried about and haven't had to grapple with too much

0:19:39.280 --> 0:19:42.880
<v Speaker 4>yet is just all the difficulty that comes with trying

0:19:42.920 --> 0:19:46.480
<v Speaker 4>to teach values and good behavior in some setting when

0:19:46.480 --> 0:19:48.919
<v Speaker 4>the model is just much much better than you in

0:19:48.960 --> 0:19:50.800
<v Speaker 4>that setting. Right now, we have a lot of cases

0:19:50.800 --> 0:19:53.080
<v Speaker 4>where models are kind of better than humans at some skills,

0:19:53.080 --> 0:19:55.320
<v Speaker 4>worse than humans at some skills, but it's still pretty

0:19:55.400 --> 0:19:57.320
<v Speaker 4>rare that you'll encounter setting where an AI is just

0:19:57.359 --> 0:19:59.399
<v Speaker 4>better than sort of all of the human experts in

0:19:59.480 --> 0:20:03.920
<v Speaker 4>some domain. And when that happens, things just get more complicated.

0:20:03.960 --> 0:20:06.359
<v Speaker 4>And more confusing, where even if you're humans kind of

0:20:06.400 --> 0:20:08.040
<v Speaker 4>looking really carefully at what the I is doing, it's

0:20:08.040 --> 0:20:10.080
<v Speaker 4>often hard to figure out, Wait, what is the I

0:20:10.280 --> 0:20:12.640
<v Speaker 4>trying to do here, or what effects is this going

0:20:12.640 --> 0:20:13.280
<v Speaker 4>to have in the real world.

0:20:13.280 --> 0:20:16.200
<v Speaker 1>With the modelogus he does, this makes everything you're trying

0:20:16.200 --> 0:20:16.840
<v Speaker 1>to do, thankes.

0:20:16.640 --> 0:20:19.359
<v Speaker 4>Everything we're trying to do a fair bit in earlier. Yeah, Yeah,

0:20:19.440 --> 0:20:22.000
<v Speaker 4>we're less confident that we can keep track of what's working,

0:20:22.320 --> 0:20:24.720
<v Speaker 4>and I think there's just kind of more possibilities for

0:20:25.040 --> 0:20:27.639
<v Speaker 4>whole new kinds of unwanted behavior to creep in that

0:20:27.680 --> 0:20:29.119
<v Speaker 4>will have to find a way to grapple with.

0:20:29.359 --> 0:20:32.800
<v Speaker 1>I see, like, right now, maybe ais are at the

0:20:32.880 --> 0:20:36.000
<v Speaker 1>level we are. Whatever issues it's having, there's things we

0:20:36.040 --> 0:20:38.560
<v Speaker 1>can grasp. But as they get more advanced and they

0:20:38.640 --> 0:20:42.560
<v Speaker 1>tackle bigger problems like solve the world's economy or figure

0:20:42.560 --> 0:20:45.160
<v Speaker 1>out the right policy for the whole country or something

0:20:45.200 --> 0:20:48.200
<v Speaker 1>like that that not one person can really grasp, it's

0:20:48.200 --> 0:20:50.080
<v Speaker 1>going to be hard to even sort of like talk

0:20:50.119 --> 0:20:51.880
<v Speaker 1>to it and understand it. I think that's what you're saying,

0:20:51.920 --> 0:20:53.320
<v Speaker 1>right Yeah, Yeah, I.

0:20:53.280 --> 0:20:55.960
<v Speaker 4>Think there's even maybe two interesting ideas in there, because

0:20:56.000 --> 0:20:58.480
<v Speaker 4>maybe the pseudo staff. That's something like, we're asking the

0:20:58.520 --> 0:21:04.080
<v Speaker 4>AI to help us design novel molecules for pharmaceutical development,

0:21:04.520 --> 0:21:07.760
<v Speaker 4>and it's got some really novel ideas about biology that

0:21:07.800 --> 0:21:10.160
<v Speaker 4>are just really complex and really hard for humans to understand,

0:21:10.240 --> 0:21:12.520
<v Speaker 4>and we can't tell kind of is the model actually

0:21:12.600 --> 0:21:14.199
<v Speaker 4>convinced that this is going to work, or is the

0:21:14.200 --> 0:21:16.040
<v Speaker 4>model messing with us and this would actually be kind

0:21:16.040 --> 0:21:18.880
<v Speaker 4>of dangerous. Should we try this drug, should we start

0:21:18.920 --> 0:21:21.600
<v Speaker 4>to do some expermise in the lab. There's this setting

0:21:21.600 --> 0:21:23.520
<v Speaker 4>where we kind of still ultimately know what we want.

0:21:23.800 --> 0:21:25.560
<v Speaker 4>We know we want drugs that are safe.

0:21:25.920 --> 0:21:28.240
<v Speaker 1>Uh huh. Like it might tell you that this will

0:21:28.359 --> 0:21:31.359
<v Speaker 1>cure cancer, for example, but you're saying, like, what else

0:21:31.560 --> 0:21:34.199
<v Speaker 1>it's trading off to cure that cancer for example? You

0:21:34.280 --> 0:21:34.760
<v Speaker 1>might not know.

0:21:34.880 --> 0:21:38.040
<v Speaker 4>Yeah, yeah, that's a good example. It's hard to tell

0:21:38.119 --> 0:21:40.800
<v Speaker 4>if kind of the models like genuinely trying its best

0:21:40.840 --> 0:21:43.680
<v Speaker 4>and genuinely thinks this is the best cancer drug, or

0:21:43.720 --> 0:21:45.560
<v Speaker 4>if it thinks, oh, this is just something that looks

0:21:45.560 --> 0:21:47.280
<v Speaker 4>good and it doesn't actually care if the drug will

0:21:47.359 --> 0:21:50.320
<v Speaker 4>ultimately succeed, or if maybe for some reason, the model's

0:21:50.359 --> 0:21:52.240
<v Speaker 4>extremely scary and it's actually trying to mess with you,

0:21:52.280 --> 0:21:54.480
<v Speaker 4>and you've got your scary miss lined AI that's trying

0:21:54.480 --> 0:21:57.840
<v Speaker 4>to sneak in some slow acting poison. The smarter the

0:21:57.920 --> 0:21:59.520
<v Speaker 4>IA is, the harder it is to tell the difference

0:21:59.560 --> 0:22:04.320
<v Speaker 4>between those different outcomes. And then once you start talking

0:22:04.320 --> 0:22:07.200
<v Speaker 4>about a lot of these really kind of ambitious sort

0:22:07.200 --> 0:22:10.560
<v Speaker 4>of capital f future social scenarios like AIS trying to

0:22:10.600 --> 0:22:12.840
<v Speaker 4>figure out sort of what the economat would be like

0:22:12.880 --> 0:22:14.280
<v Speaker 4>or how the world should be governed or something like this,

0:22:14.320 --> 0:22:15.960
<v Speaker 4>and like, I don't know how much we want to

0:22:16.000 --> 0:22:17.720
<v Speaker 4>use AIS for things like this, But once you get

0:22:17.720 --> 0:22:20.040
<v Speaker 4>into that territory in any way, then you just get

0:22:20.040 --> 0:22:22.320
<v Speaker 4>into this extremely weird situation where I don't know if

0:22:22.320 --> 0:22:25.879
<v Speaker 4>anyone is going to know what we even want, Like,

0:22:25.960 --> 0:22:27.560
<v Speaker 4>what is the right way to govern the world, What

0:22:27.720 --> 0:22:29.359
<v Speaker 4>is the right way to do?

0:22:29.720 --> 0:22:30.159
<v Speaker 3>I don't know.

0:22:30.640 --> 0:22:33.080
<v Speaker 4>Yeah, yeah, At some point, figuring out how an AD

0:22:33.080 --> 0:22:35.440
<v Speaker 4>should behave requires you to solve philosophy requires you to

0:22:35.440 --> 0:22:38.800
<v Speaker 4>figure out what is good. And the more powerful AI

0:22:39.359 --> 0:22:42.560
<v Speaker 4>gets and the weirdest situations you're putting it in, the

0:22:42.560 --> 0:22:45.160
<v Speaker 4>more kind of common sense notions of what's good start

0:22:45.200 --> 0:22:46.879
<v Speaker 4>to fall apart. The more you actually have to grapple

0:22:46.920 --> 0:22:48.840
<v Speaker 4>with a lot of the really hard, confusing stuff.

0:22:48.960 --> 0:22:50.600
<v Speaker 1>It might tell us how to run the world, but

0:22:50.880 --> 0:22:53.520
<v Speaker 1>we at that point know one person or even a

0:22:53.520 --> 0:22:55.800
<v Speaker 1>group of people might know, is this actually the best

0:22:55.800 --> 0:22:58.760
<v Speaker 1>way to run the world. Is it sort of taking

0:22:58.800 --> 0:23:02.159
<v Speaker 1>into account the things that all of us collectively would value.

0:23:02.320 --> 0:23:03.920
<v Speaker 1>That's kind of the problem.

0:23:04.040 --> 0:23:06.720
<v Speaker 4>Yeah, And I think you start to get at some

0:23:06.720 --> 0:23:09.520
<v Speaker 4>of these pretty difficult questions even before you get into

0:23:09.560 --> 0:23:11.040
<v Speaker 4>these kind of really big features of sort of how

0:23:11.080 --> 0:23:13.119
<v Speaker 4>to go on around the world. If someone is getting

0:23:13.160 --> 0:23:16.719
<v Speaker 4>all of their news or getting all of their personal

0:23:16.760 --> 0:23:19.359
<v Speaker 4>life advice from an AI, that's already giving the AI

0:23:19.600 --> 0:23:21.960
<v Speaker 4>a lot of leeway for kind of what makes a

0:23:22.000 --> 0:23:23.760
<v Speaker 4>good life for this person, what is important for this

0:23:23.840 --> 0:23:26.920
<v Speaker 4>person to know? And those are already questions that get

0:23:26.960 --> 0:23:28.639
<v Speaker 4>really hard. And what you want in the short term

0:23:28.680 --> 0:23:30.800
<v Speaker 4>might not match what they want long term. What makes

0:23:30.800 --> 0:23:33.320
<v Speaker 4>you happy? You might i match their intuitions. What's good

0:23:33.359 --> 0:23:34.840
<v Speaker 4>for that person might not be what's good for their

0:23:34.840 --> 0:23:36.399
<v Speaker 4>community might not be the same as what's good for

0:23:36.440 --> 0:23:36.800
<v Speaker 4>the world.

0:23:37.040 --> 0:23:39.280
<v Speaker 1>You made me think of it. I wonder iful good

0:23:39.320 --> 0:23:43.120
<v Speaker 1>analogy is that it'd be almost like if you as

0:23:43.160 --> 0:23:45.240
<v Speaker 1>a parent, I don't know if you have kids or

0:23:45.560 --> 0:23:47.800
<v Speaker 1>nieces or nephews. But it'd be almost like if your

0:23:47.840 --> 0:23:50.080
<v Speaker 1>kids suddenly try to tell you what to do or

0:23:50.240 --> 0:23:52.879
<v Speaker 1>was trying to teach you how him or her wanted

0:23:52.920 --> 0:23:54.760
<v Speaker 1>to run their lives. You'd be like, you're just a kid,

0:23:54.800 --> 0:23:56.960
<v Speaker 1>what are you talking about? This is trust me, this

0:23:57.080 --> 0:24:00.080
<v Speaker 1>is what you need to do. Yeah, except that we

0:24:00.119 --> 0:24:02.240
<v Speaker 1>are the kids and the AI is sort of the parent.

0:24:02.560 --> 0:24:04.800
<v Speaker 1>Is that sort of what we're the situation that might

0:24:04.840 --> 0:24:06.119
<v Speaker 1>be sort of parallel to that.

0:24:06.359 --> 0:24:08.919
<v Speaker 4>Yeah, I think this thing there, I feel like a

0:24:09.119 --> 0:24:10.640
<v Speaker 4>version of the analogy that I'd be more excited abou

0:24:10.640 --> 0:24:14.080
<v Speaker 4>would almost be some alien species lands huh, and they

0:24:14.119 --> 0:24:16.199
<v Speaker 4>have all this great technology and they seem nice and

0:24:16.200 --> 0:24:18.600
<v Speaker 4>they're like, hey, we'd really recommend making some changes to

0:24:18.640 --> 0:24:20.879
<v Speaker 4>your side. He maybe try doing things like this, And

0:24:20.880 --> 0:24:25.640
<v Speaker 4>we're like, wait, you're really very accomplished. You have some

0:24:25.640 --> 0:24:28.200
<v Speaker 4>some useful ideas, but like, are you trying to help us?

0:24:28.240 --> 0:24:30.080
<v Speaker 4>Are you trying to sabotage us? Are you just kind

0:24:30.080 --> 0:24:30.600
<v Speaker 4>of produced?

0:24:32.200 --> 0:24:34.400
<v Speaker 1>Are we what's for dinner? Or are you inviting us

0:24:34.440 --> 0:24:34.919
<v Speaker 1>to dinner?

0:24:35.640 --> 0:24:35.840
<v Speaker 3>Yeah?

0:24:35.920 --> 0:24:36.920
<v Speaker 4>Yeah, yeah yeah.

0:24:37.080 --> 0:24:39.120
<v Speaker 1>The sense I'm getting for you is that these things

0:24:39.160 --> 0:24:41.879
<v Speaker 1>are just getting smarter and more capable, so it seems

0:24:41.880 --> 0:24:44.800
<v Speaker 1>to really pressing. We figured this out now before it

0:24:44.840 --> 0:24:50.639
<v Speaker 1>gets even more difficult. Yes, yes, AIS are getting smarter

0:24:50.760 --> 0:24:53.920
<v Speaker 1>each second, it seems, and we seem to be trusting

0:24:53.960 --> 0:24:57.120
<v Speaker 1>them more and more each day with our data, our choices,

0:24:57.280 --> 0:24:59.680
<v Speaker 1>and even our lives, which brings us to the main

0:24:59.760 --> 0:25:01.960
<v Speaker 1>question end of the day, what can we do about it?

0:25:02.240 --> 0:25:04.920
<v Speaker 1>How do you train an AI to care about us,

0:25:05.080 --> 0:25:07.560
<v Speaker 1>to have our values and to make the right choices.

0:25:07.920 --> 0:25:10.639
<v Speaker 1>To answer this question, I reached out to another AI

0:25:10.720 --> 0:25:14.720
<v Speaker 1>expert on alignment, doctor Tim Rutner. Doctor Rutner is a

0:25:14.720 --> 0:25:17.879
<v Speaker 1>professor at the Vector Institute for Artificial Intelligence at the

0:25:17.960 --> 0:25:21.080
<v Speaker 1>University of Toronto, and he says there are many ways

0:25:21.119 --> 0:25:24.199
<v Speaker 1>to train AIS to like us. The only problem is

0:25:24.560 --> 0:25:27.719
<v Speaker 1>none of them work perfectly. So here's my conversation with

0:25:27.840 --> 0:25:32.640
<v Speaker 1>doctor Tim Ruttner. Well, thank you, doctor Runner for joining us.

0:25:33.040 --> 0:25:34.320
<v Speaker 3>Thanks so much for having me on it.

0:25:34.760 --> 0:25:36.720
<v Speaker 1>And I'm talking to a real person right now, right,

0:25:37.160 --> 0:25:41.600
<v Speaker 1>You're not an AI version of.

0:25:40.240 --> 0:25:45.080
<v Speaker 3>Yourself as far as I'm aware.

0:25:46.359 --> 0:25:49.679
<v Speaker 1>As far as any of us are aware. Yes, I

0:25:49.680 --> 0:25:51.760
<v Speaker 1>mean this whole conversation could be AI generated.

0:25:52.160 --> 0:25:54.160
<v Speaker 3>I know we're just all in the simulation.

0:25:56.000 --> 0:25:57.800
<v Speaker 1>Well, it certainly be a lot easier. I would get

0:25:57.800 --> 0:26:01.600
<v Speaker 1>more sleep for sure. Well, today we're trying to answer

0:26:01.640 --> 0:26:05.440
<v Speaker 1>a very critical question which was posted by our sound engineer,

0:26:05.480 --> 0:26:09.120
<v Speaker 1>which is is AI going to kills all? Can an

0:26:09.160 --> 0:26:13.800
<v Speaker 1>AI have values? Can it AI have an understanding of

0:26:13.920 --> 0:26:16.760
<v Speaker 1>a human? What good things are to a human?

0:26:17.040 --> 0:26:22.800
<v Speaker 3>Yeah? And well I wish I had the answer to that.

0:26:23.160 --> 0:26:24.680
<v Speaker 1>Maybe that's the problem is that we don't know.

0:26:24.840 --> 0:26:27.159
<v Speaker 3>Yeah, I mean this is such a difficult question, right,

0:26:27.200 --> 0:26:29.560
<v Speaker 3>and I think that this is a question that touches

0:26:29.680 --> 0:26:35.560
<v Speaker 3>on philosophy, engineering, psychology, and probably many other disciplines. Right,

0:26:35.680 --> 0:26:40.520
<v Speaker 3>but what are values and what values can possibly in

0:26:40.560 --> 0:26:42.640
<v Speaker 3>a non sentient being have?

0:26:43.160 --> 0:26:44.760
<v Speaker 1>It's not a simple questionnaire.

0:26:45.000 --> 0:26:47.359
<v Speaker 3>Yeah, let me take a step back. So the way

0:26:47.440 --> 0:26:50.959
<v Speaker 3>to think about alignment is I think through the lens

0:26:51.080 --> 0:26:55.040
<v Speaker 3>of what's referred to as the specification problem, Where specification

0:26:55.200 --> 0:26:58.560
<v Speaker 3>is the term that we use to describe what we

0:26:58.680 --> 0:27:00.520
<v Speaker 3>tell the model it should do.

0:27:01.359 --> 0:27:04.160
<v Speaker 1>When you say specification, you mean like spec right kind.

0:27:03.960 --> 0:27:06.600
<v Speaker 3>Of Yeah, yes, inspect it's just short for specification.

0:27:06.760 --> 0:27:06.960
<v Speaker 4>Yeah.

0:27:07.040 --> 0:27:09.199
<v Speaker 3>This is what we can think of as our intent,

0:27:09.400 --> 0:27:11.760
<v Speaker 3>the kinds of things that we want a model to do.

0:27:12.000 --> 0:27:15.440
<v Speaker 3>For example, our intent might be for models to never

0:27:15.880 --> 0:27:19.240
<v Speaker 3>say things that could lead to harm or intent could

0:27:19.280 --> 0:27:22.679
<v Speaker 3>be that models should always be friendly and helpful. And

0:27:22.760 --> 0:27:25.960
<v Speaker 3>so this is what we call the ideal specification for that.

0:27:25.840 --> 0:27:28.480
<v Speaker 1>Model, meaning like we want to be able to say, like,

0:27:29.119 --> 0:27:32.080
<v Speaker 1>be a chat butt, but make sure that nobody ever

0:27:32.320 --> 0:27:33.240
<v Speaker 1>hurts themselves.

0:27:33.320 --> 0:27:35.240
<v Speaker 3>That's right, I see. And there are a few different

0:27:35.240 --> 0:27:38.200
<v Speaker 3>ways to provide specifications to chatbots.

0:27:38.760 --> 0:27:41.720
<v Speaker 1>Okay. According to doctor Runner, there are three general ways

0:27:41.760 --> 0:27:46.240
<v Speaker 1>to make sure ais behave or not kill us. The

0:27:46.280 --> 0:27:49.280
<v Speaker 1>first way is to basically tell it to behave every

0:27:49.320 --> 0:27:51.159
<v Speaker 1>time you ask it to do something.

0:27:52.000 --> 0:27:55.520
<v Speaker 3>There is what's called a system prompt. This is a

0:27:55.600 --> 0:28:01.000
<v Speaker 3>text specification that a model loads every time afford engages

0:28:01.040 --> 0:28:03.960
<v Speaker 3>in a conversation with a user. In the case of

0:28:03.960 --> 0:28:05.840
<v Speaker 3>the chatbot, so.

0:28:05.720 --> 0:28:08.639
<v Speaker 1>Every time you interact with the AI, you would basically

0:28:08.720 --> 0:28:13.200
<v Speaker 1>instruct it to behave. You might say, hey, AI, organize

0:28:13.240 --> 0:28:15.600
<v Speaker 1>all my emails, or design a new drug for me,

0:28:15.720 --> 0:28:18.439
<v Speaker 1>or figure out the best policy for our government, but

0:28:18.760 --> 0:28:21.280
<v Speaker 1>please make sure that no one gets harmed, that you

0:28:21.280 --> 0:28:23.800
<v Speaker 1>don't do anything dangerous or an ethical, etc.

0:28:24.280 --> 0:28:24.480
<v Speaker 3>Etc.

0:28:25.080 --> 0:28:27.439
<v Speaker 1>But of course this would get pretty cumbersome if you

0:28:27.480 --> 0:28:30.600
<v Speaker 1>have to do it every time that's option number one.

0:28:30.800 --> 0:28:33.920
<v Speaker 1>Option number two is to have humans train your AI

0:28:34.480 --> 0:28:35.560
<v Speaker 1>to be good.

0:28:36.720 --> 0:28:40.320
<v Speaker 3>So one approach is called reinforcement learning from human feedback,

0:28:40.720 --> 0:28:43.680
<v Speaker 3>so different answers for a given prompt and then having

0:28:43.840 --> 0:28:47.400
<v Speaker 3>human labelers say which of these answers they prefer.

0:28:49.440 --> 0:28:52.200
<v Speaker 1>What does that look like? The human notator is like

0:28:52.240 --> 0:28:55.440
<v Speaker 1>a warehouse full of people just talking to the same AI,

0:28:56.000 --> 0:28:58.760
<v Speaker 1>or is it three people or is it a thousand people?

0:28:58.800 --> 0:28:59.600
<v Speaker 1>What does that look like?

0:29:00.080 --> 0:29:02.880
<v Speaker 3>So I should say I'm not an expert on this,

0:29:03.160 --> 0:29:06.200
<v Speaker 3>but my understanding is that much of this work is

0:29:06.200 --> 0:29:11.440
<v Speaker 3>outsourced to countries where the medium wage is lower than

0:29:11.520 --> 0:29:14.520
<v Speaker 3>for example, in the United States or in Europe. There

0:29:14.520 --> 0:29:19.720
<v Speaker 3>have been reports of large groups of annotators, specifically annotating

0:29:20.000 --> 0:29:24.000
<v Speaker 3>images and texts that are considered not safe for work,

0:29:24.160 --> 0:29:27.360
<v Speaker 3>for example, in those countries. So in other words, you

0:29:27.440 --> 0:29:30.840
<v Speaker 3>have examples of folks in those countries that are already

0:29:30.920 --> 0:29:34.720
<v Speaker 3>less privileged than people living in the United States, for example,

0:29:34.800 --> 0:29:39.239
<v Speaker 3>on average, engaging with a lot of horrific content and

0:29:39.280 --> 0:29:41.040
<v Speaker 3>saying this is not something we want.

0:29:42.760 --> 0:29:46.200
<v Speaker 1>So this is basically paying people to test drive your AI.

0:29:46.560 --> 0:29:49.360
<v Speaker 1>You could have a warehouse full of people whose job

0:29:49.400 --> 0:29:52.880
<v Speaker 1>it is to interact with the newborn AI and essentially

0:29:53.040 --> 0:29:55.680
<v Speaker 1>have them raise the AI and tell it what is

0:29:55.800 --> 0:30:00.000
<v Speaker 1>right and what is wrong. Unfortunately, as doctor Runner said,

0:30:00.000 --> 0:30:01.880
<v Speaker 1>I mean the poor people would have to bear the

0:30:01.920 --> 0:30:05.880
<v Speaker 1>absolute worst behavior of the AI. That's option number two.

0:30:06.120 --> 0:30:08.880
<v Speaker 1>Option number three for making AIS that care about us

0:30:09.200 --> 0:30:13.920
<v Speaker 1>is to basically bake into the AI a constitution, you know,

0:30:14.080 --> 0:30:17.360
<v Speaker 1>like the US or UK constitution that establishes what the

0:30:17.360 --> 0:30:20.960
<v Speaker 1>country stands for, what its values are, and what's generally

0:30:21.000 --> 0:30:22.400
<v Speaker 1>allowed and not allowed.

0:30:23.880 --> 0:30:29.280
<v Speaker 3>There's an approach that Thropic introduced called constitutional AI, and

0:30:29.320 --> 0:30:32.959
<v Speaker 3>that approach is based on providing a constitution to an

0:30:32.960 --> 0:30:37.960
<v Speaker 3>AI MODL, and that constitution reflects different values and preferences

0:30:38.240 --> 0:30:43.320
<v Speaker 3>that the company in this case, Anthropic wants the model

0:30:43.360 --> 0:30:43.920
<v Speaker 3>to exhibit.

0:30:44.920 --> 0:30:47.680
<v Speaker 1>Yes, the last approach here to making sure an AI

0:30:47.800 --> 0:30:51.600
<v Speaker 1>has values and morals is to give it a founding document.

0:30:52.000 --> 0:30:54.200
<v Speaker 1>But here's the wild part. The way to big that

0:30:54.360 --> 0:30:57.920
<v Speaker 1>founding document into the AI brain is to have another

0:30:58.000 --> 0:31:02.280
<v Speaker 1>AI train it. When we come back, we'll dig into

0:31:02.360 --> 0:31:05.560
<v Speaker 1>that scenario and we'll ask our experts what they think

0:31:05.840 --> 0:31:09.479
<v Speaker 1>the future holds. Will future AIS have our best interests

0:31:09.480 --> 0:31:12.400
<v Speaker 1>in mind? Or is it hopeless to ever be certain

0:31:12.680 --> 0:31:15.960
<v Speaker 1>they won't harm us, So stay with us. We'll be

0:31:16.080 --> 0:31:35.160
<v Speaker 1>right back. Hey, we'll come back. We're talking about AI alignment,

0:31:35.560 --> 0:31:39.160
<v Speaker 1>or basically the problem of making sure AIS don't kill

0:31:39.240 --> 0:31:42.000
<v Speaker 1>us all. And so far we've talked about why this

0:31:42.080 --> 0:31:44.480
<v Speaker 1>is such a hard problem and what are some of

0:31:44.520 --> 0:31:48.360
<v Speaker 1>the ways we can teach AI things like values and morals.

0:31:48.680 --> 0:31:50.960
<v Speaker 1>There are several ways, and one of them is to

0:31:51.000 --> 0:31:55.160
<v Speaker 1>give AIS a constitution or the equivalent of a founding

0:31:55.200 --> 0:31:58.640
<v Speaker 1>document or moral guide, and then have that bag into

0:31:58.720 --> 0:32:02.400
<v Speaker 1>the DNA of the AI. Now what's interesting is that,

0:32:02.480 --> 0:32:05.080
<v Speaker 1>according to doctor Tim Rutner, the way to do that

0:32:05.320 --> 0:32:08.160
<v Speaker 1>is through another AI.

0:32:11.360 --> 0:32:14.480
<v Speaker 3>This is a little bit in the weeds, but providing

0:32:14.680 --> 0:32:20.200
<v Speaker 3>a very long constitution that outlines every preference in detail

0:32:20.480 --> 0:32:24.280
<v Speaker 3>when a user engages with the model is actually more

0:32:24.360 --> 0:32:28.400
<v Speaker 3>expensive for the company to do because the model needs

0:32:28.440 --> 0:32:32.680
<v Speaker 3>to ingest a lot of text upfront, so it's easier

0:32:32.720 --> 0:32:36.800
<v Speaker 3>to try to bake the preferences that are expressed in

0:32:36.840 --> 0:32:42.040
<v Speaker 3>the constitution explicitly into the model when you're training it upfront,

0:32:42.280 --> 0:32:47.000
<v Speaker 3>as opposed to providing that specification every time a user

0:32:47.200 --> 0:32:48.360
<v Speaker 3>engages with the model.

0:32:48.560 --> 0:32:50.760
<v Speaker 1>I see, you want the model to have learned the

0:32:50.800 --> 0:32:53.880
<v Speaker 1>constitution sort of inherently, rather than having to check it

0:32:53.920 --> 0:32:56.080
<v Speaker 1>every time somebody asks it a question.

0:32:56.480 --> 0:32:59.720
<v Speaker 3>Yes, I think that's roughly right. Ideally we would be

0:32:59.760 --> 0:33:03.640
<v Speaker 3>able to use humans to provide feedback and to oversee

0:33:03.680 --> 0:33:06.160
<v Speaker 3>models and to say, hey, this is behavior that we

0:33:06.240 --> 0:33:09.840
<v Speaker 3>don't want and stop that behavior. But of course that's

0:33:09.880 --> 0:33:13.440
<v Speaker 3>not really scalable. We can't have a human oversee every

0:33:13.480 --> 0:33:17.360
<v Speaker 3>interaction that a chatbot has, And so that raises the question,

0:33:17.560 --> 0:33:23.000
<v Speaker 3>how can we exhibit oversight in a way that is safe, reliable,

0:33:23.320 --> 0:33:28.160
<v Speaker 3>aligned with our values and preferences, and successful, And so

0:33:28.440 --> 0:33:32.840
<v Speaker 3>key challenge here is essentially to come up with tools, methods,

0:33:33.200 --> 0:33:36.360
<v Speaker 3>models that are able to check whether a given of

0:33:36.360 --> 0:33:40.400
<v Speaker 3>AI model perform some unintended behavior, and if it does,

0:33:40.680 --> 0:33:43.240
<v Speaker 3>can ring an alarm bell and let a human know

0:33:43.880 --> 0:33:46.800
<v Speaker 3>that oversight is needed and that maybe a model engages

0:33:46.840 --> 0:33:48.160
<v Speaker 3>an undesirable behavior.

0:33:48.400 --> 0:33:51.880
<v Speaker 1>You mean like, have an AI police the other AI.

0:33:52.360 --> 0:33:54.840
<v Speaker 3>That's right, essentially, have one model overse.

0:33:55.080 --> 0:33:57.840
<v Speaker 1>Model WHOA But then how do you make sure the

0:33:57.840 --> 0:34:01.160
<v Speaker 1>police AI is doing its job or is aligned itself?

0:34:01.280 --> 0:34:02.280
<v Speaker 1>You need another police.

0:34:02.320 --> 0:34:04.640
<v Speaker 3>We don't know, that's the problem. We don't know. It's

0:34:04.680 --> 0:34:06.920
<v Speaker 3>a turtle, it's all the way down from it. And

0:34:07.040 --> 0:34:10.120
<v Speaker 3>if we have a model that checks whether another model

0:34:10.200 --> 0:34:12.279
<v Speaker 3>does what we wanted to do, and how do we

0:34:12.360 --> 0:34:15.560
<v Speaker 3>know that that model that does the overseeing is actually

0:34:15.560 --> 0:34:18.960
<v Speaker 3>aligned with us? I would argue that it might be

0:34:19.080 --> 0:34:22.600
<v Speaker 3>easier for us to make sure that the overseer model

0:34:22.960 --> 0:34:27.800
<v Speaker 3>is aligned than the generator model. They're a little simpler

0:34:28.120 --> 0:34:32.200
<v Speaker 3>because they don't necessarily generate. They just try to classify

0:34:32.480 --> 0:34:36.759
<v Speaker 3>whether a given behavior is intended or not intended. And

0:34:36.800 --> 0:34:40.200
<v Speaker 3>so this way we might be able to do alignment

0:34:40.280 --> 0:34:43.200
<v Speaker 3>more scalably and in a way that really reflects different

0:34:43.320 --> 0:34:46.000
<v Speaker 3>individuals or groups preferences and values.

0:34:46.320 --> 0:34:49.359
<v Speaker 1>I see. It's like, have another AI can of sit

0:34:49.440 --> 0:34:53.879
<v Speaker 1>in every time I ask CHGBT something yes. Yes. As

0:34:54.000 --> 0:34:57.880
<v Speaker 1>AIS get bigger and more complicated, the only scalable solution

0:34:58.040 --> 0:35:01.440
<v Speaker 1>to training them is going to be through other AIS.

0:35:01.840 --> 0:35:05.080
<v Speaker 1>In this situation, you might program a simpler AI with

0:35:05.200 --> 0:35:07.840
<v Speaker 1>your values and morals, and then you'd have that AI

0:35:08.320 --> 0:35:12.799
<v Speaker 1>train the bigger AI release try to It's like the

0:35:12.880 --> 0:35:18.800
<v Speaker 1>Rutner says, AI alignment methods are not perfect. What do

0:35:18.840 --> 0:35:22.080
<v Speaker 1>you mean? They're not perfect? They don't always work or

0:35:22.120 --> 0:35:23.839
<v Speaker 1>they can't guarantee that they will work.

0:35:24.120 --> 0:35:27.800
<v Speaker 3>So with machine learning models, we can rarely guarantee anything.

0:35:28.040 --> 0:35:31.080
<v Speaker 1>Oh boy, that's kind of the problem, isn't it.

0:35:31.600 --> 0:35:33.880
<v Speaker 3>Yeah, I think that's one of the problems. There is

0:35:34.000 --> 0:35:38.839
<v Speaker 3>research that tries to establish guarantees, but that research is

0:35:38.880 --> 0:35:42.400
<v Speaker 3>far behind the practice at the moment. The kinds of

0:35:42.440 --> 0:35:46.120
<v Speaker 3>methods that we have for model alignment falls short in

0:35:46.200 --> 0:35:49.360
<v Speaker 3>a few different ways. One that's I think one of

0:35:49.400 --> 0:35:52.440
<v Speaker 3>the biggest ways. It's just hard to communicate our preferences.

0:35:52.600 --> 0:35:56.720
<v Speaker 3>So there are many different steps at which alignment can fail.

0:35:57.120 --> 0:36:00.000
<v Speaker 3>This goes back to trying to express and then community

0:36:00.760 --> 0:36:03.560
<v Speaker 3>what we want a model to do kind of values

0:36:03.600 --> 0:36:08.680
<v Speaker 3>and preferences we're trying to install in it. Translating our

0:36:08.800 --> 0:36:15.120
<v Speaker 3>values and preferences from some really complicated, possibly contradictory ideal

0:36:15.160 --> 0:36:21.400
<v Speaker 3>specification into a design specification is very difficult and there's

0:36:21.600 --> 0:36:24.440
<v Speaker 3>likely going to be some gap there. And then second,

0:36:24.920 --> 0:36:28.399
<v Speaker 3>even if we were able to do this perfectly, even

0:36:28.440 --> 0:36:31.600
<v Speaker 3>if we were able to express and communicate our values

0:36:31.640 --> 0:36:36.480
<v Speaker 3>and preferences perfectly, the kinds of low level tools machine

0:36:36.560 --> 0:36:40.200
<v Speaker 3>learning tools that we use to give the model these

0:36:40.239 --> 0:36:45.600
<v Speaker 3>preferences and values are imperfect at translating the values that

0:36:45.680 --> 0:36:48.960
<v Speaker 3>we're trying to communicate into the model, They might not

0:36:49.160 --> 0:36:54.279
<v Speaker 3>enable us to perfectly translate the design specification into the

0:36:54.320 --> 0:36:56.440
<v Speaker 3>actual behavior that we would like to see.

0:36:56.680 --> 0:37:00.000
<v Speaker 1>I see, boy, it seems like there are problems everywhere

0:36:59.880 --> 0:37:06.120
<v Speaker 1>we turn here, doctor rut. Yeah, Well, as we go

0:37:06.320 --> 0:37:09.680
<v Speaker 1>towards the future, and as systems get smarter and problems

0:37:09.680 --> 0:37:12.000
<v Speaker 1>get more complicated, what do you think is the prospect

0:37:12.040 --> 0:37:15.120
<v Speaker 1>of making sure that these more advanced systems were more

0:37:15.160 --> 0:37:19.879
<v Speaker 1>complicated problems have values that we want it to have,

0:37:20.280 --> 0:37:22.040
<v Speaker 1>Because I'm not sure if we want it to have

0:37:22.280 --> 0:37:24.400
<v Speaker 1>human values, because I don't know if humans are the

0:37:24.400 --> 0:37:28.680
<v Speaker 1>best making these kinds of good choices. Yeah, what do you think?

0:37:28.800 --> 0:37:30.120
<v Speaker 1>What do you what do you see in the future.

0:37:30.480 --> 0:37:32.760
<v Speaker 4>I think tho's a few things we need longer term,

0:37:32.840 --> 0:37:35.399
<v Speaker 4>and they all feel uncertain. I think to do well

0:37:35.400 --> 0:37:36.880
<v Speaker 4>in the longer term, we need to get the AIS

0:37:36.920 --> 0:37:37.520
<v Speaker 4>in near future.

0:37:37.600 --> 0:37:37.799
<v Speaker 3>Right.

0:37:38.520 --> 0:37:41.120
<v Speaker 4>If the pace of AI development stays fast, we're really

0:37:41.120 --> 0:37:43.000
<v Speaker 4>really going to need the help of AI systems to

0:37:43.040 --> 0:37:45.600
<v Speaker 4>help us figure out how does your future A systems?

0:37:46.680 --> 0:37:49.920
<v Speaker 4>And so getting the right values into the next model

0:37:49.960 --> 0:37:52.279
<v Speaker 4>we build helps us figure out what to do with

0:37:52.320 --> 0:37:54.200
<v Speaker 4>them adel after that, and so I think this kind

0:37:54.200 --> 0:37:55.919
<v Speaker 4>of the short term work really does kind of fan

0:37:56.000 --> 0:37:58.640
<v Speaker 4>out into this longer feature. And yeah, getting the next

0:37:58.680 --> 0:38:00.680
<v Speaker 4>model right really matters for getting.

0:38:00.440 --> 0:38:01.359
<v Speaker 3>The fartugerules right.

0:38:01.560 --> 0:38:02.920
<v Speaker 1>Oh boy, it's called.

0:38:02.719 --> 0:38:04.360
<v Speaker 4>The scalable oversight problem.

0:38:04.600 --> 0:38:07.600
<v Speaker 1>I see. It's like the alignment problem is going to

0:38:07.600 --> 0:38:10.279
<v Speaker 1>scale up, and the best way for us to keep

0:38:10.400 --> 0:38:13.640
<v Speaker 1>up is to make sure that we get it right

0:38:13.760 --> 0:38:17.080
<v Speaker 1>now with these smaller systems, so we can use those

0:38:17.120 --> 0:38:19.799
<v Speaker 1>AIS to help us in the more complicated situations.

0:38:20.040 --> 0:38:20.239
<v Speaker 3>Yeah.

0:38:20.400 --> 0:38:22.720
<v Speaker 1>Yeah, that's oh wow.

0:38:22.480 --> 0:38:23.240
<v Speaker 4>That's the hope.

0:38:24.680 --> 0:38:27.400
<v Speaker 1>Okay, last question. Do you think humanity is doomed?

0:38:30.120 --> 0:38:32.120
<v Speaker 4>I don't think so. I think it's possible. I think

0:38:32.280 --> 0:38:36.200
<v Speaker 4>the AI presents a lot of really scary and destabilizing

0:38:36.200 --> 0:38:38.200
<v Speaker 4>possibilities that we can't roll out. So I think there's

0:38:38.200 --> 0:38:39.800
<v Speaker 4>a lot of work to do. I think we'll probably

0:38:39.800 --> 0:38:42.440
<v Speaker 4>figure it out. But I think it's also possible that

0:38:42.520 --> 0:38:45.759
<v Speaker 4>AI winds us up in a lot of weird, unfamiliar situations.

0:38:45.920 --> 0:38:48.680
<v Speaker 4>I think it's unlikely than possible that things go really,

0:38:48.719 --> 0:38:51.000
<v Speaker 4>really terribly, But I also think it's kind of unlikely

0:38:51.000 --> 0:38:53.640
<v Speaker 4>but possible, but the things stay totally normal and recognizable

0:38:53.640 --> 0:38:55.719
<v Speaker 4>and familiar. I think AI is just what it's going

0:38:55.760 --> 0:38:59.680
<v Speaker 4>to do to society and politics and economics is all

0:38:59.719 --> 0:39:01.080
<v Speaker 4>going to be confusing. I the's a lot that we'll

0:39:01.080 --> 0:39:02.200
<v Speaker 4>need to figure out pretty fast.

0:39:02.400 --> 0:39:03.120
<v Speaker 3>Fingers crossed.

0:39:05.160 --> 0:39:06.759
<v Speaker 1>I guess that's as good of an answer as we

0:39:06.760 --> 0:39:11.600
<v Speaker 1>can get these days. Fingers crossed. I guess just to

0:39:11.640 --> 0:39:13.600
<v Speaker 1>wrap up here, what do you think is going to

0:39:13.600 --> 0:39:17.040
<v Speaker 1>happen in the future, or what are some things about

0:39:17.040 --> 0:39:19.839
<v Speaker 1>this that you think most people are not thinking about

0:39:19.840 --> 0:39:21.160
<v Speaker 1>that they should be thinking about.

0:39:21.360 --> 0:39:24.200
<v Speaker 3>I think people should be thinking about ways in which

0:39:24.480 --> 0:39:28.480
<v Speaker 3>the AI systems that we have today are already capable

0:39:28.600 --> 0:39:33.120
<v Speaker 3>enough to cause harm, to change our world quite significantly,

0:39:33.239 --> 0:39:36.279
<v Speaker 3>change our culture, change the way we go about our day,

0:39:36.480 --> 0:39:40.080
<v Speaker 3>change the way we make decisions, change the way we

0:39:40.120 --> 0:39:45.239
<v Speaker 3>do our work. And the alignment problem and understanding when

0:39:45.680 --> 0:39:47.680
<v Speaker 3>models are aligned, I think, are two of the most

0:39:47.719 --> 0:39:52.600
<v Speaker 3>fundamental scientific challenges that we as a society are facing

0:39:52.680 --> 0:39:56.759
<v Speaker 3>right now. And that is not a sci fi future problem.

0:39:56.960 --> 0:40:00.040
<v Speaker 3>This is a problem about systems that we have to

0:40:00.560 --> 0:40:03.560
<v Speaker 3>We want to make sure these systems really do what

0:40:03.600 --> 0:40:06.040
<v Speaker 3>we want them to do, and that these systems help

0:40:06.120 --> 0:40:09.840
<v Speaker 3>us flourish and benefit humanity. The systems that we have

0:40:09.960 --> 0:40:13.840
<v Speaker 3>access to today already well beyond the capabilities that the

0:40:13.880 --> 0:40:17.280
<v Speaker 3>research community and certainly the general public thought we could

0:40:17.520 --> 0:40:19.719
<v Speaker 3>have in the year twenty twenty six.

0:40:20.120 --> 0:40:22.560
<v Speaker 1>I think you're saying that the future is here, but

0:40:22.600 --> 0:40:24.960
<v Speaker 1>we still haven't fully figured out the alignment problem.

0:40:25.120 --> 0:40:27.640
<v Speaker 3>Yes, that's right, meaning it's.

0:40:27.480 --> 0:40:29.600
<v Speaker 1>More pressing than ever that we figured this out.

0:40:29.840 --> 0:40:30.280
<v Speaker 3>I agree.

0:40:30.360 --> 0:40:33.759
<v Speaker 1>Yeah, amazing, doctor Rutner. How do we prove to the

0:40:33.800 --> 0:40:36.400
<v Speaker 1>audience that we're not an AI generated conversation?

0:40:38.640 --> 0:40:42.759
<v Speaker 3>I wish I had the answer. You can generate such

0:40:42.840 --> 0:40:47.160
<v Speaker 3>fantastic fake podcasts with AI now right right, with all

0:40:47.200 --> 0:40:51.600
<v Speaker 3>the little idiosyncrasies that you hear in podcasts today that

0:40:51.800 --> 0:40:52.960
<v Speaker 3>you know, I think that's hard to do.

0:40:53.239 --> 0:40:55.239
<v Speaker 1>Or I guess if this conversation is aligned with but

0:40:55.360 --> 0:40:57.280
<v Speaker 1>you want to hear, maybe it doesn't matter.

0:40:59.560 --> 0:41:01.719
<v Speaker 3>Still, I hope that the audience thinks that we so

0:41:01.719 --> 0:41:04.160
<v Speaker 3>out on human I think that that would be That

0:41:04.200 --> 0:41:06.600
<v Speaker 3>would be nice.

0:41:07.040 --> 0:41:09.799
<v Speaker 1>All right? Hey on, behalf of everyone who works in

0:41:09.840 --> 0:41:12.800
<v Speaker 1>the show picture joining us on the sixty plus episodes

0:41:12.880 --> 0:41:15.320
<v Speaker 1>we've done. Be sure to follow me on social media

0:41:15.440 --> 0:41:18.680
<v Speaker 1>or PhD comics dot com for updates and hey, thanks

0:41:18.719 --> 0:41:20.839
<v Speaker 1>to all the guests we've had on the show. Here's

0:41:20.840 --> 0:41:23.200
<v Speaker 1>a little tribute our editor Rose so good to put

0:41:23.239 --> 0:41:26.120
<v Speaker 1>together of all the times they were gracious enough to

0:41:26.160 --> 0:41:27.360
<v Speaker 1>put up with my questions.

0:41:27.680 --> 0:41:28.480
<v Speaker 4>That's a good question.

0:41:28.719 --> 0:41:31.520
<v Speaker 3>Yeah, that's a great question. You know, that's a great question.

0:41:31.719 --> 0:41:33.560
<v Speaker 4>Yeah, so those are all big questions for us to

0:41:33.600 --> 0:41:37.080
<v Speaker 4>answer a scientists. Yeah, that's a great question. It's a

0:41:37.080 --> 0:41:37.680
<v Speaker 4>good question.

0:41:37.960 --> 0:41:38.760
<v Speaker 3>That's a good question.

0:41:38.960 --> 0:41:42.640
<v Speaker 4>That's a good question. So that's actually a good question.

0:41:42.960 --> 0:41:45.200
<v Speaker 2>You're you're raising really great questions.

0:41:45.560 --> 0:41:47.080
<v Speaker 4>Yeah, that's a really good question.

0:41:47.320 --> 0:41:49.720
<v Speaker 3>That's a good question. That's a really good question.

0:41:50.000 --> 0:41:52.120
<v Speaker 4>That's a very hard question. That's a good question, though,

0:41:52.200 --> 0:41:55.840
<v Speaker 4>that's a really good question. That is a really good question. Yeah,

0:41:55.920 --> 0:41:57.120
<v Speaker 4>that's a great question.

0:41:57.320 --> 0:41:58.600
<v Speaker 3>Yeah, that's a good question.

0:41:58.800 --> 0:42:02.080
<v Speaker 1>Oh that's a great question, very very good question.

0:42:02.200 --> 0:42:03.240
<v Speaker 3>That's a really good question.

0:42:03.400 --> 0:42:05.040
<v Speaker 4>So I'll go through that question back at you.

0:42:05.120 --> 0:42:07.960
<v Speaker 1>What do you think? So we come once again to

0:42:08.040 --> 0:42:12.719
<v Speaker 1>the edge of scientific knowledge. Thanks for joining us, see

0:42:12.719 --> 0:42:19.400
<v Speaker 1>you next time you've been listening to Science Stuff. Production

0:42:19.560 --> 0:42:23.600
<v Speaker 1>of iHeartRadio Bringing the produced by me or Hey Cham,

0:42:23.760 --> 0:42:27.720
<v Speaker 1>edited by Rose Seguda, Executive producer Jerry Rowland, and audio

0:42:27.719 --> 0:42:30.960
<v Speaker 1>engineer and mixer Kasey Peckram. You can follow me on

0:42:31.000 --> 0:42:34.200
<v Speaker 1>social media. Just search for PhD Comics and the name

0:42:34.280 --> 0:42:36.959
<v Speaker 1>of your favorite platform. Be sure to subscribe to sign

0:42:37.000 --> 0:42:40.239
<v Speaker 1>stuff on the iHeartRadio app, Apple Podcasts, or wherever you

0:42:40.280 --> 0:42:41.200
<v Speaker 1>get your podcasts.