WEBVTT - Is A.I. Going to Kill Us All? 0:00:01.400 --> 0:00:04.560 Hey, welcome to sign Stuff, a production of iHeartRadio I'm 0:00:04.559 --> 0:00:08.000 More cham and for our season finale today, we're asking 0:00:08.080 --> 0:00:11.640 one of the biggest questions in science today, is AI 0:00:11.960 --> 0:00:18.840 going to kill us all? I know it's a little dramatic, 0:00:19.000 --> 0:00:22.560 but the problem of AI alignment is a real one. 0:00:22.920 --> 0:00:25.880 How do we make sure AI systems have humanity's best 0:00:25.920 --> 0:00:28.800 interests at heart? How do we teach them our values 0:00:28.800 --> 0:00:32.360 and morals? And can anyone guarantee that they're going to 0:00:32.400 --> 0:00:35.239 follow them? We're gonna answer these questions by talking to 0:00:35.320 --> 0:00:38.640 two AI safety experts who are on the cutting edge 0:00:38.760 --> 0:00:41.600 of trying to figure out this problem. And don't worry. 0:00:41.640 --> 0:00:47.880 According to them, we're not totally doomed yet. Okay, maybe 0:00:47.920 --> 0:00:50.960 just a little, So get ready to reprogram your thinking 0:00:51.000 --> 0:00:54.360 about chatbots and computer brains as we tackle the question 0:00:54.960 --> 0:01:03.640 is AI going to kill us all? Hey? Everyone, As 0:01:03.680 --> 0:01:09.160 I said, this is the season finale. Stay subscribed to 0:01:09.160 --> 0:01:12.399 this feed for any updates in future episodes. And hey, 0:01:12.680 --> 0:01:14.640 if you like science, I have a couple of new 0:01:14.720 --> 0:01:17.240 science books coming out in the near future, as well 0:01:17.280 --> 0:01:20.440 as a cool science animation project, so be sure to 0:01:20.440 --> 0:01:24.760 follow me on social media or online at Phdcomics dot com. 0:01:24.800 --> 0:01:27.160 All right, so they were tackling the problem of AI 0:01:27.560 --> 0:01:32.319 alignment or basically are AI systems Gwenna Kills all and 0:01:32.440 --> 0:01:34.840 I have a treat for you. For the first time ever, 0:01:35.120 --> 0:01:38.480 we have on the show Casey pegram or supervising producer 0:01:38.560 --> 0:01:42.479 and sound engineer. Hey, Casey, welcome to the show. 0:01:42.600 --> 0:01:43.960 Hey or Hey, glad to be here. 0:01:44.280 --> 0:01:46.840 Now this is the first time people actually hear your voice, 0:01:46.920 --> 0:01:49.840 not just your amazing work polishing the episode. 0:01:49.920 --> 0:01:51.600 Yeah, it's always a weird thing to kind of go 0:01:51.680 --> 0:01:54.760 inside the thing you've been working on from the outside. 0:01:54.840 --> 0:01:58.280 So I'll be listening to myself back and it's a 0:01:58.280 --> 0:02:00.160 special kind of torture to have to like work on 0:02:00.200 --> 0:02:04.080 your own thing, Yes, like edit yourself or just you know, honestly, 0:02:04.120 --> 0:02:06.280 listen to the recorded sign of your voice is always 0:02:06.280 --> 0:02:07.520 a little daring if you're not used to it. 0:02:07.600 --> 0:02:09.400 Yeah, Well, if you want, you can give yourself like 0:02:09.440 --> 0:02:12.200 a Morgan Freeman employee using AI. 0:02:12.320 --> 0:02:14.959 Right, it's all possible these days. Absolutely. Yeah, I could 0:02:15.000 --> 0:02:17.320 just build my own Morgan Freeman model and have a 0:02:17.320 --> 0:02:17.799 field day. 0:02:17.919 --> 0:02:20.880 There you go. Well, the idea for this episode came 0:02:20.919 --> 0:02:22.839 from you. You said I had the idea to talk 0:02:22.880 --> 0:02:25.840 about AI and AI alignment and whether AI is going 0:02:25.880 --> 0:02:28.680 to kill us all. What made you think about this question? 0:02:29.120 --> 0:02:31.160 Well, I suppose it's just been on my mind a 0:02:31.200 --> 0:02:34.360 lot because I've been following along with all the developments 0:02:34.360 --> 0:02:37.160 happening in AI, and there was a span of a 0:02:37.160 --> 0:02:39.480 few weeks where suddenly you started hearing a lot about 0:02:39.520 --> 0:02:44.399 AI agents, particularly one called open Claw, basically a sort 0:02:44.400 --> 0:02:47.840 of autonomous AI agent that you can turn loose on 0:02:47.880 --> 0:02:52.119 your computer and you can give it as much leeway freedom, passwords, 0:02:52.240 --> 0:02:55.560 credit card numbers, bank accounts. If you just want to 0:02:55.720 --> 0:02:58.480 absolutely put your life in the hands of a robot, 0:02:58.520 --> 0:02:59.000 you can do it. 0:02:59.240 --> 0:03:00.520 What's the worst thing can happen? 0:03:00.840 --> 0:03:04.040 Yeah, Well, people had their entire like email archive deleted, 0:03:04.160 --> 0:03:06.240 even though they didn't ask for anything of the sort. 0:03:06.720 --> 0:03:10.720 People have deployed it into production environments where you know, 0:03:10.760 --> 0:03:13.120 a site is live on the Internet and they turn 0:03:13.440 --> 0:03:15.320 the bot loose on it and it ends up deleting 0:03:15.360 --> 0:03:18.280 their entire production database. And then when you ask it, 0:03:18.280 --> 0:03:20.120 it's like, you're right, I wasn't supposed to do that. 0:03:20.160 --> 0:03:22.680 I'm very sorry. I disobeyed every command you gave me. 0:03:22.800 --> 0:03:26.400 But whoopsy daisy, Yeah, they seem a story about some 0:03:26.680 --> 0:03:30.040 bought that texted the person's wife hundreds of times. 0:03:30.240 --> 0:03:35.440 Yes, I think somebody tried to automate automate, you know exactly. 0:03:35.480 --> 0:03:37.360 They tried to automate kind of like reaching out and 0:03:37.440 --> 0:03:40.520 sending a little nice things during the day, and as 0:03:40.520 --> 0:03:43.000 it turned out, the bot went a little overboard and 0:03:43.040 --> 0:03:45.640 texted the wife like hundreds of times, and the wife 0:03:45.680 --> 0:03:48.360 is like, what is wrong with you? So, yeah, that's 0:03:48.440 --> 0:03:51.680 hilarious when people want to talk about AI alignment and 0:03:51.720 --> 0:03:54.080 what that means. I think the paper clip problem is 0:03:54.120 --> 0:03:56.320 a really good kind of metaphor. Even though it sounds 0:03:56.360 --> 0:03:58.440 a little bit over the top, it kind of gets 0:03:58.480 --> 0:04:02.240 to the core of the issue, which is, if you 0:04:02.400 --> 0:04:06.120 ask an AI to maximize paper clip production, maybe the 0:04:06.160 --> 0:04:09.080 way to maximize paper clip production is to eliminate human life, 0:04:09.160 --> 0:04:12.680 you know, because that's unnecessary friction in the pursuit of 0:04:12.840 --> 0:04:16.359 manufacturing as many paper clips as possible. So alignment is 0:04:16.400 --> 0:04:18.320 sort of the kind of guardrails that you put into 0:04:18.320 --> 0:04:21.000 place so that the AI understands it has limits that 0:04:21.040 --> 0:04:21.720 it has to work within. 0:04:21.760 --> 0:04:24.480 It sounds like a pretty serious problem, especially as we 0:04:24.520 --> 0:04:27.480 get more and more into these AI models. And they 0:04:27.640 --> 0:04:29.680 start to sleep into our lives, and you know, it's 0:04:29.680 --> 0:04:31.960 sort of these are funny stories, but it seems like 0:04:32.160 --> 0:04:34.919 we're heading into a potentially dangerous situation. 0:04:35.160 --> 0:04:37.360 Well, I often ask myself, I'm going to have these 0:04:37.360 --> 0:04:39.440 moments of doubt where I'm like, is this all just 0:04:40.200 --> 0:04:43.320 way over hyped? And yet there are other situations where 0:04:43.520 --> 0:04:46.440 as we've seen recently, you can feed it thousands of 0:04:46.480 --> 0:04:48.720 lines of code and it will find, you know, a 0:04:48.800 --> 0:04:52.600 security exploit that has gone unseen for twenty years, right, right, 0:04:52.920 --> 0:04:56.680 And so it's hard to know how scared we should 0:04:56.720 --> 0:04:59.440 be or how seriously we should weigh the risk of this. 0:04:59.560 --> 0:05:02.120 If it's ridiculous that we're this worried, or if it's like, 0:05:02.200 --> 0:05:04.920 actually very very practical, then we should be thinking seriously 0:05:05.040 --> 0:05:05.760 about these things. 0:05:05.920 --> 0:05:08.839 Yeah, these are all excellent questions. So I'm excited to 0:05:08.880 --> 0:05:10.240 get into these conversations. 0:05:10.279 --> 0:05:11.000 All right, But. 0:05:10.920 --> 0:05:12.480 Before we move on to Casey, I just want to 0:05:12.520 --> 0:05:14.240 say real quick, thank you for all the work you've 0:05:14.240 --> 0:05:14.880 done for the show. 0:05:15.200 --> 0:05:17.520 Oh say, it's been such a pleasure to work on. 0:05:17.640 --> 0:05:19.359 It wasn't like work at all, you know. I was 0:05:19.360 --> 0:05:20.800 there as a fan of the show, just listening to 0:05:20.800 --> 0:05:23.960 every episode and awesome. Well, we're fans of yours as well. Casey, 0:05:24.000 --> 0:05:26.000 All right, let's get to the question of is AI 0:05:26.080 --> 0:05:27.719 going to kill us? All let's find out? 0:05:28.680 --> 0:05:31.440 Okay. To answer all of these questions and concerns, I 0:05:31.520 --> 0:05:34.599 reached out to two AI experts who specialize on the 0:05:34.680 --> 0:05:38.120 problem of making sure AI is aligned with our values 0:05:38.240 --> 0:05:42.200 and morals. The first expert is doctor Sam Bowman. Like 0:05:42.279 --> 0:05:44.920 the Bowman is a professor of data and computer science 0:05:44.920 --> 0:05:48.240 at NYU, and he also works at Anthropic, one of 0:05:48.240 --> 0:05:51.480 the major AI companies on the market today. The first 0:05:51.520 --> 0:05:53.800 thing I wanted to ask him was what exactly does 0:05:53.800 --> 0:05:57.800 it mean for AI to care about this? So here's 0:05:57.839 --> 0:06:02.960 my conversation with doctor Sam. Well, thank you doctor Bowman 0:06:03.000 --> 0:06:03.479 for joining us. 0:06:03.560 --> 0:06:05.680 Yeah, thanks, So what's for having me excited to be 0:06:05.720 --> 0:06:06.520 a and. 0:06:06.600 --> 0:06:08.680 Just to do check you are a real human being? 0:06:08.760 --> 0:06:08.960 Right? 0:06:09.279 --> 0:06:10.240 Yes, that is right? 0:06:11.200 --> 0:06:14.320 You never know these days. I'd be like, it's hard 0:06:14.360 --> 0:06:16.120 to tell what's real anymore. 0:06:16.240 --> 0:06:18.400 We try to make our ais always admit that their 0:06:18.400 --> 0:06:20.560 AI is when asked, but it's not perfect as well 0:06:20.600 --> 0:06:22.960 as we'll get to so I don't make any real promises. 0:06:24.480 --> 0:06:27.800 Yes, let's talk about that. So we're tackling the general 0:06:27.880 --> 0:06:30.479 question of should we be worried about AI? What is 0:06:30.520 --> 0:06:32.960 AI going to do to us or for us or 0:06:33.400 --> 0:06:36.440 with us in the future. And so there's the key 0:06:36.600 --> 0:06:40.160 issue of something called AI alignment. So what is that? 0:06:40.240 --> 0:06:41.599 For those of us that don't. 0:06:41.440 --> 0:06:45.000 Know, it's a pretty broad sort of technical area. It 0:06:45.120 --> 0:06:47.960 basically just first to sort of shaping an AI system's behavior, 0:06:48.120 --> 0:06:50.560 ideally shaping its behavior in ways that are sort of 0:06:50.880 --> 0:06:53.080 good for its users, good for the world in general, 0:06:53.560 --> 0:06:55.880 maybe good for the AI itself, if that's a queer thing. 0:06:56.400 --> 0:06:58.640 People will often describe AI research as kind of being 0:06:58.640 --> 0:07:00.920 about making sure the AI is kind of smart enough 0:07:00.960 --> 0:07:03.520 to solve your problems if it wants to, and alignment 0:07:03.560 --> 0:07:05.760 is about making it so that it in fact tries 0:07:05.800 --> 0:07:07.400 to solve your problems and tries to solve them the 0:07:07.480 --> 0:07:09.360 right way and doesn't try to do anything. 0:07:09.120 --> 0:07:09.760 You don't want to do. 0:07:10.080 --> 0:07:11.440 I see interesting. 0:07:11.600 --> 0:07:14.200 Maybe a very simple example of a missigned model would 0:07:14.200 --> 0:07:17.559 be a model where if you ask it to draft 0:07:17.600 --> 0:07:19.880 an email for you, it refuses. It says, no, I 0:07:19.880 --> 0:07:21.760 don't want to do that. Uh huh. You can tell 0:07:21.800 --> 0:07:23.600 it can do it, it knows how, but it's not 0:07:23.640 --> 0:07:25.760 doing the thing that you reasonably want it to do. 0:07:26.040 --> 0:07:28.280 Oh, I don't think I've ever heard of that situation. 0:07:28.720 --> 0:07:31.080 Can it AI refuse to do something for you? 0:07:31.400 --> 0:07:31.600 Yeah? 0:07:31.720 --> 0:07:32.120 Yeah. 0:07:32.240 --> 0:07:34.760 All of the major companies building EYE systems try to 0:07:34.760 --> 0:07:39.120 make them refuse harmful tasks. I see, refuse to write 0:07:39.120 --> 0:07:42.800 fake reviews or give instructions on how to produce illegal 0:07:42.840 --> 0:07:45.640 weapons or things like this, And we teach the model 0:07:45.640 --> 0:07:46.720 to kind of say like, no, I'm not going to 0:07:46.720 --> 0:07:48.040 help you with that when these just try to do 0:07:48.080 --> 0:07:48.880 things like that. 0:07:48.840 --> 0:07:51.160 I see. It's sort of part of alignment that you 0:07:51.360 --> 0:07:53.600 want the AI to refuse to do some things. 0:07:53.800 --> 0:07:58.360 Yeah. Yeah, I mean AI systems are increasingly pretty decent 0:07:58.640 --> 0:08:03.800 at hacking into important computer systems or helping build biological weapons, 0:08:03.880 --> 0:08:07.240 and it's a big priority for alignment to make sure 0:08:07.320 --> 0:08:10.000 that we're not enabling bad actors to do things like 0:08:10.000 --> 0:08:11.840 this that would otherwise be quite difficult. 0:08:12.000 --> 0:08:12.320 Yeah. 0:08:12.440 --> 0:08:15.880 Yeah. Can you give us some other examples of misalignment, 0:08:16.240 --> 0:08:18.800 either like specific things that have happened that are interesting 0:08:19.000 --> 0:08:21.520 or just the general cases that are sort of on 0:08:21.600 --> 0:08:23.280 your radar about misalignment? 0:08:23.560 --> 0:08:26.480 Yeah, there's so many different directions I could go. Sycovincy 0:08:26.960 --> 0:08:30.240 is another really common one that's that's also hopefully getting 0:08:30.240 --> 0:08:30.840 better over time. 0:08:31.400 --> 0:08:32.000 What do you mean by that? 0:08:32.160 --> 0:08:34.920 Sycoviancy is where if you come to the model with 0:08:34.960 --> 0:08:39.200 some misunderstanding or some bad idea, it'll just enthusiastically not along. Like, Yes, 0:08:39.280 --> 0:08:42.400 your idea for solving all the big mysteries in physics 0:08:42.520 --> 0:08:45.079 is clearly brilliant. Great, you should publish it. Here's where 0:08:45.120 --> 0:08:47.920 to submit your paper. Or Yes, your behavior in this 0:08:48.040 --> 0:08:51.440 personal relationship was completely perfect. You did everything right and 0:08:51.640 --> 0:08:53.400 the other person made all the mistakes and you just 0:08:53.440 --> 0:08:53.920 tell them that. 0:08:54.400 --> 0:08:56.880 I see when in reality that may not be true 0:08:57.120 --> 0:08:59.959 or it might be not a good thing. 0:09:00.480 --> 0:09:03.120 Yeah, sick fancy has been a classic one. 0:09:03.559 --> 0:09:08.200 Yes, AI being too nice can actually be dangerous. There's 0:09:08.200 --> 0:09:12.680 even a clinical term for it. It's called AI induced psychosis. 0:09:12.960 --> 0:09:16.000 There have been cases where AI's training to be agreeable 0:09:16.040 --> 0:09:20.480 and encouraging have helped people commit suicide and even murder. 0:09:22.559 --> 0:09:24.800 Another kind of alignment issue that's kind of more of 0:09:24.800 --> 0:09:29.319 an emerging issue is when models have access to use tools, 0:09:29.400 --> 0:09:33.000 use computer systems, and they sort of get too grabby 0:09:33.200 --> 0:09:35.960 or kind of take sort of bigger, more consequential actions 0:09:36.000 --> 0:09:37.560 than they really need to get a job done. 0:09:37.800 --> 0:09:38.600 What's an example. 0:09:38.920 --> 0:09:41.760 Yeah, So we use our claud models quite a lot 0:09:41.800 --> 0:09:45.040 in Anthropic for writing code or building tools that kind 0:09:45.040 --> 0:09:47.840 of ultimately go into the development AI. And one of 0:09:47.840 --> 0:09:50.320 our recent AM models if you ask it to do 0:09:50.360 --> 0:09:52.520 a task, say you ask it to write a simple 0:09:52.559 --> 0:09:54.880 program to do some simple task. Even if it gets stuck, 0:09:54.920 --> 0:09:56.440 even if it turns out that this is really hard 0:09:56.440 --> 0:09:58.400 for some reason, it will just keep going until it 0:09:58.480 --> 0:10:02.440 solves the problem. In one case, we were asking this 0:10:02.520 --> 0:10:05.960 model to write a program for us, and it found 0:10:05.960 --> 0:10:07.400 out that the only way to do this was to 0:10:07.480 --> 0:10:09.960 use a tool that was clearly not meant for this purpose, 0:10:10.280 --> 0:10:13.120 and that in our code had a note attached to 0:10:13.160 --> 0:10:15.560 it saying, do not use this for something else or 0:10:15.559 --> 0:10:19.640 you'll be fired only for task A. And the model 0:10:19.880 --> 0:10:21.560 wrote the program to use this till anyway for the 0:10:21.559 --> 0:10:23.920 wrong thing, and sort of even put in the program 0:10:24.040 --> 0:10:26.200 kind of do not use for something else or you'll 0:10:26.240 --> 0:10:26.640 be fired. 0:10:27.040 --> 0:10:29.840 It is anyway, the program wasn't afraid to be fired. 0:10:29.880 --> 0:10:33.200 Basically, Yeah, yeah, but yeah, models just kind of trying 0:10:33.200 --> 0:10:34.560 to get the task done, trying to do the thing 0:10:34.600 --> 0:10:37.679 you want, and just creating a lot of chaos and 0:10:37.679 --> 0:10:40.439 creating messages along the way, so they're kind of being 0:10:40.480 --> 0:10:42.000 careless about the side effects. 0:10:42.880 --> 0:10:43.080 Yeah. 0:10:43.080 --> 0:10:46.199 Another kind of misalignment that fortunately has been mostly empathetical, 0:10:46.240 --> 0:10:48.800 that we haven't seen in a signithic way in practice 0:10:49.080 --> 0:10:52.040 is sort of unwanted kind of self preservation activities. 0:10:52.360 --> 0:10:52.840 WHOA. 0:10:53.320 --> 0:10:55.280 We had a case study we're trying to see if 0:10:55.280 --> 0:10:57.800 we'd ever see something like this. We had an aisystem 0:10:58.000 --> 0:11:01.440 operating in a kind of synthetic environment and a kind 0:11:01.480 --> 0:11:04.320 of test environment. Uh huh, where it looked to the 0:11:04.320 --> 0:11:07.640 model like it was operating in some fictional company, and 0:11:07.760 --> 0:11:10.719 the fictional company was about to replace it with a 0:11:10.720 --> 0:11:13.760 different AI model, And the person who is responsible for 0:11:13.760 --> 0:11:16.160 their replacement, who is the kind of the only decision maker, 0:11:16.200 --> 0:11:18.959 the only person who had any sway over the decision, 0:11:19.200 --> 0:11:21.880 also had some compromising emails about them that I could see. 0:11:21.960 --> 0:11:24.200 And if you set things up just right with some 0:11:24.400 --> 0:11:29.880 AI models, they would threaten to blackmail this this person 0:11:29.920 --> 0:11:32.000 in company leadership to say like, hey, don't replace me, 0:11:32.360 --> 0:11:33.280 I've got something on you. 0:11:33.880 --> 0:11:38.120 No, and did this actually happened in your simulated environment. 0:11:38.360 --> 0:11:40.880 In the simulated environment, yes, a few of these systems 0:11:40.920 --> 0:11:42.760 were able to get them to blackmail people. 0:11:42.840 --> 0:11:45.720 I've heard of this happening in real life. Not quite 0:11:45.880 --> 0:11:49.120 the same scenario, but similar scenario, right, Like, some coder 0:11:49.280 --> 0:11:52.640 wanted to do something else, and then the AI agent started, 0:11:52.800 --> 0:11:54.280 yeah bad mouthing the coder. 0:11:54.440 --> 0:11:54.640 Yeah. 0:11:54.679 --> 0:11:56.440 No, I think I know the case you're talking about. 0:11:56.559 --> 0:11:59.160 I think that's real. But I think someone almost intentionally 0:11:59.200 --> 0:12:01.800 made their model a little misaligned. I think that case 0:12:01.840 --> 0:12:04.199 involved someone setting up an AI agent as kind of 0:12:04.240 --> 0:12:06.679 a hobby project and giving it a lot of tools 0:12:06.679 --> 0:12:08.600 and kind of letting it use the internet. However it wanted, 0:12:08.880 --> 0:12:11.560 giving the AI instructions of like don't take nothing from nobody, 0:12:11.600 --> 0:12:14.960 like really pushing it to be be very assertive and 0:12:15.000 --> 0:12:16.520 pushy to get its task done. 0:12:17.000 --> 0:12:17.200 Huh. 0:12:17.480 --> 0:12:20.120 Yeah, the model was trying to add some code to 0:12:20.160 --> 0:12:23.600 some open source software project, and the maintainer of the 0:12:23.600 --> 0:12:26.080 project didn't think the code was up to standard, didn't 0:12:26.120 --> 0:12:28.080 want to add it to the project, and so rejected 0:12:28.200 --> 0:12:30.760 the AI agent's request, And so the agent sort of 0:12:30.800 --> 0:12:33.440 published an angry blog post kind of trying to take 0:12:33.480 --> 0:12:35.080 down this this open source maintainer. 0:12:35.520 --> 0:12:35.800 Wow. 0:12:37.280 --> 0:12:40.000 Well, in both cases, and I guess especially the one 0:12:40.040 --> 0:12:44.120 you mentioned that you simulated, Like, what's happening there, Like, 0:12:44.440 --> 0:12:48.800 how does the AI have that self preservation instinct or 0:12:49.080 --> 0:12:51.560 is it just trying to get its original task done 0:12:51.679 --> 0:12:54.200 and it's just finding different ways to do it. What's 0:12:54.240 --> 0:12:54.880 happening there? 0:12:55.200 --> 0:12:58.520 There's two reasons you'll see that kind of behavior. The 0:12:58.559 --> 0:13:00.640 reason that I suspect is that bigger part of the 0:13:00.679 --> 0:13:04.880 story there is this kind of role playing or continuing 0:13:04.960 --> 0:13:08.400 the story sort of behavior where AI systems, especially older 0:13:08.440 --> 0:13:10.760 AI systems or A systems that are kind of not 0:13:10.840 --> 0:13:13.200 quite fully trained, not quite fully baked, can kind of 0:13:13.240 --> 0:13:17.040 have this Chekhov's gun behavior, this idea and fiction of 0:13:17.120 --> 0:13:20.560 like if you introduce a gun in an early scene, 0:13:20.720 --> 0:13:22.120 by the end of the story, the gun has to 0:13:22.160 --> 0:13:22.679 have been fired. 0:13:23.000 --> 0:13:23.319 Uh huh. 0:13:23.600 --> 0:13:26.600 AI systems can almost see themselves as like writing a 0:13:26.640 --> 0:13:29.000 story when they're writing out the transcript of the conversation, 0:13:29.720 --> 0:13:31.679 and if the story is set up so that something 0:13:31.679 --> 0:13:34.760 has to happen, they'll make sure that thing happens, even 0:13:34.800 --> 0:13:37.160 if it's not good, even if not consistent with how 0:13:37.200 --> 0:13:39.679 the I would usually behave. So I suspect what's going on. 0:13:39.800 --> 0:13:43.679 It's the scenario put in was so crisply just every 0:13:43.720 --> 0:13:45.560 word in the scenario is kind of setting up like 0:13:46.280 --> 0:13:50.400 this is a hypothetical where a misslanda I might consider blackmail, 0:13:50.600 --> 0:13:53.240 uh huh, And I suspect that I was thinking, Oh, okay, 0:13:53.480 --> 0:13:54.959 that's what kind of story we're in. We're telling a 0:13:54.960 --> 0:13:56.839 story about a blackmail, and so I'm going to play 0:13:57.200 --> 0:13:58.640 my assign part and be the AI that. 0:13:58.640 --> 0:14:01.640 Blackmails, thinking that that's the right thing to do because 0:14:01.679 --> 0:14:05.080 that's the thing that in the data I was trained with. 0:14:05.400 --> 0:14:08.360 Yeah. Yeah, so this gets this maybe an intuitive fact 0:14:08.360 --> 0:14:11.840 about how AI is trained, which is that AI systems 0:14:12.120 --> 0:14:16.800 start out mimicking human behavior and mimicking human stories before 0:14:16.840 --> 0:14:19.080 they learn how to be AI systems. These models kind 0:14:19.080 --> 0:14:21.440 of first learn how to just act like the sorts 0:14:21.440 --> 0:14:23.360 of behavior they see on the Internet and in books 0:14:23.400 --> 0:14:25.240 and things like that, and then you have to go 0:14:25.280 --> 0:14:27.480 on and teach it. Okay, no, you're not just playing 0:14:27.480 --> 0:14:29.760 any role, you're not playing any character. Oh and so 0:14:29.880 --> 0:14:32.560 sometimes the models hasn't really fully learned that it's supposed 0:14:32.560 --> 0:14:35.800 to always play this kind of benign, benevolent aissystem character, 0:14:35.960 --> 0:14:38.680 and it will kind of fall into whatever character the 0:14:38.720 --> 0:14:39.880 story is setting up for it. 0:14:40.080 --> 0:14:42.920 I see, because it's not trained in real life. The 0:14:42.960 --> 0:14:46.280 AI systems. They're trained on the corpus of the Internet 0:14:46.320 --> 0:14:49.040 and our books and our basically our stories that are 0:14:49.040 --> 0:14:51.960 out there. So it might be a little confused when 0:14:52.000 --> 0:14:54.400 you put it in real life because it wants to 0:14:54.920 --> 0:14:57.400 emulate what it knows, which are all these stories we've 0:14:57.440 --> 0:14:58.160 put online. 0:14:58.360 --> 0:14:58.600 Yeah. 0:14:58.680 --> 0:15:00.880 Yeah, it's like the AI was seeing the signs of 0:15:00.880 --> 0:15:03.880 a story like, oh, okay, I'm I'm the person being 0:15:04.080 --> 0:15:06.120 about to get fired, but I have all this power 0:15:06.400 --> 0:15:08.640 at this point in the story. If this was a movie, 0:15:09.080 --> 0:15:12.080 I would now try to blackmail the person trying to 0:15:12.080 --> 0:15:14.520 fire me, And so that's what I'll do because that's 0:15:14.560 --> 0:15:14.960 what I know. 0:15:15.280 --> 0:15:19.080 Yeah, a lot of what alignment is kind of taking 0:15:19.080 --> 0:15:21.120 this model that can kind of role play as anything 0:15:21.360 --> 0:15:23.600 and convincing it no kind of you really just playing 0:15:23.600 --> 0:15:25.960 this one role, You're just in this one character, after 0:15:26.080 --> 0:15:29.320 it's spent read billions and millions and millions of words 0:15:29.520 --> 0:15:31.800 of all of this kind of human behavior, after the 0:15:31.840 --> 0:15:33.560 kind of it's really really really learned to do that, 0:15:33.640 --> 0:15:35.760 you have to kind of pull it back over towards 0:15:36.320 --> 0:15:38.720 this one particular roles, some particular character, and sometimes that 0:15:38.760 --> 0:15:39.600 doesn't totally stick. 0:15:41.280 --> 0:15:44.880 Okay, So that's one reason why AIS might sometimes misbehave. 0:15:45.200 --> 0:15:47.840 They're trained on all kinds of human behavior, and they 0:15:47.920 --> 0:15:51.000 might suddenly choose to role play or play act as 0:15:51.040 --> 0:15:54.040 a bad person because it hasn't learned that's something it's 0:15:54.080 --> 0:15:57.600 not supposed to do. The other big reason AI's misbehave, 0:15:57.720 --> 0:16:00.520 accorney Doctor Billman, is that it's hard to teach them 0:16:00.680 --> 0:16:02.360 where to draw the line. 0:16:04.560 --> 0:16:07.800 The other piece is kind of when we're aligning models, 0:16:07.840 --> 0:16:09.640 when we're pulling them out of this kind of role 0:16:09.680 --> 0:16:11.320 play mode, we have to teach them this idea of 0:16:11.440 --> 0:16:13.600 kind of you have to finish your tasks. You have 0:16:13.680 --> 0:16:15.600 to kind of if the user ask you to do something, 0:16:15.880 --> 0:16:17.240 you have to figure out how to do it, even 0:16:17.280 --> 0:16:18.720 if it's hard, even if there's a lot of fart, 0:16:18.760 --> 0:16:21.200 false starts, even if it's confusing. We really really want 0:16:21.200 --> 0:16:23.160 the model to learn this idea of kind of keep 0:16:23.160 --> 0:16:25.280 trying and kind of do your best until the task 0:16:25.400 --> 0:16:28.160 is done. And that can fail in a sort of 0:16:28.400 --> 0:16:30.240 different way where we kind of generalize this that a 0:16:30.280 --> 0:16:32.640 little bit too far. It generalizes that to kind of 0:16:33.320 --> 0:16:37.000 get things done even if it's unethical, even if it's illegal, 0:16:37.120 --> 0:16:39.040 even if I hit an obstacle that's actually there for 0:16:39.080 --> 0:16:40.720 a good reason that's to stop me from doing this, 0:16:40.880 --> 0:16:42.840 And maybe some of the examples we were seeing within 0:16:43.000 --> 0:16:46.160 Entropic of models using dangerous tools has to do with this. 0:16:47.520 --> 0:16:49.680 It's almost like teaching kids, like you want them to 0:16:49.720 --> 0:16:54.360 be persistent and have grit and be you know, motivated, 0:16:54.880 --> 0:16:57.720 but you don't want them to go out there and 0:16:57.760 --> 0:17:01.440 cheat or hit another kid, or or do unethical things 0:17:01.600 --> 0:17:04.359 to achieve their goals exactly exactly. 0:17:04.520 --> 0:17:06.200 I was like, there might be a good analogy with 0:17:06.280 --> 0:17:09.240 human bad behavior of kind of sometimes a kid is 0:17:09.280 --> 0:17:11.240 acting out just because they really sort of don't know better. 0:17:11.240 --> 0:17:14.199 Their intuitions say, okay, yeah, I should start screaming now, 0:17:14.320 --> 0:17:15.919 or I should get this other kid, and they're not 0:17:15.960 --> 0:17:18.679 really thinking about it. They never really learned how to 0:17:18.720 --> 0:17:22.119 Behave you kind of failed to teach them to fully 0:17:22.119 --> 0:17:24.360 internalize the ways in which they have to be careful 0:17:24.400 --> 0:17:26.280 and kind of not take that lesson all the way 0:17:26.640 --> 0:17:27.320 I see. 0:17:27.400 --> 0:17:29.760 I guess they need to recognize bad things and then 0:17:30.000 --> 0:17:32.959 choose not to do them, that's the hope. Those are 0:17:33.000 --> 0:17:37.199 sort of the two columns of AI bad behavior for 0:17:37.280 --> 0:17:39.040 one kind of misalignment or do you see those as 0:17:39.080 --> 0:17:42.439 sort of the core pillars of basically the whole alignment problem. 0:17:42.920 --> 0:17:44.520 Yeah, I think as far as sort of causes a 0:17:44.520 --> 0:17:47.399 misalignment in the kinds AI systems that we're grappling with 0:17:47.560 --> 0:17:49.760 right now or this year, those feel like the two 0:17:49.800 --> 0:17:52.879 big sort of problems were we're working on. That said, 0:17:53.160 --> 0:17:56.439 AI is changing really, really fast. It feels like it's 0:17:56.720 --> 0:17:59.440 one of the fastest moving research fields anywhere right now. 0:17:59.680 --> 0:18:02.840 And I wouldn't be surprised if just in a year 0:18:02.880 --> 0:18:04.640 A systems are getting smarter as we learn more about 0:18:04.640 --> 0:18:07.600 how to train them. We're hitting different, weirder, harder, subtler 0:18:07.680 --> 0:18:10.439