WEBVTT - Monologue: Don't Be Scared Of Sora 0:00:03.160 --> 0:00:08.039 Media. All right, Matt, I've read the YouTube comments and 0:00:08.039 --> 0:00:09.639 this time I want it so you do not cut 0:00:09.720 --> 0:00:12.040 me off with the music too fast. Okay, good right, 0:00:12.039 --> 0:00:15.080 all right, let's go. This is this week's Better Offline monologue. 0:00:15.080 --> 0:00:28.320 And I'm ed Zich. A lot of you have been 0:00:28.360 --> 0:00:30.320 saying you want me to do something about Sora, and 0:00:30.400 --> 0:00:32.240 if I'm honest, I haven't wanted to because I fund 0:00:32.280 --> 0:00:35.000 the whole thing is so utterly pathetic. A few weeks ago, 0:00:35.080 --> 0:00:38.200 open ai launched a half baked social networking app attached 0:00:38.200 --> 0:00:41.520 to a compute intensive video and audio generator, and people 0:00:41.520 --> 0:00:44.600 immediately began to do two things free count and generate 0:00:44.640 --> 0:00:47.879 as many copyright violations as humanly possible, all because of 0:00:47.960 --> 0:00:50.880 open AI's original plan was to ask copyright holders to 0:00:50.920 --> 0:00:53.760 opt out of having their content presented in these videos. 0:00:54.400 --> 0:00:57.720 Sora spent several days covered in Nazi spongebobs and pickagews 0:00:57.720 --> 0:01:00.600 with guns before multiple Hollywood talent agents. He's, along with 0:01:00.640 --> 0:01:04.000 the estate of Martin Luther King Junior, intervened the complained, 0:01:04.080 --> 0:01:07.000 leading to open ai creating, to quote MPR an opt 0:01:07.040 --> 0:01:10.319 in policy allowing all artists, performers, and individuals the right 0:01:10.360 --> 0:01:13.080 to determine how and whether they can be simulated with 0:01:13.200 --> 0:01:15.840 open AI, blocking the generation of well known characters on 0:01:15.840 --> 0:01:18.319 its public feed and offering to take down material not 0:01:18.360 --> 0:01:21.200 in compliance. It's unclear what happened with nintender, but I 0:01:21.200 --> 0:01:24.440 imagine one of their seventy million lawyers attacked. And now 0:01:24.440 --> 0:01:26.440 we've got that out of the way, let's talk about 0:01:26.480 --> 0:01:29.560 SORA itself. I understand a lot of the people who 0:01:29.640 --> 0:01:32.080 listen in film and TV they're kind of scared. And 0:01:32.120 --> 0:01:34.040 I understand that you've seen a few clips that look 0:01:34.200 --> 0:01:36.840 kind of sort of realistic, and that this, especially if 0:01:36.880 --> 0:01:39.480 you're in the creative arts, is quite terrifying because your 0:01:39.520 --> 0:01:41.920 mind naturally assumes that these clips can be strung together 0:01:41.959 --> 0:01:45.160 into some sort of coherent whole. This isn't the case. 0:01:45.720 --> 0:01:48.360 Every single good, and I use the term loosely, SARA 0:01:48.520 --> 0:01:51.760 video is cherry picked for many, many, many terrible generations. 0:01:52.200 --> 0:01:54.560 Every time you use SORA is random. It doesn't matter 0:01:54.560 --> 0:01:56.880 how specific your prompt is or however many times you've 0:01:56.960 --> 0:01:59.960 used it. SAA is effectively a giant video and audio slot. 0:02:00.960 --> 0:02:04.560 You can never ever guarantee that Sorrow will generate something useful, 0:02:04.760 --> 0:02:07.160 and as a result, can never really budget for using it. 0:02:07.600 --> 0:02:11.040 The human eye is remarkably demanding and little visual inconsistencies 0:02:11.080 --> 0:02:14.760 between scenes will make people feel weird and uncomfortable. Imagine 0:02:14.760 --> 0:02:17.079 that extrapolated to ten or fifteen seconds at a time, 0:02:17.080 --> 0:02:19.200 and how difficult it will be to get something that 0:02:19.240 --> 0:02:21.519 makes visual sense before you have to think about things 0:02:21.560 --> 0:02:23.480 like does this connect to the rest of the footage 0:02:23.520 --> 0:02:28.200 I'm using? Okay, So the majority of actual professionals who 0:02:28.240 --> 0:02:30.440 would use Sura would not be using the app. They'll 0:02:30.480 --> 0:02:33.280 be connecting directly to the model on open ais API. 0:02:33.639 --> 0:02:37.960 It's just it's not done via a classical app interface. Now, 0:02:38.360 --> 0:02:40.600 then there's the problem of cost. This is where you 0:02:40.639 --> 0:02:43.600 really need to start worrying if you're building things with Sourer. 0:02:44.200 --> 0:02:47.840 So let's start off with the first problem. Cost. So 0:02:48.240 --> 0:02:51.920 open ai offers two different Saur models. Sorra two, which 0:02:51.919 --> 0:02:54.480 they say is designed for speed and flexibility and is 0:02:54.560 --> 0:02:57.799 ideal for the exploration phase, and that costs ten cents 0:02:57.840 --> 0:02:59.959 per second, and then there's Sora two pro, which is 0:03:00.160 --> 0:03:03.160 either thirty cents or fifty cents a second depending on resolution, 0:03:03.639 --> 0:03:05.239 and I quote it's the thing you go to for 0:03:05.480 --> 0:03:10.160 production quality outputs. So you're either spending one, three, or 0:03:10.200 --> 0:03:12.680 five dollars for every ten seconds of footage. And like 0:03:12.760 --> 0:03:16.000 every generative model, the longer you generate, the higher the 0:03:16.040 --> 0:03:18.560 likelihood of hallucinations, which in the case of Soro, means 0:03:18.560 --> 0:03:22.480 bizarre animations, inconsistent details, or just flat out useless crap. 0:03:23.120 --> 0:03:26.720 Then there's the problem of time. Open AI's own documentation 0:03:26.840 --> 0:03:29.800 says that a single render may takes several minutes. At 0:03:29.800 --> 0:03:31.920 the end of those several minutes, out pops a video 0:03:31.960 --> 0:03:34.440 that may or may not be of any use. Open 0:03:34.480 --> 0:03:37.280 Ai allows you to remix using more prompts, which allows 0:03:37.400 --> 0:03:41.080 some iterative development, but these remixes also cost money and 0:03:41.160 --> 0:03:44.320 also take several minutes. So let me walk you through 0:03:44.360 --> 0:03:47.280 a scenario. You're making a short film. Let's just say 0:03:47.280 --> 0:03:50.240 it's fifteen minutes long, which is nine hundred seconds. You 0:03:50.280 --> 0:03:52.360 ask Zora to generate a man putting on a hat. 0:03:52.480 --> 0:03:55.560 Your first eight generations each taking four minutes and five 0:03:55.600 --> 0:03:58.800 dollars apiece, which takes about thirty two minutes and forty dollars. 0:04:00.000 --> 0:04:01.800 I don't really do the job, so you do two more, 0:04:01.880 --> 0:04:04.920 taking another four minutes apiece and ten more dollars. You 0:04:05.000 --> 0:04:08.160 finally on the next try get something kind of useful, 0:04:08.320 --> 0:04:11.040 which cost you another five dollars, and then you realize 0:04:11.080 --> 0:04:13.280 you wanted him to wear a specific kind of hat. 0:04:13.440 --> 0:04:16.200 This happens all the time when directing stuff. There are 0:04:16.360 --> 0:04:18.800 minor changes you make that you realize when you're finally 0:04:18.839 --> 0:04:22.800 in the moment, would look or sound or be better. So, yeah, 0:04:22.920 --> 0:04:26.680 that doesn't go so well with probabilistic models. So shit, fuck, 0:04:26.720 --> 0:04:29.719 you gotta do something, so you remix in another four minutes, 0:04:29.760 --> 0:04:33.400 another five dollars. Fuck. Wrong hat, four minutes five dollars. Right, 0:04:33.440 --> 0:04:36.479 hat is hand blends through it for some reason. Okay, 0:04:36.600 --> 0:04:39.440 four minutes, five dollars. The hat's right, but when he 0:04:39.480 --> 0:04:41.640 puts it on, his eye blinks. One of his eyes 0:04:41.760 --> 0:04:44.400 just blinks three times for some reason, so you can't 0:04:44.400 --> 0:04:47.599 really use that. Okay, four minutes five dollars. Looks kind 0:04:47.600 --> 0:04:51.560 of good. Different hat again, four minutes, five dollars. Hmmm, 0:04:52.400 --> 0:04:55.160 you've now spent eighty dollars in over an hour generating 0:04:55.160 --> 0:04:56.800 a man trying to put on a hat. You're not 0:04:56.839 --> 0:05:00.640 really much closer to having useful footage. And because as 0:05:00.640 --> 0:05:03.800 you remix it again and again, keeps making these little errors, 0:05:03.800 --> 0:05:06.360 because that's how these models go, it's impossible to tell 0:05:06.400 --> 0:05:08.480 whether the next generation will be the one that works 0:05:08.520 --> 0:05:10.960 or whether sorrow will spit out some new little fuck up. 0:05:12.320 --> 0:05:15.159 So the more intricate something is, the more expensive it gets. 0:05:15.440 --> 0:05:18.200 But you know what, you can find money places you 0:05:18.279 --> 0:05:21.200 can't find more goddamn time. I guess you could have 0:05:21.240 --> 0:05:24.520 a separate computer running more, but that's still gonna cost 0:05:24.520 --> 0:05:26.960 a bunch of money. How many of these slot machines 0:05:27.000 --> 0:05:29.720 are you gonna run at once? How many times are 0:05:29.760 --> 0:05:31.440 you going to allow them to edit? How can you 0:05:31.480 --> 0:05:34.400 have a coherent vision when you've got multiple people generating things? 0:05:34.720 --> 0:05:37.680 You can't. But you know what, perhaps perhaps the next 0:05:37.760 --> 0:05:40.880 generation will be great, or perhaps it will be dogshit. 0:05:41.080 --> 0:05:43.280 You have no way to know, because that's the magic 0:05:43.320 --> 0:05:47.080 of generative AI. Yet these problems compound aggressively once you 0:05:47.120 --> 0:05:50.680 need any kind of visual consistency. The man now has 0:05:50.720 --> 0:05:52.440 to put the hat on and leave the house. How 0:05:52.480 --> 0:05:54.840 does the house look? Is the hat the same? Does 0:05:54.880 --> 0:05:57.360 he have wallpaper on his walls? Is there anyone else 0:05:57.400 --> 0:05:59.880 in the house? What kind of table? Two chairs, one chair, 0:06:00.080 --> 0:06:02.560 five chairs? How do you possibly keep all of these 0:06:02.600 --> 0:06:06.680 things consistent? You don't, You can't. That's part of what 0:06:06.800 --> 0:06:10.760 makes SAURA so goddamn awful. It's built specifically to make 0:06:10.839 --> 0:06:14.400 you scared of them, to create superficially impressive clips, so 0:06:14.400 --> 0:06:17.280 that brain dead Hollywood executives can claim it's the future. 0:06:17.360 --> 0:06:19.800 Yet in a practical sense, it's impossible to budget, or 0:06:19.880 --> 0:06:22.919 plan or guarantee anything about what SAURA might do. And 0:06:23.000 --> 0:06:26.120 this is pretty much across the board for these generative 0:06:26.160 --> 0:06:30.080 models making video and audio. Now, I've heard from a 0:06:30.080 --> 0:06:33.000 few people that SAA is cheaper because it doesn't involve labor, 0:06:33.160 --> 0:06:35.200 which is something you could say only if you believed 0:06:35.200 --> 0:06:38.520 SAURA would give consistent outputs. And really, the only thing 0:06:38.560 --> 0:06:42.479 that a probabilistic model like SAURA can do is guarantee inconsistency, 0:06:43.080 --> 0:06:46.240 even by Hollywood accounting standards. A generative tool that will 0:06:46.279 --> 0:06:49.279 cost hundreds or thousands of dollars to generate ten seconds 0:06:49.279 --> 0:06:52.640 of shitty footage that is impossible to coherently connect to 0:06:52.640 --> 0:06:56.320 more footage is a really terrible idea and also very 0:06:56.800 --> 0:07:00.320 inconsistent in its costs too. And like I said earlier, 0:07:00.320 --> 0:07:03.919 there's the issue of time. Every single entertainment product requires 0:07:03.920 --> 0:07:06.360 some sort of time budgeting, and it's impossible to say 0:07:06.400 --> 0:07:09.120 how long it will take SAURAW to generate something. Open 0:07:09.120 --> 0:07:12.120 Eye doesn't even specify what several minutes means, meaning you 0:07:12.120 --> 0:07:15.560 can't really plan a production using it. SARA isn't cheaper, 0:07:15.640 --> 0:07:19.960 SAWRA isn't easier, and SARA certainly isn't more efficient. But 0:07:20.720 --> 0:07:23.120 you need to remember also that generative video models have 0:07:23.200 --> 0:07:25.920 been around for over a year and they're not really 0:07:26.000 --> 0:07:29.800 seeing mass use now. If this thing were capable of 0:07:29.800 --> 0:07:32.920 making anything truly useful, you'd see it everywhere right now. 0:07:33.040 --> 0:07:34.520 But you are seeing a little bit of it. And 0:07:34.560 --> 0:07:37.280 I do want to address that you probably saw cal 0:07:37.320 --> 0:07:39.680 She's ad and heard that it costs two thousand dollars 0:07:39.720 --> 0:07:41.280 to make and took only a few days, But I 0:07:41.320 --> 0:07:43.760 really encourage you to look at the actual commercial itself. 0:07:44.080 --> 0:07:48.080 It's completely incoherent nonsense. Each shot completely disconnected with weird 0:07:48.120 --> 0:07:51.400 glitches and animations in the crowds, and one point towards 0:07:51.400 --> 0:07:53.480 the end, a woman is meant to say okay, see, 0:07:53.480 --> 0:07:56.000 but the sea part does not map to her mouth. 0:07:56.400 --> 0:07:58.680 It looks really bad and the only way you could 0:07:58.720 --> 0:08:00.960 get away with something like this is having these quick 0:08:01.040 --> 0:08:04.280 hit shots. And also please go and view the comments 0:08:04.280 --> 0:08:06.440 about this that people just rip the fuck out of 0:08:06.440 --> 0:08:09.320 this thing. But nevertheless, it was made using VO three, 0:08:09.360 --> 0:08:12.520 Google's generative video model, and it apparently took three hundred 0:08:12.520 --> 0:08:15.600 to four hundred clips to get fifteen usable shots stitched 0:08:15.600 --> 0:08:19.320 together using traditional editing tools. Now, the reason this costs 0:08:19.320 --> 0:08:21.080 two grand is that it sucked. And the reason you're 0:08:21.120 --> 0:08:23.800 not seeing more advertisers do this is because it's impossible 0:08:23.800 --> 0:08:26.640 to make a coherent video out of this footage. I 0:08:26.720 --> 0:08:29.880 realize most commercials you see on TV may feel chaotic 0:08:30.000 --> 0:08:32.680 or kind of bland, but they're remarkably precise, and the 0:08:32.720 --> 0:08:35.800 generative shots used for the Cawshi commercial are chaotic and 0:08:35.840 --> 0:08:38.960 failed to convey any real meaning beyond a person yelling 0:08:39.000 --> 0:08:42.600 Indiana or OKC. The only reason it cost so little 0:08:42.720 --> 0:08:45.199 was one guy put several days of prompting it to 0:08:45.360 --> 0:08:48.520 it and the end result was shitting cow. She didn't 0:08:48.559 --> 0:08:51.400 mind because this was a publicity move. Cow. She put 0:08:51.440 --> 0:08:54.040 out the commercials specifically so the media would write it up, 0:08:54.040 --> 0:08:56.360 and they succeeded because the media loves to feed on 0:08:56.440 --> 0:08:59.480 scary stories like AI is going to replace human actors. 0:09:00.320 --> 0:09:03.360 Since the calshe ads pjas who made it has made 0:09:03.360 --> 0:09:06.040 a few others a Popeyes wrap one where again go 0:09:06.120 --> 0:09:07.560 and look at the comments. I'm not linking to it, 0:09:07.600 --> 0:09:08.760 by the way, I don't want to send them any 0:09:08.760 --> 0:09:10.880 fucking traffic. But the Pope is one. People are just 0:09:10.880 --> 0:09:14.240 responding saying, this looks like shit, what is this? It's incoherent, 0:09:14.240 --> 0:09:18.240 it's inconsistent. But the funniest one I found was David 0:09:18.240 --> 0:09:21.760 Beckham's iomate health supplement ad, which ends with a shot 0:09:21.800 --> 0:09:23.320 of the bottle of the product with a bunch of 0:09:23.360 --> 0:09:27.000 garbled generative texts. It does not appear that PJAS has 0:09:27.080 --> 0:09:29.720 got a ton more work than this, probably because the 0:09:29.760 --> 0:09:32.320 outputs kind of suck and brands really do not like 0:09:32.360 --> 0:09:37.600 inconsistent things. And also a fucking health supplement from David Beckham. 0:09:37.679 --> 0:09:40.800 Jesus Christ just say it's a private equity film anyway. 0:09:41.360 --> 0:09:43.400 To conclude, I also want to be clear that the 0:09:43.480 --> 0:09:46.120 rates for these videos are heavily subsidized by big tech, 0:09:46.280 --> 0:09:49.240 just like every other generative AI product or saw a 0:09:49.360 --> 0:09:52.040 might cost thirty or fifty cents a second right now. 0:09:52.360 --> 0:09:54.680 Once the AI bubble burst, these prices who will either 0:09:54.720 --> 0:09:58.040 skyrocket or these models will cease to exist for public consumption. 0:09:58.679 --> 0:10:00.680 The biggest clue I can give you is Google only 0:10:00.720 --> 0:10:03.480 allows you to generate four or five VO three videos 0:10:03.520 --> 0:10:05.319 a day on their two hundred and fifty dollars a 0:10:05.400 --> 0:10:09.319 month Gemini ultra plan. That suggests that Google's video costs 0:10:09.320 --> 0:10:11.320 a brutal and the open aiye is burning money by 0:10:11.320 --> 0:10:13.000 the bucket for to let you fuck around on the 0:10:13.040 --> 0:10:15.559 sau app. I don't recommend you do that, but if 0:10:15.600 --> 0:10:17.560 you have just no, you're burning a hole in Clammy 0:10:17.559 --> 0:10:20.320 Sammy's pocket. I will add that you may worry about 0:10:20.320 --> 0:10:23.120 these models getting better. While they might be more nuanced 0:10:23.160 --> 0:10:25.439 than their ability to generate video in five or ten 0:10:25.520 --> 0:10:28.520 second bursts, their ability to generate longer or consistent videos 0:10:28.720 --> 0:10:32.080 is inherently impossible due to the probabilistic nature of transformer 0:10:32.120 --> 0:10:35.240 based models. In simple terms, these things are rolling the 0:10:35.280 --> 0:10:38.199 dice every time. The way you prompt them is what 0:10:38.280 --> 0:10:41.520 makes them generate, and they don't have minds or thoughts. 0:10:41.920 --> 0:10:45.480 They're just rolling the dice every time on whatever you 0:10:45.520 --> 0:10:48.679 say and trying to interpret what you mean. Human beings, 0:10:48.720 --> 0:10:51.920 by the way, are extremely magical. I think you really 0:10:52.040 --> 0:10:55.760 underestimate how amazing people are. When we direct someone on 0:10:55.800 --> 0:10:59.280 a film set, even like an assistant director. That person 0:10:59.360 --> 0:11:01.719 keeps the product moving and make sure everyone gets what 0:11:01.760 --> 0:11:03.760 they need and pushes back in a director when something 0:11:03.800 --> 0:11:06.480 might be impractical. A director is a visionary, but also 0:11:06.679 --> 0:11:10.520 an actor is someone that takes interpretation and then is 0:11:10.600 --> 0:11:13.320 directed to do different things. But that direction is not 0:11:13.360 --> 0:11:16.960 a fucking prompt move your elbow, look look at this way, 0:11:17.040 --> 0:11:19.840 look that way. The things that operate on a film 0:11:19.920 --> 0:11:23.280 or TV set are inherently different to just plugging words 0:11:23.320 --> 0:11:27.320 into a fucking model, and I get them. I get 0:11:27.360 --> 0:11:30.000 everyone in Hollywood who's scared right now. I get everyone 0:11:30.040 --> 0:11:33.440 in creatives, in creative arts even who is scared right now. 0:11:33.760 --> 0:11:37.560 I feel for you. These people are losing. These people 0:11:37.720 --> 0:11:42.480 are losing. This stuff does not work, it's inconsistent, it's 0:11:42.520 --> 0:11:46.200 incredibly expensive on subsidized rates, and in the end, I 0:11:46.320 --> 0:11:49.280 really really believe that once the bubble pops, these things 0:11:49.320 --> 0:11:52.800 are going away. Thank you so much for listening. Reach 0:11:52.840 --> 0:11:54.719 out if you have any thoughts. I always love to 0:11:54.760 --> 0:11:58.360 hear from people. E Z at better offline dot com. 0:11:58.600 --> 0:12:01.280 I love getting your emails. I love getting your your 0:12:01.280 --> 0:12:04.960 weird little missives on Reddit. I really am I'm truly blessed, 0:12:05.720 --> 0:12:07.800 and I love you all. I love how many of 0:12:07.800 --> 0:12:10.600 you listener. I love how communicative you are. It's been 0:12:10.600 --> 0:12:14.319 a big week with the Anthropic exclusive, and yeah, I'm 0:12:14.320 --> 0:12:17.040 gonna have already a better offline next week as well. Crap, 0:12:17.040 --> 0:12:21.040 I've got a good do an episode. Shit damn. Oh well, 0:12:21.080 --> 0:12:22.719 I have the best job in the world anyway, Thank 0:12:22.760 --> 0:12:23.360 you for listening.