WEBVTT - Exclusive: How GPT-5 Actually Works 0:00:02.800 --> 0:00:03.560 Ze Media. 0:00:05.320 --> 0:00:07.880 Hi, my name said Tron, and welcome to Better Offline. 0:00:07.880 --> 0:00:22.280 This is also Jackass. So you've just had a cheery 0:00:22.280 --> 0:00:25.080 two part chuckle first about how Generative Ai made tanker 0:00:25.239 --> 0:00:27.319 markets in our economy. So I'm going to give you 0:00:27.320 --> 0:00:30.319 a lighter one an episode about GPT five, which is 0:00:30.360 --> 0:00:33.080 a model from open Ai, and why just under three 0:00:33.159 --> 0:00:35.360 years of hype have led to the software equivalent of 0:00:35.360 --> 0:00:38.239 the launch of Saint Anger, except every time lars are hit. 0:00:38.280 --> 0:00:41.800 The snare drama cost them fifty five thousand dollars. Now, 0:00:41.800 --> 0:00:44.480 if we look at the positive reviews, we see takes 0:00:44.600 --> 0:00:48.000 ranging from Simon Willison's tempered remark that GPT five is 0:00:48.240 --> 0:00:51.280 just good at stuff to semi anass this is completely 0:00:51.320 --> 0:00:54.800 insane statement that GPT five is setting the stage for 0:00:54.880 --> 0:00:59.920 ad monetization and the open Ai GPT chat GPT super app. 0:01:00.120 --> 0:01:02.440 In a piece that makes several assertions about how the 0:01:02.520 --> 0:01:05.520 router that underpins GPT five is somehow the secret way 0:01:05.600 --> 0:01:09.959 that Openaye will inject Dad's which is just distinctly silly. 0:01:10.080 --> 0:01:13.400 It's I'll get into this in the episode a little bit, 0:01:13.400 --> 0:01:15.480 but just with everything you're going to hear, you're going 0:01:15.560 --> 0:01:18.160 to realize that this is just someone just saying stuff. 0:01:18.200 --> 0:01:21.120 Took four bylines to do that shit too. I'm also British. 0:01:21.120 --> 0:01:23.080 I'm gonna say router. I might say router as well, 0:01:23.120 --> 0:01:24.760 because I've been here a while. Make fun of my 0:01:24.840 --> 0:01:27.399 voice if you really must. But with that out the way, 0:01:27.440 --> 0:01:30.640 here's a quote from semi Analysis' coverage. Before the router, 0:01:30.720 --> 0:01:32.640 there was no way for a query to be distinguished, 0:01:32.680 --> 0:01:35.880 and after the router, the first low value query could 0:01:35.920 --> 0:01:38.679 be routed to a GBT five mini model that can 0:01:38.760 --> 0:01:41.959 answer with zero tool calls and no reasoning. This likely 0:01:41.959 --> 0:01:44.160 means serving this user is approaching the cost of a 0:01:44.200 --> 0:01:48.120 search query. This does not make any sense. This None 0:01:48.120 --> 0:01:50.480 of this makes it like it's just a bunch of assumptions. 0:01:50.600 --> 0:01:53.120 Why would this be the case. The article also makes 0:01:53.120 --> 0:01:54.840 a lot of claims about the value of a question 0:01:54.920 --> 0:01:58.440 and how chat GPT could I am serious a agent 0:01:59.000 --> 0:02:02.000 agentically reach out to lawyers. I'm not going to edit 0:02:02.040 --> 0:02:04.560 that out because egentically is not a fun word to say. 0:02:05.640 --> 0:02:07.760 It is just complete nonsense, and in fact, I'm not 0:02:07.840 --> 0:02:11.320 sure this piece reflects how GPT five even works at all. Again, 0:02:11.400 --> 0:02:14.520 quoting it, the router serves multiple purposes on both the 0:02:14.560 --> 0:02:17.320 cost and performance side. On the cost side, routing users 0:02:17.320 --> 0:02:19.400 to many versions of each bubble allows open ai to 0:02:19.440 --> 0:02:22.480 service uses at a lower cost or with lower costs. 0:02:22.520 --> 0:02:25.160 Even to be fair on semi analysis, it's not as 0:02:25.200 --> 0:02:27.920 if open ai gave them much help. Open AI's official 0:02:27.919 --> 0:02:31.520 writings about the router aren't exactly filled with details, talking 0:02:31.560 --> 0:02:34.000 and glowing terms about what it does, but not how 0:02:34.480 --> 0:02:38.440 here's what they say. Chat GPT's real time router quickly 0:02:38.480 --> 0:02:41.760 decides which model to use based on the conversation type, complexity, 0:02:41.800 --> 0:02:44.640 tool needs, and your explicit intent. For example, if you 0:02:44.720 --> 0:02:47.520 say think hard about this in the prompt. The router 0:02:47.600 --> 0:02:51.480 is continuously trained on real signals, including when users switch models, 0:02:51.520 --> 0:02:56.200 preference rates for responses, and measured corrected correctness improving over time. 0:02:56.600 --> 0:02:59.120 Once usage limits are reached, a mini version of each 0:02:59.160 --> 0:03:02.040 model handles remains inquiries. In the near future, we plan 0:03:02.080 --> 0:03:05.280 to integrate these capabilities into a single model. And that 0:03:05.360 --> 0:03:08.359 last bit really doesn't make sense, but in any case, 0:03:08.400 --> 0:03:11.760 the lordchip GPT five has been very, very weird. At first. 0:03:11.760 --> 0:03:14.120 Some people seemed really happy about it. Chief of them 0:03:14.120 --> 0:03:16.640 software YouTube of Theo Brown, who is over four hundred 0:03:16.680 --> 0:03:19.520 and sixty eight thousand subscribers. He's also known as theogg 0:03:19.760 --> 0:03:20.560 who said. 0:03:20.840 --> 0:03:24.400 I didn't know it could get this good. This was 0:03:24.520 --> 0:03:29.280 kind of the like oh fuck moment for me in 0:03:29.320 --> 0:03:33.040 a lot of ways, and I've had to fight like 0:03:33.120 --> 0:03:38.560 a slow spiral into insanity. It's a really really good model. 0:03:39.600 --> 0:03:41.120 He finished by saying, and. 0:03:41.120 --> 0:03:42.960 Keep an eye on your job because I don't know 0:03:43.000 --> 0:03:44.840 what this means for us long term. 0:03:45.360 --> 0:03:48.480 Pretty crazy, right. Comments on the video included people saying 0:03:48.520 --> 0:03:51.200 things like if open aye is helding you hostage, blink 0:03:51.280 --> 0:03:54.200 twice and yes that is an adverbating quote. Another saying 0:03:54.240 --> 0:03:57.040 this dude, is everything wrong in it today? Another saying 0:03:57.080 --> 0:03:59.600 this video was sponsored by open Ai, Another other saying 0:03:59.800 --> 0:04:02.360 ge GPT five failed every test project I gave it today. 0:04:02.440 --> 0:04:04.640 It's a lie in my experience. Maybe they haven't ramped 0:04:04.720 --> 0:04:08.040 up the GPUs now. From what I can tell, THEO 0:04:08.160 --> 0:04:10.800 Brown played with GPT five in open ais offices and 0:04:10.800 --> 0:04:14.640 did all the benchmarking there. Open Ai, by the way, 0:04:14.880 --> 0:04:19.520 fucking how come on? You can't benchmark in their offices anyway. 0:04:19.560 --> 0:04:22.599 Open AI's API based access to GPT five models. You 0:04:22.640 --> 0:04:24.000 know the thing that you use if you want to 0:04:24.000 --> 0:04:26.720 integrate GPT into your app, does not root them, by 0:04:26.760 --> 0:04:29.000 the way, nor does open ai offer access to its 0:04:29.080 --> 0:04:32.440 router or any associated models. Important detail. Just want you 0:04:32.480 --> 0:04:34.400 to know that because we need to make sure very 0:04:34.400 --> 0:04:37.080 clear now A weekly a Theo Brown would put out 0:04:37.120 --> 0:04:39.680 another video called I was wrong about GPT five, which 0:04:39.960 --> 0:04:41.560 he would open by saying. 0:04:41.880 --> 0:04:43.760 So first and foremost, I want to make sure it 0:04:43.800 --> 0:04:47.359 is very very clear that the experience that you probably 0:04:47.400 --> 0:04:50.000 are having with chat, GPT and GPT five right now 0:04:50.400 --> 0:04:52.760 is not the experience that I had when I was 0:04:52.760 --> 0:04:53.600 first testing it. 0:04:53.960 --> 0:04:55.880 Brown goes on to explain that he was not paid 0:04:55.880 --> 0:04:59.120 by open Ai at all, that he was sincerely impressed 0:04:59.120 --> 0:05:01.599 by the company and GA five, and that he'd actually 0:05:01.680 --> 0:05:04.200 spent over twenty five thousand dollars in inference testing it 0:05:04.240 --> 0:05:06.720 on his own company software, and indeed also that he 0:05:06.800 --> 0:05:10.280 turned down a grand appearance fee. Sorry, I mean that's 0:05:10.320 --> 0:05:13.160 a very British thing, one thousand dollars appearance fee, not 0:05:13.240 --> 0:05:16.160 just like a really nice one. Brown claims he asked 0:05:16.160 --> 0:05:18.240 open Ai to try it out, and after they declined 0:05:18.279 --> 0:05:20.240 to let him test it early on his own, he 0:05:20.360 --> 0:05:22.159 was invited to try it on camera with a small 0:05:22.200 --> 0:05:24.679 group of other people open AI's offices where they'd film 0:05:24.720 --> 0:05:27.919 his reactions. He said that the API was incredible, but 0:05:28.000 --> 0:05:30.039 that it's become apparent that the models he was using 0:05:30.080 --> 0:05:31.799 in the video were not the same as those released 0:05:31.839 --> 0:05:34.200 of the public. Making a post on August thirteenth on 0:05:34.440 --> 0:05:37.000 xd Everything app that GPT five was nowhere near as 0:05:37.040 --> 0:05:39.360 good as in cursor as when it was as it 0:05:39.440 --> 0:05:40.960 was when he was using it a few weeks ago, 0:05:41.040 --> 0:05:43.760 complaining that things that worked while demoing it at open 0:05:43.800 --> 0:05:47.159 ai no longer did, adding that there was something somebody 0:05:47.160 --> 0:05:49.680 else on Twitter that said they'd had a similarly great 0:05:49.720 --> 0:05:53.560 experience GPT five on launch that has since decayed. It 0:05:53.640 --> 0:05:55.880 isn't completely clear what happened here, but I'm going to 0:05:55.880 --> 0:05:58.040 guess that open ai showed THEO Brown and others in 0:05:58.080 --> 0:06:01.200 their offices some sort of heavily molded version of the 0:06:01.200 --> 0:06:04.560 model that burns significantly more compute to provide its outputs, 0:06:04.680 --> 0:06:07.599 though I'm also very suspicious of how significance the difference 0:06:07.640 --> 0:06:11.040 is here. Brown's videos attempt to show the difference between 0:06:11.080 --> 0:06:12.840 the generations that you received from the model when it 0:06:12.880 --> 0:06:14.920 was good and when it was bad. In this video, 0:06:15.160 --> 0:06:17.000 which I'll include a link to in the episode notes. 0:06:17.000 --> 0:06:20.280 But if I'm honest, they look pretty similar in that 0:06:20.279 --> 0:06:23.440 they're kind of mediocre. I'm not saying that as a hater, 0:06:23.480 --> 0:06:25.120 by the way. They just kind of look like shit. 0:06:26.000 --> 0:06:28.080 It's just kind of okay, like shit. They look like 0:06:28.160 --> 0:06:31.240 regular fucking generated websites. They don't look special. The good 0:06:31.279 --> 0:06:34.839 one is fine, and the bad one has weird gradients 0:06:34.880 --> 0:06:37.919 on it. This whole thing sucks, though, and was a 0:06:37.960 --> 0:06:41.000 clear set up by open Ai to overstate the abilities 0:06:41.000 --> 0:06:43.320 of GPT five, one that fell apart with the lightest 0:06:43.320 --> 0:06:46.480 brush with reality. I imagine their assumption was that Brown 0:06:46.480 --> 0:06:48.720 would post the glossy video and then walk away, and 0:06:48.760 --> 0:06:51.320 it gave THEO some credit for straight up stating he 0:06:51.360 --> 0:06:53.919 was misled. This was a desperate move and one that 0:06:53.960 --> 0:06:56.000 blew up in the face of open Ai. Along with 0:06:56.040 --> 0:06:58.919 the rest of the GPT five launch. People hate the model, 0:06:59.000 --> 0:07:01.960 customers are mad for taking models away like four to 0:07:02.040 --> 0:07:04.560 H and have remained mad even with their return, and 0:07:04.600 --> 0:07:07.919 the chat gpt subreddit is almost entirely people complaining about 0:07:08.320 --> 0:07:11.320 how ineffective the new version is and how even GPT 0:07:11.360 --> 0:07:13.760 four ROH is not the same They got game of 0:07:13.760 --> 0:07:16.640 brain Baby. As I said in last week's monologue. I 0:07:16.680 --> 0:07:18.800 believe open Ai has grown a fandom rather than any 0:07:18.840 --> 0:07:21.880 kind of sustainable product market fit, and they're now suffering 0:07:21.920 --> 0:07:24.520 fandom like hate with every minor change they make in 0:07:24.520 --> 0:07:27.680 an attempt to push GPT five further, further aggravating people 0:07:27.680 --> 0:07:30.640 that barely understand why they use the product to begin with. Yeah, 0:07:30.760 --> 0:07:33.720 the center of the angle laid the reason for GPT 0:07:33.800 --> 0:07:36.520 five's launch, the belief that this was somehow a cost 0:07:36.520 --> 0:07:39.240 cutting measure, where OpenAI had added a router to chat 0:07:39.280 --> 0:07:41.920 GPT as a means of sending certain requests to cheaper 0:07:41.920 --> 0:07:45.080 models to save money. But when I hear router, I 0:07:45.160 --> 0:07:47.680 hear latency, and I never or even a second believe 0:07:47.720 --> 0:07:49.760 that this would somehow be cheaper to run. It didn't 0:07:49.760 --> 0:07:52.720 make sense. I'm a curious little criator, so I went 0:07:52.760 --> 0:07:55.920 and found out how chat GPT five actually works, and 0:07:56.040 --> 0:07:59.160 unlike the following incredible products that you should buy, it's 0:07:59.200 --> 0:08:12.679 actually kind of a big piece of shit. And we're back, 0:08:13.120 --> 0:08:14.960 and from here on out, I will define two things. 0:08:15.000 --> 0:08:17.720 GPT five referring to the model and its associated mini 0:08:17.720 --> 0:08:20.400 and nano models, and Chat GPT five referring to the 0:08:20.400 --> 0:08:23.520 current state of chat GPT, which features an auto fast 0:08:23.560 --> 0:08:27.120 and thinking and thinking mini model selections. You also can 0:08:27.160 --> 0:08:30.239 see legacy models, but that's not what we're talking about today, 0:08:30.240 --> 0:08:32.760 and that's also only for a little bit. It's a 0:08:32.800 --> 0:08:34.959 distinction I have to make, by the way, and make earlier, 0:08:34.960 --> 0:08:37.480 because the two things are different, they work in different ways, 0:08:37.480 --> 0:08:40.600 and chat GPT five structure induces a bunch of trade 0:08:40.600 --> 0:08:43.600 offs and downsides that, as I'll discuss later, make this 0:08:43.640 --> 0:08:47.320 whole thing even more wasteful. In discussions with a source 0:08:47.360 --> 0:08:50.360 that an infrastructure provider familiar with the architecture, it appears 0:08:50.400 --> 0:08:53.320 that chat GPT five is in fact potentially more expensive 0:08:53.320 --> 0:08:55.679 to run than previous models, and due to the complex 0:08:55.679 --> 0:08:58.200 and chaotic nature of said architecture, can at times spun 0:08:58.320 --> 0:09:02.400 upwards of double The tokens per quid tokens, for those 0:09:02.400 --> 0:09:04.560 who don't know, are basically chunks of texts that the 0:09:04.600 --> 0:09:08.000 AI models do stuff with. I'm simplifying this. Do not 0:09:08.120 --> 0:09:11.600 email me and correct some minor thing nobody cares. A 0:09:11.679 --> 0:09:14.320 sentence like the quick brown fox jumps over the lazy 0:09:14.360 --> 0:09:17.160 dog will be broken into lots of smaller four character chunks. 0:09:17.400 --> 0:09:19.720 There are different kinds of tokens, and they're all priced differently. 0:09:20.080 --> 0:09:22.120 An input token refers to the data you send to 0:09:22.160 --> 0:09:24.280 the model when you ask a question. Output tokens are 0:09:24.360 --> 0:09:26.199 used to measure the size of its response, with bigger 0:09:26.200 --> 0:09:30.240 responses requiring more tokens. The more tokens you burn paquery, 0:09:30.280 --> 0:09:32.480 the more expensive it is to run that query. The 0:09:32.520 --> 0:09:35.560 fact that chat GPT five can, in certain circumstances burn 0:09:35.600 --> 0:09:37.920 twice the number of tokens of query means that every 0:09:38.000 --> 0:09:41.839 question costs more. Chat GPT is also significantly more convoluted, 0:09:41.840 --> 0:09:45.280 plagued by latency issues, and is more compute intensive thanks 0:09:45.280 --> 0:09:49.319 to open a ey's new, smarter, more efficient model routing system. 0:09:50.040 --> 0:09:52.880 In simpler terms, every user prompt on chat GPT, whether 0:09:52.920 --> 0:09:55.920 it's in auto, fast thinking or Thinking Mini, starts by 0:09:55.920 --> 0:09:59.120 putting the users prompt before the static prompt. I don't 0:09:59.160 --> 0:10:01.480 want to lose you here. This is important. A static 0:10:01.480 --> 0:10:04.079 prompt is the invisible instructions given by open Ai to 0:10:04.160 --> 0:10:07.080 chat GPT, in the models themselves and the tools associate 0:10:07.160 --> 0:10:09.800 with them to tell them how to operate. Instructions like 0:10:09.840 --> 0:10:12.199 you are chat GPT, you're a large language model, You're 0:10:12.200 --> 0:10:14.720 a helpful chat bot. Do not threaten them with a knife, 0:10:14.720 --> 0:10:17.280 and so on and so forth. These static prompts are 0:10:17.280 --> 0:10:19.480 different with each model you use. A reasoning model will 0:10:19.480 --> 0:10:22.400 have a different instructions set to a more chat focused one, 0:10:22.440 --> 0:10:24.760 such as think harder about a particular problem before giving 0:10:24.800 --> 0:10:27.760 an answer. Break down problems into component answers. When you 0:10:27.840 --> 0:10:30.200 get a certain thing, like if someone asks you a 0:10:30.240 --> 0:10:33.080 coding question, query a coding tool. That kind of thing, 0:10:33.760 --> 0:10:35.800 a user prompt is exactly what it sounds like, the 0:10:35.840 --> 0:10:37.760 thing that a user wants the AI model to do. 0:10:38.320 --> 0:10:40.560 The new order in chat GPT five becomes an issue 0:10:40.600 --> 0:10:43.080 when you use multiple different models in the same conversation. 0:10:43.160 --> 0:10:45.199 Because the router, the thing that selects the right model 0:10:45.200 --> 0:10:47.520 for the request, has to look at the user prompt. 0:10:47.760 --> 0:10:50.800 It can't consider static instructions first because they may be 0:10:50.840 --> 0:10:53.920 different based on what the user asked. In fact, the 0:10:54.120 --> 0:10:56.000 order has to be flipped for the whole thing to work. 0:10:56.679 --> 0:11:00.240 But simpler previous versions of chat GPT would take the 0:11:00.240 --> 0:11:03.360 static prompt and then invisibly append the user prompt onto it. 0:11:03.400 --> 0:11:06.080 This static prompt would typically be cashed massively, reducing the 0:11:06.080 --> 0:11:08.040 amount of compute the model needs to perform a task. 0:11:08.559 --> 0:11:12.400 Chat GPT cannot do this. Every time you use chat 0:11:12.440 --> 0:11:15.480 GPT five. Every single thing you say or do can 0:11:15.520 --> 0:11:17.880 cause it to do something different. Attach a vile might 0:11:17.880 --> 0:11:20.080 need a different model. Ask it to look into something 0:11:20.120 --> 0:11:22.600 and be detailed. Might trigger a reasoning model or a 0:11:22.600 --> 0:11:26.600 different depth of reasoning. Ask a question in a weird way. Sorry, 0:11:26.600 --> 0:11:27.880 the route is going to need to send you to 0:11:27.880 --> 0:11:30.800 a different model entirely each time, coming up with new 0:11:30.800 --> 0:11:33.839 instructions based on the subtle interpretation of what you asked in. 0:11:34.559 --> 0:11:36.600 Every single thing that can happen when you ask chat 0:11:36.640 --> 0:11:39.280 GPT to do something may triget the route to change model. 0:11:39.400 --> 0:11:41.559 A request a new tool, and each time it does 0:11:41.600 --> 0:11:44.680 so requires a completely fresh static prompt, regardless of whether 0:11:44.679 --> 0:11:46.920 you select auto thinking Faster or any other option on 0:11:47.040 --> 0:11:50.400 chat GPT. This in turn requires it to expend more 0:11:50.400 --> 0:11:53.640 compute with queries consuming more tokens compared to previous versions. 0:11:54.960 --> 0:11:56.640 It's like you started a job, and every time you 0:11:56.720 --> 0:11:58.800 do a task, right an email, make a cup of copy, 0:11:58.920 --> 0:12:03.440 attend a meeting, email someone with a threat your workplace 0:12:03.480 --> 0:12:06.640 requires you to complete the entire mandatory onboarding training first. 0:12:06.760 --> 0:12:08.800 One way that it is spreadsheet, not before you brush up 0:12:08.800 --> 0:12:13.040 on your anti biberary legislation. First your prick. As a result, 0:12:13.120 --> 0:12:16.160 Chat GPT may be smart, but it doesn't really seem 0:12:16.160 --> 0:12:20.320 efficient in the GPT five version. Now to play Devil's advoca, 0:12:20.480 --> 0:12:22.840 open Ai likely added the routing model as a means 0:12:22.840 --> 0:12:25.440 of creating a more sophisticated output for a user, and 0:12:25.520 --> 0:12:28.959 I imagine with the intention of cost saving. Then again, 0:12:29.000 --> 0:12:30.800 this might just be the thing it had to ship. 0:12:30.920 --> 0:12:32.760 After all, GPT five was meant to be the next 0:12:32.840 --> 0:12:35.000 great leap in AI, and the pressure was on to 0:12:35.040 --> 0:12:37.480 get it out the door by creating a system that 0:12:37.520 --> 0:12:41.040 depends on an extern and or routing model, likely another LM. 0:12:41.080 --> 0:12:43.280 In this case, open ai has removed the ability to 0:12:43.280 --> 0:12:46.200 cash the hidden instructions that dictate the how the models 0:12:46.240 --> 0:12:50.840 generate answers in chat GPT, creating massive infrastructural overhead. Worse still, 0:12:51.000 --> 0:12:53.880 this happens with every single turn as in message on 0:12:53.960 --> 0:12:56.880 Chat GPT five, regardless of the model you choose, creating 0:12:57.000 --> 0:12:59.800 endless infrastructural baggage with no real way out that only 0:12:59.800 --> 0:13:02.880 could pounds based on how complex the user's queries get 0:13:02.920 --> 0:13:05.280 or how much they change. They could be simple, but 0:13:05.400 --> 0:13:08.560 just going in different directions every time, could open ai 0:13:08.679 --> 0:13:10.800 make a better router? Sure? Does it have a good 0:13:10.840 --> 0:13:13.959 one today? No, every time you message CHATGBT as the 0:13:13.960 --> 0:13:16.640 potential to change model or tooling based on its own whims, 0:13:16.760 --> 0:13:19.200 each time requiring a fresh static prompt, and short of 0:13:19.480 --> 0:13:22.240 totally reworking the architecture of chat GPT five, there's no 0:13:22.280 --> 0:13:25.280 way to change this. And if it's an LLM choosing 0:13:25.320 --> 0:13:28.640 which model, I don't know, maybe it hallucinates just a guess. 0:13:29.400 --> 0:13:30.840 It doesn't even need to be the case where a 0:13:30.920 --> 0:13:33.560 user asks chet GPT five to think, and based on 0:13:33.600 --> 0:13:36.480 my test with GPT five, sometimes you can just ask 0:13:36.480 --> 0:13:38.800 it a forward question and it will think about it. 0:13:38.800 --> 0:13:41.840 For no apparent reason, open ai has created a product 0:13:41.840 --> 0:13:45.680 with latency issues and an overwhelmingly convoluted routing system that's 0:13:45.720 --> 0:13:48.560 already straining capacity, to the point that this announcement feels 0:13:48.640 --> 0:13:51.880 like open ai is walking away from its API entirely. This, 0:13:52.000 --> 0:13:53.880 as a reminder, is the thing that people use to 0:13:53.920 --> 0:13:56.800 incorporate open AI's models into their apps while also running 0:13:56.800 --> 0:13:59.560 set models on the infrastructure open Ai rants from Microsoft 0:14:00.040 --> 0:14:02.400 and call even at some point as well as Oracle, 0:14:03.200 --> 0:14:05.600 and this API thing is really weird by the way 0:14:05.640 --> 0:14:08.559 because these are new models, but Open Eyes really not 0:14:08.600 --> 0:14:11.760 talking about the models themselves that much. Unlike the GPT 0:14:11.840 --> 0:14:14.840 four rower announcement, which mentions the API in the first paragraph, 0:14:14.920 --> 0:14:17.440 the GPT five announcement has no reference to it and 0:14:17.520 --> 0:14:19.720 only has a single reference to developers at all when 0:14:19.760 --> 0:14:22.560 talking about coding. Some woman has already hinted that he 0:14:22.640 --> 0:14:25.680 intends to deprecate any new API demand, though I imagine 0:14:25.680 --> 0:14:27.920 it will let anyone who will pay for priority processing, 0:14:27.960 --> 0:14:31.400 which is essentially open eyes way to require minimum commitments 0:14:31.400 --> 0:14:34.040 and extra payments from API customers just so they never 0:14:34.120 --> 0:14:37.200 feel the bite of any compute shortages and throttling, which 0:14:37.200 --> 0:14:40.520 they absolutely will do to people that don't pay. Chat 0:14:40.520 --> 0:14:43.000 GPT five feels like the ultimate comeuppance for a company 0:14:43.000 --> 0:14:45.040 that has never been forced to build a product, choosing 0:14:45.120 --> 0:14:48.200 instead to bolt increasingly complex tools onto the side of 0:14:48.280 --> 0:14:51.280 models in the hopes that one will magically appear. Now, 0:14:51.360 --> 0:14:53.880 each and every feature of Chat GPT burns more money 0:14:53.880 --> 0:14:56.760 than it ever did before. Chat GPT five feels like 0:14:56.800 --> 0:14:58.600 a product that was rushed to market by a desperate 0:14:58.600 --> 0:15:00.680 company that had to get something out of the In 0:15:00.720 --> 0:15:04.120 simpler terms, here, it's actually really funny. When I worked 0:15:04.160 --> 0:15:07.200 this out, I chuckled. I chuckled vigorously. This is just 0:15:07.240 --> 0:15:10.200 a case where open ai has given chat gpt middle manager. 0:15:10.960 --> 0:15:12.640 But now I'm giving you the chance to open up 0:15:12.680 --> 0:15:15.680 your hearts and do something better. Open up your wallets too, 0:15:15.680 --> 0:15:18.800 and send money to a company that follows here, But 0:15:19.000 --> 0:15:38.280 hold my advertisements and we're back. Like every great middle manager, 0:15:38.480 --> 0:15:41.280 chat GPT five's rutter creates more work based on its 0:15:41.320 --> 0:15:43.840 own interpretation of what's going on, and has a separate 0:15:43.920 --> 0:15:45.960 large language model. I can't imagine it has a ton 0:15:46.000 --> 0:15:48.520 of training data available if I had to guess, and 0:15:48.560 --> 0:15:51.080 this is a guess by the way open ai has done, 0:15:51.120 --> 0:15:53.160 and we'll do a lot of fine tuning and reinforcement 0:15:53.240 --> 0:15:55.680 learning to make it work. Though, to give it a 0:15:55.680 --> 0:15:57.640 little grace, this is a new thing that it's doing, 0:15:57.680 --> 0:16:01.840 and it's doing sort of a huge scale. The problems start, 0:16:01.880 --> 0:16:03.600 by the way, with the fact that chat GPT five 0:16:03.680 --> 0:16:06.280 is taking the user's initial prompt and then deciding which 0:16:06.280 --> 0:16:09.720 model to use, unlike previous models, which sent your prompt 0:16:09.760 --> 0:16:11.920 directly to the model along with the static prompt which 0:16:11.960 --> 0:16:13.880 was cashed and came first. An important feature in how 0:16:13.960 --> 0:16:17.080 these models, limit tokenburn. Open ai starts with a router 0:16:17.160 --> 0:16:20.400 model that makes takes what you ask and gives its 0:16:20.480 --> 0:16:22.560 chat GPT and tags it based on what kind of 0:16:22.640 --> 0:16:25.400 thing your question might need. The thing might be a tool, 0:16:25.480 --> 0:16:27.400 such as whether it has to do a web search 0:16:27.480 --> 0:16:30.360 to spit out the thing at the end, a reasoning model, 0:16:30.520 --> 0:16:32.360 whether it needs to use a coding language, and so 0:16:32.520 --> 0:16:35.760 on and so forth. Once chat GPT has bounced your 0:16:35.800 --> 0:16:38.800 query across various models, burn and compute along the way, 0:16:39.040 --> 0:16:41.600 it then pushes it towards the chat portion of the generation. 0:16:42.080 --> 0:16:44.480 And each time you ask chat GPT a question or 0:16:44.600 --> 0:16:47.520 to do something and you specialized static prompt is generated, 0:16:47.800 --> 0:16:50.920 sometimes several make it impossible to cash them in advance. 0:16:51.240 --> 0:16:53.520 In simpler terms, each time you message it, chat GPT 0:16:53.640 --> 0:16:56.760 is to dump all cased information and instructions for what 0:16:56.800 --> 0:16:59.120 you need to do and reload it with each prompt. 0:16:59.520 --> 0:17:02.120 Now here's some examples of what chat GPT five has 0:17:02.200 --> 0:17:04.879 to reload every single time you prompt him whether or 0:17:04.880 --> 0:17:06.560 not to use a browser or search the internet, and 0:17:06.640 --> 0:17:09.200 under what conditions to do so, because they will change 0:17:09.200 --> 0:17:12.040 with each prompt. How to approach a particular problem based 0:17:12.080 --> 0:17:14.439 on what the user asked, including any specific ways you 0:17:14.480 --> 0:17:16.840 meant to answer, tone, brevity, and so on based on 0:17:16.920 --> 0:17:20.840 their request, specifics around how it might use, say open 0:17:20.880 --> 0:17:23.800 ais code interpreter, such as the usage rules for running 0:17:23.800 --> 0:17:25.920 a Python script, or how you want the code's output, 0:17:25.960 --> 0:17:28.359 which again will be different based on each prompt. And 0:17:28.520 --> 0:17:30.199 you can even say, do it in the exactly the 0:17:30.200 --> 0:17:32.919 same way, and because it's a large language model, it 0:17:32.960 --> 0:17:37.480 may hallucinate something different every single goddamn time you prompt 0:17:37.560 --> 0:17:40.520 chat GPT five it has to do this. Worse still, 0:17:40.560 --> 0:17:43.480 a particular conversation can involve you using multiple different models 0:17:43.520 --> 0:17:47.119 and tools, requiring you with each and every prompt, having 0:17:47.119 --> 0:17:49.639 to inject a different static prompt for each component that 0:17:49.720 --> 0:17:52.800 chat GPT five uses. And you can't catch the static 0:17:52.800 --> 0:17:54.760 prompt before the user's intent because if you did that, 0:17:55.040 --> 0:17:57.040 it might send an instruction to a model that doesn't 0:17:57.040 --> 0:17:59.199 make sense, such as telling a reasoning model to give 0:17:59.200 --> 0:18:01.840 a quick and simple line answer remini or nanomodel to 0:18:01.880 --> 0:18:04.000 do some sort of deep reasoning, which would create a 0:18:04.000 --> 0:18:07.920 crappy answer and burn tokens in the process. And this 0:18:07.960 --> 0:18:10.040 is all thanks to the complicated way that open ai 0:18:10.160 --> 0:18:14.400 insisted on building GPT five. Every single time you send 0:18:14.480 --> 0:18:16.399 something to chat, GPT can trigger it to use a 0:18:16.560 --> 0:18:21.199 different series of models audio vision, reasoning, each with their 0:18:21.240 --> 0:18:24.680 own instructions, static prompts, all while pulling different tools, each 0:18:24.720 --> 0:18:27.359 requiring their own instructions based on what you asked, and 0:18:27.440 --> 0:18:30.679 reasoning models even have different depths of reasoning. Unlike four 0:18:30.720 --> 0:18:33.800 to ZH, which is a multimodal model combining text, vision, 0:18:33.800 --> 0:18:36.399 and voice, GPT five is a ratking of open AI's 0:18:36.440 --> 0:18:38.720 models and tools that gets reborn every single time you 0:18:38.760 --> 0:18:41.640 ask it to do anything prompt It can prompt cash 0:18:41.720 --> 0:18:45.199 some things, but the core instructions not so much. But 0:18:45.280 --> 0:18:47.600 let's get a little more granular, because I know I've 0:18:47.720 --> 0:18:51.480 been quite repetitive, but this is detailed. So from what 0:18:51.520 --> 0:18:53.879 I've been told, there are either one or two models 0:18:53.880 --> 0:18:55.639 at work for the routing. I'm going to go with 0:18:55.680 --> 0:18:57.600 what I think is most likely based on the discussions 0:18:57.600 --> 0:19:00.640 I've had with people familiar with the architecture. I've heard 0:19:00.680 --> 0:19:04.040 the term orchestrator thrown around potential to potentially suggesting the 0:19:04.119 --> 0:19:06.840 router may be more omnipresent throughout the process, but I 0:19:06.880 --> 0:19:09.479 was unable to confirm its existence. Reach out of you 0:19:09.480 --> 0:19:12.480 here differently, I'll explain things as they were explained to me. Though. 0:19:13.080 --> 0:19:15.760 When a user sensor prompt, it goes through the Splitter leg, 0:19:15.760 --> 0:19:18.480 which decides to send the query on one of two paths. 0:19:18.760 --> 0:19:21.399 One is called the fast path, where a query is straightforward, 0:19:21.400 --> 0:19:24.240 such as a text only conversation that doesn't require any 0:19:24.400 --> 0:19:27.399 analysis or extra tools or thinking, a path where the 0:19:27.440 --> 0:19:30.679 query may require reasoning or more complex tools like codgeneration 0:19:30.800 --> 0:19:33.560 or access to web browser for research. To be clear, 0:19:33.640 --> 0:19:35.639 there are prompts where it may be split into multiple 0:19:35.680 --> 0:19:38.320 paths that trigger multiple models or tools, each requiring their 0:19:38.320 --> 0:19:41.720