WEBVTT - Techstuff Classic: How MP3 Compression Works 0:00:04.160 --> 0:00:07.160 Get in touch with technology with tech Stuff from how 0:00:07.240 --> 0:00:14.160 stuff works dot com. Hey everybody, and welcome to tech Stuff. 0:00:14.200 --> 0:00:16.680 I'm Jonathan Strickland. I'm the host of the show, and 0:00:16.800 --> 0:00:20.439 this is a Saturday morning rerun episode where we take 0:00:20.440 --> 0:00:23.560 a classic episode of tech Stuff and we present it 0:00:23.560 --> 0:00:25.760 to you guys who may have missed it. I've been 0:00:25.800 --> 0:00:29.000 talking a lot about tech and music recently. If you've 0:00:29.000 --> 0:00:31.360 been listening to the recent episodes, you know all about that, 0:00:31.920 --> 0:00:34.640 and there have been some great discussions. But it also 0:00:34.760 --> 0:00:38.760 requires a little bit of uh knowledge of previous episodes 0:00:38.880 --> 0:00:40.960 at times, and I know it can be tricky to 0:00:41.040 --> 0:00:44.559 dig through the archives. So in this classic episode, I 0:00:44.600 --> 0:00:48.919 talk about how the MP three compression format works, so 0:00:48.920 --> 0:00:51.559 that you can actually understand how MP three works as 0:00:51.560 --> 0:00:54.600 opposed to something like middy, and you can get an 0:00:54.600 --> 0:00:58.960 appreciation for the differences between the two formats. This episode 0:00:58.960 --> 0:01:02.320 originally published on January two thousand and seventeen. This is 0:01:02.360 --> 0:01:06.039 a whole year ago more than that. Now we're in 0:01:06.080 --> 0:01:08.920 April two eighteen as I record this. I hope you 0:01:09.000 --> 0:01:11.120 enjoyed this classic episode. I hope it gives you a 0:01:11.160 --> 0:01:15.959 deeper appreciation of the technical aspect of creating digital music 0:01:16.360 --> 0:01:19.440 and I'll see you guys on the other side. So 0:01:19.560 --> 0:01:23.840 let's remember that the heart of digital information is the 0:01:23.959 --> 0:01:28.080 bit that's either a zero or a one. The basic 0:01:28.360 --> 0:01:34.720 unit of information for digital formats zeros and ones. Now 0:01:34.720 --> 0:01:36.840 we can use those zeros and ones to describe all 0:01:36.840 --> 0:01:41.120 sorts of information, from text to audio, to video and 0:01:41.480 --> 0:01:45.120 really pretty much anything you can think of that's represented digitally. Ultimately, 0:01:45.120 --> 0:01:46.680 when you get down to it, it's a bunch of 0:01:46.760 --> 0:01:49.840 zeros and ones. So let's say you start off with 0:01:49.880 --> 0:01:54.440 your uncompressed audio file. You've got this enormous audio file 0:01:54.480 --> 0:01:56.600 in front of you. It's made up of zeros and ones. 0:01:57.120 --> 0:02:00.480 How do you make that file smaller? So in the world, 0:02:00.480 --> 0:02:04.120 we can compress stuff, right, we can apply physical pressure 0:02:04.160 --> 0:02:07.400 to things. Think about packing a suitcase. You can make 0:02:07.400 --> 0:02:09.640 sure you get that extra outfit and if you just 0:02:09.919 --> 0:02:12.480 press it down hard enough and get that zipper zipped 0:02:12.480 --> 0:02:15.600 before it can burst open. But once you get to 0:02:15.639 --> 0:02:19.239 a certain level of compression, you cannot make things smaller, 0:02:19.480 --> 0:02:21.880 at least not without hurting yourself or whatever it is 0:02:21.919 --> 0:02:25.440 you're trying to compress. Digital files are a little different 0:02:25.760 --> 0:02:29.800 because you cannot physically cram the zeros and ones closer together. 0:02:29.880 --> 0:02:33.400 It doesn't work like that. These are abstract things. You 0:02:33.440 --> 0:02:36.840 can't make them smaller, right, You can't decrease the font. 0:02:36.960 --> 0:02:40.840 It doesn't work that way. The numbers represent two different states. 0:02:41.400 --> 0:02:43.640 So if you want to create a smaller audio file 0:02:44.080 --> 0:02:47.280 containing the recording that was in a larger audio file, 0:02:47.760 --> 0:02:51.200 you have to start getting creative now. In the last 0:02:51.240 --> 0:02:53.720 part of this series, I talked about how the MP 0:02:53.800 --> 0:02:57.560 three compression algorithm was born from an applied research institution 0:02:57.600 --> 0:03:00.240 in Germany and the team behind the m B three 0:03:00.280 --> 0:03:03.239 wanted to find a way to compress audio, specifically music 0:03:03.600 --> 0:03:08.280 for transmission over phone lines. Eventually this evolved into the 0:03:08.400 --> 0:03:13.000 Motion Pictures Expert Group audio Layer three compression methodology, better 0:03:13.080 --> 0:03:17.960 known as the MP three, and there's also Impact two 0:03:18.000 --> 0:03:20.519 and IMPEG four standards. Impact two, by the way, is 0:03:20.560 --> 0:03:23.799 the basis of compression on DVDs, although the actual DVD 0:03:23.880 --> 0:03:28.240 format is really a modification of Impact two. An Impact 0:03:28.280 --> 0:03:30.600 four is a compression strategy for audio and video that's 0:03:30.639 --> 0:03:34.320 frequently used in lots of different up capacities, including streaming 0:03:34.320 --> 0:03:38.480 media services. So by the late nineteen seventies, researchers began 0:03:38.560 --> 0:03:42.280 to explore the possibility of leveraging psychoacoustics to figure out 0:03:42.320 --> 0:03:46.640 how to compress audio. And psychoacoustics refers to the way 0:03:46.640 --> 0:03:51.360 we perceive sound, it's uh and also the physiological effects 0:03:51.400 --> 0:03:55.080 of sound on us. So this involves not just our 0:03:55.160 --> 0:03:58.200 our physical sense of hearing, but also our brains and 0:03:58.240 --> 0:04:01.480 the way our brains interpret sound. Owned So, for example, 0:04:01.760 --> 0:04:05.520 there's a psychoacoustic phenomenon that's called the Hawse effect h 0:04:05.680 --> 0:04:08.600 A A S. And I think it's pretty interesting. So 0:04:08.800 --> 0:04:11.240 here's how the Hawse effect works. If you hear the 0:04:11.320 --> 0:04:16.320 exact same sound coming from different directions, but the two 0:04:16.320 --> 0:04:19.680 sounds arrive within thirty to forty milliseconds of each other, 0:04:20.080 --> 0:04:23.039 your brain will be convinced that you really only heard 0:04:23.080 --> 0:04:26.479 one sound and it came from the direction that hit 0:04:26.560 --> 0:04:30.240 you first. So let's say a sounds coming from directly 0:04:30.279 --> 0:04:32.720 in front of you and to your left, and you 0:04:33.080 --> 0:04:36.520 get both of them within that thirty forty millisecond range, 0:04:37.360 --> 0:04:39.479 and you hear the one coming from ahead of you 0:04:39.560 --> 0:04:43.080 first to you. You're convinced that you only heard that 0:04:43.160 --> 0:04:46.119 sound once and it came from dead on straight ahead 0:04:46.160 --> 0:04:49.719 of you. Your brain kind of discounts the one that 0:04:49.800 --> 0:04:53.200 came off from the left, although it can reinforce it, 0:04:53.320 --> 0:04:55.560 which ends up being really useful if you're planning out 0:04:55.560 --> 0:04:58.320 p A systems for stage shows. I'm not joking. That 0:04:58.360 --> 0:05:01.120 really is the way that uh people plan those things out. 0:05:01.400 --> 0:05:04.120 It's pretty neat. Humans perceived sounds in a way that's 0:05:04.120 --> 0:05:08.240 not necessarily representational of all the sounds surrounding us. You 0:05:08.240 --> 0:05:11.640 can think of your brain as the filter between your 0:05:11.760 --> 0:05:15.719 understanding and what reality actually is. A lot of stuff 0:05:15.760 --> 0:05:18.640 goes on that it ends up getting rid of information 0:05:18.680 --> 0:05:21.080 that your brain just says, you know what, he or 0:05:21.120 --> 0:05:25.080 she doesn't need that, it's just gonna confuse things. We're 0:05:25.080 --> 0:05:28.440 gonna dump it. And that's kind of how it works. 0:05:28.480 --> 0:05:30.640 It's all on an unconscious level. It's not like you're 0:05:30.839 --> 0:05:34.960 actively working to do this. So let's say you're in 0:05:34.960 --> 0:05:37.320 a relatively busy hallway and there could be a lot 0:05:37.400 --> 0:05:40.839 of sounds in that hallway. Stuff that's going on constantly 0:05:40.839 --> 0:05:44.080 around you. Maybe they are doors opening and closing, Maybe 0:05:44.080 --> 0:05:47.000 their footsteps going up and down the hallway. Maybe someone 0:05:47.120 --> 0:05:50.760 shoes are squeaking against the linoleum floor. People are chattering 0:05:50.800 --> 0:05:53.880 away in there. But you are having a conversation with someone, 0:05:54.279 --> 0:05:57.000 so you turn your focus on that person and other 0:05:57.080 --> 0:06:01.240 sounds seemingly fade away. They're still doesn't but they're not important. 0:06:01.839 --> 0:06:04.560 So in this example, you would actually call those other 0:06:04.640 --> 0:06:08.520 sounds of distraction and you would really focus on the conversation. Uh. 0:06:08.560 --> 0:06:13.040 That also shows how we're able to consciously direct our 0:06:13.120 --> 0:06:16.760 since our perception of hearing. So both of these factors 0:06:16.800 --> 0:06:20.159 come into play. Now. One thing that MP three encoding 0:06:20.200 --> 0:06:24.120 takes advantage of is something called masking, and there are 0:06:24.120 --> 0:06:27.160 a couple of different variations of the masking effect. One 0:06:27.200 --> 0:06:30.560 of them is called frequency masking. So let's say you've 0:06:30.600 --> 0:06:33.520 got to sound frequencies that are similar, perhaps there's just 0:06:33.560 --> 0:06:37.240 a few hurts apart. Remember, UH, frequencies are measured in hurts, 0:06:37.720 --> 0:06:41.560 which is really the number of oscillations per second. So 0:06:41.680 --> 0:06:47.000 let's say you've got a sound that's at I don't know, uh, 0:06:47.400 --> 0:06:52.400 one thousand killer hurts, and another one that's at one 0:06:52.520 --> 0:06:56.599 thousand and ten killer hurts. Now, the human ear is 0:06:56.640 --> 0:07:00.080 precise enough to be able to tell the difference of 0:07:00.160 --> 0:07:02.840 two sounds that are at least two hurts apart from 0:07:02.880 --> 0:07:06.400 each other. That's how precise our resolution of hearing, it's 0:07:06.480 --> 0:07:09.840 it's at that level. But if you get two sounds 0:07:09.880 --> 0:07:13.560 played at the same time and they are that close 0:07:13.600 --> 0:07:17.160 together in frequency, and one of those frequencies is played 0:07:17.160 --> 0:07:20.320 at a greater volume than the other, our brains will 0:07:20.320 --> 0:07:23.200 pick up on the louder sound and ignore the quieter sound, 0:07:23.280 --> 0:07:26.920 even though both of them are present. What becomes important 0:07:26.920 --> 0:07:29.560 at that point is the amplitude. Now, the further apart 0:07:29.600 --> 0:07:33.400 in frequencies you get, the less that has an effect. 0:07:33.520 --> 0:07:35.400 So if you get far enough apart where there are 0:07:35.400 --> 0:07:38.720 two pitches, one of them noticeably louder than the other, 0:07:39.080 --> 0:07:41.360 but they're far enough apart, you will hear both of them. 0:07:41.400 --> 0:07:44.600 It only works if the two pitches are relatively close together, 0:07:45.720 --> 0:07:48.600 and there's not a universal formula for frequency masking. As 0:07:48.600 --> 0:07:51.560 you get closer to the boundaries of human hearing, frequency 0:07:51.600 --> 0:07:53.960 masking becomes easier, So if it's a really low pitch 0:07:54.040 --> 0:07:56.640 or a really high pitch, it's easier to get away 0:07:56.640 --> 0:07:59.400 with it. Once you started getting into what is the 0:07:59.440 --> 0:08:02.040 out of as the sweet spot for human hearing, which 0:08:02.080 --> 0:08:05.160 is generally considered to be between two and five killer hurts, 0:08:06.240 --> 0:08:10.240 you need a greater difference in volume or a smaller 0:08:10.280 --> 0:08:14.720 difference in frequency in order for masking to work. Frequency 0:08:14.760 --> 0:08:18.560 masking at any rate. But then there's also temporal masking, 0:08:19.640 --> 0:08:21.920 and you might say, okay, I got it. Temporal that 0:08:21.960 --> 0:08:26.080 means time. Indeed it does, my friend. This describes the 0:08:26.080 --> 0:08:29.080 effect of a short but loud sound masking a softer 0:08:29.160 --> 0:08:33.360 sound for a short time. Weird thing is the loud 0:08:33.400 --> 0:08:37.000 sound can actually mask sounds that precede it slightly, not 0:08:37.080 --> 0:08:39.800 by a whole lot, but a little bit. MP three 0:08:39.800 --> 0:08:43.920 compression takes advantage of both frequency and temporal masking when 0:08:43.920 --> 0:08:47.120 it's trying to determine which data needs to be included 0:08:47.200 --> 0:08:49.960 and which data can be dumped, because it won't affect 0:08:50.000 --> 0:08:52.880 your perception of whatever the the audio file is in 0:08:52.920 --> 0:08:56.760 the first place. So you also probably remember I talked 0:08:56.760 --> 0:08:59.600 about the physical limitation to what we humans can hear, 0:08:59.800 --> 0:09:01.960 no matter what our brains might be up to, so 0:09:02.040 --> 0:09:04.440 that this doesn't have to do with our brains, you know, 0:09:04.520 --> 0:09:07.280 filtering through the information that's coming in. This has to 0:09:07.320 --> 0:09:11.240 do with the physical limitations of the human ear. In 0:09:11.280 --> 0:09:14.240 the last episode of the series, I said typical human hearing. 0:09:14.880 --> 0:09:18.599 Keep in mind typical there are exceptions. UH covers the 0:09:18.720 --> 0:09:21.600 range of frequencies between about twenty hurts and twenty killer 0:09:21.679 --> 0:09:24.800 hurts or twenty thousand hurts, So twenty to twenty thou 0:09:25.840 --> 0:09:30.360 higher frequencies represent higher pitches and sound lower frequencies lower pitches, right, 0:09:31.120 --> 0:09:33.679 And as you get older, your ability to perceive those 0:09:33.760 --> 0:09:38.080 higher frequencies starts to diminish. So most adults actually have 0:09:38.360 --> 0:09:44.480 an upper range closer to sixteen killer hurts, not twenty. Uh. Kids, 0:09:44.720 --> 0:09:46.920 they can hear those higher pitches. You may have heard 0:09:46.920 --> 0:09:51.480 the story about how some convenience stores experimented with getting 0:09:51.559 --> 0:09:57.280 rid of teenage loiterers by by uh projecting out these 0:09:57.280 --> 0:10:00.760 super high pitches that that adults could not here but 0:10:00.920 --> 0:10:03.800 kids could, and it discouraged kids from hanging out at 0:10:03.800 --> 0:10:08.600 the convenience store and loitering. Um. I love that idea 0:10:09.559 --> 0:10:12.959 so much. Anyway, that's because I'm old and my hearing 0:10:13.040 --> 0:10:16.920 is terrible. Well, remember I also mentioned you can detect 0:10:17.000 --> 0:10:19.760 changes in pitch at two hurts increments if you get 0:10:19.880 --> 0:10:23.440 below two hurts and change, like, if it's just a 0:10:23.520 --> 0:10:27.760 one hurts difference between two frequencies, it's too low a 0:10:27.800 --> 0:10:30.080 resolution for us to detect. To us, it will sound 0:10:30.160 --> 0:10:34.599 exactly the same. So if you were to hear a 0:10:35.400 --> 0:10:40.400 frequency at one thousand one hurts or one point zero 0:10:40.679 --> 0:10:43.960 zero one killer hurts and one point zero zero to 0:10:44.160 --> 0:10:47.120 kill hurts, you wouldn't notice the difference. They would sound 0:10:47.120 --> 0:10:50.199 exactly the same to you. So if you're gonna take 0:10:50.200 --> 0:10:52.439 audio and compress it, one step you could consider is 0:10:52.480 --> 0:10:57.240 eliminating anything that's outside the actual range of frequencies that 0:10:57.280 --> 0:11:00.719 we can hear, or simplifying any changes in frequency that 0:11:00.760 --> 0:11:04.439 are smaller than two hurts. If you get take all 0:11:04.440 --> 0:11:07.920 that data and you say it is physically impossible for 0:11:08.000 --> 0:11:11.479 a human to perceive this, get rid of that information, 0:11:11.600 --> 0:11:14.800 then in theory it wouldn't have any effect on the 0:11:14.880 --> 0:11:19.160 rest of the recording. But how you go further than that, right, 0:11:19.240 --> 0:11:22.000 how do you create a method so that you can 0:11:22.040 --> 0:11:24.160 really compress this file? You want a method that will 0:11:24.160 --> 0:11:27.479 preserve the important sounds while potentially ignoring all the unimportant 0:11:27.559 --> 0:11:31.360 or incidel sounds. And you wanted to be automatic because 0:11:31.800 --> 0:11:34.920 if you have it manually, then that's going to take 0:11:35.679 --> 0:11:40.000 countless hours just to edit a single sound file. So 0:11:41.160 --> 0:11:44.360 that was the challenge that the MP three research team 0:11:44.400 --> 0:11:49.480 faced as a group. Now, their solution, which ultimately created 0:11:49.520 --> 0:11:51.800 even more challenges was to come up with what was 0:11:51.920 --> 0:11:55.640 essentially a simulated human ear and brain. They needed to 0:11:55.679 --> 0:12:01.559 replicate the experience of perceiving music so that an algorithm 0:12:01.559 --> 0:12:05.720 could evaluate every sound in an audio file and judge 0:12:05.800 --> 0:12:08.719 if in fact was relevant enough to include in the 0:12:08.720 --> 0:12:13.000 final compressed version. If a sound were imperceptible, then it 0:12:13.000 --> 0:12:15.520 wouldn't make sense to include it in the MP three file. 0:12:15.800 --> 0:12:18.080 So by leaving out all the irrelevant data, they can 0:12:18.160 --> 0:12:22.199 make the audio information take up less bandwidth. The file 0:12:22.240 --> 0:12:24.800 itself would be smaller because you just dumped everything that 0:12:24.880 --> 0:12:28.400 wasn't important. So the team used an algorithm called the 0:12:28.559 --> 0:12:33.760 low complexity Adaptive Transform Coding or lc DASH a TC 0:12:34.080 --> 0:12:36.520 as the foundation for their research. This was kind of 0:12:36.559 --> 0:12:40.319 their starting point, and this is an approach that that 0:12:40.600 --> 0:12:43.800 tries to do away with redundancy as much as possible, 0:12:43.840 --> 0:12:48.520 and it also incorporates adaptation to perceptual requirements. Also, MP 0:12:48.640 --> 0:12:52.239 three's oh a lot to the IMPEG Layer two standard, 0:12:52.800 --> 0:12:56.600 So the Layer two obviously came out before Layer three, 0:12:56.760 --> 0:12:59.160 and so a lot of the features of layer three 0:12:59.320 --> 0:13:04.800 are really um their legacy features from Layer two. Uh. 0:13:04.840 --> 0:13:07.040 In other words, MP three group kind of got stuck 0:13:07.040 --> 0:13:09.600 with them because otherwise they would have had a problem 0:13:09.600 --> 0:13:12.880 with backwards compatibility. So the result is kind of a 0:13:12.960 --> 0:13:16.439 clunky arrangement under the hood, and some of the features 0:13:16.640 --> 0:13:19.600 may make very little sense when I go through them, 0:13:19.640 --> 0:13:21.839 but some of that is because it's a holdover from 0:13:21.840 --> 0:13:26.840 an earlier compression strategy, which isn't terribly satisfying as an answer. 0:13:26.880 --> 0:13:29.240 But the reason many parts of the MP three compression 0:13:29.280 --> 0:13:31.480 algorithm are the way they are is because that's the 0:13:31.480 --> 0:13:35.520 way we've always done it. So next I'm gonna dive 0:13:35.600 --> 0:13:41.240 into the phases of compression. But before I do that, 0:13:41.440 --> 0:13:44.160 let's all take a deep breath and take a moment 0:13:44.200 --> 0:13:55.880 to thank our sponsor, and we're back. So there are 0:13:55.920 --> 0:13:58.760 two big phases we'll need to talk about with MP 0:13:58.920 --> 0:14:03.320 three compression. The first phase is analysis and the second 0:14:03.320 --> 0:14:07.559 phase is the actual compression itself. And after that there's 0:14:07.559 --> 0:14:10.680 the process of decoding and MP three for playback. But 0:14:10.760 --> 0:14:13.520 that's way simpler once we get an understanding of how 0:14:13.720 --> 0:14:18.959 the encoding process actually happens. So let's begin with analysis. Now. 0:14:19.000 --> 0:14:22.560 This is the part where the standard has to figure 0:14:22.560 --> 0:14:26.840 out which frequencies within an audio range are recording rather 0:14:26.960 --> 0:14:32.760 are important or perceptible. So how does a program and 0:14:33.000 --> 0:14:35.920 encoder figure out what we can hear and what we 0:14:36.000 --> 0:14:40.400 cannot hear? Alright, time to get technical. So you start 0:14:40.440 --> 0:14:45.000 off with your pulse code modulation audio file or PCM file. 0:14:45.160 --> 0:14:47.560 And you might remember I talked about PCM audio in 0:14:47.600 --> 0:14:50.400 the first episode of this series, but just in case 0:14:50.440 --> 0:14:54.160 you don't, it's a lossless digital audio file. The actual 0:14:54.200 --> 0:14:57.040 format could be a wave or ai f F or 0:14:57.080 --> 0:15:00.400 something along those lines, but the important thing to keep 0:15:00.440 --> 0:15:04.520 in mind is that it is uncompressed. Now, that means 0:15:04.560 --> 0:15:06.880 those files tend to be pretty big. This is our 0:15:06.960 --> 0:15:09.840 raw material that we want to take and squish down 0:15:09.880 --> 0:15:14.120 to a more manageable transferable size. And in our our 0:15:14.200 --> 0:15:16.640 last episode in this series, I also mentioned that the 0:15:16.760 --> 0:15:20.120 standard for c D audio is a sample rate of 0:15:20.160 --> 0:15:23.400 forty four point one killer hurts. And we learned that 0:15:23.440 --> 0:15:26.120 you need a sample rate twice the frequency of the 0:15:26.240 --> 0:15:30.520 highest frequency in your recording, and since human hearing tops 0:15:30.520 --> 0:15:32.800 out at around twenty kill hurts, the standard for c 0:15:32.960 --> 0:15:35.880 ds is forty four point one killer hurts. The MP 0:15:36.000 --> 0:15:38.840 three standard can support lots of different sample rates, but 0:15:39.000 --> 0:15:41.320 forty four point one killer hurts is pretty much the 0:15:41.480 --> 0:15:45.800 common standard. So you've got a number of samples with 0:15:45.880 --> 0:15:48.400 your audio file, and that number will depend upon how 0:15:48.440 --> 0:15:53.160 long the audio file is. You've got forty four samples 0:15:53.200 --> 0:15:56.720 per second, actually twice that for stereo. But for the 0:15:56.720 --> 0:15:59.680 purposes of this discussion, let's kind of stick with mono 0:15:59.720 --> 0:16:02.720 sound so that I don't start having math coming out 0:16:02.760 --> 0:16:06.040 of my ears. And we're still in the very easy, 0:16:06.080 --> 0:16:08.480 simple part as far as math goes. We haven't gotten 0:16:08.520 --> 0:16:11.080 to the complicated stuff yet. All right, So you've got 0:16:11.080 --> 0:16:15.880 forty four thousand, one hundred samples per second. To compress 0:16:15.920 --> 0:16:19.280 it into an MP three format, the algorithm first groups 0:16:19.320 --> 0:16:24.520 all of these samples into collections called frames. So take 0:16:24.560 --> 0:16:27.840 those four thousand one per second, and then you start saying, okay, 0:16:27.840 --> 0:16:30.880 we're gonna group you in batches. Each batch is called 0:16:30.920 --> 0:16:34.800 a frame, and each frame contains one thousand, one fifty 0:16:34.800 --> 0:16:39.320 two samples. Now that's specifically to maintain backwards compatibility to 0:16:39.560 --> 0:16:43.520 IMPEG Layer two, which established that one thousand, one fifty 0:16:43.520 --> 0:16:46.720 two number. But we're not talking about IMPEG layer two. 0:16:46.720 --> 0:16:50.760 We're talking about IMPEG Layer three, and though that means 0:16:50.760 --> 0:16:52.560 we have to get a little more complicated. So each 0:16:52.600 --> 0:16:59.280 frame consists of two subgroups called granules. So each granule 0:16:59.320 --> 0:17:04.240 has five hundred seventy six samples six times two one two, 0:17:04.400 --> 0:17:08.560 so five seventy six samples per granule. Now, technically MP 0:17:08.640 --> 0:17:11.520 three encoders only work on one granule at a time, 0:17:11.560 --> 0:17:15.040 but they may reference the granules immediately before and immediately 0:17:15.160 --> 0:17:17.639 after the current one in order to see how the 0:17:17.680 --> 0:17:21.960 audio within the file changes over time. All right, So 0:17:22.040 --> 0:17:24.960 now you've got your granules of five hundred seventy six 0:17:25.119 --> 0:17:29.200 samples each. Then the MP three encoder runs the samples 0:17:29.240 --> 0:17:33.439 through a filter bank, which sorts the sound into thirty 0:17:33.440 --> 0:17:36.359 two frequency ranges. Are you? Are you crazy about the 0:17:36.400 --> 0:17:41.000 numbers yet, Dylan? Are you? Dylan's Dan's nodding. Dylan gets 0:17:41.040 --> 0:17:45.360 worse from here. So you have thirty two frequency ranges, 0:17:45.600 --> 0:17:47.720 which is another nod to the layer two method, which 0:17:47.800 --> 0:17:50.880 use those thirty two ranges for encoding purposes. But we're 0:17:50.880 --> 0:17:54.440 not talking about layer two, are we. No, we're talking 0:17:54.560 --> 0:17:57.760 MP three. Gosh darn it. That means we take those 0:17:57.800 --> 0:18:00.679 thirty two ranges and we subdivide them by a factor 0:18:00.720 --> 0:18:05.240 of eighteen. That means we have five hundred seventies six 0:18:05.440 --> 0:18:10.199 bands of frequencies each band containing one seventy six of 0:18:10.200 --> 0:18:14.439 the frequency range of the original sample. So what that 0:18:14.520 --> 0:18:17.840 actually means and this this is actually pretty easy. The 0:18:17.880 --> 0:18:21.440 bands are not limited to a specific number for their 0:18:21.480 --> 0:18:26.399 frequency range, right. The bands don't mean that on the 0:18:26.560 --> 0:18:29.640 on band number one it goes from twenty hurts up 0:18:29.680 --> 0:18:32.359 to a certain range, and on band five D seventy 0:18:32.400 --> 0:18:35.439 six it ends at twenty killer hurts. That's not what 0:18:35.480 --> 0:18:38.639 it means. They're dependent upon the original audio. So if 0:18:38.680 --> 0:18:42.720 the original audio contains sounds within a narrow range of frequencies, 0:18:43.080 --> 0:18:46.680 the five seventy bands will be more precise. But if 0:18:46.720 --> 0:18:50.280 the original recording has a vast range of frequencies, the 0:18:50.320 --> 0:18:53.280 bands are less precise. So another way to think about 0:18:53.320 --> 0:18:56.840 this is with a pizza. So let's say you get 0:18:56.960 --> 0:19:00.000 extra large pizza and you cut it into eight equal slices, 0:19:00.640 --> 0:19:03.320 and then you get a small pizza and you cut 0:19:03.359 --> 0:19:06.679 that into eight equal slices. Well, in both cases you 0:19:06.680 --> 0:19:10.800 have with each slice one eighth of a pizza. But 0:19:10.880 --> 0:19:15.119 the extra large pizza pizza slice is bigger than the 0:19:15.160 --> 0:19:18.320 small pizza pizza slice. It all depends on the size 0:19:18.320 --> 0:19:21.000 of the pizza. So in this case, it depends upon 0:19:21.040 --> 0:19:24.120 the range of frequencies. And and Dylan, do you think 0:19:24.119 --> 0:19:26.320 we could go for some pizza, you know, just just 0:19:26.359 --> 0:19:29.199 put the episode on hold and go get pizza. Dylan's nodding. 0:19:29.760 --> 0:19:33.919 It's great for audio. Yeah, so, uh, pizza, We'll be 0:19:34.000 --> 0:19:38.840 right back. Okay, I was good pizza. Now um oh, man, 0:19:38.880 --> 0:19:41.440 I got a whole bunch more notes. Okay, well, let's 0:19:41.480 --> 0:19:43.919 let's go ahead and and do the rest of this. 0:19:43.960 --> 0:19:45.840 All right, So you've got your sound divided up into 0:19:45.920 --> 0:19:49.359 those five seventy six sub brands of frequencies, you know, 0:19:49.680 --> 0:19:52.879 the thing I compared to pizza slices earlier. Now you 0:19:52.920 --> 0:19:58.399 get two different mathematical processes applied to this data. One 0:19:58.560 --> 0:20:01.959 is the fast Furrier trans form or f T, and 0:20:02.000 --> 0:20:05.720 the other is the modified discrete Cosine transform or m 0:20:05.840 --> 0:20:09.800 d c T. Now, I am not going to dive 0:20:09.840 --> 0:20:13.080 deeply into how these transforms work, because frankly, they are 0:20:13.160 --> 0:20:17.480 beyond my mathematical understanding. But I know what they do. 0:20:17.760 --> 0:20:22.320 I just cannot explain the process like how they do 0:20:22.440 --> 0:20:24.520 what they do. So I'm going to give you the 0:20:24.560 --> 0:20:27.760 explanation of what they do. What the outcome of each 0:20:27.800 --> 0:20:31.880 of these transformed processes happens to be, but I'm not 0:20:31.960 --> 0:20:33.840 going to be able to tell you the actual mathematical 0:20:33.880 --> 0:20:36.520 steps involved in each because I don't math. So good guys, 0:20:37.680 --> 0:20:40.560 But let's start with a fast for your transform. So 0:20:40.680 --> 0:20:42.760 transform is kind of what it sounds like. It's all 0:20:42.760 --> 0:20:47.000 about transforming information in some way. So in this particular case, 0:20:47.160 --> 0:20:50.399 the f f T transforms the frequency bands we just 0:20:50.440 --> 0:20:55.400 talked about into data that can be further analyzed by 0:20:55.520 --> 0:20:59.639 a psychoacoustic model that's in the encoder. So this is 0:20:59.680 --> 0:21:03.000 that simulated human ear and brain we were talking about earlier. 0:21:03.880 --> 0:21:07.840 So what the encoder does is it analyzes each bit 0:21:07.960 --> 0:21:11.639 of data and looks for signs that it represents audio 0:21:11.720 --> 0:21:14.640 that wouldn't be perceived by a human. So it's look 0:21:14.840 --> 0:21:19.280 looking for any potential for masking possibilities. So are there 0:21:19.280 --> 0:21:21.840 collections of frequencies that are grouped close together, and is 0:21:21.880 --> 0:21:24.359 one of those frequencies louder than the others. You might 0:21:24.400 --> 0:21:27.000 be able to do away with those softerw frequencies because 0:21:27.000 --> 0:21:30.520 of frequency masking. The encoder will also look at whether 0:21:30.640 --> 0:21:33.000 or not the audio has a lot of complexity to it, 0:21:33.840 --> 0:21:36.000 if it has a lot of changes, or if it's 0:21:36.040 --> 0:21:40.879 just relatively steady or simple audio. Any transient sounds that 0:21:40.920 --> 0:21:44.640 are present in the audio might end up being temporal masking, 0:21:44.720 --> 0:21:47.080 so it'll analyze those as well and see if that's 0:21:47.080 --> 0:21:52.040 a possibility. So really what they're looking is for, you know, 0:21:53.320 --> 0:21:56.399 just any really loud sounds that stand out above the 0:21:56.440 --> 0:21:59.159 rest of the recording. That's what the f f T 0:21:59.320 --> 0:22:03.240 is doing. So what about the modified discrete cosine transform. Well, 0:22:03.280 --> 0:22:05.399 this is happening in parallel with the f f T, 0:22:05.840 --> 0:22:10.360 and the samples get sorted into different patterns called windows. Uh. 0:22:10.359 --> 0:22:12.920 And the criterion for sorting all has to do with 0:22:12.920 --> 0:22:16.760 whether the sample represents a steady sound or varied sound. 0:22:17.280 --> 0:22:20.400 So if you have a simple steady sound that goes 0:22:20.440 --> 0:22:24.240 into a long window. If there's a lot of variation 0:22:24.280 --> 0:22:27.000 in the sound, like there are a lot of consonants 0:22:27.000 --> 0:22:29.800 in a vocal line, or it's like a drum solo 0:22:30.000 --> 0:22:32.720 or something like that, it would get sorted into a 0:22:32.800 --> 0:22:36.480 series of three short windows. And each short window contains 0:22:36.520 --> 0:22:42.560 one two samples. That amounts to four whole milliseconds, so 0:22:42.720 --> 0:22:48.159 four thousands of a second in three patterned windows. So 0:22:48.200 --> 0:22:51.440 you've got these windows now, either long windows for simple 0:22:51.480 --> 0:22:54.760 sounds or short windows for the more complex sounds, and 0:22:54.760 --> 0:22:57.800 then the modified discrete cosine transformed kicks into gear. It 0:22:57.800 --> 0:23:00.200 looks at each long window or set of three sort 0:23:00.240 --> 0:23:03.960 windows and converts them into a set of spectral values. 0:23:04.560 --> 0:23:06.840 To some of you, that probably sounds meaningless. So let's 0:23:06.880 --> 0:23:10.760 talk about spectral analysis for a second. First, I was 0:23:11.040 --> 0:23:13.960 very disappointed to learn that spectral analysis doesn't involve a 0:23:13.960 --> 0:23:19.280 psychologist talking to a ghost about its emotional state. So bummer. 0:23:20.040 --> 0:23:23.600 But spectral analysis is when you look at a spectrum 0:23:23.640 --> 0:23:27.840 of information, like a spectrum of frequencies or related information 0:23:27.880 --> 0:23:31.480 like energy states. That's what this transform does. It takes 0:23:31.560 --> 0:23:35.159 data that originally represented a slice of time in a 0:23:35.240 --> 0:23:38.400 sound waveform. That's what sample is. A sample is an 0:23:38.440 --> 0:23:42.320 instance of time in a wave form and converts it 0:23:42.359 --> 0:23:48.880 into information representing sound as energy across a range of frequencies. Now, 0:23:48.880 --> 0:23:51.119 you can plot out spectral information in a lot of 0:23:51.119 --> 0:23:54.040 different ways, but one common method is to use brightness 0:23:54.080 --> 0:23:58.840 to indicate energy levels. Higher energy levels are brighter patches 0:23:59.080 --> 0:24:03.840 in your vision. Dual representation of spectral data. High frequencies 0:24:03.920 --> 0:24:06.720