WEBVTT - How MP3 Compression Works 0:00:04.160 --> 0:00:07.160 Get in tech with technology with tech Stuff from how 0:00:07.240 --> 0:00:14.040 stuff works dot com. Hey there, and welcome to tech Stuff. 0:00:14.080 --> 0:00:17.520 I'm your host, Jonathan Strickland. And in a recent episode 0:00:17.560 --> 0:00:20.560 I explored how digital audio works and gave kind of 0:00:20.560 --> 0:00:24.639 a brief history on the MP three file format. I 0:00:24.760 --> 0:00:27.680 warned you back then that that was part one of 0:00:27.720 --> 0:00:30.760 a three part series, and today we're gonna explore part two. 0:00:31.440 --> 0:00:34.599 So I hadn't forgotten about it. We're back to it, uh, 0:00:34.640 --> 0:00:36.440 And today we're gonna do a deeper dive with m 0:00:36.479 --> 0:00:39.959 P three's and how do they compress audio? And how 0:00:39.960 --> 0:00:42.239 can you take a file filled with information and make 0:00:42.280 --> 0:00:44.920 it a smaller size? What do you have to give 0:00:45.080 --> 0:00:48.159 up in order to make files smaller? And today we're 0:00:48.159 --> 0:00:51.280 gonna try and unravel the technical mystery behind the MP 0:00:51.400 --> 0:00:54.760 three And I am not going to lie to you people. 0:00:55.720 --> 0:01:01.240 This is gonna get a bit you know, man athy 0:01:01.440 --> 0:01:04.759 And that was an English major, So you mathematicians out there, 0:01:04.760 --> 0:01:07.400 get ready with your corrections because I'm probably gonna make 0:01:07.440 --> 0:01:10.760 some over generalizations for the purposes of my own sanity. 0:01:11.280 --> 0:01:14.160 There does get to a point where to really get 0:01:14.200 --> 0:01:19.000 into the technical details, it would likely be uh impossible 0:01:19.040 --> 0:01:21.080 for me to describe it in a way that would 0:01:21.080 --> 0:01:25.880 make sense and be accurate. Um, and I have given 0:01:26.120 --> 0:01:30.399 my producer Dylan the mandate that, should I get to 0:01:31.120 --> 0:01:36.200 cryptic and incomprehensible with my explanation, that he is to 0:01:36.240 --> 0:01:40.200 intervene in a way that he sees fit. Just not 0:01:40.240 --> 0:01:44.120 in the face, Dylan. It's not in the face. It's moneymaker, man. 0:01:44.240 --> 0:01:47.120 I gotta gotta take care of it. So let's remember 0:01:47.160 --> 0:01:52.320 that the heart of digital information is the bit that's 0:01:52.320 --> 0:01:56.440 either a zero or a one. The basic unit of 0:01:56.960 --> 0:02:01.920 information for digital formats zeros and ones. Now we can 0:02:02.000 --> 0:02:05.160 use those zeros and ones to describe all sorts of information, 0:02:05.800 --> 0:02:09.280 from text to audio, to video and really pretty much 0:02:09.280 --> 0:02:12.240 anything you can think of that's represented digitally. Ultimately, when 0:02:12.240 --> 0:02:14.000 you get down to it, it's a bunch of zeros 0:02:14.040 --> 0:02:17.000 and ones. So let's say you start off with your 0:02:17.080 --> 0:02:21.520 uncompressed audio file. You've got this enormous audio file in 0:02:21.520 --> 0:02:23.560 front of you. It's made up of zeros and ones. 0:02:24.080 --> 0:02:26.840 How do you make that file smaller? So in the 0:02:26.840 --> 0:02:29.560 real world, we can compress stuff, right, we can apply 0:02:29.800 --> 0:02:33.760 physical pressure to things. Think about packing a suitcase. You 0:02:33.760 --> 0:02:36.240 can make sure you get that extra outfit in if 0:02:36.280 --> 0:02:38.600 you just press it down hard enough and get that 0:02:38.680 --> 0:02:42.240 zipper zipped before it can burst open. But once you 0:02:42.280 --> 0:02:44.920 get to a certain level of compression, you cannot make 0:02:45.080 --> 0:02:48.600 things smaller, at least not without hurting yourself or whatever 0:02:48.639 --> 0:02:51.720 it is you're trying to compress. Digital files are a 0:02:51.720 --> 0:02:55.400 little different because you cannot physically cram the zeros and 0:02:55.520 --> 0:02:58.120 ones closer together. It doesn't work like that. These are 0:02:58.240 --> 0:03:02.600 abstract things. You can't make them smaller, right. You can't 0:03:02.720 --> 0:03:06.000 decrease the font. It doesn't work that way. The numbers 0:03:06.040 --> 0:03:09.240 represent two different states. So if you want to create 0:03:09.240 --> 0:03:12.840 a smaller audio file containing the recording that was in 0:03:12.880 --> 0:03:17.680 a larger audio file, you have to start getting creative now. 0:03:17.720 --> 0:03:20.120 In the last part of this series, I talked about 0:03:20.160 --> 0:03:22.920 how the MP three compression algorithm was born from an 0:03:22.960 --> 0:03:26.600 applied research institution in Germany and the team behind the 0:03:26.720 --> 0:03:29.040 MP three wanted to find a way to compress audio, 0:03:29.160 --> 0:03:34.800 specifically music for transmission over phone lines. Eventually, this evolved 0:03:34.840 --> 0:03:39.480 into the Motion Pictures Expert Group Audio Layer three compression methodology, 0:03:39.680 --> 0:03:44.560 better known as the MP three, and there's also IMPACT 0:03:44.640 --> 0:03:47.360 two and IMPEG four standards. Impact two, by the way, 0:03:47.400 --> 0:03:50.320 is the basis of compression on DVDs, although the actual 0:03:50.440 --> 0:03:54.720 DVD format is really a modification of Impact two and 0:03:54.840 --> 0:03:57.360 Impact four is a compression strategy for audio and video 0:03:57.400 --> 0:04:00.840 that's frequently used in lots of different up pacities, including 0:04:00.880 --> 0:04:05.160 streaming media services. So by the late nineteen seventies, researchers 0:04:05.200 --> 0:04:08.720 began to explore the possibility of leveraging psycho acoustics to 0:04:08.760 --> 0:04:12.960 figure out how to compress audio. And psychoacoustics refers to 0:04:13.200 --> 0:04:17.120 the way we perceive sound it's uh and also the 0:04:17.120 --> 0:04:21.360 physiological effects of sound on us. So this involves not 0:04:21.480 --> 0:04:24.640 just our our physical sense of hearing, but also our 0:04:24.680 --> 0:04:28.400 brains and the way our brains interpret sound. So, for example, 0:04:28.720 --> 0:04:32.480 there's a psychoacoustic phenomenon that's called the Hawse effect h 0:04:32.640 --> 0:04:35.560 A A S. And I think it's pretty interesting. So 0:04:35.760 --> 0:04:38.200 here's how the Hawse effect works. If you hear the 0:04:38.279 --> 0:04:43.280 exact same sound coming from different directions, but the two 0:04:43.279 --> 0:04:46.640 sounds arrive within thirty to forty milliseconds of each other, 0:04:47.040 --> 0:04:50.000 your brain will be convinced that you really only heard 0:04:50.040 --> 0:04:53.440 one sound and it came from the direction that hit 0:04:53.520 --> 0:04:57.200 you first. So let's say a sounds coming from directly 0:04:57.240 --> 0:04:59.680 in front of you and to your left, and you 0:05:00.080 --> 0:05:03.480 get both of them within that thirty to forty millisecond range, 0:05:04.279 --> 0:05:06.440 and you hear the one coming from ahead of you 0:05:06.520 --> 0:05:10.039 first to you, you're convinced that you only heard that 0:05:10.120 --> 0:05:13.080 sound once and it came from dead on straight ahead 0:05:13.080 --> 0:05:16.680 of you. Your brain kind of discounts the one that 0:05:16.760 --> 0:05:20.159 came off from the left, although it can reinforce it, 0:05:20.279 --> 0:05:22.520 which ends up being really useful if you're planning out 0:05:22.520 --> 0:05:25.279 p A systems for stage shows. I'm not joking. That 0:05:25.320 --> 0:05:28.080 really is the way that people plan those things out. 0:05:28.360 --> 0:05:31.080 It's pretty neat. Humans perceive sounds in a way that's 0:05:31.080 --> 0:05:35.200 not necessarily representational of all the sounds surrounding us. You 0:05:35.200 --> 0:05:38.600 can think of your brain as the filter between your 0:05:38.720 --> 0:05:42.679 understanding and what reality actually is. A lot of stuff 0:05:42.720 --> 0:05:45.599 goes on that it ends up getting rid of information 0:05:45.640 --> 0:05:48.040 that your brain just says, you know what, he or 0:05:48.080 --> 0:05:52.040 she doesn't need that, it's just gonna confuse things. We're 0:05:52.040 --> 0:05:55.400 gonna dump it. And that's kind of how it works. 0:05:55.440 --> 0:05:57.599 It's all on an unconscious level. It's not like you're 0:05:57.800 --> 0:06:01.919 actively working to do this. So let's say you're in 0:06:01.920 --> 0:06:04.320 a relatively busy hallway, and there could be a lot 0:06:04.360 --> 0:06:07.800 of sounds in that hallway, stuff that's going on constantly 0:06:07.800 --> 0:06:11.000 around you. Maybe they are doors opening and closing, Maybe 0:06:11.040 --> 0:06:13.960 their footsteps going up and down the hallway. Maybe someone 0:06:14.080 --> 0:06:17.719 shoes are squeaking against the linoleum floor. People are chattering 0:06:17.760 --> 0:06:20.839 away in there. But you are having a conversation with someone, 0:06:21.240 --> 0:06:23.960 so you turn your focus on that person and other 0:06:24.040 --> 0:06:28.200 sounds seemingly fade away. They're still present, but they're not important. 0:06:28.800 --> 0:06:31.520 So in this example, you would actually call those other 0:06:31.600 --> 0:06:35.479 sounds of distraction and you would really focus on the conversation. Uh. 0:06:35.520 --> 0:06:40.000 That also shows how we're able to consciously direct our 0:06:40.080 --> 0:06:43.719 sense our perception of hearing. So both of these factors 0:06:43.760 --> 0:06:47.120 come into play. Now. One thing that MP three encoding 0:06:47.160 --> 0:06:51.080 takes advantage of is something called masking, and there are 0:06:51.080 --> 0:06:54.120 a couple of different variations of the masking effect. One 0:06:54.160 --> 0:06:57.520 of them is called frequency masking. So let's say you've 0:06:57.560 --> 0:07:00.480 got to sound frequencies that are similar ahaps, there're just 0:07:00.520 --> 0:07:04.200 a few hurts apart. Remember, frequencies are measured in hurts, 0:07:04.680 --> 0:07:08.520 which is really the number of oscillations per second. So 0:07:08.640 --> 0:07:14.040 let's say you've got a sound that's at I don't know, uh, 0:07:14.360 --> 0:07:19.360 one thousand killer hurts, and another one that's at one 0:07:19.480 --> 0:07:23.560 thousand and ten killer hurts. Now, the human ear is 0:07:23.600 --> 0:07:26.920 precise enough to be able to tell the difference of 0:07:27.040 --> 0:07:29.840 two sounds that are at least two hurts apart from 0:07:29.840 --> 0:07:33.360 each other. That's how precise our resolution of hearing it's 0:07:33.440 --> 0:07:36.760 it's at that level. But if you get two sounds 0:07:36.840 --> 0:07:40.520 played at the same time and they are that close 0:07:40.560 --> 0:07:44.080 together in frequency, and one of those frequencies is played 0:07:44.120 --> 0:07:47.280 at a greater volume than the other, our brains will 0:07:47.280 --> 0:07:50.160 pick up on the louder sound and ignore the quieter sound, 0:07:50.240 --> 0:07:53.880 even though both of them are present. What becomes important 0:07:53.880 --> 0:07:56.520 at that point is the amplitude. Now, the further apart 0:07:56.560 --> 0:08:00.400 in frequencies you get, the less that hasn't a effect. 0:08:00.480 --> 0:08:02.360 So if you get far enough apart where they are 0:08:02.360 --> 0:08:05.680 two pitches, one of them noticeably louder than the other, 0:08:06.040 --> 0:08:08.320 but they're far enough apart, you will hear both of them. 0:08:08.360 --> 0:08:11.560 It only works if the two pitches are relatively close together, 0:08:12.680 --> 0:08:15.560 and there's not a universal formula for frequency masking. As 0:08:15.560 --> 0:08:18.520 you get closer to the boundaries of human hearing, frequency 0:08:18.560 --> 0:08:20.920 masking becomes easier. So if it's a really low pitch 0:08:21.000 --> 0:08:23.600 or a really high pitch, it's easier to get away 0:08:23.600 --> 0:08:26.400 with it. Once you start getting into what is the 0:08:26.400 --> 0:08:28.960 ought of as the sweet spot for human hearing, which 0:08:29.000 --> 0:08:32.120 is generally considered to be between two and five killer hurts, 0:08:33.200 --> 0:08:37.200 you need a greater difference in volume or a smaller 0:08:37.240 --> 0:08:41.640 difference in frequency in order for masking to work. Frequency 0:08:41.720 --> 0:08:45.480 masking at any rate. But then there's also temporal masking, 0:08:46.600 --> 0:08:48.880 and you might say, okay, I got it. Temporal that 0:08:48.920 --> 0:08:53.040 means time. Indeed it does, my friend. This describes the 0:08:53.040 --> 0:08:56.040 effect of a short but loud sound masking a softer 0:08:56.120 --> 0:09:00.360 sound for a short time. Weird thing is the loud 0:09:00.360 --> 0:09:03.960 sound can actually mask sounds that precede it slightly, not 0:09:04.040 --> 0:09:06.760 by a whole lot, but a little bit. MP three 0:09:06.760 --> 0:09:10.880 compression takes advantage of both frequency and temporal masking when 0:09:10.880 --> 0:09:14.079 it's trying to determine which data needs to be included 0:09:14.160 --> 0:09:16.920 and which data can be dumped, because it won't affect 0:09:16.960 --> 0:09:19.840 your perception of whatever the the audio file is in 0:09:19.840 --> 0:09:23.720 the first place. So you also probably remember I talked 0:09:23.720 --> 0:09:26.560 about the physical limitation to what we humans can hear, 0:09:26.800 --> 0:09:28.920 no matter what our brains might be up to, so 0:09:29.000 --> 0:09:31.400 that this doesn't have to do with our brains, you know, 0:09:31.480 --> 0:09:34.240 filtering through the information that's coming in. This has to 0:09:34.280 --> 0:09:38.200 do with the physical limitations of the human ear. In 0:09:38.240 --> 0:09:41.199 the last episode of the series, I said typical human hearing. 0:09:41.840 --> 0:09:45.559 Keep in mind typical there are exceptions. UH covers the 0:09:45.679 --> 0:09:48.560 range of frequencies between about twenty hurts and twenty killer 0:09:48.640 --> 0:09:52.000 hurts or twenty thousand hurts. So twenty to twenty thousand 0:09:52.800 --> 0:09:57.280 higher frequencies represent higher pitches and sound lower frequencies lower pitches, right, 0:09:58.080 --> 0:10:00.640 And as you get older, your ability to perceive those 0:10:00.720 --> 0:10:05.040 higher frequencies starts to diminish. So most adults actually have 0:10:05.320 --> 0:10:10.880 an upper range closer to sixteen killer hurts, not twenty. UH. 0:10:11.080 --> 0:10:13.480 Kids they can hear those higher pitches. You may have 0:10:13.600 --> 0:10:17.920 heard the story about how some convenience stores experimented with 0:10:18.160 --> 0:10:23.600 getting rid of teenage loiterers by by UH projecting out 0:10:24.000 --> 0:10:27.280 the super high pitches that that adults could not hear 0:10:27.640 --> 0:10:30.600 but kids could, and it discouraged kids from hanging out 0:10:30.640 --> 0:10:35.080 at the convenience store and loitering. UM. I love that 0:10:35.200 --> 0:10:39.600 idea so much. Anyway, that's because I'm old and my 0:10:39.640 --> 0:10:43.520 hearing is terrible. Well, remember I also mentioned you can 0:10:43.559 --> 0:10:46.400 detect changes in pitch at two hurts increments if you 0:10:46.440 --> 0:10:48.960 get below two hurts and change, Like, if it's just 0:10:49.040 --> 0:10:54.600 a one hurts difference between two frequencies, it's too low 0:10:54.640 --> 0:10:56.800 a resolution for us to detect. To us, it will 0:10:56.800 --> 0:11:01.040 sound exactly the same. So if you were to hear 0:11:01.520 --> 0:11:06.800 a frequency at one thousand one hurts or one point 0:11:07.000 --> 0:11:10.800 zero zero one killer hurts and one point zero zero 0:11:10.840 --> 0:11:13.800 to killer hurts, you wouldn't notice the difference. They would 0:11:13.840 --> 0:11:16.960 sound exactly the same to you. So if you're gonna 0:11:17.000 --> 0:11:19.240 take audio and compress it, one step you could consider 0:11:19.360 --> 0:11:23.960 is eliminating anything that's outside the actual range of frequencies 0:11:24.040 --> 0:11:27.560 that we can hear, or simplifying any changes in frequency 0:11:27.640 --> 0:11:31.240 that are smaller than two hurts. If you get take 0:11:31.240 --> 0:11:34.760 all that data and you say it is physically impossible 0:11:34.800 --> 0:11:38.439 for a human to perceive this, get rid of that information, 0:11:38.559 --> 0:11:41.800 then in theory it wouldn't have any effect on the 0:11:41.840 --> 0:11:46.120 rest of the recording. But how you go further than that? Right, 0:11:46.200 --> 0:11:48.959 how do you create a method so that you can 0:11:49.000 --> 0:11:51.120 really compress this file? You want a method that will 0:11:51.120 --> 0:11:54.439 preserve the important sounds while potentially ignoring all the unimportant 0:11:54.520 --> 0:11:58.320 or incidel sounds. And you want to be automatic because 0:11:58.760 --> 0:12:01.440 if you have a man you really then that's going 0:12:01.520 --> 0:12:05.640 to take countless hours just to edit a single sound file. 0:12:06.760 --> 0:12:10.959 So that was the challenge that the MP three research 0:12:11.040 --> 0:12:16.040 team faced as a group. Now, their solution, which ultimately 0:12:16.080 --> 0:12:18.559 created even more challenges, was to come up with what 0:12:18.640 --> 0:12:22.480 was essentially a simulated human ear and brain. They needed 0:12:22.520 --> 0:12:27.880 to replicate the experience of perceiving music so that an 0:12:27.880 --> 0:12:32.160 algorithm could evaluate every sound in an audio file and 0:12:32.280 --> 0:12:35.359 judge if an in fact was relevant enough to include 0:12:35.400 --> 0:12:39.720 in the final compressed version. If a sound were imperceptible, 0:12:39.760 --> 0:12:41.600 then it wouldn't make sense to include it in the 0:12:41.720 --> 0:12:44.720 MP three file. So by leaving out all the irrelevant data, 0:12:44.760 --> 0:12:48.680 they can make the audio information take up less bandwidth. 0:12:48.679 --> 0:12:51.240 The file itself would be smaller because you just dumped 0:12:51.280 --> 0:12:54.880 everything that wasn't important. So the team used an algorithm 0:12:55.000 --> 0:13:00.000 called the low complexity adaptive transform coding or lc DASH 0:13:00.160 --> 0:13:03.080 a t C as the foundation for their research. This 0:13:03.160 --> 0:13:06.440 was kind of their starting point, and this is an 0:13:06.480 --> 0:13:10.120 approach that tries to do away with redundancy as much 0:13:10.160 --> 0:13:15.199 as possible. And it also incorporates adaptation to perceptual requirements. Also, 0:13:15.320 --> 0:13:19.199 MP three's oh a lot to the IMPEG Layer two standard, 0:13:19.760 --> 0:13:23.199 So the layer two obviously came out before Layer three, 0:13:23.720 --> 0:13:26.199 and so a lot of the features of layer three 0:13:26.320 --> 0:13:31.760 are really um their legacy features from layer two. Uh. 0:13:31.800 --> 0:13:34.000 In other words, MP three group kind of got stuck 0:13:34.000 --> 0:13:36.560 with them because otherwise they would have had a problem 0:13:36.559 --> 0:13:39.880 with backwards compatibility. So the result is kind of a 0:13:39.920 --> 0:13:43.400 clunky arrangement under the hood, and some of the features 0:13:43.600 --> 0:13:46.160 may make very little sense when I go through them, 0:13:46.600 --> 0:13:48.600 but some of that is because it's a hold over 0:13:48.640 --> 0:13:53.280 from an earlier compression strategy, which isn't terribly satisfying as 0:13:53.280 --> 0:13:55.559 an answer. But the reason many parts of the MP 0:13:55.640 --> 0:13:57.840 three compression algorithm are the way they are is because 0:13:57.880 --> 0:14:01.560 that's the way we've always done it. So next I'm 0:14:01.600 --> 0:14:07.760 gonna dive into the phases of compression. But before I 0:14:07.800 --> 0:14:10.680 do that, let's all take a deep breath and take 0:14:10.720 --> 0:14:22.440 a moment to thank our sponsor, and we're back. So 0:14:22.560 --> 0:14:25.080 there are two big phases we'll need to talk about 0:14:25.160 --> 0:14:29.760 with MP three compression. The first phase is analysis and 0:14:29.800 --> 0:14:33.960 the second phase is the actual compression itself. And after 0:14:34.040 --> 0:14:37.080 that there's the process of decoding and MP three for playback. 0:14:37.560 --> 0:14:40.120 But that's way simpler once we get an understanding of 0:14:40.160 --> 0:14:45.920 how the encoding process actually happens. So let's begin with analysis. Now. 0:14:45.960 --> 0:14:49.480 This is the part where the standard has to figure 0:14:49.520 --> 0:14:53.800 out which frequencies within an audio range are recording rather 0:14:53.920 --> 0:14:59.720 are important or perceptible. So how does a program and 0:14:59.760 --> 0:15:02.680 in coder figure out what we can hear and what 0:15:02.800 --> 0:15:06.160 we cannot hear? All? Right, time to get technical. So 0:15:06.880 --> 0:15:10.440 you start off with your pulse code modulation audio file 0:15:10.720 --> 0:15:13.480 or PCM file. And you might remember I talked about 0:15:13.480 --> 0:15:16.720 PCM audio in the first episode of this series, but 0:15:16.840 --> 0:15:20.600 just in case you don't, it's a lossless digital audio file. 0:15:20.680 --> 0:15:23.720 The actual format could be a wave or ai f 0:15:23.720 --> 0:15:26.480 F or something along those lines, but the important thing 0:15:26.920 --> 0:15:31.080 to keep in mind is that it is uncompressed. Now, 0:15:31.120 --> 0:15:33.560 that means those files tend to be pretty big. This 0:15:33.640 --> 0:15:36.040 is our raw material that we want to take and 0:15:36.120 --> 0:15:40.560 squish down to a more manageable, transferable size. And in 0:15:40.640 --> 0:15:43.320 our our last episode in this series, I also mentioned 0:15:43.320 --> 0:15:46.680 that the standard for c D audio is a sample 0:15:46.760 --> 0:15:49.880 rate of forty four point one. Killer hurts and we 0:15:50.040 --> 0:15:52.680 learned that you need a sample rate twice the frequency 0:15:52.840 --> 0:15:56.800 of the highest frequency in your recording, and since human 0:15:56.840 --> 0:15:59.600 hearing tops out at around twenty kill hurts, the standard 0:15:59.600 --> 0:16:02.520 for CDs is forty four point one killer hurts. The 0:16:02.640 --> 0:16:05.640 MP three standard can support lots of different sample rates, 0:16:05.720 --> 0:16:08.160 but forty four point one killer Hurts is pretty much 0:16:08.200 --> 0:16:12.600 the common standard. So you've got a number of samples 0:16:12.680 --> 0:16:15.120 with your audio file, and that number will depend upon 0:16:15.120 --> 0:16:18.320 how long the audio file is. You've got forty four 0:16:18.320 --> 0:16:23.120 thousand one samples per second, actually twice that for stereo, 0:16:23.280 --> 0:16:25.760 but for the purposes of this discussion, let's kind of 0:16:25.920 --> 0:16:28.960 stick with mono sounds so that I don't start having 0:16:29.040 --> 0:16:31.720 math coming out of my ears. And we're still in 0:16:31.720 --> 0:16:34.920 the very easy, simple part as far as math goes. 0:16:34.960 --> 0:16:37.520 We haven't gotten to the complicated stuff yet, all right, 0:16:37.600 --> 0:16:41.600 So you've got forty four thousand, one hundred samples per second. 0:16:42.160 --> 0:16:45.320 To compress it into an MP three format, the algorithm 0:16:45.360 --> 0:16:49.320 first groups all of these samples into collections called frames. 0:16:50.440 --> 0:16:53.640 So take those forty four thousand one per second, and 0:16:53.640 --> 0:16:56.480 then you start saying, okay, we're gonna group you in batches. 0:16:56.960 --> 0:17:00.080 Each batch is called a frame and each frame contains 0:17:00.120 --> 0:17:04.480 one thousand, one fifty two samples. Now that's specifically to 0:17:04.560 --> 0:17:09.280 maintain backwards compatibility to IMPEG Layer two, which established that 0:17:09.320 --> 0:17:12.119 one thousand, one or fifty two number. But we're not 0:17:12.160 --> 0:17:16.360 talking about IMPEG layer two. We're talking about IMPEG Layer three, 0:17:16.800 --> 0:17:18.400 and though that means we have to get a little 0:17:18.400 --> 0:17:25.440 more complicated. So each frame consists of two subgroups called granules. 0:17:25.440 --> 0:17:29.320 So each granule has five undred seventy six samples seventy 0:17:29.359 --> 0:17:32.639 six times two one thousand fifty two, so five seventy 0:17:32.680 --> 0:17:36.680 six samples per granule. Now, technically MP three encoders only 0:17:36.680 --> 0:17:39.000 work on one granule at a time, but they may 0:17:39.040 --> 0:17:42.879 reference the granules immediately before and immediately after the current 0:17:42.920 --> 0:17:45.520 one in order to see how the audio within the 0:17:45.560 --> 0:17:49.480 file changes over time. All right, so now you've got 0:17:49.480 --> 0:17:54.000 your granules of five hundred seventy six samples each. Then 0:17:54.040 --> 0:17:57.480 the MP three encoder runs the samples through a filter bank, 0:17:57.960 --> 0:18:01.960 which sorts the sound into thirty two frequency ranges. Are 0:18:02.000 --> 0:18:05.239 you are you crazy about the numbers yet, Dylan? Are you? 0:18:05.720 --> 0:18:10.520 Dylan's Dylan's nodding. Dylan gets worse from here. So you 0:18:10.560 --> 0:18:13.560 have thirty two frequency ranges, which is another nod to 0:18:13.560 --> 0:18:15.840 the layer two method which use those thirty two ranges 0:18:15.880 --> 0:18:20.240 for encoding purposes. But we're not talking about layer two early, No, 0:18:20.760 --> 0:18:24.320 we're talking MP three. Gosh darn it. That means we 0:18:24.359 --> 0:18:27.159 take those thirty two ranges and we subdivide them by 0:18:27.200 --> 0:18:31.320 a factor of eighteen. That means we have five hundred 0:18:31.320 --> 0:18:36.879 seventies six bands of frequencies, each band containing one six 0:18:37.080 --> 0:18:41.199 of the frequency range of the original sample. So what 0:18:41.280 --> 0:18:44.320 that actually means, and this this is actually pretty easy. 0:18:44.720 --> 0:18:48.159 The bands are not limited to a specific number for 0:18:48.240 --> 0:18:53.240 their frequency range. Right. The bands don't mean that on 0:18:53.280 --> 0:18:56.359 the on band number one it goes from twenty hurts 0:18:56.440 --> 0:18:58.840 up to a certain range and on band five D 0:18:59.000 --> 0:19:02.399 seventy six in that twenty killer hurts. That's not what 0:19:02.440 --> 0:19:05.600 it means. They're dependent upon the original audio. So if 0:19:05.600 --> 0:19:09.680 the original audio contains sounds within a narrow range of frequencies, 0:19:10.040 --> 0:19:13.760 the five bands will be more precise. But if the 0:19:13.760 --> 0:19:17.600 original recording has a vast range of frequencies, the bands 0:19:17.640 --> 0:19:20.440 are less precise. So another way to think about this 0:19:21.119 --> 0:19:24.160 is with a pizza. So let's say you get extra 0:19:24.240 --> 0:19:26.960 large pizza and you cut it into eight equal slices. 0:19:27.600 --> 0:19:30.280 And then you get a small pizza and you cut 0:19:30.320 --> 0:19:33.600 that into eight equal slices. Well, in both cases you 0:19:33.640 --> 0:19:37.760 have with each slice one eighth of a pizza. But 0:19:37.840 --> 0:19:42.080 the extra large pizza pizza slice is bigger than the 0:19:42.119 --> 0:19:45.280 small pizza pizza slice. It all depends on the size 0:19:45.280 --> 0:19:47.960 of the pizza. So in this case, it depends upon 0:19:48.000 --> 0:19:51.080 the range of frequencies. And and Dylan, do you think 0:19:51.080 --> 0:19:53.280 we could go for some pizza, you know, just just 0:19:53.320 --> 0:19:56.159 put the episode on hole and go get pizza. Dylan's nodding. 0:19:56.720 --> 0:20:00.879 It's great for audio. Yeah, so, uh, pizza, We'll be 0:20:00.960 --> 0:20:05.800 right back. Okay, that was good pizza. Now um oh man, 0:20:05.840 --> 0:20:08.400 I got a whole bunch more notes. Okay, well, let's 0:20:08.440 --> 0:20:10.879 let's go ahead and and do the rest of this. 0:20:10.920 --> 0:20:12.840 All right, So you've got your sound divided up into 0:20:12.880 --> 0:20:16.320 those five seventy six sub brands of frequencies, you know, 0:20:16.640 --> 0:20:19.840 the thing I compared to pizza slices earlier. Now you 0:20:19.880 --> 0:20:25.359 get two different mathematical processes applied to this data. One 0:20:25.520 --> 0:20:28.919 is the fast Furrier transform or f f T, and 0:20:28.960 --> 0:20:32.720 the other is the modified discrete cosine transform or m 0:20:32.800 --> 0:20:36.760 d c T. Now I am not going to dive 0:20:36.800 --> 0:20:40.040 deeply into how these transforms work because frankly, they are 0:20:40.119 --> 0:20:44.439 beyond my mathematical understanding. But I know what they do. 0:20:44.680 --> 0:20:49.280 I just cannot explain the process like how they do 0:20:49.400 --> 0:20:51.479 what they do. So I'm going to give you the 0:20:51.480 --> 0:20:54.720 explanation of what they do what the outcome of each 0:20:54.760 --> 0:20:58.840 of these transformed processes happens to be. But I'm not 0:20:58.920 --> 0:21:00.800 going to be able to tell you the actual mathematical 0:21:00.840 --> 0:21:03.479 steps involved in each because I don't math. So good guys, 0:21:04.640 --> 0:21:07.520 But let's start with a fast for your transform. So 0:21:07.640 --> 0:21:09.720 transform is kind of what it sounds like. It's all 0:21:09.720 --> 0:21:13.960 about transforming information in some way. So in this particular case, 0:21:14.119 --> 0:21:17.359 the f f T transforms the frequency bands we just 0:21:17.400 --> 0:21:22.360 talked about into data that can be further analyzed by 0:21:22.480 --> 0:21:26.600 a psychoacoustic model that's in the encoder. So this is 0:21:26.640 --> 0:21:29.960 that simulated human ear and brain we were talking about earlier. 0:21:30.840 --> 0:21:34.800 So what the encoder does is it analyzes each bed 0:21:34.920 --> 0:21:38.600 of data and looks for signs that it represents audio 0:21:38.680 --> 0:21:41.680 that wouldn't be perceived by a human. So it's looks 0:21:41.800 --> 0:21:46.240 looking for any potential for masking possibilities. So are there 0:21:46.240 --> 0:21:48.800 collections of frequencies that are grouped close together, and is 0:21:48.840 --> 0:21:51.320 one of those frequencies louder than the others, you might 0:21:51.359 --> 0:21:53.919 be able to do away with those softer frequencies because 0:21:53.960 --> 0:21:57.480 of frequency masking. The encoder will also look at whether 0:21:57.560 --> 0:21:59.879 or not the audio has a lot of complexity to it, 0:22:00.800 --> 0:22:02.960 if it has a lot of changes, or if it's 0:22:03.000 --> 0:22:07.840 just relatively steady or simple audio. Any transient sounds that 0:22:07.880 --> 0:22:11.600 are present in the audio might end up being temporal masking, 0:22:11.680 --> 0:22:14.040 so it'll analyze those as well and see if that's 0:22:14.040 --> 0:22:19.000 a possibility. So really what they're looking is for, you know, 0:22:20.280 --> 0:22:23.320 just any really loud sounds that stand out above the 0:22:23.400 --> 0:22:26.119 rest of the recording. That's what the f f T 0:22:26.280 --> 0:22:30.200 is doing. So what about the modified discrete cosign transform. Well, 0:22:30.240 --> 0:22:32.359 this is happening in parallel with the f f T 0:22:32.800 --> 0:22:36.280 and the samples get sorted into different patterns called windows 0:22:37.119 --> 0:22:39.679 uh and the criterion for sorting all has to do 0:22:39.720 --> 0:22:43.719 with whether the sample represents a steady sound or varied sound. 0:22:44.240 --> 0:22:47.359 So if you have a simple steady sound that goes 0:22:47.400 --> 0:22:51.200 into a long window, if there's a lot of variation 0:22:51.240 --> 0:22:53.960 in the sound, like there are a lot of consonants 0:22:53.960 --> 0:22:56.760 in a vocal line or it's like a drum solo 0:22:56.960 --> 0:22:59.600 or something like that. It would get sorted into it 0:22:59.720 --> 0:23:02.960 series ease of three short windows, and each short window 0:23:03.000 --> 0:23:09.320 contains one two samples. That amounts to four whole milliseconds, 0:23:09.440 --> 0:23:15.000 so four thousands of a second in three patterned windows. 0:23:15.040 --> 0:23:18.080 So you've got these windows now, either long windows for 0:23:18.119 --> 0:23:21.600 simple sounds or short windows for the more complex sounds. 0:23:21.640 --> 0:23:24.600 And then the modified discrete cosine transform kicks into gear. 0:23:24.680 --> 0:23:26.840 It looks at each long window or set of three 0:23:26.840 --> 0:23:30.920 short windows and converts them into a set of spectral values. 0:23:31.520 --> 0:23:33.800 To some of you, that probably sounds meaningless. So let's 0:23:33.840 --> 0:23:37.720 talk about spectral analysis for a second. First, I was 0:23:38.000 --> 0:23:40.919 very disappointed to learn that spectral analysis doesn't involve a 0:23:40.920 --> 0:23:46.199 psychologist talking to a ghost about its emotional state, so bummer. 0:23:47.000 --> 0:23:50.560 But spectral analysis is when you look at a spectrum 0:23:50.600 --> 0:23:54.800 of information, like a spectrum of frequencies or related information 0:23:54.840 --> 0:23:58.399 like energy states. That's what this transform does. It takes 0:23:58.520 --> 0:24:02.119 data that originally represents a slice of time in a 0:24:02.200 --> 0:24:05.360 sound waveform. That's what sample is. A sample is an 0:24:05.400 --> 0:24:09.280