1 00:00:04,160 --> 00:00:07,160 Speaker 1: Get in touch with technology with tech Stuff from how 2 00:00:07,240 --> 00:00:14,160 Speaker 1: stuff works dot com. Hey everybody, and welcome to tech Stuff. 3 00:00:14,200 --> 00:00:16,680 Speaker 1: I'm Jonathan Strickland. I'm the host of the show, and 4 00:00:16,800 --> 00:00:20,439 Speaker 1: this is a Saturday morning rerun episode where we take 5 00:00:20,440 --> 00:00:23,560 Speaker 1: a classic episode of tech Stuff and we present it 6 00:00:23,560 --> 00:00:25,760 Speaker 1: to you guys who may have missed it. I've been 7 00:00:25,800 --> 00:00:29,000 Speaker 1: talking a lot about tech and music recently. If you've 8 00:00:29,000 --> 00:00:31,360 Speaker 1: been listening to the recent episodes, you know all about that, 9 00:00:31,920 --> 00:00:34,640 Speaker 1: and there have been some great discussions. But it also 10 00:00:34,760 --> 00:00:38,760 Speaker 1: requires a little bit of uh knowledge of previous episodes 11 00:00:38,880 --> 00:00:40,960 Speaker 1: at times, and I know it can be tricky to 12 00:00:41,040 --> 00:00:44,559 Speaker 1: dig through the archives. So in this classic episode, I 13 00:00:44,600 --> 00:00:48,919 Speaker 1: talk about how the MP three compression format works, so 14 00:00:48,920 --> 00:00:51,559 Speaker 1: that you can actually understand how MP three works as 15 00:00:51,560 --> 00:00:54,600 Speaker 1: opposed to something like middy, and you can get an 16 00:00:54,600 --> 00:00:58,960 Speaker 1: appreciation for the differences between the two formats. This episode 17 00:00:58,960 --> 00:01:02,320 Speaker 1: originally published on January two thousand and seventeen. This is 18 00:01:02,360 --> 00:01:06,039 Speaker 1: a whole year ago more than that. Now we're in 19 00:01:06,080 --> 00:01:08,920 Speaker 1: April two eighteen as I record this. I hope you 20 00:01:09,000 --> 00:01:11,120 Speaker 1: enjoyed this classic episode. I hope it gives you a 21 00:01:11,160 --> 00:01:15,959 Speaker 1: deeper appreciation of the technical aspect of creating digital music 22 00:01:16,360 --> 00:01:19,440 Speaker 1: and I'll see you guys on the other side. So 23 00:01:19,560 --> 00:01:23,840 Speaker 1: let's remember that the heart of digital information is the 24 00:01:23,959 --> 00:01:28,080 Speaker 1: bit that's either a zero or a one. The basic 25 00:01:28,360 --> 00:01:34,720 Speaker 1: unit of information for digital formats zeros and ones. Now 26 00:01:34,720 --> 00:01:36,840 Speaker 1: we can use those zeros and ones to describe all 27 00:01:36,840 --> 00:01:41,120 Speaker 1: sorts of information, from text to audio, to video and 28 00:01:41,480 --> 00:01:45,120 Speaker 1: really pretty much anything you can think of that's represented digitally. Ultimately, 29 00:01:45,120 --> 00:01:46,680 Speaker 1: when you get down to it, it's a bunch of 30 00:01:46,760 --> 00:01:49,840 Speaker 1: zeros and ones. So let's say you start off with 31 00:01:49,880 --> 00:01:54,440 Speaker 1: your uncompressed audio file. You've got this enormous audio file 32 00:01:54,480 --> 00:01:56,600 Speaker 1: in front of you. It's made up of zeros and ones. 33 00:01:57,120 --> 00:02:00,480 Speaker 1: How do you make that file smaller? So in the world, 34 00:02:00,480 --> 00:02:04,120 Speaker 1: we can compress stuff, right, we can apply physical pressure 35 00:02:04,160 --> 00:02:07,400 Speaker 1: to things. Think about packing a suitcase. You can make 36 00:02:07,400 --> 00:02:09,640 Speaker 1: sure you get that extra outfit and if you just 37 00:02:09,919 --> 00:02:12,480 Speaker 1: press it down hard enough and get that zipper zipped 38 00:02:12,480 --> 00:02:15,600 Speaker 1: before it can burst open. But once you get to 39 00:02:15,639 --> 00:02:19,239 Speaker 1: a certain level of compression, you cannot make things smaller, 40 00:02:19,480 --> 00:02:21,880 Speaker 1: at least not without hurting yourself or whatever it is 41 00:02:21,919 --> 00:02:25,440 Speaker 1: you're trying to compress. Digital files are a little different 42 00:02:25,760 --> 00:02:29,800 Speaker 1: because you cannot physically cram the zeros and ones closer together. 43 00:02:29,880 --> 00:02:33,400 Speaker 1: It doesn't work like that. These are abstract things. You 44 00:02:33,440 --> 00:02:36,840 Speaker 1: can't make them smaller, right, You can't decrease the font. 45 00:02:36,960 --> 00:02:40,840 Speaker 1: It doesn't work that way. The numbers represent two different states. 46 00:02:41,400 --> 00:02:43,640 Speaker 1: So if you want to create a smaller audio file 47 00:02:44,080 --> 00:02:47,280 Speaker 1: containing the recording that was in a larger audio file, 48 00:02:47,760 --> 00:02:51,200 Speaker 1: you have to start getting creative now. In the last 49 00:02:51,240 --> 00:02:53,720 Speaker 1: part of this series, I talked about how the MP 50 00:02:53,800 --> 00:02:57,560 Speaker 1: three compression algorithm was born from an applied research institution 51 00:02:57,600 --> 00:03:00,240 Speaker 1: in Germany and the team behind the m B three 52 00:03:00,280 --> 00:03:03,239 Speaker 1: wanted to find a way to compress audio, specifically music 53 00:03:03,600 --> 00:03:08,280 Speaker 1: for transmission over phone lines. Eventually this evolved into the 54 00:03:08,400 --> 00:03:13,000 Speaker 1: Motion Pictures Expert Group audio Layer three compression methodology, better 55 00:03:13,080 --> 00:03:17,960 Speaker 1: known as the MP three, and there's also Impact two 56 00:03:18,000 --> 00:03:20,519 Speaker 1: and IMPEG four standards. Impact two, by the way, is 57 00:03:20,560 --> 00:03:23,799 Speaker 1: the basis of compression on DVDs, although the actual DVD 58 00:03:23,880 --> 00:03:28,240 Speaker 1: format is really a modification of Impact two. An Impact 59 00:03:28,280 --> 00:03:30,600 Speaker 1: four is a compression strategy for audio and video that's 60 00:03:30,639 --> 00:03:34,320 Speaker 1: frequently used in lots of different up capacities, including streaming 61 00:03:34,320 --> 00:03:38,480 Speaker 1: media services. So by the late nineteen seventies, researchers began 62 00:03:38,560 --> 00:03:42,280 Speaker 1: to explore the possibility of leveraging psychoacoustics to figure out 63 00:03:42,320 --> 00:03:46,640 Speaker 1: how to compress audio. And psychoacoustics refers to the way 64 00:03:46,640 --> 00:03:51,360 Speaker 1: we perceive sound, it's uh and also the physiological effects 65 00:03:51,400 --> 00:03:55,080 Speaker 1: of sound on us. So this involves not just our 66 00:03:55,160 --> 00:03:58,200 Speaker 1: our physical sense of hearing, but also our brains and 67 00:03:58,240 --> 00:04:01,480 Speaker 1: the way our brains interpret sound. Owned So, for example, 68 00:04:01,760 --> 00:04:05,520 Speaker 1: there's a psychoacoustic phenomenon that's called the Hawse effect h 69 00:04:05,680 --> 00:04:08,600 Speaker 1: A A S. And I think it's pretty interesting. So 70 00:04:08,800 --> 00:04:11,240 Speaker 1: here's how the Hawse effect works. If you hear the 71 00:04:11,320 --> 00:04:16,320 Speaker 1: exact same sound coming from different directions, but the two 72 00:04:16,320 --> 00:04:19,680 Speaker 1: sounds arrive within thirty to forty milliseconds of each other, 73 00:04:20,080 --> 00:04:23,039 Speaker 1: your brain will be convinced that you really only heard 74 00:04:23,080 --> 00:04:26,479 Speaker 1: one sound and it came from the direction that hit 75 00:04:26,560 --> 00:04:30,240 Speaker 1: you first. So let's say a sounds coming from directly 76 00:04:30,279 --> 00:04:32,720 Speaker 1: in front of you and to your left, and you 77 00:04:33,080 --> 00:04:36,520 Speaker 1: get both of them within that thirty forty millisecond range, 78 00:04:37,360 --> 00:04:39,479 Speaker 1: and you hear the one coming from ahead of you 79 00:04:39,560 --> 00:04:43,080 Speaker 1: first to you. You're convinced that you only heard that 80 00:04:43,160 --> 00:04:46,119 Speaker 1: sound once and it came from dead on straight ahead 81 00:04:46,160 --> 00:04:49,719 Speaker 1: of you. Your brain kind of discounts the one that 82 00:04:49,800 --> 00:04:53,200 Speaker 1: came off from the left, although it can reinforce it, 83 00:04:53,320 --> 00:04:55,560 Speaker 1: which ends up being really useful if you're planning out 84 00:04:55,560 --> 00:04:58,320 Speaker 1: p A systems for stage shows. I'm not joking. That 85 00:04:58,360 --> 00:05:01,120 Speaker 1: really is the way that uh people plan those things out. 86 00:05:01,400 --> 00:05:04,120 Speaker 1: It's pretty neat. Humans perceived sounds in a way that's 87 00:05:04,120 --> 00:05:08,240 Speaker 1: not necessarily representational of all the sounds surrounding us. You 88 00:05:08,240 --> 00:05:11,640 Speaker 1: can think of your brain as the filter between your 89 00:05:11,760 --> 00:05:15,719 Speaker 1: understanding and what reality actually is. A lot of stuff 90 00:05:15,760 --> 00:05:18,640 Speaker 1: goes on that it ends up getting rid of information 91 00:05:18,680 --> 00:05:21,080 Speaker 1: that your brain just says, you know what, he or 92 00:05:21,120 --> 00:05:25,080 Speaker 1: she doesn't need that, it's just gonna confuse things. We're 93 00:05:25,080 --> 00:05:28,440 Speaker 1: gonna dump it. And that's kind of how it works. 94 00:05:28,480 --> 00:05:30,640 Speaker 1: It's all on an unconscious level. It's not like you're 95 00:05:30,839 --> 00:05:34,960 Speaker 1: actively working to do this. So let's say you're in 96 00:05:34,960 --> 00:05:37,320 Speaker 1: a relatively busy hallway and there could be a lot 97 00:05:37,400 --> 00:05:40,839 Speaker 1: of sounds in that hallway. Stuff that's going on constantly 98 00:05:40,839 --> 00:05:44,080 Speaker 1: around you. Maybe they are doors opening and closing, Maybe 99 00:05:44,080 --> 00:05:47,000 Speaker 1: their footsteps going up and down the hallway. Maybe someone 100 00:05:47,120 --> 00:05:50,760 Speaker 1: shoes are squeaking against the linoleum floor. People are chattering 101 00:05:50,800 --> 00:05:53,880 Speaker 1: away in there. But you are having a conversation with someone, 102 00:05:54,279 --> 00:05:57,000 Speaker 1: so you turn your focus on that person and other 103 00:05:57,080 --> 00:06:01,240 Speaker 1: sounds seemingly fade away. They're still doesn't but they're not important. 104 00:06:01,839 --> 00:06:04,560 Speaker 1: So in this example, you would actually call those other 105 00:06:04,640 --> 00:06:08,520 Speaker 1: sounds of distraction and you would really focus on the conversation. Uh. 106 00:06:08,560 --> 00:06:13,040 Speaker 1: That also shows how we're able to consciously direct our 107 00:06:13,120 --> 00:06:16,760 Speaker 1: since our perception of hearing. So both of these factors 108 00:06:16,800 --> 00:06:20,159 Speaker 1: come into play. Now. One thing that MP three encoding 109 00:06:20,200 --> 00:06:24,120 Speaker 1: takes advantage of is something called masking, and there are 110 00:06:24,120 --> 00:06:27,160 Speaker 1: a couple of different variations of the masking effect. One 111 00:06:27,200 --> 00:06:30,560 Speaker 1: of them is called frequency masking. So let's say you've 112 00:06:30,600 --> 00:06:33,520 Speaker 1: got to sound frequencies that are similar, perhaps there's just 113 00:06:33,560 --> 00:06:37,240 Speaker 1: a few hurts apart. Remember, UH, frequencies are measured in hurts, 114 00:06:37,720 --> 00:06:41,560 Speaker 1: which is really the number of oscillations per second. So 115 00:06:41,680 --> 00:06:47,000 Speaker 1: let's say you've got a sound that's at I don't know, uh, 116 00:06:47,400 --> 00:06:52,400 Speaker 1: one thousand killer hurts, and another one that's at one 117 00:06:52,520 --> 00:06:56,599 Speaker 1: thousand and ten killer hurts. Now, the human ear is 118 00:06:56,640 --> 00:07:00,080 Speaker 1: precise enough to be able to tell the difference of 119 00:07:00,160 --> 00:07:02,840 Speaker 1: two sounds that are at least two hurts apart from 120 00:07:02,880 --> 00:07:06,400 Speaker 1: each other. That's how precise our resolution of hearing, it's 121 00:07:06,480 --> 00:07:09,840 Speaker 1: it's at that level. But if you get two sounds 122 00:07:09,880 --> 00:07:13,560 Speaker 1: played at the same time and they are that close 123 00:07:13,600 --> 00:07:17,160 Speaker 1: together in frequency, and one of those frequencies is played 124 00:07:17,160 --> 00:07:20,320 Speaker 1: at a greater volume than the other, our brains will 125 00:07:20,320 --> 00:07:23,200 Speaker 1: pick up on the louder sound and ignore the quieter sound, 126 00:07:23,280 --> 00:07:26,920 Speaker 1: even though both of them are present. What becomes important 127 00:07:26,920 --> 00:07:29,560 Speaker 1: at that point is the amplitude. Now, the further apart 128 00:07:29,600 --> 00:07:33,400 Speaker 1: in frequencies you get, the less that has an effect. 129 00:07:33,520 --> 00:07:35,400 Speaker 1: So if you get far enough apart where there are 130 00:07:35,400 --> 00:07:38,720 Speaker 1: two pitches, one of them noticeably louder than the other, 131 00:07:39,080 --> 00:07:41,360 Speaker 1: but they're far enough apart, you will hear both of them. 132 00:07:41,400 --> 00:07:44,600 Speaker 1: It only works if the two pitches are relatively close together, 133 00:07:45,720 --> 00:07:48,600 Speaker 1: and there's not a universal formula for frequency masking. As 134 00:07:48,600 --> 00:07:51,560 Speaker 1: you get closer to the boundaries of human hearing, frequency 135 00:07:51,600 --> 00:07:53,960 Speaker 1: masking becomes easier, So if it's a really low pitch 136 00:07:54,040 --> 00:07:56,640 Speaker 1: or a really high pitch, it's easier to get away 137 00:07:56,640 --> 00:07:59,400 Speaker 1: with it. Once you started getting into what is the 138 00:07:59,440 --> 00:08:02,040 Speaker 1: out of as the sweet spot for human hearing, which 139 00:08:02,080 --> 00:08:05,160 Speaker 1: is generally considered to be between two and five killer hurts, 140 00:08:06,240 --> 00:08:10,240 Speaker 1: you need a greater difference in volume or a smaller 141 00:08:10,280 --> 00:08:14,720 Speaker 1: difference in frequency in order for masking to work. Frequency 142 00:08:14,760 --> 00:08:18,560 Speaker 1: masking at any rate. But then there's also temporal masking, 143 00:08:19,640 --> 00:08:21,920 Speaker 1: and you might say, okay, I got it. Temporal that 144 00:08:21,960 --> 00:08:26,080 Speaker 1: means time. Indeed it does, my friend. This describes the 145 00:08:26,080 --> 00:08:29,080 Speaker 1: effect of a short but loud sound masking a softer 146 00:08:29,160 --> 00:08:33,360 Speaker 1: sound for a short time. Weird thing is the loud 147 00:08:33,400 --> 00:08:37,000 Speaker 1: sound can actually mask sounds that precede it slightly, not 148 00:08:37,080 --> 00:08:39,800 Speaker 1: by a whole lot, but a little bit. MP three 149 00:08:39,800 --> 00:08:43,920 Speaker 1: compression takes advantage of both frequency and temporal masking when 150 00:08:43,920 --> 00:08:47,120 Speaker 1: it's trying to determine which data needs to be included 151 00:08:47,200 --> 00:08:49,960 Speaker 1: and which data can be dumped, because it won't affect 152 00:08:50,000 --> 00:08:52,880 Speaker 1: your perception of whatever the the audio file is in 153 00:08:52,920 --> 00:08:56,760 Speaker 1: the first place. So you also probably remember I talked 154 00:08:56,760 --> 00:08:59,600 Speaker 1: about the physical limitation to what we humans can hear, 155 00:08:59,800 --> 00:09:01,960 Speaker 1: no matter what our brains might be up to, so 156 00:09:02,040 --> 00:09:04,440 Speaker 1: that this doesn't have to do with our brains, you know, 157 00:09:04,520 --> 00:09:07,280 Speaker 1: filtering through the information that's coming in. This has to 158 00:09:07,320 --> 00:09:11,240 Speaker 1: do with the physical limitations of the human ear. In 159 00:09:11,280 --> 00:09:14,240 Speaker 1: the last episode of the series, I said typical human hearing. 160 00:09:14,880 --> 00:09:18,599 Speaker 1: Keep in mind typical there are exceptions. UH covers the 161 00:09:18,720 --> 00:09:21,600 Speaker 1: range of frequencies between about twenty hurts and twenty killer 162 00:09:21,679 --> 00:09:24,800 Speaker 1: hurts or twenty thousand hurts, So twenty to twenty thou 163 00:09:25,840 --> 00:09:30,360 Speaker 1: higher frequencies represent higher pitches and sound lower frequencies lower pitches, right, 164 00:09:31,120 --> 00:09:33,679 Speaker 1: And as you get older, your ability to perceive those 165 00:09:33,760 --> 00:09:38,080 Speaker 1: higher frequencies starts to diminish. So most adults actually have 166 00:09:38,360 --> 00:09:44,480 Speaker 1: an upper range closer to sixteen killer hurts, not twenty. Uh. Kids, 167 00:09:44,720 --> 00:09:46,920 Speaker 1: they can hear those higher pitches. You may have heard 168 00:09:46,920 --> 00:09:51,480 Speaker 1: the story about how some convenience stores experimented with getting 169 00:09:51,559 --> 00:09:57,280 Speaker 1: rid of teenage loiterers by by uh projecting out these 170 00:09:57,280 --> 00:10:00,760 Speaker 1: super high pitches that that adults could not here but 171 00:10:00,920 --> 00:10:03,800 Speaker 1: kids could, and it discouraged kids from hanging out at 172 00:10:03,800 --> 00:10:08,600 Speaker 1: the convenience store and loitering. Um. I love that idea 173 00:10:09,559 --> 00:10:12,959 Speaker 1: so much. Anyway, that's because I'm old and my hearing 174 00:10:13,040 --> 00:10:16,920 Speaker 1: is terrible. Well, remember I also mentioned you can detect 175 00:10:17,000 --> 00:10:19,760 Speaker 1: changes in pitch at two hurts increments if you get 176 00:10:19,880 --> 00:10:23,440 Speaker 1: below two hurts and change, like, if it's just a 177 00:10:23,520 --> 00:10:27,760 Speaker 1: one hurts difference between two frequencies, it's too low a 178 00:10:27,800 --> 00:10:30,080 Speaker 1: resolution for us to detect. To us, it will sound 179 00:10:30,160 --> 00:10:34,599 Speaker 1: exactly the same. So if you were to hear a 180 00:10:35,400 --> 00:10:40,400 Speaker 1: frequency at one thousand one hurts or one point zero 181 00:10:40,679 --> 00:10:43,960 Speaker 1: zero one killer hurts and one point zero zero to 182 00:10:44,160 --> 00:10:47,120 Speaker 1: kill hurts, you wouldn't notice the difference. They would sound 183 00:10:47,120 --> 00:10:50,199 Speaker 1: exactly the same to you. So if you're gonna take 184 00:10:50,200 --> 00:10:52,439 Speaker 1: audio and compress it, one step you could consider is 185 00:10:52,480 --> 00:10:57,240 Speaker 1: eliminating anything that's outside the actual range of frequencies that 186 00:10:57,280 --> 00:11:00,719 Speaker 1: we can hear, or simplifying any changes in frequency that 187 00:11:00,760 --> 00:11:04,439 Speaker 1: are smaller than two hurts. If you get take all 188 00:11:04,440 --> 00:11:07,920 Speaker 1: that data and you say it is physically impossible for 189 00:11:08,000 --> 00:11:11,479 Speaker 1: a human to perceive this, get rid of that information, 190 00:11:11,600 --> 00:11:14,800 Speaker 1: then in theory it wouldn't have any effect on the 191 00:11:14,880 --> 00:11:19,160 Speaker 1: rest of the recording. But how you go further than that, right, 192 00:11:19,240 --> 00:11:22,000 Speaker 1: how do you create a method so that you can 193 00:11:22,040 --> 00:11:24,160 Speaker 1: really compress this file? You want a method that will 194 00:11:24,160 --> 00:11:27,479 Speaker 1: preserve the important sounds while potentially ignoring all the unimportant 195 00:11:27,559 --> 00:11:31,360 Speaker 1: or incidel sounds. And you wanted to be automatic because 196 00:11:31,800 --> 00:11:34,920 Speaker 1: if you have it manually, then that's going to take 197 00:11:35,679 --> 00:11:40,000 Speaker 1: countless hours just to edit a single sound file. So 198 00:11:41,160 --> 00:11:44,360 Speaker 1: that was the challenge that the MP three research team 199 00:11:44,400 --> 00:11:49,480 Speaker 1: faced as a group. Now, their solution, which ultimately created 200 00:11:49,520 --> 00:11:51,800 Speaker 1: even more challenges was to come up with what was 201 00:11:51,920 --> 00:11:55,640 Speaker 1: essentially a simulated human ear and brain. They needed to 202 00:11:55,679 --> 00:12:01,559 Speaker 1: replicate the experience of perceiving music so that an algorithm 203 00:12:01,559 --> 00:12:05,720 Speaker 1: could evaluate every sound in an audio file and judge 204 00:12:05,800 --> 00:12:08,719 Speaker 1: if in fact was relevant enough to include in the 205 00:12:08,720 --> 00:12:13,000 Speaker 1: final compressed version. If a sound were imperceptible, then it 206 00:12:13,000 --> 00:12:15,520 Speaker 1: wouldn't make sense to include it in the MP three file. 207 00:12:15,800 --> 00:12:18,080 Speaker 1: So by leaving out all the irrelevant data, they can 208 00:12:18,160 --> 00:12:22,199 Speaker 1: make the audio information take up less bandwidth. The file 209 00:12:22,240 --> 00:12:24,800 Speaker 1: itself would be smaller because you just dumped everything that 210 00:12:24,880 --> 00:12:28,400 Speaker 1: wasn't important. So the team used an algorithm called the 211 00:12:28,559 --> 00:12:33,760 Speaker 1: low complexity Adaptive Transform Coding or lc DASH a TC 212 00:12:34,080 --> 00:12:36,520 Speaker 1: as the foundation for their research. This was kind of 213 00:12:36,559 --> 00:12:40,319 Speaker 1: their starting point, and this is an approach that that 214 00:12:40,600 --> 00:12:43,800 Speaker 1: tries to do away with redundancy as much as possible, 215 00:12:43,840 --> 00:12:48,520 Speaker 1: and it also incorporates adaptation to perceptual requirements. Also, MP 216 00:12:48,640 --> 00:12:52,239 Speaker 1: three's oh a lot to the IMPEG Layer two standard, 217 00:12:52,800 --> 00:12:56,600 Speaker 1: So the Layer two obviously came out before Layer three, 218 00:12:56,760 --> 00:12:59,160 Speaker 1: and so a lot of the features of layer three 219 00:12:59,320 --> 00:13:04,800 Speaker 1: are really um their legacy features from Layer two. Uh. 220 00:13:04,840 --> 00:13:07,040 Speaker 1: In other words, MP three group kind of got stuck 221 00:13:07,040 --> 00:13:09,600 Speaker 1: with them because otherwise they would have had a problem 222 00:13:09,600 --> 00:13:12,880 Speaker 1: with backwards compatibility. So the result is kind of a 223 00:13:12,960 --> 00:13:16,439 Speaker 1: clunky arrangement under the hood, and some of the features 224 00:13:16,640 --> 00:13:19,600 Speaker 1: may make very little sense when I go through them, 225 00:13:19,640 --> 00:13:21,839 Speaker 1: but some of that is because it's a holdover from 226 00:13:21,840 --> 00:13:26,840 Speaker 1: an earlier compression strategy, which isn't terribly satisfying as an answer. 227 00:13:26,880 --> 00:13:29,240 Speaker 1: But the reason many parts of the MP three compression 228 00:13:29,280 --> 00:13:31,480 Speaker 1: algorithm are the way they are is because that's the 229 00:13:31,480 --> 00:13:35,520 Speaker 1: way we've always done it. So next I'm gonna dive 230 00:13:35,600 --> 00:13:41,240 Speaker 1: into the phases of compression. But before I do that, 231 00:13:41,440 --> 00:13:44,160 Speaker 1: let's all take a deep breath and take a moment 232 00:13:44,200 --> 00:13:55,880 Speaker 1: to thank our sponsor, and we're back. So there are 233 00:13:55,920 --> 00:13:58,760 Speaker 1: two big phases we'll need to talk about with MP 234 00:13:58,920 --> 00:14:03,320 Speaker 1: three compression. The first phase is analysis and the second 235 00:14:03,320 --> 00:14:07,559 Speaker 1: phase is the actual compression itself. And after that there's 236 00:14:07,559 --> 00:14:10,680 Speaker 1: the process of decoding and MP three for playback. But 237 00:14:10,760 --> 00:14:13,520 Speaker 1: that's way simpler once we get an understanding of how 238 00:14:13,720 --> 00:14:18,959 Speaker 1: the encoding process actually happens. So let's begin with analysis. Now. 239 00:14:19,000 --> 00:14:22,560 Speaker 1: This is the part where the standard has to figure 240 00:14:22,560 --> 00:14:26,840 Speaker 1: out which frequencies within an audio range are recording rather 241 00:14:26,960 --> 00:14:32,760 Speaker 1: are important or perceptible. So how does a program and 242 00:14:33,000 --> 00:14:35,920 Speaker 1: encoder figure out what we can hear and what we 243 00:14:36,000 --> 00:14:40,400 Speaker 1: cannot hear? Alright, time to get technical. So you start 244 00:14:40,440 --> 00:14:45,000 Speaker 1: off with your pulse code modulation audio file or PCM file. 245 00:14:45,160 --> 00:14:47,560 Speaker 1: And you might remember I talked about PCM audio in 246 00:14:47,600 --> 00:14:50,400 Speaker 1: the first episode of this series, but just in case 247 00:14:50,440 --> 00:14:54,160 Speaker 1: you don't, it's a lossless digital audio file. The actual 248 00:14:54,200 --> 00:14:57,040 Speaker 1: format could be a wave or ai f F or 249 00:14:57,080 --> 00:15:00,400 Speaker 1: something along those lines, but the important thing to keep 250 00:15:00,440 --> 00:15:04,520 Speaker 1: in mind is that it is uncompressed. Now, that means 251 00:15:04,560 --> 00:15:06,880 Speaker 1: those files tend to be pretty big. This is our 252 00:15:06,960 --> 00:15:09,840 Speaker 1: raw material that we want to take and squish down 253 00:15:09,880 --> 00:15:14,120 Speaker 1: to a more manageable transferable size. And in our our 254 00:15:14,200 --> 00:15:16,640 Speaker 1: last episode in this series, I also mentioned that the 255 00:15:16,760 --> 00:15:20,120 Speaker 1: standard for c D audio is a sample rate of 256 00:15:20,160 --> 00:15:23,400 Speaker 1: forty four point one killer hurts. And we learned that 257 00:15:23,440 --> 00:15:26,120 Speaker 1: you need a sample rate twice the frequency of the 258 00:15:26,240 --> 00:15:30,520 Speaker 1: highest frequency in your recording, and since human hearing tops 259 00:15:30,520 --> 00:15:32,800 Speaker 1: out at around twenty kill hurts, the standard for c 260 00:15:32,960 --> 00:15:35,880 Speaker 1: ds is forty four point one killer hurts. The MP 261 00:15:36,000 --> 00:15:38,840 Speaker 1: three standard can support lots of different sample rates, but 262 00:15:39,000 --> 00:15:41,320 Speaker 1: forty four point one killer hurts is pretty much the 263 00:15:41,480 --> 00:15:45,800 Speaker 1: common standard. So you've got a number of samples with 264 00:15:45,880 --> 00:15:48,400 Speaker 1: your audio file, and that number will depend upon how 265 00:15:48,440 --> 00:15:53,160 Speaker 1: long the audio file is. You've got forty four samples 266 00:15:53,200 --> 00:15:56,720 Speaker 1: per second, actually twice that for stereo. But for the 267 00:15:56,720 --> 00:15:59,680 Speaker 1: purposes of this discussion, let's kind of stick with mono 268 00:15:59,720 --> 00:16:02,720 Speaker 1: sound so that I don't start having math coming out 269 00:16:02,760 --> 00:16:06,040 Speaker 1: of my ears. And we're still in the very easy, 270 00:16:06,080 --> 00:16:08,480 Speaker 1: simple part as far as math goes. We haven't gotten 271 00:16:08,520 --> 00:16:11,080 Speaker 1: to the complicated stuff yet. All right, So you've got 272 00:16:11,080 --> 00:16:15,880 Speaker 1: forty four thousand, one hundred samples per second. To compress 273 00:16:15,920 --> 00:16:19,280 Speaker 1: it into an MP three format, the algorithm first groups 274 00:16:19,320 --> 00:16:24,520 Speaker 1: all of these samples into collections called frames. So take 275 00:16:24,560 --> 00:16:27,840 Speaker 1: those four thousand one per second, and then you start saying, okay, 276 00:16:27,840 --> 00:16:30,880 Speaker 1: we're gonna group you in batches. Each batch is called 277 00:16:30,920 --> 00:16:34,800 Speaker 1: a frame, and each frame contains one thousand, one fifty 278 00:16:34,800 --> 00:16:39,320 Speaker 1: two samples. Now that's specifically to maintain backwards compatibility to 279 00:16:39,560 --> 00:16:43,520 Speaker 1: IMPEG Layer two, which established that one thousand, one fifty 280 00:16:43,520 --> 00:16:46,720 Speaker 1: two number. But we're not talking about IMPEG layer two. 281 00:16:46,720 --> 00:16:50,760 Speaker 1: We're talking about IMPEG Layer three, and though that means 282 00:16:50,760 --> 00:16:52,560 Speaker 1: we have to get a little more complicated. So each 283 00:16:52,600 --> 00:16:59,280 Speaker 1: frame consists of two subgroups called granules. So each granule 284 00:16:59,320 --> 00:17:04,240 Speaker 1: has five hundred seventy six samples six times two one two, 285 00:17:04,400 --> 00:17:08,560 Speaker 1: so five seventy six samples per granule. Now, technically MP 286 00:17:08,640 --> 00:17:11,520 Speaker 1: three encoders only work on one granule at a time, 287 00:17:11,560 --> 00:17:15,040 Speaker 1: but they may reference the granules immediately before and immediately 288 00:17:15,160 --> 00:17:17,639 Speaker 1: after the current one in order to see how the 289 00:17:17,680 --> 00:17:21,960 Speaker 1: audio within the file changes over time. All right, So 290 00:17:22,040 --> 00:17:24,960 Speaker 1: now you've got your granules of five hundred seventy six 291 00:17:25,119 --> 00:17:29,200 Speaker 1: samples each. Then the MP three encoder runs the samples 292 00:17:29,240 --> 00:17:33,439 Speaker 1: through a filter bank, which sorts the sound into thirty 293 00:17:33,440 --> 00:17:36,359 Speaker 1: two frequency ranges. Are you? Are you crazy about the 294 00:17:36,400 --> 00:17:41,000 Speaker 1: numbers yet, Dylan? Are you? Dylan's Dan's nodding. Dylan gets 295 00:17:41,040 --> 00:17:45,360 Speaker 1: worse from here. So you have thirty two frequency ranges, 296 00:17:45,600 --> 00:17:47,720 Speaker 1: which is another nod to the layer two method, which 297 00:17:47,800 --> 00:17:50,880 Speaker 1: use those thirty two ranges for encoding purposes. But we're 298 00:17:50,880 --> 00:17:54,440 Speaker 1: not talking about layer two, are we. No, we're talking 299 00:17:54,560 --> 00:17:57,760 Speaker 1: MP three. Gosh darn it. That means we take those 300 00:17:57,800 --> 00:18:00,679 Speaker 1: thirty two ranges and we subdivide them by a factor 301 00:18:00,720 --> 00:18:05,240 Speaker 1: of eighteen. That means we have five hundred seventies six 302 00:18:05,440 --> 00:18:10,199 Speaker 1: bands of frequencies each band containing one seventy six of 303 00:18:10,200 --> 00:18:14,439 Speaker 1: the frequency range of the original sample. So what that 304 00:18:14,520 --> 00:18:17,840 Speaker 1: actually means and this this is actually pretty easy. The 305 00:18:17,880 --> 00:18:21,440 Speaker 1: bands are not limited to a specific number for their 306 00:18:21,480 --> 00:18:26,399 Speaker 1: frequency range, right. The bands don't mean that on the 307 00:18:26,560 --> 00:18:29,640 Speaker 1: on band number one it goes from twenty hurts up 308 00:18:29,680 --> 00:18:32,359 Speaker 1: to a certain range, and on band five D seventy 309 00:18:32,400 --> 00:18:35,439 Speaker 1: six it ends at twenty killer hurts. That's not what 310 00:18:35,480 --> 00:18:38,639 Speaker 1: it means. They're dependent upon the original audio. So if 311 00:18:38,680 --> 00:18:42,720 Speaker 1: the original audio contains sounds within a narrow range of frequencies, 312 00:18:43,080 --> 00:18:46,680 Speaker 1: the five seventy bands will be more precise. But if 313 00:18:46,720 --> 00:18:50,280 Speaker 1: the original recording has a vast range of frequencies, the 314 00:18:50,320 --> 00:18:53,280 Speaker 1: bands are less precise. So another way to think about 315 00:18:53,320 --> 00:18:56,840 Speaker 1: this is with a pizza. So let's say you get 316 00:18:56,960 --> 00:19:00,000 Speaker 1: extra large pizza and you cut it into eight equal slices, 317 00:19:00,640 --> 00:19:03,320 Speaker 1: and then you get a small pizza and you cut 318 00:19:03,359 --> 00:19:06,679 Speaker 1: that into eight equal slices. Well, in both cases you 319 00:19:06,680 --> 00:19:10,800 Speaker 1: have with each slice one eighth of a pizza. But 320 00:19:10,880 --> 00:19:15,119 Speaker 1: the extra large pizza pizza slice is bigger than the 321 00:19:15,160 --> 00:19:18,320 Speaker 1: small pizza pizza slice. It all depends on the size 322 00:19:18,320 --> 00:19:21,000 Speaker 1: of the pizza. So in this case, it depends upon 323 00:19:21,040 --> 00:19:24,120 Speaker 1: the range of frequencies. And and Dylan, do you think 324 00:19:24,119 --> 00:19:26,320 Speaker 1: we could go for some pizza, you know, just just 325 00:19:26,359 --> 00:19:29,199 Speaker 1: put the episode on hold and go get pizza. Dylan's nodding. 326 00:19:29,760 --> 00:19:33,919 Speaker 1: It's great for audio. Yeah, so, uh, pizza, We'll be 327 00:19:34,000 --> 00:19:38,840 Speaker 1: right back. Okay, I was good pizza. Now um oh, man, 328 00:19:38,880 --> 00:19:41,440 Speaker 1: I got a whole bunch more notes. Okay, well, let's 329 00:19:41,480 --> 00:19:43,919 Speaker 1: let's go ahead and and do the rest of this. 330 00:19:43,960 --> 00:19:45,840 Speaker 1: All right, So you've got your sound divided up into 331 00:19:45,920 --> 00:19:49,359 Speaker 1: those five seventy six sub brands of frequencies, you know, 332 00:19:49,680 --> 00:19:52,879 Speaker 1: the thing I compared to pizza slices earlier. Now you 333 00:19:52,920 --> 00:19:58,399 Speaker 1: get two different mathematical processes applied to this data. One 334 00:19:58,560 --> 00:20:01,959 Speaker 1: is the fast Furrier trans form or f T, and 335 00:20:02,000 --> 00:20:05,720 Speaker 1: the other is the modified discrete Cosine transform or m 336 00:20:05,840 --> 00:20:09,800 Speaker 1: d c T. Now, I am not going to dive 337 00:20:09,840 --> 00:20:13,080 Speaker 1: deeply into how these transforms work, because frankly, they are 338 00:20:13,160 --> 00:20:17,480 Speaker 1: beyond my mathematical understanding. But I know what they do. 339 00:20:17,760 --> 00:20:22,320 Speaker 1: I just cannot explain the process like how they do 340 00:20:22,440 --> 00:20:24,520 Speaker 1: what they do. So I'm going to give you the 341 00:20:24,560 --> 00:20:27,760 Speaker 1: explanation of what they do. What the outcome of each 342 00:20:27,800 --> 00:20:31,880 Speaker 1: of these transformed processes happens to be, but I'm not 343 00:20:31,960 --> 00:20:33,840 Speaker 1: going to be able to tell you the actual mathematical 344 00:20:33,880 --> 00:20:36,520 Speaker 1: steps involved in each because I don't math. So good guys, 345 00:20:37,680 --> 00:20:40,560 Speaker 1: But let's start with a fast for your transform. So 346 00:20:40,680 --> 00:20:42,760 Speaker 1: transform is kind of what it sounds like. It's all 347 00:20:42,760 --> 00:20:47,000 Speaker 1: about transforming information in some way. So in this particular case, 348 00:20:47,160 --> 00:20:50,399 Speaker 1: the f f T transforms the frequency bands we just 349 00:20:50,440 --> 00:20:55,400 Speaker 1: talked about into data that can be further analyzed by 350 00:20:55,520 --> 00:20:59,639 Speaker 1: a psychoacoustic model that's in the encoder. So this is 351 00:20:59,680 --> 00:21:03,000 Speaker 1: that simulated human ear and brain we were talking about earlier. 352 00:21:03,880 --> 00:21:07,840 Speaker 1: So what the encoder does is it analyzes each bit 353 00:21:07,960 --> 00:21:11,639 Speaker 1: of data and looks for signs that it represents audio 354 00:21:11,720 --> 00:21:14,640 Speaker 1: that wouldn't be perceived by a human. So it's look 355 00:21:14,840 --> 00:21:19,280 Speaker 1: looking for any potential for masking possibilities. So are there 356 00:21:19,280 --> 00:21:21,840 Speaker 1: collections of frequencies that are grouped close together, and is 357 00:21:21,880 --> 00:21:24,359 Speaker 1: one of those frequencies louder than the others. You might 358 00:21:24,400 --> 00:21:27,000 Speaker 1: be able to do away with those softerw frequencies because 359 00:21:27,000 --> 00:21:30,520 Speaker 1: of frequency masking. The encoder will also look at whether 360 00:21:30,640 --> 00:21:33,000 Speaker 1: or not the audio has a lot of complexity to it, 361 00:21:33,840 --> 00:21:36,000 Speaker 1: if it has a lot of changes, or if it's 362 00:21:36,040 --> 00:21:40,879 Speaker 1: just relatively steady or simple audio. Any transient sounds that 363 00:21:40,920 --> 00:21:44,640 Speaker 1: are present in the audio might end up being temporal masking, 364 00:21:44,720 --> 00:21:47,080 Speaker 1: so it'll analyze those as well and see if that's 365 00:21:47,080 --> 00:21:52,040 Speaker 1: a possibility. So really what they're looking is for, you know, 366 00:21:53,320 --> 00:21:56,399 Speaker 1: just any really loud sounds that stand out above the 367 00:21:56,440 --> 00:21:59,159 Speaker 1: rest of the recording. That's what the f f T 368 00:21:59,320 --> 00:22:03,240 Speaker 1: is doing. So what about the modified discrete cosine transform. Well, 369 00:22:03,280 --> 00:22:05,399 Speaker 1: this is happening in parallel with the f f T, 370 00:22:05,840 --> 00:22:10,360 Speaker 1: and the samples get sorted into different patterns called windows. Uh. 371 00:22:10,359 --> 00:22:12,920 Speaker 1: And the criterion for sorting all has to do with 372 00:22:12,920 --> 00:22:16,760 Speaker 1: whether the sample represents a steady sound or varied sound. 373 00:22:17,280 --> 00:22:20,400 Speaker 1: So if you have a simple steady sound that goes 374 00:22:20,440 --> 00:22:24,240 Speaker 1: into a long window. If there's a lot of variation 375 00:22:24,280 --> 00:22:27,000 Speaker 1: in the sound, like there are a lot of consonants 376 00:22:27,000 --> 00:22:29,800 Speaker 1: in a vocal line, or it's like a drum solo 377 00:22:30,000 --> 00:22:32,720 Speaker 1: or something like that, it would get sorted into a 378 00:22:32,800 --> 00:22:36,480 Speaker 1: series of three short windows. And each short window contains 379 00:22:36,520 --> 00:22:42,560 Speaker 1: one two samples. That amounts to four whole milliseconds, so 380 00:22:42,720 --> 00:22:48,159 Speaker 1: four thousands of a second in three patterned windows. So 381 00:22:48,200 --> 00:22:51,440 Speaker 1: you've got these windows now, either long windows for simple 382 00:22:51,480 --> 00:22:54,760 Speaker 1: sounds or short windows for the more complex sounds, and 383 00:22:54,760 --> 00:22:57,800 Speaker 1: then the modified discrete cosine transformed kicks into gear. It 384 00:22:57,800 --> 00:23:00,200 Speaker 1: looks at each long window or set of three sort 385 00:23:00,240 --> 00:23:03,960 Speaker 1: windows and converts them into a set of spectral values. 386 00:23:04,560 --> 00:23:06,840 Speaker 1: To some of you, that probably sounds meaningless. So let's 387 00:23:06,880 --> 00:23:10,760 Speaker 1: talk about spectral analysis for a second. First, I was 388 00:23:11,040 --> 00:23:13,960 Speaker 1: very disappointed to learn that spectral analysis doesn't involve a 389 00:23:13,960 --> 00:23:19,280 Speaker 1: psychologist talking to a ghost about its emotional state. So bummer. 390 00:23:20,040 --> 00:23:23,600 Speaker 1: But spectral analysis is when you look at a spectrum 391 00:23:23,640 --> 00:23:27,840 Speaker 1: of information, like a spectrum of frequencies or related information 392 00:23:27,880 --> 00:23:31,480 Speaker 1: like energy states. That's what this transform does. It takes 393 00:23:31,560 --> 00:23:35,159 Speaker 1: data that originally represented a slice of time in a 394 00:23:35,240 --> 00:23:38,400 Speaker 1: sound waveform. That's what sample is. A sample is an 395 00:23:38,440 --> 00:23:42,320 Speaker 1: instance of time in a wave form and converts it 396 00:23:42,359 --> 00:23:48,880 Speaker 1: into information representing sound as energy across a range of frequencies. Now, 397 00:23:48,880 --> 00:23:51,119 Speaker 1: you can plot out spectral information in a lot of 398 00:23:51,119 --> 00:23:54,040 Speaker 1: different ways, but one common method is to use brightness 399 00:23:54,080 --> 00:23:58,840 Speaker 1: to indicate energy levels. Higher energy levels are brighter patches 400 00:23:59,080 --> 00:24:03,840 Speaker 1: in your vision. Dual representation of spectral data. High frequencies 401 00:24:03,920 --> 00:24:06,720 Speaker 1: would appear at the top of a spectral view like 402 00:24:06,800 --> 00:24:10,000 Speaker 1: imagine a box, and at the top of the box 403 00:24:10,200 --> 00:24:12,440 Speaker 1: that's where you would find high frequencies. At the bottom 404 00:24:12,440 --> 00:24:14,760 Speaker 1: of the boxes where you find low frequencies, and it's 405 00:24:14,800 --> 00:24:17,880 Speaker 1: just lots of patches of color. The really bright patches 406 00:24:17,880 --> 00:24:23,280 Speaker 1: of color represent very high energy frequencies, so they could 407 00:24:23,280 --> 00:24:27,080 Speaker 1: be high or low in in actual frequency, but we're 408 00:24:27,080 --> 00:24:30,640 Speaker 1: talking about energy levels, not whether it's a higher low pitch. 409 00:24:32,520 --> 00:24:35,160 Speaker 1: Looking left or right represents the passing of time, and 410 00:24:35,200 --> 00:24:38,600 Speaker 1: looking along any vertical points shows you the actual frequency 411 00:24:39,280 --> 00:24:42,840 Speaker 1: or pitch, and then the respective energy level is the brightness. 412 00:24:42,960 --> 00:24:45,119 Speaker 1: So it's kind of like looking at sound as a wave, 413 00:24:45,280 --> 00:24:47,800 Speaker 1: but instead of being a wave, you're looking at information 414 00:24:47,800 --> 00:24:52,639 Speaker 1: that indicates frequency range and energy level. That representation is 415 00:24:52,640 --> 00:24:55,520 Speaker 1: actually kind of analogous to how we hear audio, So 416 00:24:55,600 --> 00:24:58,720 Speaker 1: and encoder can analyze the spectral view and start to 417 00:24:58,720 --> 00:25:02,920 Speaker 1: filter out the data we would and perceived due to psychoacoustics. Now, 418 00:25:02,960 --> 00:25:06,960 Speaker 1: after all that processing, the encoder looks at the frequency 419 00:25:07,040 --> 00:25:10,240 Speaker 1: sub brands and the levels of spectral intensity for each 420 00:25:10,840 --> 00:25:14,240 Speaker 1: and that information can then be used for the next phase, 421 00:25:14,840 --> 00:25:18,280 Speaker 1: which is compression. But right now I think we could 422 00:25:18,320 --> 00:25:21,800 Speaker 1: all stand a little decompression, So let's take another quick 423 00:25:21,800 --> 00:25:33,280 Speaker 1: break to thank our sponsor all right, So now you're 424 00:25:33,320 --> 00:25:37,320 Speaker 1: ready to compress your analyzed audio. Good for you, and 425 00:25:37,359 --> 00:25:41,120 Speaker 1: by you I mean encoders. This has to be simpler 426 00:25:41,160 --> 00:25:44,159 Speaker 1: than that analysis segment, right, I mean that got a 427 00:25:44,160 --> 00:25:47,880 Speaker 1: little crazy with all the different bands and sub bands 428 00:25:48,040 --> 00:25:55,160 Speaker 1: and windows and frames and granules. Sadly it gets more complicated. 429 00:25:55,160 --> 00:25:58,320 Speaker 1: All right. So there are two layers of compression going 430 00:25:58,359 --> 00:26:03,040 Speaker 1: on with IMPEG Layer three. One of those layers depends 431 00:26:03,119 --> 00:26:07,560 Speaker 1: upon the psychoacoustic analysis and the other doesn't. So why 432 00:26:07,560 --> 00:26:10,840 Speaker 1: would you use two layers with different strategies like that? Well, 433 00:26:10,880 --> 00:26:13,879 Speaker 1: the reason is that one strategy is great for complex 434 00:26:13,920 --> 00:26:16,679 Speaker 1: audio with lots of components, but not so great with 435 00:26:16,800 --> 00:26:19,679 Speaker 1: simpler sounds, and the other strategy is kind of the opposite. 436 00:26:20,160 --> 00:26:22,560 Speaker 1: So the psychoacoustic approach is the one that's really good 437 00:26:22,600 --> 00:26:26,520 Speaker 1: for complicated sounds. If if you've got a lot of 438 00:26:26,720 --> 00:26:30,879 Speaker 1: volume changes, lots of different frequencies, it's just complicated and 439 00:26:31,000 --> 00:26:33,880 Speaker 1: rich sound. You've got a lot of opportunities to look 440 00:26:33,920 --> 00:26:37,280 Speaker 1: for masking and other acoustic elements that limit the actual 441 00:26:37,359 --> 00:26:41,200 Speaker 1: sounds that people perceive. So it means there are a 442 00:26:41,240 --> 00:26:44,800 Speaker 1: lot of chances for you to uh fudge by dropping 443 00:26:44,800 --> 00:26:49,720 Speaker 1: all the stuff that people probably wouldn't notice anyway. And Uh, 444 00:26:49,880 --> 00:26:51,439 Speaker 1: if you take a piece that's got a lot of 445 00:26:51,440 --> 00:26:54,960 Speaker 1: elements at varying volumes, there are likely several opportunities to 446 00:26:54,960 --> 00:26:58,800 Speaker 1: to do this. But if you're talking about relatively straightforward 447 00:26:59,440 --> 00:27:04,359 Speaker 1: audio with few components, few changes in volume, there's really 448 00:27:04,359 --> 00:27:06,439 Speaker 1: not a whole lot of data you can ditch without 449 00:27:06,480 --> 00:27:08,960 Speaker 1: it actually affecting the quality of the audio in a 450 00:27:09,000 --> 00:27:13,280 Speaker 1: perceptible way. And this is part of what Brandenburg, that 451 00:27:13,320 --> 00:27:15,480 Speaker 1: guy I was talking about in our first episode in 452 00:27:15,520 --> 00:27:18,439 Speaker 1: this series. Uh, that's when he discovered when he was 453 00:27:18,840 --> 00:27:22,000 Speaker 1: working with the MP three standard and he was listening 454 00:27:22,040 --> 00:27:26,600 Speaker 1: back to that Suzanne Vega acapella track Tom's Diner. Uh, 455 00:27:26,720 --> 00:27:28,560 Speaker 1: he was listening to a compressed version of it, and 456 00:27:28,560 --> 00:27:31,159 Speaker 1: he said it was terrible. He said it ruined the 457 00:27:31,200 --> 00:27:34,520 Speaker 1: quality of the audio. And part of that is because 458 00:27:34,600 --> 00:27:37,919 Speaker 1: that particular song is fairly simple, there's just not a 459 00:27:37,920 --> 00:27:40,800 Speaker 1: lot of opportunity to take advantage of masking and other 460 00:27:40,920 --> 00:27:46,520 Speaker 1: tricks without potentially compromising the quality. So they decided to 461 00:27:46,560 --> 00:27:50,600 Speaker 1: also incorporate some traditional compression strategies which which worked better 462 00:27:50,760 --> 00:27:53,679 Speaker 1: with those types of recordings. So the MP three format 463 00:27:53,720 --> 00:27:57,800 Speaker 1: takes advantage of both the traditional approach and the psychoacoustic approach, 464 00:27:58,520 --> 00:28:01,560 Speaker 1: and that allows the encoder to compressed files into smaller 465 00:28:01,600 --> 00:28:05,720 Speaker 1: size without just following a single strategy, like it doesn't 466 00:28:05,720 --> 00:28:07,800 Speaker 1: have to do a one size fits all for all 467 00:28:07,880 --> 00:28:12,639 Speaker 1: elements of audio. Now, combining those two strategies requires a 468 00:28:12,640 --> 00:28:16,359 Speaker 1: little more mathematical gymnastics. So let's go back to those 469 00:28:16,480 --> 00:28:20,240 Speaker 1: five seventy six frequency bins. You know, those sub bands 470 00:28:20,280 --> 00:28:24,560 Speaker 1: we talked about earlier. You gotta quantize those suckers. What 471 00:28:24,600 --> 00:28:27,480 Speaker 1: does that mean. It means assigning a quantity to each 472 00:28:27,800 --> 00:28:31,639 Speaker 1: to each frequency bin, you have to give it a 473 00:28:31,720 --> 00:28:34,720 Speaker 1: quantity of some sorts so that you can end up 474 00:28:34,840 --> 00:28:39,640 Speaker 1: judging how much you can get away with dropping data. 475 00:28:40,000 --> 00:28:42,840 Speaker 1: So to do this, the encoder sorts those five six 476 00:28:42,880 --> 00:28:46,320 Speaker 1: bins into twenty two scale factor bands. How you doing 477 00:28:46,320 --> 00:28:50,680 Speaker 1: over there, Dylan? Just checking in on you? Okay, Dylan's 478 00:28:50,720 --> 00:28:53,440 Speaker 1: got Dylan's got a thousand yards stare going. I hope 479 00:28:53,480 --> 00:28:55,920 Speaker 1: you guys are doing okay over there? All right, So 480 00:28:56,120 --> 00:28:58,080 Speaker 1: before smoke starts coming out of your ears, let me 481 00:28:58,120 --> 00:29:01,800 Speaker 1: explain what the scale factor bands are all about. The 482 00:29:01,840 --> 00:29:05,400 Speaker 1: whole purpose of the scale factor bands is to determine 483 00:29:05,480 --> 00:29:10,000 Speaker 1: how the information will be stored within the compressed state. 484 00:29:10,880 --> 00:29:12,840 Speaker 1: So you want to get away with as little data 485 00:29:12,920 --> 00:29:16,080 Speaker 1: as possible before affecting sound quality. So if you can 486 00:29:16,120 --> 00:29:19,800 Speaker 1: say the same thing in a shorter space without affecting 487 00:29:19,800 --> 00:29:22,640 Speaker 1: the quality of what it is you're saying, you go 488 00:29:22,720 --> 00:29:27,720 Speaker 1: with it. Brevity is the soul of compression. So if 489 00:29:27,720 --> 00:29:31,000 Speaker 1: we were talking about language, I would say it's more 490 00:29:31,000 --> 00:29:35,920 Speaker 1: efficient to say it's raining outside, or even just it's raining, 491 00:29:36,240 --> 00:29:39,320 Speaker 1: because you would assume that it would be outside where 492 00:29:39,320 --> 00:29:41,880 Speaker 1: the rain is happening, and it would be inefficient for 493 00:29:41,920 --> 00:29:44,400 Speaker 1: me to say it's coming down like cats and dogs 494 00:29:44,440 --> 00:29:48,280 Speaker 1: out there. It's not as efficient as saying it's raining. 495 00:29:49,040 --> 00:29:53,800 Speaker 1: So if you can get away with shorter statements without 496 00:29:53,880 --> 00:29:57,680 Speaker 1: affecting the actual quality, and you could argue that by 497 00:29:57,840 --> 00:30:00,360 Speaker 1: switching from it's coming down like cats and dog out 498 00:30:00,360 --> 00:30:03,920 Speaker 1: there and it's raining changes the quality, and that could 499 00:30:03,920 --> 00:30:05,680 Speaker 1: be a valid argument. But if you can get away 500 00:30:06,120 --> 00:30:10,440 Speaker 1: with shorter without affecting quality, you do it. So each 501 00:30:10,480 --> 00:30:15,000 Speaker 1: scale factor band is represented by a quantity, Then the 502 00:30:15,080 --> 00:30:19,480 Speaker 1: encoder divides that quantity by a given number called the quantizer, 503 00:30:19,840 --> 00:30:23,520 Speaker 1: which is the same across the entire frequency spectrum for 504 00:30:23,600 --> 00:30:28,080 Speaker 1: that recording. The resulting number is then rounded up or 505 00:30:28,200 --> 00:30:33,320 Speaker 1: down to a whole digit. And here's an important point. 506 00:30:33,720 --> 00:30:37,200 Speaker 1: Individual scale factor bands can be scaled up or down 507 00:30:37,320 --> 00:30:41,320 Speaker 1: for more or less precision to represent the actual value 508 00:30:41,480 --> 00:30:45,480 Speaker 1: of those bands. So what the heck does all that mean? Well, 509 00:30:45,560 --> 00:30:48,120 Speaker 1: the purpose of dividing and rounding is just to simplify 510 00:30:48,160 --> 00:30:50,880 Speaker 1: the data to reduce the amount you need in order 511 00:30:50,920 --> 00:30:53,680 Speaker 1: to store the information. So let's go with a totally 512 00:30:53,760 --> 00:30:57,560 Speaker 1: hypothetical example. Let's say you've got a scale factor band 513 00:30:58,360 --> 00:31:01,240 Speaker 1: and you've decided your rep is sending that scale factor 514 00:31:01,320 --> 00:31:05,280 Speaker 1: band with the quantity seven eight four zero seven thousand, 515 00:31:05,360 --> 00:31:08,880 Speaker 1: eight hundred forty, and you've chosen the number one hundred 516 00:31:08,920 --> 00:31:12,480 Speaker 1: to quantize your data, meaning that you will divide each 517 00:31:13,400 --> 00:31:18,160 Speaker 1: uh scale factor bands quantity by one hundred. So this 518 00:31:18,200 --> 00:31:20,560 Speaker 1: is seven thousand, eight hundred forty. You divide it by 519 00:31:20,680 --> 00:31:24,440 Speaker 1: one hundred UH, and the scale factor for this particular 520 00:31:24,480 --> 00:31:28,080 Speaker 1: band you have determined is one point zero. That means 521 00:31:28,160 --> 00:31:31,360 Speaker 1: that once you get that result where you've divided the 522 00:31:31,440 --> 00:31:34,560 Speaker 1: quantity by the quantizer, you multiply by one. That means 523 00:31:34,560 --> 00:31:36,880 Speaker 1: there's no change. You multiply by one you get the 524 00:31:36,960 --> 00:31:40,080 Speaker 1: same number. More on that end a bit. Okay, So 525 00:31:40,120 --> 00:31:42,680 Speaker 1: you take that seven thousand, eight hundred forty you divided 526 00:31:42,720 --> 00:31:46,520 Speaker 1: by one hundred. That gives you seventy eight point four. Well, 527 00:31:46,600 --> 00:31:48,680 Speaker 1: now you have to round that number, so you round 528 00:31:48,680 --> 00:31:51,520 Speaker 1: it down to seventy eight. Now, when you have a 529 00:31:51,560 --> 00:31:54,240 Speaker 1: decoder and you're ready to play back the information, it 530 00:31:54,320 --> 00:31:59,040 Speaker 1: comes across this quantity the sight and it knows what 531 00:31:59,200 --> 00:32:02,760 Speaker 1: the quantizer number was, so it multiplies by one hundred 532 00:32:02,800 --> 00:32:05,720 Speaker 1: to get back to seven thousand, eight hundred. So the 533 00:32:05,800 --> 00:32:09,720 Speaker 1: replicated number is actually forty off from the original number. 534 00:32:09,760 --> 00:32:12,800 Speaker 1: The original number again was seven thousand, eight hundred forty. 535 00:32:13,080 --> 00:32:16,560 Speaker 1: The replicated number is seven thousand, eight hundred. Now those 536 00:32:16,600 --> 00:32:21,920 Speaker 1: inconsistencies manifest as noise in the actual playback. So if 537 00:32:21,920 --> 00:32:24,840 Speaker 1: you wanted to increase the precision of any given scale 538 00:32:24,840 --> 00:32:27,200 Speaker 1: factor band, you could do so by changing the scale 539 00:32:27,200 --> 00:32:30,080 Speaker 1: factor number. So in that example, just now, I said 540 00:32:30,120 --> 00:32:32,680 Speaker 1: the number was one point zero, meaning there's no change 541 00:32:32,840 --> 00:32:36,160 Speaker 1: to that result. But I could have said it was ten, 542 00:32:36,640 --> 00:32:39,280 Speaker 1: which means we would multiply the quanties number by ten. 543 00:32:39,640 --> 00:32:41,720 Speaker 1: So we would take that seven thousand, eight hundred forty 544 00:32:41,840 --> 00:32:44,040 Speaker 1: divided by one hundred, you get seventy eight point four, 545 00:32:44,520 --> 00:32:48,120 Speaker 1: then multiplied by ten to get seven eight four. So 546 00:32:48,880 --> 00:32:52,160 Speaker 1: when the decoder decompresses the file, it would reverse this 547 00:32:52,160 --> 00:32:55,400 Speaker 1: this whole thing. It would just multiply by a hundred um. 548 00:32:55,440 --> 00:32:57,720 Speaker 1: You would end up getting seven thousand, hundred forty again, 549 00:32:57,800 --> 00:33:00,680 Speaker 1: which means that you wouldn't introduce any noise to the file. 550 00:33:00,720 --> 00:33:04,040 Speaker 1: You would have a perfect representation. But in some cases 551 00:33:04,040 --> 00:33:07,560 Speaker 1: the encoder may determine that any noise that you generate 552 00:33:07,880 --> 00:33:11,000 Speaker 1: wouldn't be noticed or it wouldn't impact the quality of 553 00:33:11,000 --> 00:33:13,520 Speaker 1: the audio enough for it to be a problem because 554 00:33:13,520 --> 00:33:16,440 Speaker 1: of other factors for that particular scale factor band, like 555 00:33:16,520 --> 00:33:20,000 Speaker 1: maybe it's really quiet, or maybe it's really complex. So 556 00:33:20,040 --> 00:33:22,920 Speaker 1: in those cases, you could reduce the scale factor number 557 00:33:23,320 --> 00:33:26,120 Speaker 1: by making it something else, like point one instead of 558 00:33:26,160 --> 00:33:28,720 Speaker 1: one point oh. So that means you would multiply the 559 00:33:28,800 --> 00:33:32,400 Speaker 1: quantized number by point one, So the seventy eight point 560 00:33:32,440 --> 00:33:35,240 Speaker 1: four would become seven point eight four, and then you 561 00:33:35,280 --> 00:33:37,320 Speaker 1: have to round it to get a whole integer, so 562 00:33:37,360 --> 00:33:41,320 Speaker 1: you get eight seven point eight four rounds up to eight. Now, 563 00:33:41,320 --> 00:33:44,880 Speaker 1: when a decode or decompresses the audio and multiplies eight 564 00:33:44,920 --> 00:33:48,200 Speaker 1: by one hundred, that quantizer that we've talked about so much. 565 00:33:49,120 --> 00:33:51,200 Speaker 1: Uh and uh. Actually at this point it would have 566 00:33:51,200 --> 00:33:53,680 Speaker 1: to be eight thousand because it's also taking into account 567 00:33:53,680 --> 00:33:57,520 Speaker 1: the scale factor, so it's multiplying it by a thousand, 568 00:33:57,520 --> 00:34:01,760 Speaker 1: not just a hundred. So you would get a number 569 00:34:01,800 --> 00:34:04,440 Speaker 1: that would pop up to eight thousand. And remember the 570 00:34:04,440 --> 00:34:06,800 Speaker 1: original with seven thousand, eight hundred forty. So you look 571 00:34:06,800 --> 00:34:09,640 Speaker 1: at the difference between these two, the original seven thousand forty, 572 00:34:09,719 --> 00:34:12,240 Speaker 1: the new fact number is eight thousand. There's a pretty 573 00:34:12,239 --> 00:34:15,040 Speaker 1: big difference there. That change might introduce enough noise for 574 00:34:15,040 --> 00:34:17,240 Speaker 1: it to be a problem. So how does the encoder 575 00:34:17,280 --> 00:34:20,400 Speaker 1: determine if a scale factor band is meeting the proper criteria? 576 00:34:20,440 --> 00:34:25,319 Speaker 1: How can it tell if there is uh too much 577 00:34:25,400 --> 00:34:28,799 Speaker 1: noise or if the noise falls below the threshold. Well, 578 00:34:28,840 --> 00:34:32,360 Speaker 1: it goes through what it's called a Huffman coding process. 579 00:34:32,440 --> 00:34:37,160 Speaker 1: At this point, Dylan is currently just staring at the 580 00:34:37,160 --> 00:34:41,480 Speaker 1: wall and drool is coming out. Huffman coding process. It's 581 00:34:41,520 --> 00:34:45,160 Speaker 1: converts scale factor bands into binary strings, and the process 582 00:34:45,200 --> 00:34:47,160 Speaker 1: goes through a series of tables to determine if the 583 00:34:47,239 --> 00:34:50,160 Speaker 1: data within the scale factor band requires more or less 584 00:34:50,200 --> 00:34:53,200 Speaker 1: precision to describe the sound without affecting the audio quality. 585 00:34:54,320 --> 00:34:56,719 Speaker 1: So Huffman coding is a process. And when you start 586 00:34:56,760 --> 00:34:58,880 Speaker 1: with a large number of possibilities and you begin to 587 00:34:58,960 --> 00:35:01,880 Speaker 1: narrow it down. Uh. Some people describe it as the 588 00:35:01,920 --> 00:35:05,719 Speaker 1: coding equivalent of twenty questions. So you ask your first 589 00:35:05,800 --> 00:35:08,960 Speaker 1: question like animal, vegetable, or mineral. You get an answer 590 00:35:09,080 --> 00:35:12,640 Speaker 1: so animal. While that first answer eliminates a ton of 591 00:35:12,680 --> 00:35:16,400 Speaker 1: other possibilities and narrows the focus, like anything that doesn't 592 00:35:16,400 --> 00:35:20,120 Speaker 1: pertain to animal, you can automatically discount because you already 593 00:35:20,160 --> 00:35:25,280 Speaker 1: know it can apply to that answer. With MP three compression, 594 00:35:25,320 --> 00:35:28,319 Speaker 1: this means making certain the number of bits representing a 595 00:35:28,360 --> 00:35:33,160 Speaker 1: granule because remember I mentioned that an MP three formats 596 00:35:33,280 --> 00:35:36,400 Speaker 1: you have frames, and each frame, each frame has a thousand, 597 00:35:36,400 --> 00:35:40,000 Speaker 1: one or fifty two samples and consists of two granules 598 00:35:40,000 --> 00:35:43,840 Speaker 1: with five s each. So when you answer the first question, 599 00:35:43,960 --> 00:35:46,640 Speaker 1: it eliminates a lot of other possibilities and narrows the focus. 600 00:35:46,640 --> 00:35:49,800 Speaker 1: So like with animal, vegetable, mineral, if I say animal, 601 00:35:49,920 --> 00:35:52,840 Speaker 1: you're gonna not ask any questions that have to do 602 00:35:52,880 --> 00:35:56,480 Speaker 1: with minerals or vegetables only because it wouldn't make sense. 603 00:35:57,239 --> 00:35:59,400 Speaker 1: You know, those aren't gonna apply. Same thing with m 604 00:35:59,440 --> 00:36:02,160 Speaker 1: P three's, except this time it means making certain the 605 00:36:02,239 --> 00:36:05,799 Speaker 1: number of bits representing a granule. Remember their two granules 606 00:36:05,800 --> 00:36:09,680 Speaker 1: per frame with the MP three layer, Uh, you want 607 00:36:09,680 --> 00:36:12,759 Speaker 1: to make sure that the number of bits representing that 608 00:36:12,800 --> 00:36:16,319 Speaker 1: granule match the chosen bit rate for a compression. So 609 00:36:16,360 --> 00:36:18,640 Speaker 1: if after going through this process, the encoder says, hey, 610 00:36:18,640 --> 00:36:21,839 Speaker 1: this granule has more bits than what's allowed. It's too 611 00:36:21,840 --> 00:36:24,680 Speaker 1: many bits. The we gotta get rid of some of these, 612 00:36:24,840 --> 00:36:27,200 Speaker 1: the encoder can adjust the scale factor band so that 613 00:36:27,239 --> 00:36:31,560 Speaker 1: there's less precision meaning that multiplier in other words, that 614 00:36:32,120 --> 00:36:35,480 Speaker 1: but I talked about earlier, and thus reduce the amount 615 00:36:35,480 --> 00:36:40,120 Speaker 1: of data needed to represent that particular granule. If a 616 00:36:40,160 --> 00:36:44,120 Speaker 1: granule comes in under the bit rate, the encoder can 617 00:36:44,160 --> 00:36:48,320 Speaker 1: increase the precision to reduce noise and fill that granule 618 00:36:48,440 --> 00:36:55,040 Speaker 1: out properly so that matches the actual threshold. After all this, 619 00:36:55,160 --> 00:36:58,360 Speaker 1: the pairs of granules become frames within the MP three files, 620 00:36:58,360 --> 00:37:01,280 Speaker 1: and the only other component then MP three file apart 621 00:37:01,320 --> 00:37:04,719 Speaker 1: from these frames is the I D three metadata. And 622 00:37:04,719 --> 00:37:06,799 Speaker 1: this is pretty simple. This is like a header and 623 00:37:06,840 --> 00:37:09,080 Speaker 1: it comes before all the frames in the audio file 624 00:37:09,160 --> 00:37:13,000 Speaker 1: and contains information about about the file itself, which can 625 00:37:13,000 --> 00:37:15,719 Speaker 1: include stuff like the title of a song, an artist name, 626 00:37:15,840 --> 00:37:19,640 Speaker 1: an album title, other stuff like that. It can also 627 00:37:19,680 --> 00:37:23,080 Speaker 1: include copyright information as well as information about the file itself, 628 00:37:23,160 --> 00:37:25,440 Speaker 1: such as whether or not it's stereo recording or a 629 00:37:25,440 --> 00:37:29,279 Speaker 1: mono recording. So when you use a decoder like an 630 00:37:29,360 --> 00:37:34,720 Speaker 1: MP three player, it takes this compressed information, these these 631 00:37:34,719 --> 00:37:40,960 Speaker 1: these representations that the music has been reduced to, and 632 00:37:41,040 --> 00:37:44,520 Speaker 1: it converts that Huffman data back into the quantized format, 633 00:37:45,080 --> 00:37:47,759 Speaker 1: scales the data back up to its original size or 634 00:37:47,800 --> 00:37:53,640 Speaker 1: close approximation. Remember the the uncompressed version may actually be 635 00:37:53,719 --> 00:37:58,280 Speaker 1: off by a significant amount depending upon each individual granule. 636 00:37:58,840 --> 00:38:01,080 Speaker 1: And all of that data gets combined into a new 637 00:38:01,160 --> 00:38:04,200 Speaker 1: PCM sample that can be played back to you. And 638 00:38:04,320 --> 00:38:07,120 Speaker 1: that's all there is to it. Nothing could be easier, 639 00:38:08,320 --> 00:38:11,920 Speaker 1: all right. That took a lot out of me, So 640 00:38:11,960 --> 00:38:14,320 Speaker 1: I got really technical, and I apologize if I lost 641 00:38:14,360 --> 00:38:16,600 Speaker 1: any of you out there, or for those of you 642 00:38:16,600 --> 00:38:19,160 Speaker 1: who have a lot of experience working on compression algorithms, 643 00:38:19,160 --> 00:38:23,040 Speaker 1: for oversimplifying in several cases. But now we've got a 644 00:38:23,040 --> 00:38:25,520 Speaker 1: full episode about this, and I hope you have a 645 00:38:25,520 --> 00:38:28,640 Speaker 1: better understanding of how a big sound file can be 646 00:38:28,719 --> 00:38:32,880 Speaker 1: reduced to a smaller sound file. Next time, I'll just 647 00:38:32,920 --> 00:38:36,160 Speaker 1: say magic. It will make everyone happier. If you guys 648 00:38:36,200 --> 00:38:39,320 Speaker 1: have any questions for me, or comments or suggestions, anything 649 00:38:39,360 --> 00:38:42,480 Speaker 1: like that, send me a message. My email is tech 650 00:38:42,520 --> 00:38:45,520 Speaker 1: Stuff at how stuff works dot com, or you can 651 00:38:45,560 --> 00:38:48,120 Speaker 1: drop me a line on Facebook or Twitter to handle it. 652 00:38:48,239 --> 00:38:51,359 Speaker 1: Both of those is tech Stuff H. S W. And 653 00:38:51,400 --> 00:38:59,919 Speaker 1: I'll talk to you guys again really soon. For more 654 00:39:00,000 --> 00:39:02,279 Speaker 1: on this and thousands of other topics, is it how 655 00:39:02,320 --> 00:39:08,640 Speaker 1: stuff works dot com, wh