1 00:00:04,160 --> 00:00:07,160 Speaker 1: Get in tech with technology with tech Stuff from how 2 00:00:07,240 --> 00:00:14,040 Speaker 1: stuff works dot com. Hey there, and welcome to tech Stuff. 3 00:00:14,080 --> 00:00:17,520 Speaker 1: I'm your host, Jonathan Strickland. And in a recent episode 4 00:00:17,560 --> 00:00:20,560 Speaker 1: I explored how digital audio works and gave kind of 5 00:00:20,560 --> 00:00:24,639 Speaker 1: a brief history on the MP three file format. I 6 00:00:24,760 --> 00:00:27,680 Speaker 1: warned you back then that that was part one of 7 00:00:27,720 --> 00:00:30,760 Speaker 1: a three part series, and today we're gonna explore part two. 8 00:00:31,440 --> 00:00:34,599 Speaker 1: So I hadn't forgotten about it. We're back to it, uh, 9 00:00:34,640 --> 00:00:36,440 Speaker 1: And today we're gonna do a deeper dive with m 10 00:00:36,479 --> 00:00:39,959 Speaker 1: P three's and how do they compress audio? And how 11 00:00:39,960 --> 00:00:42,239 Speaker 1: can you take a file filled with information and make 12 00:00:42,280 --> 00:00:44,920 Speaker 1: it a smaller size? What do you have to give 13 00:00:45,080 --> 00:00:48,159 Speaker 1: up in order to make files smaller? And today we're 14 00:00:48,159 --> 00:00:51,280 Speaker 1: gonna try and unravel the technical mystery behind the MP 15 00:00:51,400 --> 00:00:54,760 Speaker 1: three And I am not going to lie to you people. 16 00:00:55,720 --> 00:01:01,240 Speaker 1: This is gonna get a bit you know, man athy 17 00:01:01,440 --> 00:01:04,759 Speaker 1: And that was an English major, So you mathematicians out there, 18 00:01:04,760 --> 00:01:07,400 Speaker 1: get ready with your corrections because I'm probably gonna make 19 00:01:07,440 --> 00:01:10,760 Speaker 1: some over generalizations for the purposes of my own sanity. 20 00:01:11,280 --> 00:01:14,160 Speaker 1: There does get to a point where to really get 21 00:01:14,200 --> 00:01:19,000 Speaker 1: into the technical details, it would likely be uh impossible 22 00:01:19,040 --> 00:01:21,080 Speaker 1: for me to describe it in a way that would 23 00:01:21,080 --> 00:01:25,880 Speaker 1: make sense and be accurate. Um, and I have given 24 00:01:26,120 --> 00:01:30,399 Speaker 1: my producer Dylan the mandate that, should I get to 25 00:01:31,120 --> 00:01:36,200 Speaker 1: cryptic and incomprehensible with my explanation, that he is to 26 00:01:36,240 --> 00:01:40,200 Speaker 1: intervene in a way that he sees fit. Just not 27 00:01:40,240 --> 00:01:44,120 Speaker 1: in the face, Dylan. It's not in the face. It's moneymaker, man. 28 00:01:44,240 --> 00:01:47,120 Speaker 1: I gotta gotta take care of it. So let's remember 29 00:01:47,160 --> 00:01:52,320 Speaker 1: that the heart of digital information is the bit that's 30 00:01:52,320 --> 00:01:56,440 Speaker 1: either a zero or a one. The basic unit of 31 00:01:56,960 --> 00:02:01,920 Speaker 1: information for digital formats zeros and ones. Now we can 32 00:02:02,000 --> 00:02:05,160 Speaker 1: use those zeros and ones to describe all sorts of information, 33 00:02:05,800 --> 00:02:09,280 Speaker 1: from text to audio, to video and really pretty much 34 00:02:09,280 --> 00:02:12,240 Speaker 1: anything you can think of that's represented digitally. Ultimately, when 35 00:02:12,240 --> 00:02:14,000 Speaker 1: you get down to it, it's a bunch of zeros 36 00:02:14,040 --> 00:02:17,000 Speaker 1: and ones. So let's say you start off with your 37 00:02:17,080 --> 00:02:21,520 Speaker 1: uncompressed audio file. You've got this enormous audio file in 38 00:02:21,520 --> 00:02:23,560 Speaker 1: front of you. It's made up of zeros and ones. 39 00:02:24,080 --> 00:02:26,840 Speaker 1: How do you make that file smaller? So in the 40 00:02:26,840 --> 00:02:29,560 Speaker 1: real world, we can compress stuff, right, we can apply 41 00:02:29,800 --> 00:02:33,760 Speaker 1: physical pressure to things. Think about packing a suitcase. You 42 00:02:33,760 --> 00:02:36,240 Speaker 1: can make sure you get that extra outfit in if 43 00:02:36,280 --> 00:02:38,600 Speaker 1: you just press it down hard enough and get that 44 00:02:38,680 --> 00:02:42,240 Speaker 1: zipper zipped before it can burst open. But once you 45 00:02:42,280 --> 00:02:44,920 Speaker 1: get to a certain level of compression, you cannot make 46 00:02:45,080 --> 00:02:48,600 Speaker 1: things smaller, at least not without hurting yourself or whatever 47 00:02:48,639 --> 00:02:51,720 Speaker 1: it is you're trying to compress. Digital files are a 48 00:02:51,720 --> 00:02:55,400 Speaker 1: little different because you cannot physically cram the zeros and 49 00:02:55,520 --> 00:02:58,120 Speaker 1: ones closer together. It doesn't work like that. These are 50 00:02:58,240 --> 00:03:02,600 Speaker 1: abstract things. You can't make them smaller, right. You can't 51 00:03:02,720 --> 00:03:06,000 Speaker 1: decrease the font. It doesn't work that way. The numbers 52 00:03:06,040 --> 00:03:09,240 Speaker 1: represent two different states. So if you want to create 53 00:03:09,240 --> 00:03:12,840 Speaker 1: a smaller audio file containing the recording that was in 54 00:03:12,880 --> 00:03:17,680 Speaker 1: a larger audio file, you have to start getting creative now. 55 00:03:17,720 --> 00:03:20,120 Speaker 1: In the last part of this series, I talked about 56 00:03:20,160 --> 00:03:22,920 Speaker 1: how the MP three compression algorithm was born from an 57 00:03:22,960 --> 00:03:26,600 Speaker 1: applied research institution in Germany and the team behind the 58 00:03:26,720 --> 00:03:29,040 Speaker 1: MP three wanted to find a way to compress audio, 59 00:03:29,160 --> 00:03:34,800 Speaker 1: specifically music for transmission over phone lines. Eventually, this evolved 60 00:03:34,840 --> 00:03:39,480 Speaker 1: into the Motion Pictures Expert Group Audio Layer three compression methodology, 61 00:03:39,680 --> 00:03:44,560 Speaker 1: better known as the MP three, and there's also IMPACT 62 00:03:44,640 --> 00:03:47,360 Speaker 1: two and IMPEG four standards. Impact two, by the way, 63 00:03:47,400 --> 00:03:50,320 Speaker 1: is the basis of compression on DVDs, although the actual 64 00:03:50,440 --> 00:03:54,720 Speaker 1: DVD format is really a modification of Impact two and 65 00:03:54,840 --> 00:03:57,360 Speaker 1: Impact four is a compression strategy for audio and video 66 00:03:57,400 --> 00:04:00,840 Speaker 1: that's frequently used in lots of different up pacities, including 67 00:04:00,880 --> 00:04:05,160 Speaker 1: streaming media services. So by the late nineteen seventies, researchers 68 00:04:05,200 --> 00:04:08,720 Speaker 1: began to explore the possibility of leveraging psycho acoustics to 69 00:04:08,760 --> 00:04:12,960 Speaker 1: figure out how to compress audio. And psychoacoustics refers to 70 00:04:13,200 --> 00:04:17,120 Speaker 1: the way we perceive sound it's uh and also the 71 00:04:17,120 --> 00:04:21,360 Speaker 1: physiological effects of sound on us. So this involves not 72 00:04:21,480 --> 00:04:24,640 Speaker 1: just our our physical sense of hearing, but also our 73 00:04:24,680 --> 00:04:28,400 Speaker 1: brains and the way our brains interpret sound. So, for example, 74 00:04:28,720 --> 00:04:32,480 Speaker 1: there's a psychoacoustic phenomenon that's called the Hawse effect h 75 00:04:32,640 --> 00:04:35,560 Speaker 1: A A S. And I think it's pretty interesting. So 76 00:04:35,760 --> 00:04:38,200 Speaker 1: here's how the Hawse effect works. If you hear the 77 00:04:38,279 --> 00:04:43,280 Speaker 1: exact same sound coming from different directions, but the two 78 00:04:43,279 --> 00:04:46,640 Speaker 1: sounds arrive within thirty to forty milliseconds of each other, 79 00:04:47,040 --> 00:04:50,000 Speaker 1: your brain will be convinced that you really only heard 80 00:04:50,040 --> 00:04:53,440 Speaker 1: one sound and it came from the direction that hit 81 00:04:53,520 --> 00:04:57,200 Speaker 1: you first. So let's say a sounds coming from directly 82 00:04:57,240 --> 00:04:59,680 Speaker 1: in front of you and to your left, and you 83 00:05:00,080 --> 00:05:03,480 Speaker 1: get both of them within that thirty to forty millisecond range, 84 00:05:04,279 --> 00:05:06,440 Speaker 1: and you hear the one coming from ahead of you 85 00:05:06,520 --> 00:05:10,039 Speaker 1: first to you, you're convinced that you only heard that 86 00:05:10,120 --> 00:05:13,080 Speaker 1: sound once and it came from dead on straight ahead 87 00:05:13,080 --> 00:05:16,680 Speaker 1: of you. Your brain kind of discounts the one that 88 00:05:16,760 --> 00:05:20,159 Speaker 1: came off from the left, although it can reinforce it, 89 00:05:20,279 --> 00:05:22,520 Speaker 1: which ends up being really useful if you're planning out 90 00:05:22,520 --> 00:05:25,279 Speaker 1: p A systems for stage shows. I'm not joking. That 91 00:05:25,320 --> 00:05:28,080 Speaker 1: really is the way that people plan those things out. 92 00:05:28,360 --> 00:05:31,080 Speaker 1: It's pretty neat. Humans perceive sounds in a way that's 93 00:05:31,080 --> 00:05:35,200 Speaker 1: not necessarily representational of all the sounds surrounding us. You 94 00:05:35,200 --> 00:05:38,600 Speaker 1: can think of your brain as the filter between your 95 00:05:38,720 --> 00:05:42,679 Speaker 1: understanding and what reality actually is. A lot of stuff 96 00:05:42,720 --> 00:05:45,599 Speaker 1: goes on that it ends up getting rid of information 97 00:05:45,640 --> 00:05:48,040 Speaker 1: that your brain just says, you know what, he or 98 00:05:48,080 --> 00:05:52,040 Speaker 1: she doesn't need that, it's just gonna confuse things. We're 99 00:05:52,040 --> 00:05:55,400 Speaker 1: gonna dump it. And that's kind of how it works. 100 00:05:55,440 --> 00:05:57,599 Speaker 1: It's all on an unconscious level. It's not like you're 101 00:05:57,800 --> 00:06:01,919 Speaker 1: actively working to do this. So let's say you're in 102 00:06:01,920 --> 00:06:04,320 Speaker 1: a relatively busy hallway, and there could be a lot 103 00:06:04,360 --> 00:06:07,800 Speaker 1: of sounds in that hallway, stuff that's going on constantly 104 00:06:07,800 --> 00:06:11,000 Speaker 1: around you. Maybe they are doors opening and closing, Maybe 105 00:06:11,040 --> 00:06:13,960 Speaker 1: their footsteps going up and down the hallway. Maybe someone 106 00:06:14,080 --> 00:06:17,719 Speaker 1: shoes are squeaking against the linoleum floor. People are chattering 107 00:06:17,760 --> 00:06:20,839 Speaker 1: away in there. But you are having a conversation with someone, 108 00:06:21,240 --> 00:06:23,960 Speaker 1: so you turn your focus on that person and other 109 00:06:24,040 --> 00:06:28,200 Speaker 1: sounds seemingly fade away. They're still present, but they're not important. 110 00:06:28,800 --> 00:06:31,520 Speaker 1: So in this example, you would actually call those other 111 00:06:31,600 --> 00:06:35,479 Speaker 1: sounds of distraction and you would really focus on the conversation. Uh. 112 00:06:35,520 --> 00:06:40,000 Speaker 1: That also shows how we're able to consciously direct our 113 00:06:40,080 --> 00:06:43,719 Speaker 1: sense our perception of hearing. So both of these factors 114 00:06:43,760 --> 00:06:47,120 Speaker 1: come into play. Now. One thing that MP three encoding 115 00:06:47,160 --> 00:06:51,080 Speaker 1: takes advantage of is something called masking, and there are 116 00:06:51,080 --> 00:06:54,120 Speaker 1: a couple of different variations of the masking effect. One 117 00:06:54,160 --> 00:06:57,520 Speaker 1: of them is called frequency masking. So let's say you've 118 00:06:57,560 --> 00:07:00,480 Speaker 1: got to sound frequencies that are similar ahaps, there're just 119 00:07:00,520 --> 00:07:04,200 Speaker 1: a few hurts apart. Remember, frequencies are measured in hurts, 120 00:07:04,680 --> 00:07:08,520 Speaker 1: which is really the number of oscillations per second. So 121 00:07:08,640 --> 00:07:14,040 Speaker 1: let's say you've got a sound that's at I don't know, uh, 122 00:07:14,360 --> 00:07:19,360 Speaker 1: one thousand killer hurts, and another one that's at one 123 00:07:19,480 --> 00:07:23,560 Speaker 1: thousand and ten killer hurts. Now, the human ear is 124 00:07:23,600 --> 00:07:26,920 Speaker 1: precise enough to be able to tell the difference of 125 00:07:27,040 --> 00:07:29,840 Speaker 1: two sounds that are at least two hurts apart from 126 00:07:29,840 --> 00:07:33,360 Speaker 1: each other. That's how precise our resolution of hearing it's 127 00:07:33,440 --> 00:07:36,760 Speaker 1: it's at that level. But if you get two sounds 128 00:07:36,840 --> 00:07:40,520 Speaker 1: played at the same time and they are that close 129 00:07:40,560 --> 00:07:44,080 Speaker 1: together in frequency, and one of those frequencies is played 130 00:07:44,120 --> 00:07:47,280 Speaker 1: at a greater volume than the other, our brains will 131 00:07:47,280 --> 00:07:50,160 Speaker 1: pick up on the louder sound and ignore the quieter sound, 132 00:07:50,240 --> 00:07:53,880 Speaker 1: even though both of them are present. What becomes important 133 00:07:53,880 --> 00:07:56,520 Speaker 1: at that point is the amplitude. Now, the further apart 134 00:07:56,560 --> 00:08:00,400 Speaker 1: in frequencies you get, the less that hasn't a effect. 135 00:08:00,480 --> 00:08:02,360 Speaker 1: So if you get far enough apart where they are 136 00:08:02,360 --> 00:08:05,680 Speaker 1: two pitches, one of them noticeably louder than the other, 137 00:08:06,040 --> 00:08:08,320 Speaker 1: but they're far enough apart, you will hear both of them. 138 00:08:08,360 --> 00:08:11,560 Speaker 1: It only works if the two pitches are relatively close together, 139 00:08:12,680 --> 00:08:15,560 Speaker 1: and there's not a universal formula for frequency masking. As 140 00:08:15,560 --> 00:08:18,520 Speaker 1: you get closer to the boundaries of human hearing, frequency 141 00:08:18,560 --> 00:08:20,920 Speaker 1: masking becomes easier. So if it's a really low pitch 142 00:08:21,000 --> 00:08:23,600 Speaker 1: or a really high pitch, it's easier to get away 143 00:08:23,600 --> 00:08:26,400 Speaker 1: with it. Once you start getting into what is the 144 00:08:26,400 --> 00:08:28,960 Speaker 1: ought of as the sweet spot for human hearing, which 145 00:08:29,000 --> 00:08:32,120 Speaker 1: is generally considered to be between two and five killer hurts, 146 00:08:33,200 --> 00:08:37,200 Speaker 1: you need a greater difference in volume or a smaller 147 00:08:37,240 --> 00:08:41,640 Speaker 1: difference in frequency in order for masking to work. Frequency 148 00:08:41,720 --> 00:08:45,480 Speaker 1: masking at any rate. But then there's also temporal masking, 149 00:08:46,600 --> 00:08:48,880 Speaker 1: and you might say, okay, I got it. Temporal that 150 00:08:48,920 --> 00:08:53,040 Speaker 1: means time. Indeed it does, my friend. This describes the 151 00:08:53,040 --> 00:08:56,040 Speaker 1: effect of a short but loud sound masking a softer 152 00:08:56,120 --> 00:09:00,360 Speaker 1: sound for a short time. Weird thing is the loud 153 00:09:00,360 --> 00:09:03,960 Speaker 1: sound can actually mask sounds that precede it slightly, not 154 00:09:04,040 --> 00:09:06,760 Speaker 1: by a whole lot, but a little bit. MP three 155 00:09:06,760 --> 00:09:10,880 Speaker 1: compression takes advantage of both frequency and temporal masking when 156 00:09:10,880 --> 00:09:14,079 Speaker 1: it's trying to determine which data needs to be included 157 00:09:14,160 --> 00:09:16,920 Speaker 1: and which data can be dumped, because it won't affect 158 00:09:16,960 --> 00:09:19,840 Speaker 1: your perception of whatever the the audio file is in 159 00:09:19,840 --> 00:09:23,720 Speaker 1: the first place. So you also probably remember I talked 160 00:09:23,720 --> 00:09:26,560 Speaker 1: about the physical limitation to what we humans can hear, 161 00:09:26,800 --> 00:09:28,920 Speaker 1: no matter what our brains might be up to, so 162 00:09:29,000 --> 00:09:31,400 Speaker 1: that this doesn't have to do with our brains, you know, 163 00:09:31,480 --> 00:09:34,240 Speaker 1: filtering through the information that's coming in. This has to 164 00:09:34,280 --> 00:09:38,200 Speaker 1: do with the physical limitations of the human ear. In 165 00:09:38,240 --> 00:09:41,199 Speaker 1: the last episode of the series, I said typical human hearing. 166 00:09:41,840 --> 00:09:45,559 Speaker 1: Keep in mind typical there are exceptions. UH covers the 167 00:09:45,679 --> 00:09:48,560 Speaker 1: range of frequencies between about twenty hurts and twenty killer 168 00:09:48,640 --> 00:09:52,000 Speaker 1: hurts or twenty thousand hurts. So twenty to twenty thousand 169 00:09:52,800 --> 00:09:57,280 Speaker 1: higher frequencies represent higher pitches and sound lower frequencies lower pitches, right, 170 00:09:58,080 --> 00:10:00,640 Speaker 1: And as you get older, your ability to perceive those 171 00:10:00,720 --> 00:10:05,040 Speaker 1: higher frequencies starts to diminish. So most adults actually have 172 00:10:05,320 --> 00:10:10,880 Speaker 1: an upper range closer to sixteen killer hurts, not twenty. UH. 173 00:10:11,080 --> 00:10:13,480 Speaker 1: Kids they can hear those higher pitches. You may have 174 00:10:13,600 --> 00:10:17,920 Speaker 1: heard the story about how some convenience stores experimented with 175 00:10:18,160 --> 00:10:23,600 Speaker 1: getting rid of teenage loiterers by by UH projecting out 176 00:10:24,000 --> 00:10:27,280 Speaker 1: the super high pitches that that adults could not hear 177 00:10:27,640 --> 00:10:30,600 Speaker 1: but kids could, and it discouraged kids from hanging out 178 00:10:30,640 --> 00:10:35,080 Speaker 1: at the convenience store and loitering. UM. I love that 179 00:10:35,200 --> 00:10:39,600 Speaker 1: idea so much. Anyway, that's because I'm old and my 180 00:10:39,640 --> 00:10:43,520 Speaker 1: hearing is terrible. Well, remember I also mentioned you can 181 00:10:43,559 --> 00:10:46,400 Speaker 1: detect changes in pitch at two hurts increments if you 182 00:10:46,440 --> 00:10:48,960 Speaker 1: get below two hurts and change, Like, if it's just 183 00:10:49,040 --> 00:10:54,600 Speaker 1: a one hurts difference between two frequencies, it's too low 184 00:10:54,640 --> 00:10:56,800 Speaker 1: a resolution for us to detect. To us, it will 185 00:10:56,800 --> 00:11:01,040 Speaker 1: sound exactly the same. So if you were to hear 186 00:11:01,520 --> 00:11:06,800 Speaker 1: a frequency at one thousand one hurts or one point 187 00:11:07,000 --> 00:11:10,800 Speaker 1: zero zero one killer hurts and one point zero zero 188 00:11:10,840 --> 00:11:13,800 Speaker 1: to killer hurts, you wouldn't notice the difference. They would 189 00:11:13,840 --> 00:11:16,960 Speaker 1: sound exactly the same to you. So if you're gonna 190 00:11:17,000 --> 00:11:19,240 Speaker 1: take audio and compress it, one step you could consider 191 00:11:19,360 --> 00:11:23,960 Speaker 1: is eliminating anything that's outside the actual range of frequencies 192 00:11:24,040 --> 00:11:27,560 Speaker 1: that we can hear, or simplifying any changes in frequency 193 00:11:27,640 --> 00:11:31,240 Speaker 1: that are smaller than two hurts. If you get take 194 00:11:31,240 --> 00:11:34,760 Speaker 1: all that data and you say it is physically impossible 195 00:11:34,800 --> 00:11:38,439 Speaker 1: for a human to perceive this, get rid of that information, 196 00:11:38,559 --> 00:11:41,800 Speaker 1: then in theory it wouldn't have any effect on the 197 00:11:41,840 --> 00:11:46,120 Speaker 1: rest of the recording. But how you go further than that? Right, 198 00:11:46,200 --> 00:11:48,959 Speaker 1: how do you create a method so that you can 199 00:11:49,000 --> 00:11:51,120 Speaker 1: really compress this file? You want a method that will 200 00:11:51,120 --> 00:11:54,439 Speaker 1: preserve the important sounds while potentially ignoring all the unimportant 201 00:11:54,520 --> 00:11:58,320 Speaker 1: or incidel sounds. And you want to be automatic because 202 00:11:58,760 --> 00:12:01,440 Speaker 1: if you have a man you really then that's going 203 00:12:01,520 --> 00:12:05,640 Speaker 1: to take countless hours just to edit a single sound file. 204 00:12:06,760 --> 00:12:10,959 Speaker 1: So that was the challenge that the MP three research 205 00:12:11,040 --> 00:12:16,040 Speaker 1: team faced as a group. Now, their solution, which ultimately 206 00:12:16,080 --> 00:12:18,559 Speaker 1: created even more challenges, was to come up with what 207 00:12:18,640 --> 00:12:22,480 Speaker 1: was essentially a simulated human ear and brain. They needed 208 00:12:22,520 --> 00:12:27,880 Speaker 1: to replicate the experience of perceiving music so that an 209 00:12:27,880 --> 00:12:32,160 Speaker 1: algorithm could evaluate every sound in an audio file and 210 00:12:32,280 --> 00:12:35,359 Speaker 1: judge if an in fact was relevant enough to include 211 00:12:35,400 --> 00:12:39,720 Speaker 1: in the final compressed version. If a sound were imperceptible, 212 00:12:39,760 --> 00:12:41,600 Speaker 1: then it wouldn't make sense to include it in the 213 00:12:41,720 --> 00:12:44,720 Speaker 1: MP three file. So by leaving out all the irrelevant data, 214 00:12:44,760 --> 00:12:48,680 Speaker 1: they can make the audio information take up less bandwidth. 215 00:12:48,679 --> 00:12:51,240 Speaker 1: The file itself would be smaller because you just dumped 216 00:12:51,280 --> 00:12:54,880 Speaker 1: everything that wasn't important. So the team used an algorithm 217 00:12:55,000 --> 00:13:00,000 Speaker 1: called the low complexity adaptive transform coding or lc DASH 218 00:13:00,160 --> 00:13:03,080 Speaker 1: a t C as the foundation for their research. This 219 00:13:03,160 --> 00:13:06,440 Speaker 1: was kind of their starting point, and this is an 220 00:13:06,480 --> 00:13:10,120 Speaker 1: approach that tries to do away with redundancy as much 221 00:13:10,160 --> 00:13:15,199 Speaker 1: as possible. And it also incorporates adaptation to perceptual requirements. Also, 222 00:13:15,320 --> 00:13:19,199 Speaker 1: MP three's oh a lot to the IMPEG Layer two standard, 223 00:13:19,760 --> 00:13:23,199 Speaker 1: So the layer two obviously came out before Layer three, 224 00:13:23,720 --> 00:13:26,199 Speaker 1: and so a lot of the features of layer three 225 00:13:26,320 --> 00:13:31,760 Speaker 1: are really um their legacy features from layer two. Uh. 226 00:13:31,800 --> 00:13:34,000 Speaker 1: In other words, MP three group kind of got stuck 227 00:13:34,000 --> 00:13:36,560 Speaker 1: with them because otherwise they would have had a problem 228 00:13:36,559 --> 00:13:39,880 Speaker 1: with backwards compatibility. So the result is kind of a 229 00:13:39,920 --> 00:13:43,400 Speaker 1: clunky arrangement under the hood, and some of the features 230 00:13:43,600 --> 00:13:46,160 Speaker 1: may make very little sense when I go through them, 231 00:13:46,600 --> 00:13:48,600 Speaker 1: but some of that is because it's a hold over 232 00:13:48,640 --> 00:13:53,280 Speaker 1: from an earlier compression strategy, which isn't terribly satisfying as 233 00:13:53,280 --> 00:13:55,559 Speaker 1: an answer. But the reason many parts of the MP 234 00:13:55,640 --> 00:13:57,840 Speaker 1: three compression algorithm are the way they are is because 235 00:13:57,880 --> 00:14:01,560 Speaker 1: that's the way we've always done it. So next I'm 236 00:14:01,600 --> 00:14:07,760 Speaker 1: gonna dive into the phases of compression. But before I 237 00:14:07,800 --> 00:14:10,680 Speaker 1: do that, let's all take a deep breath and take 238 00:14:10,720 --> 00:14:22,440 Speaker 1: a moment to thank our sponsor, and we're back. So 239 00:14:22,560 --> 00:14:25,080 Speaker 1: there are two big phases we'll need to talk about 240 00:14:25,160 --> 00:14:29,760 Speaker 1: with MP three compression. The first phase is analysis and 241 00:14:29,800 --> 00:14:33,960 Speaker 1: the second phase is the actual compression itself. And after 242 00:14:34,040 --> 00:14:37,080 Speaker 1: that there's the process of decoding and MP three for playback. 243 00:14:37,560 --> 00:14:40,120 Speaker 1: But that's way simpler once we get an understanding of 244 00:14:40,160 --> 00:14:45,920 Speaker 1: how the encoding process actually happens. So let's begin with analysis. Now. 245 00:14:45,960 --> 00:14:49,480 Speaker 1: This is the part where the standard has to figure 246 00:14:49,520 --> 00:14:53,800 Speaker 1: out which frequencies within an audio range are recording rather 247 00:14:53,920 --> 00:14:59,720 Speaker 1: are important or perceptible. So how does a program and 248 00:14:59,760 --> 00:15:02,680 Speaker 1: in coder figure out what we can hear and what 249 00:15:02,800 --> 00:15:06,160 Speaker 1: we cannot hear? All? Right, time to get technical. So 250 00:15:06,880 --> 00:15:10,440 Speaker 1: you start off with your pulse code modulation audio file 251 00:15:10,720 --> 00:15:13,480 Speaker 1: or PCM file. And you might remember I talked about 252 00:15:13,480 --> 00:15:16,720 Speaker 1: PCM audio in the first episode of this series, but 253 00:15:16,840 --> 00:15:20,600 Speaker 1: just in case you don't, it's a lossless digital audio file. 254 00:15:20,680 --> 00:15:23,720 Speaker 1: The actual format could be a wave or ai f 255 00:15:23,720 --> 00:15:26,480 Speaker 1: F or something along those lines, but the important thing 256 00:15:26,920 --> 00:15:31,080 Speaker 1: to keep in mind is that it is uncompressed. Now, 257 00:15:31,120 --> 00:15:33,560 Speaker 1: that means those files tend to be pretty big. This 258 00:15:33,640 --> 00:15:36,040 Speaker 1: is our raw material that we want to take and 259 00:15:36,120 --> 00:15:40,560 Speaker 1: squish down to a more manageable, transferable size. And in 260 00:15:40,640 --> 00:15:43,320 Speaker 1: our our last episode in this series, I also mentioned 261 00:15:43,320 --> 00:15:46,680 Speaker 1: that the standard for c D audio is a sample 262 00:15:46,760 --> 00:15:49,880 Speaker 1: rate of forty four point one. Killer hurts and we 263 00:15:50,040 --> 00:15:52,680 Speaker 1: learned that you need a sample rate twice the frequency 264 00:15:52,840 --> 00:15:56,800 Speaker 1: of the highest frequency in your recording, and since human 265 00:15:56,840 --> 00:15:59,600 Speaker 1: hearing tops out at around twenty kill hurts, the standard 266 00:15:59,600 --> 00:16:02,520 Speaker 1: for CDs is forty four point one killer hurts. The 267 00:16:02,640 --> 00:16:05,640 Speaker 1: MP three standard can support lots of different sample rates, 268 00:16:05,720 --> 00:16:08,160 Speaker 1: but forty four point one killer Hurts is pretty much 269 00:16:08,200 --> 00:16:12,600 Speaker 1: the common standard. So you've got a number of samples 270 00:16:12,680 --> 00:16:15,120 Speaker 1: with your audio file, and that number will depend upon 271 00:16:15,120 --> 00:16:18,320 Speaker 1: how long the audio file is. You've got forty four 272 00:16:18,320 --> 00:16:23,120 Speaker 1: thousand one samples per second, actually twice that for stereo, 273 00:16:23,280 --> 00:16:25,760 Speaker 1: but for the purposes of this discussion, let's kind of 274 00:16:25,920 --> 00:16:28,960 Speaker 1: stick with mono sounds so that I don't start having 275 00:16:29,040 --> 00:16:31,720 Speaker 1: math coming out of my ears. And we're still in 276 00:16:31,720 --> 00:16:34,920 Speaker 1: the very easy, simple part as far as math goes. 277 00:16:34,960 --> 00:16:37,520 Speaker 1: We haven't gotten to the complicated stuff yet, all right, 278 00:16:37,600 --> 00:16:41,600 Speaker 1: So you've got forty four thousand, one hundred samples per second. 279 00:16:42,160 --> 00:16:45,320 Speaker 1: To compress it into an MP three format, the algorithm 280 00:16:45,360 --> 00:16:49,320 Speaker 1: first groups all of these samples into collections called frames. 281 00:16:50,440 --> 00:16:53,640 Speaker 1: So take those forty four thousand one per second, and 282 00:16:53,640 --> 00:16:56,480 Speaker 1: then you start saying, okay, we're gonna group you in batches. 283 00:16:56,960 --> 00:17:00,080 Speaker 1: Each batch is called a frame and each frame contains 284 00:17:00,120 --> 00:17:04,480 Speaker 1: one thousand, one fifty two samples. Now that's specifically to 285 00:17:04,560 --> 00:17:09,280 Speaker 1: maintain backwards compatibility to IMPEG Layer two, which established that 286 00:17:09,320 --> 00:17:12,119 Speaker 1: one thousand, one or fifty two number. But we're not 287 00:17:12,160 --> 00:17:16,360 Speaker 1: talking about IMPEG layer two. We're talking about IMPEG Layer three, 288 00:17:16,800 --> 00:17:18,400 Speaker 1: and though that means we have to get a little 289 00:17:18,400 --> 00:17:25,440 Speaker 1: more complicated. So each frame consists of two subgroups called granules. 290 00:17:25,440 --> 00:17:29,320 Speaker 1: So each granule has five undred seventy six samples seventy 291 00:17:29,359 --> 00:17:32,639 Speaker 1: six times two one thousand fifty two, so five seventy 292 00:17:32,680 --> 00:17:36,680 Speaker 1: six samples per granule. Now, technically MP three encoders only 293 00:17:36,680 --> 00:17:39,000 Speaker 1: work on one granule at a time, but they may 294 00:17:39,040 --> 00:17:42,879 Speaker 1: reference the granules immediately before and immediately after the current 295 00:17:42,920 --> 00:17:45,520 Speaker 1: one in order to see how the audio within the 296 00:17:45,560 --> 00:17:49,480 Speaker 1: file changes over time. All right, so now you've got 297 00:17:49,480 --> 00:17:54,000 Speaker 1: your granules of five hundred seventy six samples each. Then 298 00:17:54,040 --> 00:17:57,480 Speaker 1: the MP three encoder runs the samples through a filter bank, 299 00:17:57,960 --> 00:18:01,960 Speaker 1: which sorts the sound into thirty two frequency ranges. Are 300 00:18:02,000 --> 00:18:05,239 Speaker 1: you are you crazy about the numbers yet, Dylan? Are you? 301 00:18:05,720 --> 00:18:10,520 Speaker 1: Dylan's Dylan's nodding. Dylan gets worse from here. So you 302 00:18:10,560 --> 00:18:13,560 Speaker 1: have thirty two frequency ranges, which is another nod to 303 00:18:13,560 --> 00:18:15,840 Speaker 1: the layer two method which use those thirty two ranges 304 00:18:15,880 --> 00:18:20,240 Speaker 1: for encoding purposes. But we're not talking about layer two early, No, 305 00:18:20,760 --> 00:18:24,320 Speaker 1: we're talking MP three. Gosh darn it. That means we 306 00:18:24,359 --> 00:18:27,159 Speaker 1: take those thirty two ranges and we subdivide them by 307 00:18:27,200 --> 00:18:31,320 Speaker 1: a factor of eighteen. That means we have five hundred 308 00:18:31,320 --> 00:18:36,879 Speaker 1: seventies six bands of frequencies, each band containing one six 309 00:18:37,080 --> 00:18:41,199 Speaker 1: of the frequency range of the original sample. So what 310 00:18:41,280 --> 00:18:44,320 Speaker 1: that actually means, and this this is actually pretty easy. 311 00:18:44,720 --> 00:18:48,159 Speaker 1: The bands are not limited to a specific number for 312 00:18:48,240 --> 00:18:53,240 Speaker 1: their frequency range. Right. The bands don't mean that on 313 00:18:53,280 --> 00:18:56,359 Speaker 1: the on band number one it goes from twenty hurts 314 00:18:56,440 --> 00:18:58,840 Speaker 1: up to a certain range and on band five D 315 00:18:59,000 --> 00:19:02,399 Speaker 1: seventy six in that twenty killer hurts. That's not what 316 00:19:02,440 --> 00:19:05,600 Speaker 1: it means. They're dependent upon the original audio. So if 317 00:19:05,600 --> 00:19:09,680 Speaker 1: the original audio contains sounds within a narrow range of frequencies, 318 00:19:10,040 --> 00:19:13,760 Speaker 1: the five bands will be more precise. But if the 319 00:19:13,760 --> 00:19:17,600 Speaker 1: original recording has a vast range of frequencies, the bands 320 00:19:17,640 --> 00:19:20,440 Speaker 1: are less precise. So another way to think about this 321 00:19:21,119 --> 00:19:24,160 Speaker 1: is with a pizza. So let's say you get extra 322 00:19:24,240 --> 00:19:26,960 Speaker 1: large pizza and you cut it into eight equal slices. 323 00:19:27,600 --> 00:19:30,280 Speaker 1: And then you get a small pizza and you cut 324 00:19:30,320 --> 00:19:33,600 Speaker 1: that into eight equal slices. Well, in both cases you 325 00:19:33,640 --> 00:19:37,760 Speaker 1: have with each slice one eighth of a pizza. But 326 00:19:37,840 --> 00:19:42,080 Speaker 1: the extra large pizza pizza slice is bigger than the 327 00:19:42,119 --> 00:19:45,280 Speaker 1: small pizza pizza slice. It all depends on the size 328 00:19:45,280 --> 00:19:47,960 Speaker 1: of the pizza. So in this case, it depends upon 329 00:19:48,000 --> 00:19:51,080 Speaker 1: the range of frequencies. And and Dylan, do you think 330 00:19:51,080 --> 00:19:53,280 Speaker 1: we could go for some pizza, you know, just just 331 00:19:53,320 --> 00:19:56,159 Speaker 1: put the episode on hole and go get pizza. Dylan's nodding. 332 00:19:56,720 --> 00:20:00,879 Speaker 1: It's great for audio. Yeah, so, uh, pizza, We'll be 333 00:20:00,960 --> 00:20:05,800 Speaker 1: right back. Okay, that was good pizza. Now um oh man, 334 00:20:05,840 --> 00:20:08,400 Speaker 1: I got a whole bunch more notes. Okay, well, let's 335 00:20:08,440 --> 00:20:10,879 Speaker 1: let's go ahead and and do the rest of this. 336 00:20:10,920 --> 00:20:12,840 Speaker 1: All right, So you've got your sound divided up into 337 00:20:12,880 --> 00:20:16,320 Speaker 1: those five seventy six sub brands of frequencies, you know, 338 00:20:16,640 --> 00:20:19,840 Speaker 1: the thing I compared to pizza slices earlier. Now you 339 00:20:19,880 --> 00:20:25,359 Speaker 1: get two different mathematical processes applied to this data. One 340 00:20:25,520 --> 00:20:28,919 Speaker 1: is the fast Furrier transform or f f T, and 341 00:20:28,960 --> 00:20:32,720 Speaker 1: the other is the modified discrete cosine transform or m 342 00:20:32,800 --> 00:20:36,760 Speaker 1: d c T. Now I am not going to dive 343 00:20:36,800 --> 00:20:40,040 Speaker 1: deeply into how these transforms work because frankly, they are 344 00:20:40,119 --> 00:20:44,439 Speaker 1: beyond my mathematical understanding. But I know what they do. 345 00:20:44,680 --> 00:20:49,280 Speaker 1: I just cannot explain the process like how they do 346 00:20:49,400 --> 00:20:51,479 Speaker 1: what they do. So I'm going to give you the 347 00:20:51,480 --> 00:20:54,720 Speaker 1: explanation of what they do what the outcome of each 348 00:20:54,760 --> 00:20:58,840 Speaker 1: of these transformed processes happens to be. But I'm not 349 00:20:58,920 --> 00:21:00,800 Speaker 1: going to be able to tell you the actual mathematical 350 00:21:00,840 --> 00:21:03,479 Speaker 1: steps involved in each because I don't math. So good guys, 351 00:21:04,640 --> 00:21:07,520 Speaker 1: But let's start with a fast for your transform. So 352 00:21:07,640 --> 00:21:09,720 Speaker 1: transform is kind of what it sounds like. It's all 353 00:21:09,720 --> 00:21:13,960 Speaker 1: about transforming information in some way. So in this particular case, 354 00:21:14,119 --> 00:21:17,359 Speaker 1: the f f T transforms the frequency bands we just 355 00:21:17,400 --> 00:21:22,360 Speaker 1: talked about into data that can be further analyzed by 356 00:21:22,480 --> 00:21:26,600 Speaker 1: a psychoacoustic model that's in the encoder. So this is 357 00:21:26,640 --> 00:21:29,960 Speaker 1: that simulated human ear and brain we were talking about earlier. 358 00:21:30,840 --> 00:21:34,800 Speaker 1: So what the encoder does is it analyzes each bed 359 00:21:34,920 --> 00:21:38,600 Speaker 1: of data and looks for signs that it represents audio 360 00:21:38,680 --> 00:21:41,680 Speaker 1: that wouldn't be perceived by a human. So it's looks 361 00:21:41,800 --> 00:21:46,240 Speaker 1: looking for any potential for masking possibilities. So are there 362 00:21:46,240 --> 00:21:48,800 Speaker 1: collections of frequencies that are grouped close together, and is 363 00:21:48,840 --> 00:21:51,320 Speaker 1: one of those frequencies louder than the others, you might 364 00:21:51,359 --> 00:21:53,919 Speaker 1: be able to do away with those softer frequencies because 365 00:21:53,960 --> 00:21:57,480 Speaker 1: of frequency masking. The encoder will also look at whether 366 00:21:57,560 --> 00:21:59,879 Speaker 1: or not the audio has a lot of complexity to it, 367 00:22:00,800 --> 00:22:02,960 Speaker 1: if it has a lot of changes, or if it's 368 00:22:03,000 --> 00:22:07,840 Speaker 1: just relatively steady or simple audio. Any transient sounds that 369 00:22:07,880 --> 00:22:11,600 Speaker 1: are present in the audio might end up being temporal masking, 370 00:22:11,680 --> 00:22:14,040 Speaker 1: so it'll analyze those as well and see if that's 371 00:22:14,040 --> 00:22:19,000 Speaker 1: a possibility. So really what they're looking is for, you know, 372 00:22:20,280 --> 00:22:23,320 Speaker 1: just any really loud sounds that stand out above the 373 00:22:23,400 --> 00:22:26,119 Speaker 1: rest of the recording. That's what the f f T 374 00:22:26,280 --> 00:22:30,200 Speaker 1: is doing. So what about the modified discrete cosign transform. Well, 375 00:22:30,240 --> 00:22:32,359 Speaker 1: this is happening in parallel with the f f T 376 00:22:32,800 --> 00:22:36,280 Speaker 1: and the samples get sorted into different patterns called windows 377 00:22:37,119 --> 00:22:39,679 Speaker 1: uh and the criterion for sorting all has to do 378 00:22:39,720 --> 00:22:43,719 Speaker 1: with whether the sample represents a steady sound or varied sound. 379 00:22:44,240 --> 00:22:47,359 Speaker 1: So if you have a simple steady sound that goes 380 00:22:47,400 --> 00:22:51,200 Speaker 1: into a long window, if there's a lot of variation 381 00:22:51,240 --> 00:22:53,960 Speaker 1: in the sound, like there are a lot of consonants 382 00:22:53,960 --> 00:22:56,760 Speaker 1: in a vocal line or it's like a drum solo 383 00:22:56,960 --> 00:22:59,600 Speaker 1: or something like that. It would get sorted into it 384 00:22:59,720 --> 00:23:02,960 Speaker 1: series ease of three short windows, and each short window 385 00:23:03,000 --> 00:23:09,320 Speaker 1: contains one two samples. That amounts to four whole milliseconds, 386 00:23:09,440 --> 00:23:15,000 Speaker 1: so four thousands of a second in three patterned windows. 387 00:23:15,040 --> 00:23:18,080 Speaker 1: So you've got these windows now, either long windows for 388 00:23:18,119 --> 00:23:21,600 Speaker 1: simple sounds or short windows for the more complex sounds. 389 00:23:21,640 --> 00:23:24,600 Speaker 1: And then the modified discrete cosine transform kicks into gear. 390 00:23:24,680 --> 00:23:26,840 Speaker 1: It looks at each long window or set of three 391 00:23:26,840 --> 00:23:30,920 Speaker 1: short windows and converts them into a set of spectral values. 392 00:23:31,520 --> 00:23:33,800 Speaker 1: To some of you, that probably sounds meaningless. So let's 393 00:23:33,840 --> 00:23:37,720 Speaker 1: talk about spectral analysis for a second. First, I was 394 00:23:38,000 --> 00:23:40,919 Speaker 1: very disappointed to learn that spectral analysis doesn't involve a 395 00:23:40,920 --> 00:23:46,199 Speaker 1: psychologist talking to a ghost about its emotional state, so bummer. 396 00:23:47,000 --> 00:23:50,560 Speaker 1: But spectral analysis is when you look at a spectrum 397 00:23:50,600 --> 00:23:54,800 Speaker 1: of information, like a spectrum of frequencies or related information 398 00:23:54,840 --> 00:23:58,399 Speaker 1: like energy states. That's what this transform does. It takes 399 00:23:58,520 --> 00:24:02,119 Speaker 1: data that originally represents a slice of time in a 400 00:24:02,200 --> 00:24:05,360 Speaker 1: sound waveform. That's what sample is. A sample is an 401 00:24:05,400 --> 00:24:09,280 Speaker 1: instance of time in a wave form and converts it 402 00:24:09,320 --> 00:24:15,800 Speaker 1: into information representing sound as energy across a range of frequencies. Now, 403 00:24:15,840 --> 00:24:18,080 Speaker 1: you can plot out spectral information in a lot of 404 00:24:18,080 --> 00:24:21,000 Speaker 1: different ways, but one common method is to use brightness 405 00:24:21,040 --> 00:24:25,800 Speaker 1: to indicate energy levels. Higher energy levels are brighter patches 406 00:24:26,040 --> 00:24:31,120 Speaker 1: in your visual representation of spectral data. High frequencies would 407 00:24:31,119 --> 00:24:34,160 Speaker 1: appear at the top of a spectral view, like imagine 408 00:24:34,200 --> 00:24:37,400 Speaker 1: a box, and at the top of the box that's 409 00:24:37,400 --> 00:24:39,440 Speaker 1: where you would find high frequencies, at the bottom of 410 00:24:39,440 --> 00:24:41,720 Speaker 1: the box that's where you find low frequencies, and it's 411 00:24:41,760 --> 00:24:44,840 Speaker 1: just lots of patches of color. The really bright patches 412 00:24:44,840 --> 00:24:50,200 Speaker 1: of color represent very high energy frequencies, so they could 413 00:24:50,240 --> 00:24:54,000 Speaker 1: be high or low in in actual frequency, but we're 414 00:24:54,040 --> 00:24:57,600 Speaker 1: talking about energy levels, not whether it's a higher low pitch. 415 00:24:59,440 --> 00:25:02,120 Speaker 1: Looking left to write represents the passing of time, and 416 00:25:02,160 --> 00:25:05,560 Speaker 1: looking along any vertical points shows you the actual frequency 417 00:25:06,240 --> 00:25:09,800 Speaker 1: or pitch, and then the respective energy level is the brightness. 418 00:25:09,920 --> 00:25:12,080 Speaker 1: So it's kind of like looking at sound as a wave, 419 00:25:12,240 --> 00:25:14,760 Speaker 1: but instead of being a wave, you're looking at information 420 00:25:14,760 --> 00:25:19,600 Speaker 1: that indicates frequency range and energy level. That representation is 421 00:25:19,600 --> 00:25:22,480 Speaker 1: actually kind of analogous to how we hear audio. So 422 00:25:22,560 --> 00:25:25,679 Speaker 1: an encoder can analyze the spectral view and start to 423 00:25:25,680 --> 00:25:29,880 Speaker 1: filter out the data we wouldn't perceive due to psychoacoustics. Now, 424 00:25:29,920 --> 00:25:33,920 Speaker 1: after all that processing, the encoder looks at the frequency 425 00:25:34,000 --> 00:25:37,200 Speaker 1: sub brands and the levels of spectral intensity for each 426 00:25:37,800 --> 00:25:41,200 Speaker 1: and that information can then be used for the next phase, 427 00:25:41,800 --> 00:25:45,240 Speaker 1: which is compression. But right now I think we could 428 00:25:45,280 --> 00:25:48,760 Speaker 1: all stand a little decompression, So let's take another quick 429 00:25:48,760 --> 00:26:00,280 Speaker 1: break to thank our sponsor. All right, so now you're 430 00:26:00,280 --> 00:26:04,280 Speaker 1: ready to compress your analyzed audio. Good for you, and 431 00:26:04,320 --> 00:26:08,080 Speaker 1: by you I mean encoders. This has to be simpler 432 00:26:08,119 --> 00:26:11,119 Speaker 1: than that analysis segment, right, I mean that got a 433 00:26:11,119 --> 00:26:14,959 Speaker 1: little crazy with all the different bands and sub bands 434 00:26:15,000 --> 00:26:22,119 Speaker 1: and windows and frames and granules. Sadly it gets more complicated, 435 00:26:22,119 --> 00:26:25,280 Speaker 1: all right. So there are two layers of compression going 436 00:26:25,320 --> 00:26:30,000 Speaker 1: on with MPEG Layer three. One of those layers depends 437 00:26:30,080 --> 00:26:34,480 Speaker 1: upon the psychoacoustic analysis and the other doesn't. So why 438 00:26:34,520 --> 00:26:37,800 Speaker 1: would you use two layers with different strategies like that? Well, 439 00:26:37,840 --> 00:26:40,840 Speaker 1: the reason is that one strategy is great for complex 440 00:26:40,880 --> 00:26:43,639 Speaker 1: audio with lots of components, but not so great with 441 00:26:43,760 --> 00:26:46,639 Speaker 1: simpler sounds, and the other strategy is kind of the opposite. 442 00:26:47,160 --> 00:26:49,520 Speaker 1: So the psychoacoustic approach is the one that's really good 443 00:26:49,560 --> 00:26:53,480 Speaker 1: for complicated sounds. If if you've got a lot of 444 00:26:53,680 --> 00:26:57,840 Speaker 1: volume changes, lots of different frequencies, it's just complicated and 445 00:26:57,960 --> 00:27:00,840 Speaker 1: rich sound, you've got a lot of opportunity to look 446 00:27:00,880 --> 00:27:04,240 Speaker 1: for masking and other acoustic elements that limit the actual 447 00:27:04,320 --> 00:27:08,159 Speaker 1: sounds that people perceive. So it means there are a 448 00:27:08,200 --> 00:27:11,760 Speaker 1: lot of chances for you to uh fudge by dropping 449 00:27:11,760 --> 00:27:16,720 Speaker 1: all the stuff that people probably wouldn't notice anyway. And uh, 450 00:27:16,800 --> 00:27:18,399 Speaker 1: if you take a piece that's got a lot of 451 00:27:18,400 --> 00:27:21,879 Speaker 1: elements at varying volumes, there are likely several opportunities to 452 00:27:21,880 --> 00:27:25,760 Speaker 1: to do this. But if you're talking about relatively straightforward 453 00:27:26,440 --> 00:27:31,320 Speaker 1: audio with few components, few changes in volume, there's really 454 00:27:31,320 --> 00:27:33,399 Speaker 1: not a whole lot of data you can ditch without 455 00:27:33,440 --> 00:27:35,919 Speaker 1: it actually affecting the quality of the audio in a 456 00:27:35,960 --> 00:27:40,240 Speaker 1: perceptible way. And this is part of what Brandenburg, that 457 00:27:40,280 --> 00:27:42,439 Speaker 1: guy I was talking about in our first episode in 458 00:27:42,480 --> 00:27:45,399 Speaker 1: this series. Uh, that's what he discovered when he was 459 00:27:45,800 --> 00:27:48,960 Speaker 1: working with the MP three standard and he was listening 460 00:27:49,000 --> 00:27:53,760 Speaker 1: back to that Suzanne Vega acapella track Tom's Diner. He 461 00:27:53,840 --> 00:27:55,560 Speaker 1: was listening to a compressed version of it, and he 462 00:27:55,600 --> 00:27:58,480 Speaker 1: said it was terrible. He said it ruined the quality 463 00:27:58,480 --> 00:28:01,679 Speaker 1: of the audio. And part of that is because that 464 00:28:01,720 --> 00:28:05,040 Speaker 1: particular song is fairly simple. There's just not a lot 465 00:28:05,040 --> 00:28:08,280 Speaker 1: of opportunity to take advantage of masking and other tricks 466 00:28:08,760 --> 00:28:13,800 Speaker 1: without potentially compromising the quality. So they decided to also 467 00:28:13,840 --> 00:28:17,800 Speaker 1: incorporate some traditional compression strategies, which which work better with 468 00:28:17,880 --> 00:28:20,880 Speaker 1: those types of recordings. So the MP three format takes 469 00:28:20,880 --> 00:28:24,760 Speaker 1: advantage of both the traditional approach and the psychoacoustic approach, 470 00:28:25,480 --> 00:28:28,520 Speaker 1: and that allows the encoder to compressed files into smaller 471 00:28:28,560 --> 00:28:32,679 Speaker 1: size without just following a single strategy, like it doesn't 472 00:28:32,680 --> 00:28:34,760 Speaker 1: have to do a one size fits all for all 473 00:28:34,840 --> 00:28:39,600 Speaker 1: elements of audio. Now, combining those two strategies requires a 474 00:28:39,600 --> 00:28:43,320 Speaker 1: little more mathematical gymnastics. So let's go back to those 475 00:28:43,440 --> 00:28:47,200 Speaker 1: five seventy six frequency bins. You know, those sub bands 476 00:28:47,240 --> 00:28:50,320 Speaker 1: we talked about earlier. You've got to quantize those suckers. 477 00:28:51,440 --> 00:28:54,000 Speaker 1: What does that mean. It means assigning a quantity to 478 00:28:54,160 --> 00:28:58,479 Speaker 1: each to each frequency bin, you have to give it 479 00:28:58,520 --> 00:29:01,400 Speaker 1: a quantity of some sorts so that you can end 480 00:29:01,480 --> 00:29:06,600 Speaker 1: up judging how much you can get away with dropping data. 481 00:29:06,960 --> 00:29:09,800 Speaker 1: So to do this, the encoder sorts those five six 482 00:29:09,840 --> 00:29:13,280 Speaker 1: bins into twenty two scale factor bands. How you doing 483 00:29:13,280 --> 00:29:17,640 Speaker 1: over there? Dylan just checking in on you? Okay, Dylan's 484 00:29:17,680 --> 00:29:20,400 Speaker 1: got Dylan's got a thousand yards stare going. I hope 485 00:29:20,440 --> 00:29:22,880 Speaker 1: you guys are doing okay over there? All right, So 486 00:29:23,080 --> 00:29:25,040 Speaker 1: before smoke starts coming out of your ears, let me 487 00:29:25,080 --> 00:29:28,760 Speaker 1: explain what the scale factor bands are all about. The 488 00:29:28,800 --> 00:29:32,360 Speaker 1: whole purpose of the scale factor bands is to determine 489 00:29:32,440 --> 00:29:36,960 Speaker 1: how the information will be stored within the compressed state. 490 00:29:37,800 --> 00:29:39,800 Speaker 1: So you want to get away with as little data 491 00:29:39,880 --> 00:29:43,040 Speaker 1: as possible before affecting sound quality. So if you can 492 00:29:43,080 --> 00:29:46,760 Speaker 1: say the same thing in a shorter space without affecting 493 00:29:46,760 --> 00:29:49,600 Speaker 1: the quality of what it is you're saying, you go 494 00:29:49,680 --> 00:29:54,680 Speaker 1: with it. Brevity is the soul of compression. So if 495 00:29:54,680 --> 00:29:57,960 Speaker 1: we were talking about language, I would say it's more 496 00:29:57,960 --> 00:30:02,880 Speaker 1: efficient to say it's raining outside, or even just it's raining, 497 00:30:03,200 --> 00:30:06,280 Speaker 1: because you would assume that it would be outside where 498 00:30:06,280 --> 00:30:08,840 Speaker 1: the rain is happening, and it would be inefficient for 499 00:30:08,840 --> 00:30:11,360 Speaker 1: me to say it's coming down like cats and dogs 500 00:30:11,360 --> 00:30:15,240 Speaker 1: out there. It's not as efficient as saying it's raining. 501 00:30:16,000 --> 00:30:20,760 Speaker 1: So if you can get away with shorter statements without 502 00:30:20,840 --> 00:30:24,680 Speaker 1: affecting the actual quality, and you could argue that by 503 00:30:24,840 --> 00:30:27,280 Speaker 1: switching from it's coming down like cats and dogs out 504 00:30:27,320 --> 00:30:30,840 Speaker 1: there and it's raining changes the quality, And that could 505 00:30:30,880 --> 00:30:32,640 Speaker 1: be a valid argument. But if you can get away 506 00:30:33,080 --> 00:30:37,400 Speaker 1: with shorter without affecting quality, you do it. So each 507 00:30:37,440 --> 00:30:41,960 Speaker 1: scale factor band is represented by a quantity, Then the 508 00:30:42,040 --> 00:30:46,440 Speaker 1: encoder divides that quantity by a given number called the quantizer, 509 00:30:46,800 --> 00:30:50,440 Speaker 1: which is the same across the entire frequency spectrum for 510 00:30:50,560 --> 00:30:55,040 Speaker 1: that recording. The resulting number is then rounded up or 511 00:30:55,160 --> 00:31:00,280 Speaker 1: down to a whole digit. And here's an important point. 512 00:31:00,680 --> 00:31:04,160 Speaker 1: Individual scale factor bands can be scaled up or down 513 00:31:04,280 --> 00:31:08,280 Speaker 1: for more or less precision to represent the actual value 514 00:31:08,440 --> 00:31:12,440 Speaker 1: of those bands. So what the heck does all that mean? Well, 515 00:31:12,520 --> 00:31:15,080 Speaker 1: the purpose of dividing and rounding is just to simplify 516 00:31:15,120 --> 00:31:17,840 Speaker 1: the data to reduce the amount you need in order 517 00:31:17,880 --> 00:31:20,640 Speaker 1: to store the information. So let's go with a totally 518 00:31:20,720 --> 00:31:24,520 Speaker 1: hypothetical example. Let's say you've got a scale factor band 519 00:31:25,320 --> 00:31:29,040 Speaker 1: and you've decided you're representing that scale factor band with 520 00:31:29,160 --> 00:31:33,160 Speaker 1: the quantity seven eight four zero seven thousand, eight hundred forty, 521 00:31:33,880 --> 00:31:37,200 Speaker 1: and you've chosen the number one hundred to quantize your data, 522 00:31:37,280 --> 00:31:41,719 Speaker 1: meaning that you will divide each uh scale factor bands 523 00:31:41,800 --> 00:31:45,880 Speaker 1: quantity by one hundred. So this is seven thousand, eight 524 00:31:45,960 --> 00:31:49,400 Speaker 1: hundred forty. You divide it by one hundred. Uh and 525 00:31:49,440 --> 00:31:52,680 Speaker 1: the scale factor for this particular band you have determined 526 00:31:52,840 --> 00:31:56,280 Speaker 1: is one point zero. That means that once you get 527 00:31:56,320 --> 00:31:59,840 Speaker 1: that result where you've divided the quantity by the quantizer, 528 00:32:00,080 --> 00:32:03,120 Speaker 1: you multiply by one. That means there's no change. Multiply 529 00:32:03,160 --> 00:32:05,440 Speaker 1: by one you get the same number. More on that 530 00:32:05,480 --> 00:32:07,960 Speaker 1: end a bit. Okay, So you take that seven thousand, 531 00:32:08,000 --> 00:32:11,000 Speaker 1: eight hundred forty you divided by one hundred. That gives 532 00:32:11,040 --> 00:32:14,000 Speaker 1: you seventy eight point four. Well, now you have to 533 00:32:14,080 --> 00:32:17,960 Speaker 1: round that number, so you round it down to seventy eight. Now, 534 00:32:17,960 --> 00:32:20,200 Speaker 1: when you have a decoder and you're ready to play 535 00:32:20,240 --> 00:32:23,960 Speaker 1: back the information, it comes across this quantity the seventy eight, 536 00:32:24,400 --> 00:32:28,200 Speaker 1: and it knows what the quantizer number was, so it 537 00:32:28,280 --> 00:32:31,080 Speaker 1: multiplies by one hundred to get back to seven thousand, 538 00:32:31,120 --> 00:32:35,280 Speaker 1: eight hundred. So the replicated number is actually forty off 539 00:32:35,560 --> 00:32:38,760 Speaker 1: from the original number. The original number again with seven thousand, 540 00:32:38,800 --> 00:32:43,200 Speaker 1: eight hundred forty, the replicated number is seven thousand, eight hundred. Now, 541 00:32:43,240 --> 00:32:48,680 Speaker 1: those inconsistencies manifest as noise in the actual playback. So 542 00:32:48,720 --> 00:32:51,400 Speaker 1: if you wanted to increase the precision of any given 543 00:32:51,440 --> 00:32:53,760 Speaker 1: scale factor band, you could do so by changing the 544 00:32:53,800 --> 00:32:56,800 Speaker 1: scale factor number. So in that example, just now, I 545 00:32:56,840 --> 00:32:59,160 Speaker 1: said the number was one point zero, meaning there's no 546 00:32:59,280 --> 00:33:02,680 Speaker 1: change to that result. But I could have said it 547 00:33:02,760 --> 00:33:05,840 Speaker 1: was ten, which means we would multiply the quantized number 548 00:33:05,840 --> 00:33:07,960 Speaker 1: by ten. So we would take that seven thousand, eight 549 00:33:08,040 --> 00:33:10,520 Speaker 1: hundred forty divided by one hundred you get seventy eight 550 00:33:10,520 --> 00:33:14,040 Speaker 1: point four, then multiplied by ten to get seven four. 551 00:33:14,760 --> 00:33:18,600 Speaker 1: So when the decoder decompresses the file, it would reverse 552 00:33:18,720 --> 00:33:21,320 Speaker 1: this this whole thing. It would just multiply by a 553 00:33:21,400 --> 00:33:24,160 Speaker 1: hundred um. You would end up getting seven thousand, hundred 554 00:33:24,160 --> 00:33:26,960 Speaker 1: forty again, which means that you wouldn't introduce any noise 555 00:33:27,160 --> 00:33:30,200 Speaker 1: to the file. You would have a perfect representation. But 556 00:33:30,320 --> 00:33:33,760 Speaker 1: in some cases, the encoder may determine that any noise 557 00:33:33,800 --> 00:33:37,440 Speaker 1: that you generate wouldn't be noticed or it wouldn't impact 558 00:33:37,440 --> 00:33:39,240 Speaker 1: the quality of the audio enough for it to be 559 00:33:39,240 --> 00:33:42,680 Speaker 1: a problem because of other factors for that particular scale 560 00:33:42,680 --> 00:33:45,440 Speaker 1: factor band, like maybe it's really quiet, or maybe it's 561 00:33:45,440 --> 00:33:48,800 Speaker 1: really complex. So in those cases, you could reduce the 562 00:33:48,840 --> 00:33:52,120 Speaker 1: scale factor number by making it something else like point 563 00:33:52,160 --> 00:33:54,920 Speaker 1: one instead of one point oh. So that means you 564 00:33:54,960 --> 00:33:58,520 Speaker 1: would multiply the quantized number by point one, So the 565 00:33:58,600 --> 00:34:01,760 Speaker 1: seventy eight point four would become seven point eight four, 566 00:34:01,880 --> 00:34:03,280 Speaker 1: and then you have to round it to get a 567 00:34:03,280 --> 00:34:06,440 Speaker 1: whole integer, so you get eight seven point eight four 568 00:34:06,520 --> 00:34:09,880 Speaker 1: rounds up to eight. Now, when a decode or decompresses 569 00:34:09,880 --> 00:34:14,000 Speaker 1: the audio, it multiplies eight by one hundred. That quantizer 570 00:34:14,040 --> 00:34:17,400 Speaker 1: that we've talked about so much, uh and uh, actually 571 00:34:17,440 --> 00:34:19,080 Speaker 1: at this point would have to be eight thousand because 572 00:34:19,080 --> 00:34:22,759 Speaker 1: it's also taking into account the scale factor, so it's 573 00:34:22,800 --> 00:34:26,879 Speaker 1: multiplying it by a thousand, not just a hundred. So 574 00:34:27,000 --> 00:34:29,480 Speaker 1: you would get a number that would pop up to 575 00:34:29,600 --> 00:34:32,520 Speaker 1: eight thousand. And remember the original with seven thousand, eight 576 00:34:32,560 --> 00:34:34,960 Speaker 1: hundred forty. So you look at the difference between these two, 577 00:34:35,000 --> 00:34:37,759 Speaker 1: the original seven thousand forty, the new fact number is 578 00:34:37,840 --> 00:34:40,680 Speaker 1: eight thousand. There's a pretty big difference there. That change 579 00:34:40,760 --> 00:34:43,120 Speaker 1: might introduce enough noise for it to be a problem. 580 00:34:43,160 --> 00:34:45,440 Speaker 1: So how does the encoder determine if a scale factor 581 00:34:45,520 --> 00:34:48,120 Speaker 1: band is meeting the proper criteria? How can it tell 582 00:34:48,960 --> 00:34:53,120 Speaker 1: if there is ah too much noise or if the 583 00:34:53,160 --> 00:34:56,440 Speaker 1: noise falls below the threshold? Well, it goes through what 584 00:34:56,480 --> 00:35:00,400 Speaker 1: it's called a Huffman coding process. At this point, Dylan 585 00:35:01,360 --> 00:35:05,000 Speaker 1: is currently just staring at the wall and drool is 586 00:35:05,040 --> 00:35:09,719 Speaker 1: coming out. Huffman coding process. It's converts scale factor bands 587 00:35:09,719 --> 00:35:12,920 Speaker 1: into binary strings, and the process goes through a series 588 00:35:12,920 --> 00:35:15,120 Speaker 1: of tables to determine if the data within the scale 589 00:35:15,120 --> 00:35:18,320 Speaker 1: factor band requires more or less precision to describe the 590 00:35:18,360 --> 00:35:22,160 Speaker 1: sound without affecting the audio quality. So, Huffman coding is 591 00:35:22,160 --> 00:35:24,520 Speaker 1: a process. And when you start with a large number 592 00:35:24,520 --> 00:35:27,239 Speaker 1: of possibilities and you begin to narrow it down, uh. 593 00:35:27,320 --> 00:35:30,880 Speaker 1: Some people describe it as the coding equivalent of twenty questions. 594 00:35:31,560 --> 00:35:34,760 Speaker 1: So you ask your first question like animal, vegetable or mineral. 595 00:35:35,040 --> 00:35:38,200 Speaker 1: You get an answer so animal. While that first answer 596 00:35:38,280 --> 00:35:42,200 Speaker 1: eliminates a ton of other possibilities and narrows the focus 597 00:35:42,239 --> 00:35:45,279 Speaker 1: like anything that doesn't pertain to animal, you can automatically 598 00:35:45,320 --> 00:35:49,440 Speaker 1: discount because you already know it can apply to that answer. 599 00:35:51,080 --> 00:35:53,840 Speaker 1: With MP three compression, this means making certain the number 600 00:35:53,920 --> 00:35:57,840 Speaker 1: of bits representing a granule because remember I mentioned that 601 00:35:58,480 --> 00:36:01,919 Speaker 1: in MP three formats you have frames, and each frame. 602 00:36:02,280 --> 00:36:05,200 Speaker 1: Each frame has a thousand, one or fifty two samples 603 00:36:05,239 --> 00:36:09,200 Speaker 1: and consists of two granules with five s each. So 604 00:36:09,440 --> 00:36:11,640 Speaker 1: when you answer the first question, it eliminates a lot 605 00:36:11,680 --> 00:36:16,000 Speaker 1: of other possibilities and narrows the focus. So like with animal, vegetable, mineral, 606 00:36:16,000 --> 00:36:19,080 Speaker 1: if I say animal, you're gonna not ask any questions 607 00:36:19,320 --> 00:36:22,520 Speaker 1: that have to do with minerals or vegetables only because 608 00:36:22,520 --> 00:36:25,520 Speaker 1: it wouldn't make sense. You know, those aren't gonna apply. 609 00:36:25,760 --> 00:36:28,120 Speaker 1: Same thing with m P three's except this time it 610 00:36:28,120 --> 00:36:30,920 Speaker 1: means making certain the number of bits representing a granule. 611 00:36:31,080 --> 00:36:36,239 Speaker 1: Remember their two granules per frame with the MP three layer, Uh, 612 00:36:36,360 --> 00:36:39,120 Speaker 1: you want to make sure that the number of bits 613 00:36:39,160 --> 00:36:42,839 Speaker 1: representing that granule match the chosen bit rate for a compression. 614 00:36:43,200 --> 00:36:45,600 Speaker 1: So if after going through this process, the encoder says, hey, 615 00:36:45,600 --> 00:36:48,719 Speaker 1: this granule has more bits than what's allowed. It's too 616 00:36:48,800 --> 00:36:51,640 Speaker 1: many bits. The we gotta get rid of some of these, 617 00:36:51,800 --> 00:36:54,160 Speaker 1: the encoder can adjust the scale factor band so that 618 00:36:54,200 --> 00:36:58,560 Speaker 1: there's less precision meaning that multiplier in other words, that 619 00:36:59,040 --> 00:37:02,440 Speaker 1: but I talked about earlier, and thus reduce the amount 620 00:37:02,440 --> 00:37:07,080 Speaker 1: of data needed to represent that particular granule. If a 621 00:37:07,120 --> 00:37:11,080 Speaker 1: granule comes in under the bit rate, the encoder can 622 00:37:11,120 --> 00:37:15,279 Speaker 1: increase the precision to reduce noise and fill that granule 623 00:37:15,400 --> 00:37:22,000 Speaker 1: out properly so it matches the actual threshold. After all this, 624 00:37:22,120 --> 00:37:25,320 Speaker 1: the pairs of granules become frames within the MP three files. 625 00:37:25,320 --> 00:37:27,839 Speaker 1: And the only other component in an MP three file 626 00:37:27,960 --> 00:37:31,399 Speaker 1: apart from these frames is the I D three metadata. 627 00:37:31,719 --> 00:37:33,759 Speaker 1: This is pretty simple. This is like a header, and 628 00:37:33,800 --> 00:37:36,040 Speaker 1: it comes before all the frames in the audio file 629 00:37:36,120 --> 00:37:39,920 Speaker 1: and contains information about about the file itself, which can 630 00:37:39,960 --> 00:37:42,680 Speaker 1: include stuff like the title of a song, an artist name, 631 00:37:42,800 --> 00:37:46,600 Speaker 1: an album title, other stuff like that. It can also 632 00:37:46,640 --> 00:37:50,080 Speaker 1: include copyright information as well as information about the file itself, 633 00:37:50,120 --> 00:37:52,279 Speaker 1: such as whether or not it's a stereo recording or 634 00:37:52,320 --> 00:37:56,080 Speaker 1: a mono recording. So when you use a decoder like 635 00:37:56,120 --> 00:38:00,480 Speaker 1: an MP three player, it takes this compressed information. These 636 00:38:01,320 --> 00:38:06,560 Speaker 1: these these representations that the music has been reduced to, 637 00:38:07,840 --> 00:38:11,480 Speaker 1: and it converts that Huffman data back into the quantized format, 638 00:38:12,040 --> 00:38:14,719 Speaker 1: scales the data back up to its original size or 639 00:38:14,760 --> 00:38:20,560 Speaker 1: close approximation. Remember the the uncompressed version may actually be 640 00:38:20,680 --> 00:38:25,240 Speaker 1: off by a significant amount depending upon each individual granule. 641 00:38:25,800 --> 00:38:28,040 Speaker 1: And all of that data gets recombined into a new 642 00:38:28,120 --> 00:38:30,319 Speaker 1: pc M sample that can be played back to you. 643 00:38:31,000 --> 00:38:34,080 Speaker 1: And that's all there is to it. Nothing could be easier. 644 00:38:35,280 --> 00:38:38,880 Speaker 1: All right, that took a lot out of me, so 645 00:38:38,920 --> 00:38:41,280 Speaker 1: I got really technical, and I apologize if I lost 646 00:38:41,320 --> 00:38:43,560 Speaker 1: any of you out there, or for those of you 647 00:38:43,560 --> 00:38:46,080 Speaker 1: who have a lot of experience working on compression algorithms, 648 00:38:46,120 --> 00:38:50,000 Speaker 1: for oversimplifying in several cases. But now we've got a 649 00:38:50,000 --> 00:38:52,480 Speaker 1: full episode about this, and I hope you have a 650 00:38:52,480 --> 00:38:55,600 Speaker 1: better understanding of how a big sound file can be 651 00:38:55,640 --> 00:38:59,799 Speaker 1: reduced to a smaller sound file. Next time, I'll just 652 00:38:59,800 --> 00:39:04,359 Speaker 1: say magic. It will make everyone happier. But I hope 653 00:39:04,360 --> 00:39:06,920 Speaker 1: you guys appreciated this. In the next episode in this 654 00:39:07,000 --> 00:39:09,160 Speaker 1: series it will be far less technical. I'm going to 655 00:39:09,239 --> 00:39:12,839 Speaker 1: be more historical. I'm going to talk about the progression 656 00:39:13,040 --> 00:39:16,279 Speaker 1: of the MP three player, how it came, about, how 657 00:39:16,280 --> 00:39:19,000 Speaker 1: it evolved, and how the iPod ended up becoming the 658 00:39:19,120 --> 00:39:24,600 Speaker 1: dominant brand in a c of MP three players, and 659 00:39:24,600 --> 00:39:27,520 Speaker 1: then maybe kind of explore where MP three players are today, 660 00:39:28,480 --> 00:39:30,600 Speaker 1: like how many are there, how how big is the market? 661 00:39:30,960 --> 00:39:33,360 Speaker 1: Are are people still buying them? That kind of question. 662 00:39:35,000 --> 00:39:37,280 Speaker 1: If you guys have any questions for me, or comments 663 00:39:37,400 --> 00:39:40,799 Speaker 1: or suggestions anything like that, send me a message. My 664 00:39:40,920 --> 00:39:44,400 Speaker 1: email is tech Stuff at how stuff works dot com, 665 00:39:44,520 --> 00:39:46,680 Speaker 1: or you can drop me a line on Facebook or Twitter, 666 00:39:46,920 --> 00:39:49,279 Speaker 1: the handle of both of those those tech stuff h 667 00:39:49,480 --> 00:39:53,000 Speaker 1: s W and I'll talk to you guys again really 668 00:39:53,080 --> 00:40:00,960 Speaker 1: soon for more on this and sense of other topics. 669 00:40:01,200 --> 00:40:11,920 Speaker 1: Is it how stuff works? Dot com m