WEBVTT - Techstuff Classic: How MP3 Compression Works

0:00:04.160 --> 0:00:07.160
<v Speaker 1>Get in touch with technology with tech Stuff from how

0:00:07.240 --> 0:00:14.160
<v Speaker 1>stuff works dot com. Hey everybody, and welcome to tech Stuff.

0:00:14.200 --> 0:00:16.680
<v Speaker 1>I'm Jonathan Strickland. I'm the host of the show, and

0:00:16.800 --> 0:00:20.439
<v Speaker 1>this is a Saturday morning rerun episode where we take

0:00:20.440 --> 0:00:23.560
<v Speaker 1>a classic episode of tech Stuff and we present it

0:00:23.560 --> 0:00:25.760
<v Speaker 1>to you guys who may have missed it. I've been

0:00:25.800 --> 0:00:29.000
<v Speaker 1>talking a lot about tech and music recently. If you've

0:00:29.000 --> 0:00:31.360
<v Speaker 1>been listening to the recent episodes, you know all about that,

0:00:31.920 --> 0:00:34.640
<v Speaker 1>and there have been some great discussions. But it also

0:00:34.760 --> 0:00:38.760
<v Speaker 1>requires a little bit of uh knowledge of previous episodes

0:00:38.880 --> 0:00:40.960
<v Speaker 1>at times, and I know it can be tricky to

0:00:41.040 --> 0:00:44.559
<v Speaker 1>dig through the archives. So in this classic episode, I

0:00:44.600 --> 0:00:48.919
<v Speaker 1>talk about how the MP three compression format works, so

0:00:48.920 --> 0:00:51.559
<v Speaker 1>that you can actually understand how MP three works as

0:00:51.560 --> 0:00:54.600
<v Speaker 1>opposed to something like middy, and you can get an

0:00:54.600 --> 0:00:58.960
<v Speaker 1>appreciation for the differences between the two formats. This episode

0:00:58.960 --> 0:01:02.320
<v Speaker 1>originally published on January two thousand and seventeen. This is

0:01:02.360 --> 0:01:06.039
<v Speaker 1>a whole year ago more than that. Now we're in

0:01:06.080 --> 0:01:08.920
<v Speaker 1>April two eighteen as I record this. I hope you

0:01:09.000 --> 0:01:11.120
<v Speaker 1>enjoyed this classic episode. I hope it gives you a

0:01:11.160 --> 0:01:15.959
<v Speaker 1>deeper appreciation of the technical aspect of creating digital music

0:01:16.360 --> 0:01:19.440
<v Speaker 1>and I'll see you guys on the other side. So

0:01:19.560 --> 0:01:23.840
<v Speaker 1>let's remember that the heart of digital information is the

0:01:23.959 --> 0:01:28.080
<v Speaker 1>bit that's either a zero or a one. The basic

0:01:28.360 --> 0:01:34.720
<v Speaker 1>unit of information for digital formats zeros and ones. Now

0:01:34.720 --> 0:01:36.840
<v Speaker 1>we can use those zeros and ones to describe all

0:01:36.840 --> 0:01:41.120
<v Speaker 1>sorts of information, from text to audio, to video and

0:01:41.480 --> 0:01:45.120
<v Speaker 1>really pretty much anything you can think of that's represented digitally. Ultimately,

0:01:45.120 --> 0:01:46.680
<v Speaker 1>when you get down to it, it's a bunch of

0:01:46.760 --> 0:01:49.840
<v Speaker 1>zeros and ones. So let's say you start off with

0:01:49.880 --> 0:01:54.440
<v Speaker 1>your uncompressed audio file. You've got this enormous audio file

0:01:54.480 --> 0:01:56.600
<v Speaker 1>in front of you. It's made up of zeros and ones.

0:01:57.120 --> 0:02:00.480
<v Speaker 1>How do you make that file smaller? So in the world,

0:02:00.480 --> 0:02:04.120
<v Speaker 1>we can compress stuff, right, we can apply physical pressure

0:02:04.160 --> 0:02:07.400
<v Speaker 1>to things. Think about packing a suitcase. You can make

0:02:07.400 --> 0:02:09.640
<v Speaker 1>sure you get that extra outfit and if you just

0:02:09.919 --> 0:02:12.480
<v Speaker 1>press it down hard enough and get that zipper zipped

0:02:12.480 --> 0:02:15.600
<v Speaker 1>before it can burst open. But once you get to

0:02:15.639 --> 0:02:19.239
<v Speaker 1>a certain level of compression, you cannot make things smaller,

0:02:19.480 --> 0:02:21.880
<v Speaker 1>at least not without hurting yourself or whatever it is

0:02:21.919 --> 0:02:25.440
<v Speaker 1>you're trying to compress. Digital files are a little different

0:02:25.760 --> 0:02:29.800
<v Speaker 1>because you cannot physically cram the zeros and ones closer together.

0:02:29.880 --> 0:02:33.400
<v Speaker 1>It doesn't work like that. These are abstract things. You

0:02:33.440 --> 0:02:36.840
<v Speaker 1>can't make them smaller, right, You can't decrease the font.

0:02:36.960 --> 0:02:40.840
<v Speaker 1>It doesn't work that way. The numbers represent two different states.

0:02:41.400 --> 0:02:43.640
<v Speaker 1>So if you want to create a smaller audio file

0:02:44.080 --> 0:02:47.280
<v Speaker 1>containing the recording that was in a larger audio file,

0:02:47.760 --> 0:02:51.200
<v Speaker 1>you have to start getting creative now. In the last

0:02:51.240 --> 0:02:53.720
<v Speaker 1>part of this series, I talked about how the MP

0:02:53.800 --> 0:02:57.560
<v Speaker 1>three compression algorithm was born from an applied research institution

0:02:57.600 --> 0:03:00.240
<v Speaker 1>in Germany and the team behind the m B three

0:03:00.280 --> 0:03:03.239
<v Speaker 1>wanted to find a way to compress audio, specifically music

0:03:03.600 --> 0:03:08.280
<v Speaker 1>for transmission over phone lines. Eventually this evolved into the

0:03:08.400 --> 0:03:13.000
<v Speaker 1>Motion Pictures Expert Group audio Layer three compression methodology, better

0:03:13.080 --> 0:03:17.960
<v Speaker 1>known as the MP three, and there's also Impact two

0:03:18.000 --> 0:03:20.519
<v Speaker 1>and IMPEG four standards. Impact two, by the way, is

0:03:20.560 --> 0:03:23.799
<v Speaker 1>the basis of compression on DVDs, although the actual DVD

0:03:23.880 --> 0:03:28.240
<v Speaker 1>format is really a modification of Impact two. An Impact

0:03:28.280 --> 0:03:30.600
<v Speaker 1>four is a compression strategy for audio and video that's

0:03:30.639 --> 0:03:34.320
<v Speaker 1>frequently used in lots of different up capacities, including streaming

0:03:34.320 --> 0:03:38.480
<v Speaker 1>media services. So by the late nineteen seventies, researchers began

0:03:38.560 --> 0:03:42.280
<v Speaker 1>to explore the possibility of leveraging psychoacoustics to figure out

0:03:42.320 --> 0:03:46.640
<v Speaker 1>how to compress audio. And psychoacoustics refers to the way

0:03:46.640 --> 0:03:51.360
<v Speaker 1>we perceive sound, it's uh and also the physiological effects

0:03:51.400 --> 0:03:55.080
<v Speaker 1>of sound on us. So this involves not just our

0:03:55.160 --> 0:03:58.200
<v Speaker 1>our physical sense of hearing, but also our brains and

0:03:58.240 --> 0:04:01.480
<v Speaker 1>the way our brains interpret sound. Owned So, for example,

0:04:01.760 --> 0:04:05.520
<v Speaker 1>there's a psychoacoustic phenomenon that's called the Hawse effect h

0:04:05.680 --> 0:04:08.600
<v Speaker 1>A A S. And I think it's pretty interesting. So

0:04:08.800 --> 0:04:11.240
<v Speaker 1>here's how the Hawse effect works. If you hear the

0:04:11.320 --> 0:04:16.320
<v Speaker 1>exact same sound coming from different directions, but the two

0:04:16.320 --> 0:04:19.680
<v Speaker 1>sounds arrive within thirty to forty milliseconds of each other,

0:04:20.080 --> 0:04:23.039
<v Speaker 1>your brain will be convinced that you really only heard

0:04:23.080 --> 0:04:26.479
<v Speaker 1>one sound and it came from the direction that hit

0:04:26.560 --> 0:04:30.240
<v Speaker 1>you first. So let's say a sounds coming from directly

0:04:30.279 --> 0:04:32.720
<v Speaker 1>in front of you and to your left, and you

0:04:33.080 --> 0:04:36.520
<v Speaker 1>get both of them within that thirty forty millisecond range,

0:04:37.360 --> 0:04:39.479
<v Speaker 1>and you hear the one coming from ahead of you

0:04:39.560 --> 0:04:43.080
<v Speaker 1>first to you. You're convinced that you only heard that

0:04:43.160 --> 0:04:46.119
<v Speaker 1>sound once and it came from dead on straight ahead

0:04:46.160 --> 0:04:49.719
<v Speaker 1>of you. Your brain kind of discounts the one that

0:04:49.800 --> 0:04:53.200
<v Speaker 1>came off from the left, although it can reinforce it,

0:04:53.320 --> 0:04:55.560
<v Speaker 1>which ends up being really useful if you're planning out

0:04:55.560 --> 0:04:58.320
<v Speaker 1>p A systems for stage shows. I'm not joking. That

0:04:58.360 --> 0:05:01.120
<v Speaker 1>really is the way that uh people plan those things out.

0:05:01.400 --> 0:05:04.120
<v Speaker 1>It's pretty neat. Humans perceived sounds in a way that's

0:05:04.120 --> 0:05:08.240
<v Speaker 1>not necessarily representational of all the sounds surrounding us. You

0:05:08.240 --> 0:05:11.640
<v Speaker 1>can think of your brain as the filter between your

0:05:11.760 --> 0:05:15.719
<v Speaker 1>understanding and what reality actually is. A lot of stuff

0:05:15.760 --> 0:05:18.640
<v Speaker 1>goes on that it ends up getting rid of information

0:05:18.680 --> 0:05:21.080
<v Speaker 1>that your brain just says, you know what, he or

0:05:21.120 --> 0:05:25.080
<v Speaker 1>she doesn't need that, it's just gonna confuse things. We're

0:05:25.080 --> 0:05:28.440
<v Speaker 1>gonna dump it. And that's kind of how it works.

0:05:28.480 --> 0:05:30.640
<v Speaker 1>It's all on an unconscious level. It's not like you're

0:05:30.839 --> 0:05:34.960
<v Speaker 1>actively working to do this. So let's say you're in

0:05:34.960 --> 0:05:37.320
<v Speaker 1>a relatively busy hallway and there could be a lot

0:05:37.400 --> 0:05:40.839
<v Speaker 1>of sounds in that hallway. Stuff that's going on constantly

0:05:40.839 --> 0:05:44.080
<v Speaker 1>around you. Maybe they are doors opening and closing, Maybe

0:05:44.080 --> 0:05:47.000
<v Speaker 1>their footsteps going up and down the hallway. Maybe someone

0:05:47.120 --> 0:05:50.760
<v Speaker 1>shoes are squeaking against the linoleum floor. People are chattering

0:05:50.800 --> 0:05:53.880
<v Speaker 1>away in there. But you are having a conversation with someone,

0:05:54.279 --> 0:05:57.000
<v Speaker 1>so you turn your focus on that person and other

0:05:57.080 --> 0:06:01.240
<v Speaker 1>sounds seemingly fade away. They're still doesn't but they're not important.

0:06:01.839 --> 0:06:04.560
<v Speaker 1>So in this example, you would actually call those other

0:06:04.640 --> 0:06:08.520
<v Speaker 1>sounds of distraction and you would really focus on the conversation. Uh.

0:06:08.560 --> 0:06:13.040
<v Speaker 1>That also shows how we're able to consciously direct our

0:06:13.120 --> 0:06:16.760
<v Speaker 1>since our perception of hearing. So both of these factors

0:06:16.800 --> 0:06:20.159
<v Speaker 1>come into play. Now. One thing that MP three encoding

0:06:20.200 --> 0:06:24.120
<v Speaker 1>takes advantage of is something called masking, and there are

0:06:24.120 --> 0:06:27.160
<v Speaker 1>a couple of different variations of the masking effect. One

0:06:27.200 --> 0:06:30.560
<v Speaker 1>of them is called frequency masking. So let's say you've

0:06:30.600 --> 0:06:33.520
<v Speaker 1>got to sound frequencies that are similar, perhaps there's just

0:06:33.560 --> 0:06:37.240
<v Speaker 1>a few hurts apart. Remember, UH, frequencies are measured in hurts,

0:06:37.720 --> 0:06:41.560
<v Speaker 1>which is really the number of oscillations per second. So

0:06:41.680 --> 0:06:47.000
<v Speaker 1>let's say you've got a sound that's at I don't know, uh,

0:06:47.400 --> 0:06:52.400
<v Speaker 1>one thousand killer hurts, and another one that's at one

0:06:52.520 --> 0:06:56.599
<v Speaker 1>thousand and ten killer hurts. Now, the human ear is

0:06:56.640 --> 0:07:00.080
<v Speaker 1>precise enough to be able to tell the difference of

0:07:00.160 --> 0:07:02.840
<v Speaker 1>two sounds that are at least two hurts apart from

0:07:02.880 --> 0:07:06.400
<v Speaker 1>each other. That's how precise our resolution of hearing, it's

0:07:06.480 --> 0:07:09.840
<v Speaker 1>it's at that level. But if you get two sounds

0:07:09.880 --> 0:07:13.560
<v Speaker 1>played at the same time and they are that close

0:07:13.600 --> 0:07:17.160
<v Speaker 1>together in frequency, and one of those frequencies is played

0:07:17.160 --> 0:07:20.320
<v Speaker 1>at a greater volume than the other, our brains will

0:07:20.320 --> 0:07:23.200
<v Speaker 1>pick up on the louder sound and ignore the quieter sound,

0:07:23.280 --> 0:07:26.920
<v Speaker 1>even though both of them are present. What becomes important

0:07:26.920 --> 0:07:29.560
<v Speaker 1>at that point is the amplitude. Now, the further apart

0:07:29.600 --> 0:07:33.400
<v Speaker 1>in frequencies you get, the less that has an effect.

0:07:33.520 --> 0:07:35.400
<v Speaker 1>So if you get far enough apart where there are

0:07:35.400 --> 0:07:38.720
<v Speaker 1>two pitches, one of them noticeably louder than the other,

0:07:39.080 --> 0:07:41.360
<v Speaker 1>but they're far enough apart, you will hear both of them.

0:07:41.400 --> 0:07:44.600
<v Speaker 1>It only works if the two pitches are relatively close together,

0:07:45.720 --> 0:07:48.600
<v Speaker 1>and there's not a universal formula for frequency masking. As

0:07:48.600 --> 0:07:51.560
<v Speaker 1>you get closer to the boundaries of human hearing, frequency

0:07:51.600 --> 0:07:53.960
<v Speaker 1>masking becomes easier, So if it's a really low pitch

0:07:54.040 --> 0:07:56.640
<v Speaker 1>or a really high pitch, it's easier to get away

0:07:56.640 --> 0:07:59.400
<v Speaker 1>with it. Once you started getting into what is the

0:07:59.440 --> 0:08:02.040
<v Speaker 1>out of as the sweet spot for human hearing, which

0:08:02.080 --> 0:08:05.160
<v Speaker 1>is generally considered to be between two and five killer hurts,

0:08:06.240 --> 0:08:10.240
<v Speaker 1>you need a greater difference in volume or a smaller

0:08:10.280 --> 0:08:14.720
<v Speaker 1>difference in frequency in order for masking to work. Frequency

0:08:14.760 --> 0:08:18.560
<v Speaker 1>masking at any rate. But then there's also temporal masking,

0:08:19.640 --> 0:08:21.920
<v Speaker 1>and you might say, okay, I got it. Temporal that

0:08:21.960 --> 0:08:26.080
<v Speaker 1>means time. Indeed it does, my friend. This describes the

0:08:26.080 --> 0:08:29.080
<v Speaker 1>effect of a short but loud sound masking a softer

0:08:29.160 --> 0:08:33.360
<v Speaker 1>sound for a short time. Weird thing is the loud

0:08:33.400 --> 0:08:37.000
<v Speaker 1>sound can actually mask sounds that precede it slightly, not

0:08:37.080 --> 0:08:39.800
<v Speaker 1>by a whole lot, but a little bit. MP three

0:08:39.800 --> 0:08:43.920
<v Speaker 1>compression takes advantage of both frequency and temporal masking when

0:08:43.920 --> 0:08:47.120
<v Speaker 1>it's trying to determine which data needs to be included

0:08:47.200 --> 0:08:49.960
<v Speaker 1>and which data can be dumped, because it won't affect

0:08:50.000 --> 0:08:52.880
<v Speaker 1>your perception of whatever the the audio file is in

0:08:52.920 --> 0:08:56.760
<v Speaker 1>the first place. So you also probably remember I talked

0:08:56.760 --> 0:08:59.600
<v Speaker 1>about the physical limitation to what we humans can hear,

0:08:59.800 --> 0:09:01.960
<v Speaker 1>no matter what our brains might be up to, so

0:09:02.040 --> 0:09:04.440
<v Speaker 1>that this doesn't have to do with our brains, you know,

0:09:04.520 --> 0:09:07.280
<v Speaker 1>filtering through the information that's coming in. This has to

0:09:07.320 --> 0:09:11.240
<v Speaker 1>do with the physical limitations of the human ear. In

0:09:11.280 --> 0:09:14.240
<v Speaker 1>the last episode of the series, I said typical human hearing.

0:09:14.880 --> 0:09:18.599
<v Speaker 1>Keep in mind typical there are exceptions. UH covers the

0:09:18.720 --> 0:09:21.600
<v Speaker 1>range of frequencies between about twenty hurts and twenty killer

0:09:21.679 --> 0:09:24.800
<v Speaker 1>hurts or twenty thousand hurts, So twenty to twenty thou

0:09:25.840 --> 0:09:30.360
<v Speaker 1>higher frequencies represent higher pitches and sound lower frequencies lower pitches, right,

0:09:31.120 --> 0:09:33.679
<v Speaker 1>And as you get older, your ability to perceive those

0:09:33.760 --> 0:09:38.080
<v Speaker 1>higher frequencies starts to diminish. So most adults actually have

0:09:38.360 --> 0:09:44.480
<v Speaker 1>an upper range closer to sixteen killer hurts, not twenty. Uh. Kids,

0:09:44.720 --> 0:09:46.920
<v Speaker 1>they can hear those higher pitches. You may have heard

0:09:46.920 --> 0:09:51.480
<v Speaker 1>the story about how some convenience stores experimented with getting

0:09:51.559 --> 0:09:57.280
<v Speaker 1>rid of teenage loiterers by by uh projecting out these

0:09:57.280 --> 0:10:00.760
<v Speaker 1>super high pitches that that adults could not here but

0:10:00.920 --> 0:10:03.800
<v Speaker 1>kids could, and it discouraged kids from hanging out at

0:10:03.800 --> 0:10:08.600
<v Speaker 1>the convenience store and loitering. Um. I love that idea

0:10:09.559 --> 0:10:12.959
<v Speaker 1>so much. Anyway, that's because I'm old and my hearing

0:10:13.040 --> 0:10:16.920
<v Speaker 1>is terrible. Well, remember I also mentioned you can detect

0:10:17.000 --> 0:10:19.760
<v Speaker 1>changes in pitch at two hurts increments if you get

0:10:19.880 --> 0:10:23.440
<v Speaker 1>below two hurts and change, like, if it's just a

0:10:23.520 --> 0:10:27.760
<v Speaker 1>one hurts difference between two frequencies, it's too low a

0:10:27.800 --> 0:10:30.080
<v Speaker 1>resolution for us to detect. To us, it will sound

0:10:30.160 --> 0:10:34.599
<v Speaker 1>exactly the same. So if you were to hear a

0:10:35.400 --> 0:10:40.400
<v Speaker 1>frequency at one thousand one hurts or one point zero

0:10:40.679 --> 0:10:43.960
<v Speaker 1>zero one killer hurts and one point zero zero to

0:10:44.160 --> 0:10:47.120
<v Speaker 1>kill hurts, you wouldn't notice the difference. They would sound

0:10:47.120 --> 0:10:50.199
<v Speaker 1>exactly the same to you. So if you're gonna take

0:10:50.200 --> 0:10:52.439
<v Speaker 1>audio and compress it, one step you could consider is

0:10:52.480 --> 0:10:57.240
<v Speaker 1>eliminating anything that's outside the actual range of frequencies that

0:10:57.280 --> 0:11:00.719
<v Speaker 1>we can hear, or simplifying any changes in frequency that

0:11:00.760 --> 0:11:04.439
<v Speaker 1>are smaller than two hurts. If you get take all

0:11:04.440 --> 0:11:07.920
<v Speaker 1>that data and you say it is physically impossible for

0:11:08.000 --> 0:11:11.479
<v Speaker 1>a human to perceive this, get rid of that information,

0:11:11.600 --> 0:11:14.800
<v Speaker 1>then in theory it wouldn't have any effect on the

0:11:14.880 --> 0:11:19.160
<v Speaker 1>rest of the recording. But how you go further than that, right,

0:11:19.240 --> 0:11:22.000
<v Speaker 1>how do you create a method so that you can

0:11:22.040 --> 0:11:24.160
<v Speaker 1>really compress this file? You want a method that will

0:11:24.160 --> 0:11:27.479
<v Speaker 1>preserve the important sounds while potentially ignoring all the unimportant

0:11:27.559 --> 0:11:31.360
<v Speaker 1>or incidel sounds. And you wanted to be automatic because

0:11:31.800 --> 0:11:34.920
<v Speaker 1>if you have it manually, then that's going to take

0:11:35.679 --> 0:11:40.000
<v Speaker 1>countless hours just to edit a single sound file. So

0:11:41.160 --> 0:11:44.360
<v Speaker 1>that was the challenge that the MP three research team

0:11:44.400 --> 0:11:49.480
<v Speaker 1>faced as a group. Now, their solution, which ultimately created

0:11:49.520 --> 0:11:51.800
<v Speaker 1>even more challenges was to come up with what was

0:11:51.920 --> 0:11:55.640
<v Speaker 1>essentially a simulated human ear and brain. They needed to

0:11:55.679 --> 0:12:01.559
<v Speaker 1>replicate the experience of perceiving music so that an algorithm

0:12:01.559 --> 0:12:05.720
<v Speaker 1>could evaluate every sound in an audio file and judge

0:12:05.800 --> 0:12:08.719
<v Speaker 1>if in fact was relevant enough to include in the

0:12:08.720 --> 0:12:13.000
<v Speaker 1>final compressed version. If a sound were imperceptible, then it

0:12:13.000 --> 0:12:15.520
<v Speaker 1>wouldn't make sense to include it in the MP three file.

0:12:15.800 --> 0:12:18.080
<v Speaker 1>So by leaving out all the irrelevant data, they can

0:12:18.160 --> 0:12:22.199
<v Speaker 1>make the audio information take up less bandwidth. The file

0:12:22.240 --> 0:12:24.800
<v Speaker 1>itself would be smaller because you just dumped everything that

0:12:24.880 --> 0:12:28.400
<v Speaker 1>wasn't important. So the team used an algorithm called the

0:12:28.559 --> 0:12:33.760
<v Speaker 1>low complexity Adaptive Transform Coding or lc DASH a TC

0:12:34.080 --> 0:12:36.520
<v Speaker 1>as the foundation for their research. This was kind of

0:12:36.559 --> 0:12:40.319
<v Speaker 1>their starting point, and this is an approach that that

0:12:40.600 --> 0:12:43.800
<v Speaker 1>tries to do away with redundancy as much as possible,

0:12:43.840 --> 0:12:48.520
<v Speaker 1>and it also incorporates adaptation to perceptual requirements. Also, MP

0:12:48.640 --> 0:12:52.239
<v Speaker 1>three's oh a lot to the IMPEG Layer two standard,

0:12:52.800 --> 0:12:56.600
<v Speaker 1>So the Layer two obviously came out before Layer three,

0:12:56.760 --> 0:12:59.160
<v Speaker 1>and so a lot of the features of layer three

0:12:59.320 --> 0:13:04.800
<v Speaker 1>are really um their legacy features from Layer two. Uh.

0:13:04.840 --> 0:13:07.040
<v Speaker 1>In other words, MP three group kind of got stuck

0:13:07.040 --> 0:13:09.600
<v Speaker 1>with them because otherwise they would have had a problem

0:13:09.600 --> 0:13:12.880
<v Speaker 1>with backwards compatibility. So the result is kind of a

0:13:12.960 --> 0:13:16.439
<v Speaker 1>clunky arrangement under the hood, and some of the features

0:13:16.640 --> 0:13:19.600
<v Speaker 1>may make very little sense when I go through them,

0:13:19.640 --> 0:13:21.839
<v Speaker 1>but some of that is because it's a holdover from

0:13:21.840 --> 0:13:26.840
<v Speaker 1>an earlier compression strategy, which isn't terribly satisfying as an answer.

0:13:26.880 --> 0:13:29.240
<v Speaker 1>But the reason many parts of the MP three compression

0:13:29.280 --> 0:13:31.480
<v Speaker 1>algorithm are the way they are is because that's the

0:13:31.480 --> 0:13:35.520
<v Speaker 1>way we've always done it. So next I'm gonna dive

0:13:35.600 --> 0:13:41.240
<v Speaker 1>into the phases of compression. But before I do that,

0:13:41.440 --> 0:13:44.160
<v Speaker 1>let's all take a deep breath and take a moment

0:13:44.200 --> 0:13:55.880
<v Speaker 1>to thank our sponsor, and we're back. So there are

0:13:55.920 --> 0:13:58.760
<v Speaker 1>two big phases we'll need to talk about with MP

0:13:58.920 --> 0:14:03.320
<v Speaker 1>three compression. The first phase is analysis and the second

0:14:03.320 --> 0:14:07.559
<v Speaker 1>phase is the actual compression itself. And after that there's

0:14:07.559 --> 0:14:10.680
<v Speaker 1>the process of decoding and MP three for playback. But

0:14:10.760 --> 0:14:13.520
<v Speaker 1>that's way simpler once we get an understanding of how

0:14:13.720 --> 0:14:18.959
<v Speaker 1>the encoding process actually happens. So let's begin with analysis. Now.

0:14:19.000 --> 0:14:22.560
<v Speaker 1>This is the part where the standard has to figure

0:14:22.560 --> 0:14:26.840
<v Speaker 1>out which frequencies within an audio range are recording rather

0:14:26.960 --> 0:14:32.760
<v Speaker 1>are important or perceptible. So how does a program and

0:14:33.000 --> 0:14:35.920
<v Speaker 1>encoder figure out what we can hear and what we

0:14:36.000 --> 0:14:40.400
<v Speaker 1>cannot hear? Alright, time to get technical. So you start

0:14:40.440 --> 0:14:45.000
<v Speaker 1>off with your pulse code modulation audio file or PCM file.

0:14:45.160 --> 0:14:47.560
<v Speaker 1>And you might remember I talked about PCM audio in

0:14:47.600 --> 0:14:50.400
<v Speaker 1>the first episode of this series, but just in case

0:14:50.440 --> 0:14:54.160
<v Speaker 1>you don't, it's a lossless digital audio file. The actual

0:14:54.200 --> 0:14:57.040
<v Speaker 1>format could be a wave or ai f F or

0:14:57.080 --> 0:15:00.400
<v Speaker 1>something along those lines, but the important thing to keep

0:15:00.440 --> 0:15:04.520
<v Speaker 1>in mind is that it is uncompressed. Now, that means

0:15:04.560 --> 0:15:06.880
<v Speaker 1>those files tend to be pretty big. This is our

0:15:06.960 --> 0:15:09.840
<v Speaker 1>raw material that we want to take and squish down

0:15:09.880 --> 0:15:14.120
<v Speaker 1>to a more manageable transferable size. And in our our

0:15:14.200 --> 0:15:16.640
<v Speaker 1>last episode in this series, I also mentioned that the

0:15:16.760 --> 0:15:20.120
<v Speaker 1>standard for c D audio is a sample rate of

0:15:20.160 --> 0:15:23.400
<v Speaker 1>forty four point one killer hurts. And we learned that

0:15:23.440 --> 0:15:26.120
<v Speaker 1>you need a sample rate twice the frequency of the

0:15:26.240 --> 0:15:30.520
<v Speaker 1>highest frequency in your recording, and since human hearing tops

0:15:30.520 --> 0:15:32.800
<v Speaker 1>out at around twenty kill hurts, the standard for c

0:15:32.960 --> 0:15:35.880
<v Speaker 1>ds is forty four point one killer hurts. The MP

0:15:36.000 --> 0:15:38.840
<v Speaker 1>three standard can support lots of different sample rates, but

0:15:39.000 --> 0:15:41.320
<v Speaker 1>forty four point one killer hurts is pretty much the

0:15:41.480 --> 0:15:45.800
<v Speaker 1>common standard. So you've got a number of samples with

0:15:45.880 --> 0:15:48.400
<v Speaker 1>your audio file, and that number will depend upon how

0:15:48.440 --> 0:15:53.160
<v Speaker 1>long the audio file is. You've got forty four samples

0:15:53.200 --> 0:15:56.720
<v Speaker 1>per second, actually twice that for stereo. But for the

0:15:56.720 --> 0:15:59.680
<v Speaker 1>purposes of this discussion, let's kind of stick with mono

0:15:59.720 --> 0:16:02.720
<v Speaker 1>sound so that I don't start having math coming out

0:16:02.760 --> 0:16:06.040
<v Speaker 1>of my ears. And we're still in the very easy,

0:16:06.080 --> 0:16:08.480
<v Speaker 1>simple part as far as math goes. We haven't gotten

0:16:08.520 --> 0:16:11.080
<v Speaker 1>to the complicated stuff yet. All right, So you've got

0:16:11.080 --> 0:16:15.880
<v Speaker 1>forty four thousand, one hundred samples per second. To compress

0:16:15.920 --> 0:16:19.280
<v Speaker 1>it into an MP three format, the algorithm first groups

0:16:19.320 --> 0:16:24.520
<v Speaker 1>all of these samples into collections called frames. So take

0:16:24.560 --> 0:16:27.840
<v Speaker 1>those four thousand one per second, and then you start saying, okay,

0:16:27.840 --> 0:16:30.880
<v Speaker 1>we're gonna group you in batches. Each batch is called

0:16:30.920 --> 0:16:34.800
<v Speaker 1>a frame, and each frame contains one thousand, one fifty

0:16:34.800 --> 0:16:39.320
<v Speaker 1>two samples. Now that's specifically to maintain backwards compatibility to

0:16:39.560 --> 0:16:43.520
<v Speaker 1>IMPEG Layer two, which established that one thousand, one fifty

0:16:43.520 --> 0:16:46.720
<v Speaker 1>two number. But we're not talking about IMPEG layer two.

0:16:46.720 --> 0:16:50.760
<v Speaker 1>We're talking about IMPEG Layer three, and though that means

0:16:50.760 --> 0:16:52.560
<v Speaker 1>we have to get a little more complicated. So each

0:16:52.600 --> 0:16:59.280
<v Speaker 1>frame consists of two subgroups called granules. So each granule

0:16:59.320 --> 0:17:04.240
<v Speaker 1>has five hundred seventy six samples six times two one two,

0:17:04.400 --> 0:17:08.560
<v Speaker 1>so five seventy six samples per granule. Now, technically MP

0:17:08.640 --> 0:17:11.520
<v Speaker 1>three encoders only work on one granule at a time,

0:17:11.560 --> 0:17:15.040
<v Speaker 1>but they may reference the granules immediately before and immediately

0:17:15.160 --> 0:17:17.639
<v Speaker 1>after the current one in order to see how the

0:17:17.680 --> 0:17:21.960
<v Speaker 1>audio within the file changes over time. All right, So

0:17:22.040 --> 0:17:24.960
<v Speaker 1>now you've got your granules of five hundred seventy six

0:17:25.119 --> 0:17:29.200
<v Speaker 1>samples each. Then the MP three encoder runs the samples

0:17:29.240 --> 0:17:33.439
<v Speaker 1>through a filter bank, which sorts the sound into thirty

0:17:33.440 --> 0:17:36.359
<v Speaker 1>two frequency ranges. Are you? Are you crazy about the

0:17:36.400 --> 0:17:41.000
<v Speaker 1>numbers yet, Dylan? Are you? Dylan's Dan's nodding. Dylan gets

0:17:41.040 --> 0:17:45.360
<v Speaker 1>worse from here. So you have thirty two frequency ranges,

0:17:45.600 --> 0:17:47.720
<v Speaker 1>which is another nod to the layer two method, which

0:17:47.800 --> 0:17:50.880
<v Speaker 1>use those thirty two ranges for encoding purposes. But we're

0:17:50.880 --> 0:17:54.440
<v Speaker 1>not talking about layer two, are we. No, we're talking

0:17:54.560 --> 0:17:57.760
<v Speaker 1>MP three. Gosh darn it. That means we take those

0:17:57.800 --> 0:18:00.679
<v Speaker 1>thirty two ranges and we subdivide them by a factor

0:18:00.720 --> 0:18:05.240
<v Speaker 1>of eighteen. That means we have five hundred seventies six

0:18:05.440 --> 0:18:10.199
<v Speaker 1>bands of frequencies each band containing one seventy six of

0:18:10.200 --> 0:18:14.439
<v Speaker 1>the frequency range of the original sample. So what that

0:18:14.520 --> 0:18:17.840
<v Speaker 1>actually means and this this is actually pretty easy. The

0:18:17.880 --> 0:18:21.440
<v Speaker 1>bands are not limited to a specific number for their

0:18:21.480 --> 0:18:26.399
<v Speaker 1>frequency range, right. The bands don't mean that on the

0:18:26.560 --> 0:18:29.640
<v Speaker 1>on band number one it goes from twenty hurts up

0:18:29.680 --> 0:18:32.359
<v Speaker 1>to a certain range, and on band five D seventy

0:18:32.400 --> 0:18:35.439
<v Speaker 1>six it ends at twenty killer hurts. That's not what

0:18:35.480 --> 0:18:38.639
<v Speaker 1>it means. They're dependent upon the original audio. So if

0:18:38.680 --> 0:18:42.720
<v Speaker 1>the original audio contains sounds within a narrow range of frequencies,

0:18:43.080 --> 0:18:46.680
<v Speaker 1>the five seventy bands will be more precise. But if

0:18:46.720 --> 0:18:50.280
<v Speaker 1>the original recording has a vast range of frequencies, the

0:18:50.320 --> 0:18:53.280
<v Speaker 1>bands are less precise. So another way to think about

0:18:53.320 --> 0:18:56.840
<v Speaker 1>this is with a pizza. So let's say you get

0:18:56.960 --> 0:19:00.000
<v Speaker 1>extra large pizza and you cut it into eight equal slices,

0:19:00.640 --> 0:19:03.320
<v Speaker 1>and then you get a small pizza and you cut

0:19:03.359 --> 0:19:06.679
<v Speaker 1>that into eight equal slices. Well, in both cases you

0:19:06.680 --> 0:19:10.800
<v Speaker 1>have with each slice one eighth of a pizza. But

0:19:10.880 --> 0:19:15.119
<v Speaker 1>the extra large pizza pizza slice is bigger than the

0:19:15.160 --> 0:19:18.320
<v Speaker 1>small pizza pizza slice. It all depends on the size

0:19:18.320 --> 0:19:21.000
<v Speaker 1>of the pizza. So in this case, it depends upon

0:19:21.040 --> 0:19:24.120
<v Speaker 1>the range of frequencies. And and Dylan, do you think

0:19:24.119 --> 0:19:26.320
<v Speaker 1>we could go for some pizza, you know, just just

0:19:26.359 --> 0:19:29.199
<v Speaker 1>put the episode on hold and go get pizza. Dylan's nodding.

0:19:29.760 --> 0:19:33.919
<v Speaker 1>It's great for audio. Yeah, so, uh, pizza, We'll be

0:19:34.000 --> 0:19:38.840
<v Speaker 1>right back. Okay, I was good pizza. Now um oh, man,

0:19:38.880 --> 0:19:41.440
<v Speaker 1>I got a whole bunch more notes. Okay, well, let's

0:19:41.480 --> 0:19:43.919
<v Speaker 1>let's go ahead and and do the rest of this.

0:19:43.960 --> 0:19:45.840
<v Speaker 1>All right, So you've got your sound divided up into

0:19:45.920 --> 0:19:49.359
<v Speaker 1>those five seventy six sub brands of frequencies, you know,

0:19:49.680 --> 0:19:52.879
<v Speaker 1>the thing I compared to pizza slices earlier. Now you

0:19:52.920 --> 0:19:58.399
<v Speaker 1>get two different mathematical processes applied to this data. One

0:19:58.560 --> 0:20:01.959
<v Speaker 1>is the fast Furrier trans form or f T, and

0:20:02.000 --> 0:20:05.720
<v Speaker 1>the other is the modified discrete Cosine transform or m

0:20:05.840 --> 0:20:09.800
<v Speaker 1>d c T. Now, I am not going to dive

0:20:09.840 --> 0:20:13.080
<v Speaker 1>deeply into how these transforms work, because frankly, they are

0:20:13.160 --> 0:20:17.480
<v Speaker 1>beyond my mathematical understanding. But I know what they do.

0:20:17.760 --> 0:20:22.320
<v Speaker 1>I just cannot explain the process like how they do

0:20:22.440 --> 0:20:24.520
<v Speaker 1>what they do. So I'm going to give you the

0:20:24.560 --> 0:20:27.760
<v Speaker 1>explanation of what they do. What the outcome of each

0:20:27.800 --> 0:20:31.880
<v Speaker 1>of these transformed processes happens to be, but I'm not

0:20:31.960 --> 0:20:33.840
<v Speaker 1>going to be able to tell you the actual mathematical

0:20:33.880 --> 0:20:36.520
<v Speaker 1>steps involved in each because I don't math. So good guys,

0:20:37.680 --> 0:20:40.560
<v Speaker 1>But let's start with a fast for your transform. So

0:20:40.680 --> 0:20:42.760
<v Speaker 1>transform is kind of what it sounds like. It's all

0:20:42.760 --> 0:20:47.000
<v Speaker 1>about transforming information in some way. So in this particular case,

0:20:47.160 --> 0:20:50.399
<v Speaker 1>the f f T transforms the frequency bands we just

0:20:50.440 --> 0:20:55.400
<v Speaker 1>talked about into data that can be further analyzed by

0:20:55.520 --> 0:20:59.639
<v Speaker 1>a psychoacoustic model that's in the encoder. So this is

0:20:59.680 --> 0:21:03.000
<v Speaker 1>that simulated human ear and brain we were talking about earlier.

0:21:03.880 --> 0:21:07.840
<v Speaker 1>So what the encoder does is it analyzes each bit

0:21:07.960 --> 0:21:11.639
<v Speaker 1>of data and looks for signs that it represents audio

0:21:11.720 --> 0:21:14.640
<v Speaker 1>that wouldn't be perceived by a human. So it's look

0:21:14.840 --> 0:21:19.280
<v Speaker 1>looking for any potential for masking possibilities. So are there

0:21:19.280 --> 0:21:21.840
<v Speaker 1>collections of frequencies that are grouped close together, and is

0:21:21.880 --> 0:21:24.359
<v Speaker 1>one of those frequencies louder than the others. You might

0:21:24.400 --> 0:21:27.000
<v Speaker 1>be able to do away with those softerw frequencies because

0:21:27.000 --> 0:21:30.520
<v Speaker 1>of frequency masking. The encoder will also look at whether

0:21:30.640 --> 0:21:33.000
<v Speaker 1>or not the audio has a lot of complexity to it,

0:21:33.840 --> 0:21:36.000
<v Speaker 1>if it has a lot of changes, or if it's

0:21:36.040 --> 0:21:40.879
<v Speaker 1>just relatively steady or simple audio. Any transient sounds that

0:21:40.920 --> 0:21:44.640
<v Speaker 1>are present in the audio might end up being temporal masking,

0:21:44.720 --> 0:21:47.080
<v Speaker 1>so it'll analyze those as well and see if that's

0:21:47.080 --> 0:21:52.040
<v Speaker 1>a possibility. So really what they're looking is for, you know,

0:21:53.320 --> 0:21:56.399
<v Speaker 1>just any really loud sounds that stand out above the

0:21:56.440 --> 0:21:59.159
<v Speaker 1>rest of the recording. That's what the f f T

0:21:59.320 --> 0:22:03.240
<v Speaker 1>is doing. So what about the modified discrete cosine transform. Well,

0:22:03.280 --> 0:22:05.399
<v Speaker 1>this is happening in parallel with the f f T,

0:22:05.840 --> 0:22:10.360
<v Speaker 1>and the samples get sorted into different patterns called windows. Uh.

0:22:10.359 --> 0:22:12.920
<v Speaker 1>And the criterion for sorting all has to do with

0:22:12.920 --> 0:22:16.760
<v Speaker 1>whether the sample represents a steady sound or varied sound.

0:22:17.280 --> 0:22:20.400
<v Speaker 1>So if you have a simple steady sound that goes

0:22:20.440 --> 0:22:24.240
<v Speaker 1>into a long window. If there's a lot of variation

0:22:24.280 --> 0:22:27.000
<v Speaker 1>in the sound, like there are a lot of consonants

0:22:27.000 --> 0:22:29.800
<v Speaker 1>in a vocal line, or it's like a drum solo

0:22:30.000 --> 0:22:32.720
<v Speaker 1>or something like that, it would get sorted into a

0:22:32.800 --> 0:22:36.480
<v Speaker 1>series of three short windows. And each short window contains

0:22:36.520 --> 0:22:42.560
<v Speaker 1>one two samples. That amounts to four whole milliseconds, so

0:22:42.720 --> 0:22:48.159
<v Speaker 1>four thousands of a second in three patterned windows. So

0:22:48.200 --> 0:22:51.440
<v Speaker 1>you've got these windows now, either long windows for simple

0:22:51.480 --> 0:22:54.760
<v Speaker 1>sounds or short windows for the more complex sounds, and

0:22:54.760 --> 0:22:57.800
<v Speaker 1>then the modified discrete cosine transformed kicks into gear. It

0:22:57.800 --> 0:23:00.200
<v Speaker 1>looks at each long window or set of three sort

0:23:00.240 --> 0:23:03.960
<v Speaker 1>windows and converts them into a set of spectral values.

0:23:04.560 --> 0:23:06.840
<v Speaker 1>To some of you, that probably sounds meaningless. So let's

0:23:06.880 --> 0:23:10.760
<v Speaker 1>talk about spectral analysis for a second. First, I was

0:23:11.040 --> 0:23:13.960
<v Speaker 1>very disappointed to learn that spectral analysis doesn't involve a

0:23:13.960 --> 0:23:19.280
<v Speaker 1>psychologist talking to a ghost about its emotional state. So bummer.

0:23:20.040 --> 0:23:23.600
<v Speaker 1>But spectral analysis is when you look at a spectrum

0:23:23.640 --> 0:23:27.840
<v Speaker 1>of information, like a spectrum of frequencies or related information

0:23:27.880 --> 0:23:31.480
<v Speaker 1>like energy states. That's what this transform does. It takes

0:23:31.560 --> 0:23:35.159
<v Speaker 1>data that originally represented a slice of time in a

0:23:35.240 --> 0:23:38.400
<v Speaker 1>sound waveform. That's what sample is. A sample is an

0:23:38.440 --> 0:23:42.320
<v Speaker 1>instance of time in a wave form and converts it

0:23:42.359 --> 0:23:48.880
<v Speaker 1>into information representing sound as energy across a range of frequencies. Now,

0:23:48.880 --> 0:23:51.119
<v Speaker 1>you can plot out spectral information in a lot of

0:23:51.119 --> 0:23:54.040
<v Speaker 1>different ways, but one common method is to use brightness

0:23:54.080 --> 0:23:58.840
<v Speaker 1>to indicate energy levels. Higher energy levels are brighter patches

0:23:59.080 --> 0:24:03.840
<v Speaker 1>in your vision. Dual representation of spectral data. High frequencies

0:24:03.920 --> 0:24:06.720
<v Speaker 1>would appear at the top of a spectral view like

0:24:06.800 --> 0:24:10.000
<v Speaker 1>imagine a box, and at the top of the box

0:24:10.200 --> 0:24:12.440
<v Speaker 1>that's where you would find high frequencies. At the bottom

0:24:12.440 --> 0:24:14.760
<v Speaker 1>of the boxes where you find low frequencies, and it's

0:24:14.800 --> 0:24:17.880
<v Speaker 1>just lots of patches of color. The really bright patches

0:24:17.880 --> 0:24:23.280
<v Speaker 1>of color represent very high energy frequencies, so they could

0:24:23.280 --> 0:24:27.080
<v Speaker 1>be high or low in in actual frequency, but we're

0:24:27.080 --> 0:24:30.640
<v Speaker 1>talking about energy levels, not whether it's a higher low pitch.

0:24:32.520 --> 0:24:35.160
<v Speaker 1>Looking left or right represents the passing of time, and

0:24:35.200 --> 0:24:38.600
<v Speaker 1>looking along any vertical points shows you the actual frequency

0:24:39.280 --> 0:24:42.840
<v Speaker 1>or pitch, and then the respective energy level is the brightness.

0:24:42.960 --> 0:24:45.119
<v Speaker 1>So it's kind of like looking at sound as a wave,

0:24:45.280 --> 0:24:47.800
<v Speaker 1>but instead of being a wave, you're looking at information

0:24:47.800 --> 0:24:52.639
<v Speaker 1>that indicates frequency range and energy level. That representation is

0:24:52.640 --> 0:24:55.520
<v Speaker 1>actually kind of analogous to how we hear audio, So

0:24:55.600 --> 0:24:58.720
<v Speaker 1>and encoder can analyze the spectral view and start to

0:24:58.720 --> 0:25:02.920
<v Speaker 1>filter out the data we would and perceived due to psychoacoustics. Now,

0:25:02.960 --> 0:25:06.960
<v Speaker 1>after all that processing, the encoder looks at the frequency

0:25:07.040 --> 0:25:10.240
<v Speaker 1>sub brands and the levels of spectral intensity for each

0:25:10.840 --> 0:25:14.240
<v Speaker 1>and that information can then be used for the next phase,

0:25:14.840 --> 0:25:18.280
<v Speaker 1>which is compression. But right now I think we could

0:25:18.320 --> 0:25:21.800
<v Speaker 1>all stand a little decompression, So let's take another quick

0:25:21.800 --> 0:25:33.280
<v Speaker 1>break to thank our sponsor all right, So now you're

0:25:33.320 --> 0:25:37.320
<v Speaker 1>ready to compress your analyzed audio. Good for you, and

0:25:37.359 --> 0:25:41.120
<v Speaker 1>by you I mean encoders. This has to be simpler

0:25:41.160 --> 0:25:44.159
<v Speaker 1>than that analysis segment, right, I mean that got a

0:25:44.160 --> 0:25:47.880
<v Speaker 1>little crazy with all the different bands and sub bands

0:25:48.040 --> 0:25:55.160
<v Speaker 1>and windows and frames and granules. Sadly it gets more complicated.

0:25:55.160 --> 0:25:58.320
<v Speaker 1>All right. So there are two layers of compression going

0:25:58.359 --> 0:26:03.040
<v Speaker 1>on with IMPEG Layer three. One of those layers depends

0:26:03.119 --> 0:26:07.560
<v Speaker 1>upon the psychoacoustic analysis and the other doesn't. So why

0:26:07.560 --> 0:26:10.840
<v Speaker 1>would you use two layers with different strategies like that? Well,

0:26:10.880 --> 0:26:13.879
<v Speaker 1>the reason is that one strategy is great for complex

0:26:13.920 --> 0:26:16.679
<v Speaker 1>audio with lots of components, but not so great with

0:26:16.800 --> 0:26:19.679
<v Speaker 1>simpler sounds, and the other strategy is kind of the opposite.

0:26:20.160 --> 0:26:22.560
<v Speaker 1>So the psychoacoustic approach is the one that's really good

0:26:22.600 --> 0:26:26.520
<v Speaker 1>for complicated sounds. If if you've got a lot of

0:26:26.720 --> 0:26:30.879
<v Speaker 1>volume changes, lots of different frequencies, it's just complicated and

0:26:31.000 --> 0:26:33.880
<v Speaker 1>rich sound. You've got a lot of opportunities to look

0:26:33.920 --> 0:26:37.280
<v Speaker 1>for masking and other acoustic elements that limit the actual

0:26:37.359 --> 0:26:41.200
<v Speaker 1>sounds that people perceive. So it means there are a

0:26:41.240 --> 0:26:44.800
<v Speaker 1>lot of chances for you to uh fudge by dropping

0:26:44.800 --> 0:26:49.720
<v Speaker 1>all the stuff that people probably wouldn't notice anyway. And Uh,

0:26:49.880 --> 0:26:51.439
<v Speaker 1>if you take a piece that's got a lot of

0:26:51.440 --> 0:26:54.960
<v Speaker 1>elements at varying volumes, there are likely several opportunities to

0:26:54.960 --> 0:26:58.800
<v Speaker 1>to do this. But if you're talking about relatively straightforward

0:26:59.440 --> 0:27:04.359
<v Speaker 1>audio with few components, few changes in volume, there's really

0:27:04.359 --> 0:27:06.439
<v Speaker 1>not a whole lot of data you can ditch without

0:27:06.480 --> 0:27:08.960
<v Speaker 1>it actually affecting the quality of the audio in a

0:27:09.000 --> 0:27:13.280
<v Speaker 1>perceptible way. And this is part of what Brandenburg, that

0:27:13.320 --> 0:27:15.480
<v Speaker 1>guy I was talking about in our first episode in

0:27:15.520 --> 0:27:18.439
<v Speaker 1>this series. Uh, that's when he discovered when he was

0:27:18.840 --> 0:27:22.000
<v Speaker 1>working with the MP three standard and he was listening

0:27:22.040 --> 0:27:26.600
<v Speaker 1>back to that Suzanne Vega acapella track Tom's Diner. Uh,

0:27:26.720 --> 0:27:28.560
<v Speaker 1>he was listening to a compressed version of it, and

0:27:28.560 --> 0:27:31.159
<v Speaker 1>he said it was terrible. He said it ruined the

0:27:31.200 --> 0:27:34.520
<v Speaker 1>quality of the audio. And part of that is because

0:27:34.600 --> 0:27:37.919
<v Speaker 1>that particular song is fairly simple, there's just not a

0:27:37.920 --> 0:27:40.800
<v Speaker 1>lot of opportunity to take advantage of masking and other

0:27:40.920 --> 0:27:46.520
<v Speaker 1>tricks without potentially compromising the quality. So they decided to

0:27:46.560 --> 0:27:50.600
<v Speaker 1>also incorporate some traditional compression strategies which which worked better

0:27:50.760 --> 0:27:53.679
<v Speaker 1>with those types of recordings. So the MP three format

0:27:53.720 --> 0:27:57.800
<v Speaker 1>takes advantage of both the traditional approach and the psychoacoustic approach,

0:27:58.520 --> 0:28:01.560
<v Speaker 1>and that allows the encoder to compressed files into smaller

0:28:01.600 --> 0:28:05.720
<v Speaker 1>size without just following a single strategy, like it doesn't

0:28:05.720 --> 0:28:07.800
<v Speaker 1>have to do a one size fits all for all

0:28:07.880 --> 0:28:12.639
<v Speaker 1>elements of audio. Now, combining those two strategies requires a

0:28:12.640 --> 0:28:16.359
<v Speaker 1>little more mathematical gymnastics. So let's go back to those

0:28:16.480 --> 0:28:20.240
<v Speaker 1>five seventy six frequency bins. You know, those sub bands

0:28:20.280 --> 0:28:24.560
<v Speaker 1>we talked about earlier. You gotta quantize those suckers. What

0:28:24.600 --> 0:28:27.480
<v Speaker 1>does that mean. It means assigning a quantity to each

0:28:27.800 --> 0:28:31.639
<v Speaker 1>to each frequency bin, you have to give it a

0:28:31.720 --> 0:28:34.720
<v Speaker 1>quantity of some sorts so that you can end up

0:28:34.840 --> 0:28:39.640
<v Speaker 1>judging how much you can get away with dropping data.

0:28:40.000 --> 0:28:42.840
<v Speaker 1>So to do this, the encoder sorts those five six

0:28:42.880 --> 0:28:46.320
<v Speaker 1>bins into twenty two scale factor bands. How you doing

0:28:46.320 --> 0:28:50.680
<v Speaker 1>over there, Dylan? Just checking in on you? Okay, Dylan's

0:28:50.720 --> 0:28:53.440
<v Speaker 1>got Dylan's got a thousand yards stare going. I hope

0:28:53.480 --> 0:28:55.920
<v Speaker 1>you guys are doing okay over there? All right, So

0:28:56.120 --> 0:28:58.080
<v Speaker 1>before smoke starts coming out of your ears, let me

0:28:58.120 --> 0:29:01.800
<v Speaker 1>explain what the scale factor bands are all about. The

0:29:01.840 --> 0:29:05.400
<v Speaker 1>whole purpose of the scale factor bands is to determine

0:29:05.480 --> 0:29:10.000
<v Speaker 1>how the information will be stored within the compressed state.

0:29:10.880 --> 0:29:12.840
<v Speaker 1>So you want to get away with as little data

0:29:12.920 --> 0:29:16.080
<v Speaker 1>as possible before affecting sound quality. So if you can

0:29:16.120 --> 0:29:19.800
<v Speaker 1>say the same thing in a shorter space without affecting

0:29:19.800 --> 0:29:22.640
<v Speaker 1>the quality of what it is you're saying, you go

0:29:22.720 --> 0:29:27.720
<v Speaker 1>with it. Brevity is the soul of compression. So if

0:29:27.720 --> 0:29:31.000
<v Speaker 1>we were talking about language, I would say it's more

0:29:31.000 --> 0:29:35.920
<v Speaker 1>efficient to say it's raining outside, or even just it's raining,

0:29:36.240 --> 0:29:39.320
<v Speaker 1>because you would assume that it would be outside where

0:29:39.320 --> 0:29:41.880
<v Speaker 1>the rain is happening, and it would be inefficient for

0:29:41.920 --> 0:29:44.400
<v Speaker 1>me to say it's coming down like cats and dogs

0:29:44.440 --> 0:29:48.280
<v Speaker 1>out there. It's not as efficient as saying it's raining.

0:29:49.040 --> 0:29:53.800
<v Speaker 1>So if you can get away with shorter statements without

0:29:53.880 --> 0:29:57.680
<v Speaker 1>affecting the actual quality, and you could argue that by

0:29:57.840 --> 0:30:00.360
<v Speaker 1>switching from it's coming down like cats and dog out

0:30:00.360 --> 0:30:03.920
<v Speaker 1>there and it's raining changes the quality, and that could

0:30:03.920 --> 0:30:05.680
<v Speaker 1>be a valid argument. But if you can get away

0:30:06.120 --> 0:30:10.440
<v Speaker 1>with shorter without affecting quality, you do it. So each

0:30:10.480 --> 0:30:15.000
<v Speaker 1>scale factor band is represented by a quantity, Then the

0:30:15.080 --> 0:30:19.480
<v Speaker 1>encoder divides that quantity by a given number called the quantizer,

0:30:19.840 --> 0:30:23.520
<v Speaker 1>which is the same across the entire frequency spectrum for

0:30:23.600 --> 0:30:28.080
<v Speaker 1>that recording. The resulting number is then rounded up or

0:30:28.200 --> 0:30:33.320
<v Speaker 1>down to a whole digit. And here's an important point.

0:30:33.720 --> 0:30:37.200
<v Speaker 1>Individual scale factor bands can be scaled up or down

0:30:37.320 --> 0:30:41.320
<v Speaker 1>for more or less precision to represent the actual value

0:30:41.480 --> 0:30:45.480
<v Speaker 1>of those bands. So what the heck does all that mean? Well,

0:30:45.560 --> 0:30:48.120
<v Speaker 1>the purpose of dividing and rounding is just to simplify

0:30:48.160 --> 0:30:50.880
<v Speaker 1>the data to reduce the amount you need in order

0:30:50.920 --> 0:30:53.680
<v Speaker 1>to store the information. So let's go with a totally

0:30:53.760 --> 0:30:57.560
<v Speaker 1>hypothetical example. Let's say you've got a scale factor band

0:30:58.360 --> 0:31:01.240
<v Speaker 1>and you've decided your rep is sending that scale factor

0:31:01.320 --> 0:31:05.280
<v Speaker 1>band with the quantity seven eight four zero seven thousand,

0:31:05.360 --> 0:31:08.880
<v Speaker 1>eight hundred forty, and you've chosen the number one hundred

0:31:08.920 --> 0:31:12.480
<v Speaker 1>to quantize your data, meaning that you will divide each

0:31:13.400 --> 0:31:18.160
<v Speaker 1>uh scale factor bands quantity by one hundred. So this

0:31:18.200 --> 0:31:20.560
<v Speaker 1>is seven thousand, eight hundred forty. You divide it by

0:31:20.680 --> 0:31:24.440
<v Speaker 1>one hundred UH, and the scale factor for this particular

0:31:24.480 --> 0:31:28.080
<v Speaker 1>band you have determined is one point zero. That means

0:31:28.160 --> 0:31:31.360
<v Speaker 1>that once you get that result where you've divided the

0:31:31.440 --> 0:31:34.560
<v Speaker 1>quantity by the quantizer, you multiply by one. That means

0:31:34.560 --> 0:31:36.880
<v Speaker 1>there's no change. You multiply by one you get the

0:31:36.960 --> 0:31:40.080
<v Speaker 1>same number. More on that end a bit. Okay, So

0:31:40.120 --> 0:31:42.680
<v Speaker 1>you take that seven thousand, eight hundred forty you divided

0:31:42.720 --> 0:31:46.520
<v Speaker 1>by one hundred. That gives you seventy eight point four. Well,

0:31:46.600 --> 0:31:48.680
<v Speaker 1>now you have to round that number, so you round

0:31:48.680 --> 0:31:51.520
<v Speaker 1>it down to seventy eight. Now, when you have a

0:31:51.560 --> 0:31:54.240
<v Speaker 1>decoder and you're ready to play back the information, it

0:31:54.320 --> 0:31:59.040
<v Speaker 1>comes across this quantity the sight and it knows what

0:31:59.200 --> 0:32:02.760
<v Speaker 1>the quantizer number was, so it multiplies by one hundred

0:32:02.800 --> 0:32:05.720
<v Speaker 1>to get back to seven thousand, eight hundred. So the

0:32:05.800 --> 0:32:09.720
<v Speaker 1>replicated number is actually forty off from the original number.

0:32:09.760 --> 0:32:12.800
<v Speaker 1>The original number again was seven thousand, eight hundred forty.

0:32:13.080 --> 0:32:16.560
<v Speaker 1>The replicated number is seven thousand, eight hundred. Now those

0:32:16.600 --> 0:32:21.920
<v Speaker 1>inconsistencies manifest as noise in the actual playback. So if

0:32:21.920 --> 0:32:24.840
<v Speaker 1>you wanted to increase the precision of any given scale

0:32:24.840 --> 0:32:27.200
<v Speaker 1>factor band, you could do so by changing the scale

0:32:27.200 --> 0:32:30.080
<v Speaker 1>factor number. So in that example, just now, I said

0:32:30.120 --> 0:32:32.680
<v Speaker 1>the number was one point zero, meaning there's no change

0:32:32.840 --> 0:32:36.160
<v Speaker 1>to that result. But I could have said it was ten,

0:32:36.640 --> 0:32:39.280
<v Speaker 1>which means we would multiply the quanties number by ten.

0:32:39.640 --> 0:32:41.720
<v Speaker 1>So we would take that seven thousand, eight hundred forty

0:32:41.840 --> 0:32:44.040
<v Speaker 1>divided by one hundred, you get seventy eight point four,

0:32:44.520 --> 0:32:48.120
<v Speaker 1>then multiplied by ten to get seven eight four. So

0:32:48.880 --> 0:32:52.160
<v Speaker 1>when the decoder decompresses the file, it would reverse this

0:32:52.160 --> 0:32:55.400
<v Speaker 1>this whole thing. It would just multiply by a hundred um.

0:32:55.440 --> 0:32:57.720
<v Speaker 1>You would end up getting seven thousand, hundred forty again,

0:32:57.800 --> 0:33:00.680
<v Speaker 1>which means that you wouldn't introduce any noise to the file.

0:33:00.720 --> 0:33:04.040
<v Speaker 1>You would have a perfect representation. But in some cases

0:33:04.040 --> 0:33:07.560
<v Speaker 1>the encoder may determine that any noise that you generate

0:33:07.880 --> 0:33:11.000
<v Speaker 1>wouldn't be noticed or it wouldn't impact the quality of

0:33:11.000 --> 0:33:13.520
<v Speaker 1>the audio enough for it to be a problem because

0:33:13.520 --> 0:33:16.440
<v Speaker 1>of other factors for that particular scale factor band, like

0:33:16.520 --> 0:33:20.000
<v Speaker 1>maybe it's really quiet, or maybe it's really complex. So

0:33:20.040 --> 0:33:22.920
<v Speaker 1>in those cases, you could reduce the scale factor number

0:33:23.320 --> 0:33:26.120
<v Speaker 1>by making it something else, like point one instead of

0:33:26.160 --> 0:33:28.720
<v Speaker 1>one point oh. So that means you would multiply the

0:33:28.800 --> 0:33:32.400
<v Speaker 1>quantized number by point one, So the seventy eight point

0:33:32.440 --> 0:33:35.240
<v Speaker 1>four would become seven point eight four, and then you

0:33:35.280 --> 0:33:37.320
<v Speaker 1>have to round it to get a whole integer, so

0:33:37.360 --> 0:33:41.320
<v Speaker 1>you get eight seven point eight four rounds up to eight. Now,

0:33:41.320 --> 0:33:44.880
<v Speaker 1>when a decode or decompresses the audio and multiplies eight

0:33:44.920 --> 0:33:48.200
<v Speaker 1>by one hundred, that quantizer that we've talked about so much.

0:33:49.120 --> 0:33:51.200
<v Speaker 1>Uh and uh. Actually at this point it would have

0:33:51.200 --> 0:33:53.680
<v Speaker 1>to be eight thousand because it's also taking into account

0:33:53.680 --> 0:33:57.520
<v Speaker 1>the scale factor, so it's multiplying it by a thousand,

0:33:57.520 --> 0:34:01.760
<v Speaker 1>not just a hundred. So you would get a number

0:34:01.800 --> 0:34:04.440
<v Speaker 1>that would pop up to eight thousand. And remember the

0:34:04.440 --> 0:34:06.800
<v Speaker 1>original with seven thousand, eight hundred forty. So you look

0:34:06.800 --> 0:34:09.640
<v Speaker 1>at the difference between these two, the original seven thousand forty,

0:34:09.719 --> 0:34:12.240
<v Speaker 1>the new fact number is eight thousand. There's a pretty

0:34:12.239 --> 0:34:15.040
<v Speaker 1>big difference there. That change might introduce enough noise for

0:34:15.040 --> 0:34:17.240
<v Speaker 1>it to be a problem. So how does the encoder

0:34:17.280 --> 0:34:20.400
<v Speaker 1>determine if a scale factor band is meeting the proper criteria?

0:34:20.440 --> 0:34:25.319
<v Speaker 1>How can it tell if there is uh too much

0:34:25.400 --> 0:34:28.799
<v Speaker 1>noise or if the noise falls below the threshold. Well,

0:34:28.840 --> 0:34:32.360
<v Speaker 1>it goes through what it's called a Huffman coding process.

0:34:32.440 --> 0:34:37.160
<v Speaker 1>At this point, Dylan is currently just staring at the

0:34:37.160 --> 0:34:41.480
<v Speaker 1>wall and drool is coming out. Huffman coding process. It's

0:34:41.520 --> 0:34:45.160
<v Speaker 1>converts scale factor bands into binary strings, and the process

0:34:45.200 --> 0:34:47.160
<v Speaker 1>goes through a series of tables to determine if the

0:34:47.239 --> 0:34:50.160
<v Speaker 1>data within the scale factor band requires more or less

0:34:50.200 --> 0:34:53.200
<v Speaker 1>precision to describe the sound without affecting the audio quality.

0:34:54.320 --> 0:34:56.719
<v Speaker 1>So Huffman coding is a process. And when you start

0:34:56.760 --> 0:34:58.880
<v Speaker 1>with a large number of possibilities and you begin to

0:34:58.960 --> 0:35:01.880
<v Speaker 1>narrow it down. Uh. Some people describe it as the

0:35:01.920 --> 0:35:05.719
<v Speaker 1>coding equivalent of twenty questions. So you ask your first

0:35:05.800 --> 0:35:08.960
<v Speaker 1>question like animal, vegetable, or mineral. You get an answer

0:35:09.080 --> 0:35:12.640
<v Speaker 1>so animal. While that first answer eliminates a ton of

0:35:12.680 --> 0:35:16.400
<v Speaker 1>other possibilities and narrows the focus, like anything that doesn't

0:35:16.400 --> 0:35:20.120
<v Speaker 1>pertain to animal, you can automatically discount because you already

0:35:20.160 --> 0:35:25.280
<v Speaker 1>know it can apply to that answer. With MP three compression,

0:35:25.320 --> 0:35:28.319
<v Speaker 1>this means making certain the number of bits representing a

0:35:28.360 --> 0:35:33.160
<v Speaker 1>granule because remember I mentioned that an MP three formats

0:35:33.280 --> 0:35:36.400
<v Speaker 1>you have frames, and each frame, each frame has a thousand,

0:35:36.400 --> 0:35:40.000
<v Speaker 1>one or fifty two samples and consists of two granules

0:35:40.000 --> 0:35:43.840
<v Speaker 1>with five s each. So when you answer the first question,

0:35:43.960 --> 0:35:46.640
<v Speaker 1>it eliminates a lot of other possibilities and narrows the focus.

0:35:46.640 --> 0:35:49.800
<v Speaker 1>So like with animal, vegetable, mineral, if I say animal,

0:35:49.920 --> 0:35:52.840
<v Speaker 1>you're gonna not ask any questions that have to do

0:35:52.880 --> 0:35:56.480
<v Speaker 1>with minerals or vegetables only because it wouldn't make sense.

0:35:57.239 --> 0:35:59.400
<v Speaker 1>You know, those aren't gonna apply. Same thing with m

0:35:59.440 --> 0:36:02.160
<v Speaker 1>P three's, except this time it means making certain the

0:36:02.239 --> 0:36:05.799
<v Speaker 1>number of bits representing a granule. Remember their two granules

0:36:05.800 --> 0:36:09.680
<v Speaker 1>per frame with the MP three layer, Uh, you want

0:36:09.680 --> 0:36:12.759
<v Speaker 1>to make sure that the number of bits representing that

0:36:12.800 --> 0:36:16.319
<v Speaker 1>granule match the chosen bit rate for a compression. So

0:36:16.360 --> 0:36:18.640
<v Speaker 1>if after going through this process, the encoder says, hey,

0:36:18.640 --> 0:36:21.839
<v Speaker 1>this granule has more bits than what's allowed. It's too

0:36:21.840 --> 0:36:24.680
<v Speaker 1>many bits. The we gotta get rid of some of these,

0:36:24.840 --> 0:36:27.200
<v Speaker 1>the encoder can adjust the scale factor band so that

0:36:27.239 --> 0:36:31.560
<v Speaker 1>there's less precision meaning that multiplier in other words, that

0:36:32.120 --> 0:36:35.480
<v Speaker 1>but I talked about earlier, and thus reduce the amount

0:36:35.480 --> 0:36:40.120
<v Speaker 1>of data needed to represent that particular granule. If a

0:36:40.160 --> 0:36:44.120
<v Speaker 1>granule comes in under the bit rate, the encoder can

0:36:44.160 --> 0:36:48.320
<v Speaker 1>increase the precision to reduce noise and fill that granule

0:36:48.440 --> 0:36:55.040
<v Speaker 1>out properly so that matches the actual threshold. After all this,

0:36:55.160 --> 0:36:58.360
<v Speaker 1>the pairs of granules become frames within the MP three files,

0:36:58.360 --> 0:37:01.280
<v Speaker 1>and the only other component then MP three file apart

0:37:01.320 --> 0:37:04.719
<v Speaker 1>from these frames is the I D three metadata. And

0:37:04.719 --> 0:37:06.799
<v Speaker 1>this is pretty simple. This is like a header and

0:37:06.840 --> 0:37:09.080
<v Speaker 1>it comes before all the frames in the audio file

0:37:09.160 --> 0:37:13.000
<v Speaker 1>and contains information about about the file itself, which can

0:37:13.000 --> 0:37:15.719
<v Speaker 1>include stuff like the title of a song, an artist name,

0:37:15.840 --> 0:37:19.640
<v Speaker 1>an album title, other stuff like that. It can also

0:37:19.680 --> 0:37:23.080
<v Speaker 1>include copyright information as well as information about the file itself,

0:37:23.160 --> 0:37:25.440
<v Speaker 1>such as whether or not it's stereo recording or a

0:37:25.440 --> 0:37:29.279
<v Speaker 1>mono recording. So when you use a decoder like an

0:37:29.360 --> 0:37:34.720
<v Speaker 1>MP three player, it takes this compressed information, these these

0:37:34.719 --> 0:37:40.960
<v Speaker 1>these representations that the music has been reduced to, and

0:37:41.040 --> 0:37:44.520
<v Speaker 1>it converts that Huffman data back into the quantized format,

0:37:45.080 --> 0:37:47.759
<v Speaker 1>scales the data back up to its original size or

0:37:47.800 --> 0:37:53.640
<v Speaker 1>close approximation. Remember the the uncompressed version may actually be

0:37:53.719 --> 0:37:58.280
<v Speaker 1>off by a significant amount depending upon each individual granule.

0:37:58.840 --> 0:38:01.080
<v Speaker 1>And all of that data gets combined into a new

0:38:01.160 --> 0:38:04.200
<v Speaker 1>PCM sample that can be played back to you. And

0:38:04.320 --> 0:38:07.120
<v Speaker 1>that's all there is to it. Nothing could be easier,

0:38:08.320 --> 0:38:11.920
<v Speaker 1>all right. That took a lot out of me, So

0:38:11.960 --> 0:38:14.320
<v Speaker 1>I got really technical, and I apologize if I lost

0:38:14.360 --> 0:38:16.600
<v Speaker 1>any of you out there, or for those of you

0:38:16.600 --> 0:38:19.160
<v Speaker 1>who have a lot of experience working on compression algorithms,

0:38:19.160 --> 0:38:23.040
<v Speaker 1>for oversimplifying in several cases. But now we've got a

0:38:23.040 --> 0:38:25.520
<v Speaker 1>full episode about this, and I hope you have a

0:38:25.520 --> 0:38:28.640
<v Speaker 1>better understanding of how a big sound file can be

0:38:28.719 --> 0:38:32.880
<v Speaker 1>reduced to a smaller sound file. Next time, I'll just

0:38:32.920 --> 0:38:36.160
<v Speaker 1>say magic. It will make everyone happier. If you guys

0:38:36.200 --> 0:38:39.320
<v Speaker 1>have any questions for me, or comments or suggestions, anything

0:38:39.360 --> 0:38:42.480
<v Speaker 1>like that, send me a message. My email is tech

0:38:42.520 --> 0:38:45.520
<v Speaker 1>Stuff at how stuff works dot com, or you can

0:38:45.560 --> 0:38:48.120
<v Speaker 1>drop me a line on Facebook or Twitter to handle it.

0:38:48.239 --> 0:38:51.359
<v Speaker 1>Both of those is tech Stuff H. S W. And

0:38:51.400 --> 0:38:59.919
<v Speaker 1>I'll talk to you guys again really soon. For more

0:39:00.000 --> 0:39:02.279
<v Speaker 1>on this and thousands of other topics, is it how

0:39:02.320 --> 0:39:08.640
<v Speaker 1>stuff works dot com, wh