WEBVTT - How MP3 Compression Works

0:00:04.160 --> 0:00:07.160
<v Speaker 1>Get in tech with technology with tech Stuff from how

0:00:07.240 --> 0:00:14.040
<v Speaker 1>stuff works dot com. Hey there, and welcome to tech Stuff.

0:00:14.080 --> 0:00:17.520
<v Speaker 1>I'm your host, Jonathan Strickland. And in a recent episode

0:00:17.560 --> 0:00:20.560
<v Speaker 1>I explored how digital audio works and gave kind of

0:00:20.560 --> 0:00:24.639
<v Speaker 1>a brief history on the MP three file format. I

0:00:24.760 --> 0:00:27.680
<v Speaker 1>warned you back then that that was part one of

0:00:27.720 --> 0:00:30.760
<v Speaker 1>a three part series, and today we're gonna explore part two.

0:00:31.440 --> 0:00:34.599
<v Speaker 1>So I hadn't forgotten about it. We're back to it, uh,

0:00:34.640 --> 0:00:36.440
<v Speaker 1>And today we're gonna do a deeper dive with m

0:00:36.479 --> 0:00:39.959
<v Speaker 1>P three's and how do they compress audio? And how

0:00:39.960 --> 0:00:42.239
<v Speaker 1>can you take a file filled with information and make

0:00:42.280 --> 0:00:44.920
<v Speaker 1>it a smaller size? What do you have to give

0:00:45.080 --> 0:00:48.159
<v Speaker 1>up in order to make files smaller? And today we're

0:00:48.159 --> 0:00:51.280
<v Speaker 1>gonna try and unravel the technical mystery behind the MP

0:00:51.400 --> 0:00:54.760
<v Speaker 1>three And I am not going to lie to you people.

0:00:55.720 --> 0:01:01.240
<v Speaker 1>This is gonna get a bit you know, man athy

0:01:01.440 --> 0:01:04.759
<v Speaker 1>And that was an English major, So you mathematicians out there,

0:01:04.760 --> 0:01:07.400
<v Speaker 1>get ready with your corrections because I'm probably gonna make

0:01:07.440 --> 0:01:10.760
<v Speaker 1>some over generalizations for the purposes of my own sanity.

0:01:11.280 --> 0:01:14.160
<v Speaker 1>There does get to a point where to really get

0:01:14.200 --> 0:01:19.000
<v Speaker 1>into the technical details, it would likely be uh impossible

0:01:19.040 --> 0:01:21.080
<v Speaker 1>for me to describe it in a way that would

0:01:21.080 --> 0:01:25.880
<v Speaker 1>make sense and be accurate. Um, and I have given

0:01:26.120 --> 0:01:30.399
<v Speaker 1>my producer Dylan the mandate that, should I get to

0:01:31.120 --> 0:01:36.200
<v Speaker 1>cryptic and incomprehensible with my explanation, that he is to

0:01:36.240 --> 0:01:40.200
<v Speaker 1>intervene in a way that he sees fit. Just not

0:01:40.240 --> 0:01:44.120
<v Speaker 1>in the face, Dylan. It's not in the face. It's moneymaker, man.

0:01:44.240 --> 0:01:47.120
<v Speaker 1>I gotta gotta take care of it. So let's remember

0:01:47.160 --> 0:01:52.320
<v Speaker 1>that the heart of digital information is the bit that's

0:01:52.320 --> 0:01:56.440
<v Speaker 1>either a zero or a one. The basic unit of

0:01:56.960 --> 0:02:01.920
<v Speaker 1>information for digital formats zeros and ones. Now we can

0:02:02.000 --> 0:02:05.160
<v Speaker 1>use those zeros and ones to describe all sorts of information,

0:02:05.800 --> 0:02:09.280
<v Speaker 1>from text to audio, to video and really pretty much

0:02:09.280 --> 0:02:12.240
<v Speaker 1>anything you can think of that's represented digitally. Ultimately, when

0:02:12.240 --> 0:02:14.000
<v Speaker 1>you get down to it, it's a bunch of zeros

0:02:14.040 --> 0:02:17.000
<v Speaker 1>and ones. So let's say you start off with your

0:02:17.080 --> 0:02:21.520
<v Speaker 1>uncompressed audio file. You've got this enormous audio file in

0:02:21.520 --> 0:02:23.560
<v Speaker 1>front of you. It's made up of zeros and ones.

0:02:24.080 --> 0:02:26.840
<v Speaker 1>How do you make that file smaller? So in the

0:02:26.840 --> 0:02:29.560
<v Speaker 1>real world, we can compress stuff, right, we can apply

0:02:29.800 --> 0:02:33.760
<v Speaker 1>physical pressure to things. Think about packing a suitcase. You

0:02:33.760 --> 0:02:36.240
<v Speaker 1>can make sure you get that extra outfit in if

0:02:36.280 --> 0:02:38.600
<v Speaker 1>you just press it down hard enough and get that

0:02:38.680 --> 0:02:42.240
<v Speaker 1>zipper zipped before it can burst open. But once you

0:02:42.280 --> 0:02:44.920
<v Speaker 1>get to a certain level of compression, you cannot make

0:02:45.080 --> 0:02:48.600
<v Speaker 1>things smaller, at least not without hurting yourself or whatever

0:02:48.639 --> 0:02:51.720
<v Speaker 1>it is you're trying to compress. Digital files are a

0:02:51.720 --> 0:02:55.400
<v Speaker 1>little different because you cannot physically cram the zeros and

0:02:55.520 --> 0:02:58.120
<v Speaker 1>ones closer together. It doesn't work like that. These are

0:02:58.240 --> 0:03:02.600
<v Speaker 1>abstract things. You can't make them smaller, right. You can't

0:03:02.720 --> 0:03:06.000
<v Speaker 1>decrease the font. It doesn't work that way. The numbers

0:03:06.040 --> 0:03:09.240
<v Speaker 1>represent two different states. So if you want to create

0:03:09.240 --> 0:03:12.840
<v Speaker 1>a smaller audio file containing the recording that was in

0:03:12.880 --> 0:03:17.680
<v Speaker 1>a larger audio file, you have to start getting creative now.

0:03:17.720 --> 0:03:20.120
<v Speaker 1>In the last part of this series, I talked about

0:03:20.160 --> 0:03:22.920
<v Speaker 1>how the MP three compression algorithm was born from an

0:03:22.960 --> 0:03:26.600
<v Speaker 1>applied research institution in Germany and the team behind the

0:03:26.720 --> 0:03:29.040
<v Speaker 1>MP three wanted to find a way to compress audio,

0:03:29.160 --> 0:03:34.800
<v Speaker 1>specifically music for transmission over phone lines. Eventually, this evolved

0:03:34.840 --> 0:03:39.480
<v Speaker 1>into the Motion Pictures Expert Group Audio Layer three compression methodology,

0:03:39.680 --> 0:03:44.560
<v Speaker 1>better known as the MP three, and there's also IMPACT

0:03:44.640 --> 0:03:47.360
<v Speaker 1>two and IMPEG four standards. Impact two, by the way,

0:03:47.400 --> 0:03:50.320
<v Speaker 1>is the basis of compression on DVDs, although the actual

0:03:50.440 --> 0:03:54.720
<v Speaker 1>DVD format is really a modification of Impact two and

0:03:54.840 --> 0:03:57.360
<v Speaker 1>Impact four is a compression strategy for audio and video

0:03:57.400 --> 0:04:00.840
<v Speaker 1>that's frequently used in lots of different up pacities, including

0:04:00.880 --> 0:04:05.160
<v Speaker 1>streaming media services. So by the late nineteen seventies, researchers

0:04:05.200 --> 0:04:08.720
<v Speaker 1>began to explore the possibility of leveraging psycho acoustics to

0:04:08.760 --> 0:04:12.960
<v Speaker 1>figure out how to compress audio. And psychoacoustics refers to

0:04:13.200 --> 0:04:17.120
<v Speaker 1>the way we perceive sound it's uh and also the

0:04:17.120 --> 0:04:21.360
<v Speaker 1>physiological effects of sound on us. So this involves not

0:04:21.480 --> 0:04:24.640
<v Speaker 1>just our our physical sense of hearing, but also our

0:04:24.680 --> 0:04:28.400
<v Speaker 1>brains and the way our brains interpret sound. So, for example,

0:04:28.720 --> 0:04:32.480
<v Speaker 1>there's a psychoacoustic phenomenon that's called the Hawse effect h

0:04:32.640 --> 0:04:35.560
<v Speaker 1>A A S. And I think it's pretty interesting. So

0:04:35.760 --> 0:04:38.200
<v Speaker 1>here's how the Hawse effect works. If you hear the

0:04:38.279 --> 0:04:43.280
<v Speaker 1>exact same sound coming from different directions, but the two

0:04:43.279 --> 0:04:46.640
<v Speaker 1>sounds arrive within thirty to forty milliseconds of each other,

0:04:47.040 --> 0:04:50.000
<v Speaker 1>your brain will be convinced that you really only heard

0:04:50.040 --> 0:04:53.440
<v Speaker 1>one sound and it came from the direction that hit

0:04:53.520 --> 0:04:57.200
<v Speaker 1>you first. So let's say a sounds coming from directly

0:04:57.240 --> 0:04:59.680
<v Speaker 1>in front of you and to your left, and you

0:05:00.080 --> 0:05:03.480
<v Speaker 1>get both of them within that thirty to forty millisecond range,

0:05:04.279 --> 0:05:06.440
<v Speaker 1>and you hear the one coming from ahead of you

0:05:06.520 --> 0:05:10.039
<v Speaker 1>first to you, you're convinced that you only heard that

0:05:10.120 --> 0:05:13.080
<v Speaker 1>sound once and it came from dead on straight ahead

0:05:13.080 --> 0:05:16.680
<v Speaker 1>of you. Your brain kind of discounts the one that

0:05:16.760 --> 0:05:20.159
<v Speaker 1>came off from the left, although it can reinforce it,

0:05:20.279 --> 0:05:22.520
<v Speaker 1>which ends up being really useful if you're planning out

0:05:22.520 --> 0:05:25.279
<v Speaker 1>p A systems for stage shows. I'm not joking. That

0:05:25.320 --> 0:05:28.080
<v Speaker 1>really is the way that people plan those things out.

0:05:28.360 --> 0:05:31.080
<v Speaker 1>It's pretty neat. Humans perceive sounds in a way that's

0:05:31.080 --> 0:05:35.200
<v Speaker 1>not necessarily representational of all the sounds surrounding us. You

0:05:35.200 --> 0:05:38.600
<v Speaker 1>can think of your brain as the filter between your

0:05:38.720 --> 0:05:42.679
<v Speaker 1>understanding and what reality actually is. A lot of stuff

0:05:42.720 --> 0:05:45.599
<v Speaker 1>goes on that it ends up getting rid of information

0:05:45.640 --> 0:05:48.040
<v Speaker 1>that your brain just says, you know what, he or

0:05:48.080 --> 0:05:52.040
<v Speaker 1>she doesn't need that, it's just gonna confuse things. We're

0:05:52.040 --> 0:05:55.400
<v Speaker 1>gonna dump it. And that's kind of how it works.

0:05:55.440 --> 0:05:57.599
<v Speaker 1>It's all on an unconscious level. It's not like you're

0:05:57.800 --> 0:06:01.919
<v Speaker 1>actively working to do this. So let's say you're in

0:06:01.920 --> 0:06:04.320
<v Speaker 1>a relatively busy hallway, and there could be a lot

0:06:04.360 --> 0:06:07.800
<v Speaker 1>of sounds in that hallway, stuff that's going on constantly

0:06:07.800 --> 0:06:11.000
<v Speaker 1>around you. Maybe they are doors opening and closing, Maybe

0:06:11.040 --> 0:06:13.960
<v Speaker 1>their footsteps going up and down the hallway. Maybe someone

0:06:14.080 --> 0:06:17.719
<v Speaker 1>shoes are squeaking against the linoleum floor. People are chattering

0:06:17.760 --> 0:06:20.839
<v Speaker 1>away in there. But you are having a conversation with someone,

0:06:21.240 --> 0:06:23.960
<v Speaker 1>so you turn your focus on that person and other

0:06:24.040 --> 0:06:28.200
<v Speaker 1>sounds seemingly fade away. They're still present, but they're not important.

0:06:28.800 --> 0:06:31.520
<v Speaker 1>So in this example, you would actually call those other

0:06:31.600 --> 0:06:35.479
<v Speaker 1>sounds of distraction and you would really focus on the conversation. Uh.

0:06:35.520 --> 0:06:40.000
<v Speaker 1>That also shows how we're able to consciously direct our

0:06:40.080 --> 0:06:43.719
<v Speaker 1>sense our perception of hearing. So both of these factors

0:06:43.760 --> 0:06:47.120
<v Speaker 1>come into play. Now. One thing that MP three encoding

0:06:47.160 --> 0:06:51.080
<v Speaker 1>takes advantage of is something called masking, and there are

0:06:51.080 --> 0:06:54.120
<v Speaker 1>a couple of different variations of the masking effect. One

0:06:54.160 --> 0:06:57.520
<v Speaker 1>of them is called frequency masking. So let's say you've

0:06:57.560 --> 0:07:00.480
<v Speaker 1>got to sound frequencies that are similar ahaps, there're just

0:07:00.520 --> 0:07:04.200
<v Speaker 1>a few hurts apart. Remember, frequencies are measured in hurts,

0:07:04.680 --> 0:07:08.520
<v Speaker 1>which is really the number of oscillations per second. So

0:07:08.640 --> 0:07:14.040
<v Speaker 1>let's say you've got a sound that's at I don't know, uh,

0:07:14.360 --> 0:07:19.360
<v Speaker 1>one thousand killer hurts, and another one that's at one

0:07:19.480 --> 0:07:23.560
<v Speaker 1>thousand and ten killer hurts. Now, the human ear is

0:07:23.600 --> 0:07:26.920
<v Speaker 1>precise enough to be able to tell the difference of

0:07:27.040 --> 0:07:29.840
<v Speaker 1>two sounds that are at least two hurts apart from

0:07:29.840 --> 0:07:33.360
<v Speaker 1>each other. That's how precise our resolution of hearing it's

0:07:33.440 --> 0:07:36.760
<v Speaker 1>it's at that level. But if you get two sounds

0:07:36.840 --> 0:07:40.520
<v Speaker 1>played at the same time and they are that close

0:07:40.560 --> 0:07:44.080
<v Speaker 1>together in frequency, and one of those frequencies is played

0:07:44.120 --> 0:07:47.280
<v Speaker 1>at a greater volume than the other, our brains will

0:07:47.280 --> 0:07:50.160
<v Speaker 1>pick up on the louder sound and ignore the quieter sound,

0:07:50.240 --> 0:07:53.880
<v Speaker 1>even though both of them are present. What becomes important

0:07:53.880 --> 0:07:56.520
<v Speaker 1>at that point is the amplitude. Now, the further apart

0:07:56.560 --> 0:08:00.400
<v Speaker 1>in frequencies you get, the less that hasn't a effect.

0:08:00.480 --> 0:08:02.360
<v Speaker 1>So if you get far enough apart where they are

0:08:02.360 --> 0:08:05.680
<v Speaker 1>two pitches, one of them noticeably louder than the other,

0:08:06.040 --> 0:08:08.320
<v Speaker 1>but they're far enough apart, you will hear both of them.

0:08:08.360 --> 0:08:11.560
<v Speaker 1>It only works if the two pitches are relatively close together,

0:08:12.680 --> 0:08:15.560
<v Speaker 1>and there's not a universal formula for frequency masking. As

0:08:15.560 --> 0:08:18.520
<v Speaker 1>you get closer to the boundaries of human hearing, frequency

0:08:18.560 --> 0:08:20.920
<v Speaker 1>masking becomes easier. So if it's a really low pitch

0:08:21.000 --> 0:08:23.600
<v Speaker 1>or a really high pitch, it's easier to get away

0:08:23.600 --> 0:08:26.400
<v Speaker 1>with it. Once you start getting into what is the

0:08:26.400 --> 0:08:28.960
<v Speaker 1>ought of as the sweet spot for human hearing, which

0:08:29.000 --> 0:08:32.120
<v Speaker 1>is generally considered to be between two and five killer hurts,

0:08:33.200 --> 0:08:37.200
<v Speaker 1>you need a greater difference in volume or a smaller

0:08:37.240 --> 0:08:41.640
<v Speaker 1>difference in frequency in order for masking to work. Frequency

0:08:41.720 --> 0:08:45.480
<v Speaker 1>masking at any rate. But then there's also temporal masking,

0:08:46.600 --> 0:08:48.880
<v Speaker 1>and you might say, okay, I got it. Temporal that

0:08:48.920 --> 0:08:53.040
<v Speaker 1>means time. Indeed it does, my friend. This describes the

0:08:53.040 --> 0:08:56.040
<v Speaker 1>effect of a short but loud sound masking a softer

0:08:56.120 --> 0:09:00.360
<v Speaker 1>sound for a short time. Weird thing is the loud

0:09:00.360 --> 0:09:03.960
<v Speaker 1>sound can actually mask sounds that precede it slightly, not

0:09:04.040 --> 0:09:06.760
<v Speaker 1>by a whole lot, but a little bit. MP three

0:09:06.760 --> 0:09:10.880
<v Speaker 1>compression takes advantage of both frequency and temporal masking when

0:09:10.880 --> 0:09:14.079
<v Speaker 1>it's trying to determine which data needs to be included

0:09:14.160 --> 0:09:16.920
<v Speaker 1>and which data can be dumped, because it won't affect

0:09:16.960 --> 0:09:19.840
<v Speaker 1>your perception of whatever the the audio file is in

0:09:19.840 --> 0:09:23.720
<v Speaker 1>the first place. So you also probably remember I talked

0:09:23.720 --> 0:09:26.560
<v Speaker 1>about the physical limitation to what we humans can hear,

0:09:26.800 --> 0:09:28.920
<v Speaker 1>no matter what our brains might be up to, so

0:09:29.000 --> 0:09:31.400
<v Speaker 1>that this doesn't have to do with our brains, you know,

0:09:31.480 --> 0:09:34.240
<v Speaker 1>filtering through the information that's coming in. This has to

0:09:34.280 --> 0:09:38.200
<v Speaker 1>do with the physical limitations of the human ear. In

0:09:38.240 --> 0:09:41.199
<v Speaker 1>the last episode of the series, I said typical human hearing.

0:09:41.840 --> 0:09:45.559
<v Speaker 1>Keep in mind typical there are exceptions. UH covers the

0:09:45.679 --> 0:09:48.560
<v Speaker 1>range of frequencies between about twenty hurts and twenty killer

0:09:48.640 --> 0:09:52.000
<v Speaker 1>hurts or twenty thousand hurts. So twenty to twenty thousand

0:09:52.800 --> 0:09:57.280
<v Speaker 1>higher frequencies represent higher pitches and sound lower frequencies lower pitches, right,

0:09:58.080 --> 0:10:00.640
<v Speaker 1>And as you get older, your ability to perceive those

0:10:00.720 --> 0:10:05.040
<v Speaker 1>higher frequencies starts to diminish. So most adults actually have

0:10:05.320 --> 0:10:10.880
<v Speaker 1>an upper range closer to sixteen killer hurts, not twenty. UH.

0:10:11.080 --> 0:10:13.480
<v Speaker 1>Kids they can hear those higher pitches. You may have

0:10:13.600 --> 0:10:17.920
<v Speaker 1>heard the story about how some convenience stores experimented with

0:10:18.160 --> 0:10:23.600
<v Speaker 1>getting rid of teenage loiterers by by UH projecting out

0:10:24.000 --> 0:10:27.280
<v Speaker 1>the super high pitches that that adults could not hear

0:10:27.640 --> 0:10:30.600
<v Speaker 1>but kids could, and it discouraged kids from hanging out

0:10:30.640 --> 0:10:35.080
<v Speaker 1>at the convenience store and loitering. UM. I love that

0:10:35.200 --> 0:10:39.600
<v Speaker 1>idea so much. Anyway, that's because I'm old and my

0:10:39.640 --> 0:10:43.520
<v Speaker 1>hearing is terrible. Well, remember I also mentioned you can

0:10:43.559 --> 0:10:46.400
<v Speaker 1>detect changes in pitch at two hurts increments if you

0:10:46.440 --> 0:10:48.960
<v Speaker 1>get below two hurts and change, Like, if it's just

0:10:49.040 --> 0:10:54.600
<v Speaker 1>a one hurts difference between two frequencies, it's too low

0:10:54.640 --> 0:10:56.800
<v Speaker 1>a resolution for us to detect. To us, it will

0:10:56.800 --> 0:11:01.040
<v Speaker 1>sound exactly the same. So if you were to hear

0:11:01.520 --> 0:11:06.800
<v Speaker 1>a frequency at one thousand one hurts or one point

0:11:07.000 --> 0:11:10.800
<v Speaker 1>zero zero one killer hurts and one point zero zero

0:11:10.840 --> 0:11:13.800
<v Speaker 1>to killer hurts, you wouldn't notice the difference. They would

0:11:13.840 --> 0:11:16.960
<v Speaker 1>sound exactly the same to you. So if you're gonna

0:11:17.000 --> 0:11:19.240
<v Speaker 1>take audio and compress it, one step you could consider

0:11:19.360 --> 0:11:23.960
<v Speaker 1>is eliminating anything that's outside the actual range of frequencies

0:11:24.040 --> 0:11:27.560
<v Speaker 1>that we can hear, or simplifying any changes in frequency

0:11:27.640 --> 0:11:31.240
<v Speaker 1>that are smaller than two hurts. If you get take

0:11:31.240 --> 0:11:34.760
<v Speaker 1>all that data and you say it is physically impossible

0:11:34.800 --> 0:11:38.439
<v Speaker 1>for a human to perceive this, get rid of that information,

0:11:38.559 --> 0:11:41.800
<v Speaker 1>then in theory it wouldn't have any effect on the

0:11:41.840 --> 0:11:46.120
<v Speaker 1>rest of the recording. But how you go further than that? Right,

0:11:46.200 --> 0:11:48.959
<v Speaker 1>how do you create a method so that you can

0:11:49.000 --> 0:11:51.120
<v Speaker 1>really compress this file? You want a method that will

0:11:51.120 --> 0:11:54.439
<v Speaker 1>preserve the important sounds while potentially ignoring all the unimportant

0:11:54.520 --> 0:11:58.320
<v Speaker 1>or incidel sounds. And you want to be automatic because

0:11:58.760 --> 0:12:01.440
<v Speaker 1>if you have a man you really then that's going

0:12:01.520 --> 0:12:05.640
<v Speaker 1>to take countless hours just to edit a single sound file.

0:12:06.760 --> 0:12:10.959
<v Speaker 1>So that was the challenge that the MP three research

0:12:11.040 --> 0:12:16.040
<v Speaker 1>team faced as a group. Now, their solution, which ultimately

0:12:16.080 --> 0:12:18.559
<v Speaker 1>created even more challenges, was to come up with what

0:12:18.640 --> 0:12:22.480
<v Speaker 1>was essentially a simulated human ear and brain. They needed

0:12:22.520 --> 0:12:27.880
<v Speaker 1>to replicate the experience of perceiving music so that an

0:12:27.880 --> 0:12:32.160
<v Speaker 1>algorithm could evaluate every sound in an audio file and

0:12:32.280 --> 0:12:35.359
<v Speaker 1>judge if an in fact was relevant enough to include

0:12:35.400 --> 0:12:39.720
<v Speaker 1>in the final compressed version. If a sound were imperceptible,

0:12:39.760 --> 0:12:41.600
<v Speaker 1>then it wouldn't make sense to include it in the

0:12:41.720 --> 0:12:44.720
<v Speaker 1>MP three file. So by leaving out all the irrelevant data,

0:12:44.760 --> 0:12:48.680
<v Speaker 1>they can make the audio information take up less bandwidth.

0:12:48.679 --> 0:12:51.240
<v Speaker 1>The file itself would be smaller because you just dumped

0:12:51.280 --> 0:12:54.880
<v Speaker 1>everything that wasn't important. So the team used an algorithm

0:12:55.000 --> 0:13:00.000
<v Speaker 1>called the low complexity adaptive transform coding or lc DASH

0:13:00.160 --> 0:13:03.080
<v Speaker 1>a t C as the foundation for their research. This

0:13:03.160 --> 0:13:06.440
<v Speaker 1>was kind of their starting point, and this is an

0:13:06.480 --> 0:13:10.120
<v Speaker 1>approach that tries to do away with redundancy as much

0:13:10.160 --> 0:13:15.199
<v Speaker 1>as possible. And it also incorporates adaptation to perceptual requirements. Also,

0:13:15.320 --> 0:13:19.199
<v Speaker 1>MP three's oh a lot to the IMPEG Layer two standard,

0:13:19.760 --> 0:13:23.199
<v Speaker 1>So the layer two obviously came out before Layer three,

0:13:23.720 --> 0:13:26.199
<v Speaker 1>and so a lot of the features of layer three

0:13:26.320 --> 0:13:31.760
<v Speaker 1>are really um their legacy features from layer two. Uh.

0:13:31.800 --> 0:13:34.000
<v Speaker 1>In other words, MP three group kind of got stuck

0:13:34.000 --> 0:13:36.560
<v Speaker 1>with them because otherwise they would have had a problem

0:13:36.559 --> 0:13:39.880
<v Speaker 1>with backwards compatibility. So the result is kind of a

0:13:39.920 --> 0:13:43.400
<v Speaker 1>clunky arrangement under the hood, and some of the features

0:13:43.600 --> 0:13:46.160
<v Speaker 1>may make very little sense when I go through them,

0:13:46.600 --> 0:13:48.600
<v Speaker 1>but some of that is because it's a hold over

0:13:48.640 --> 0:13:53.280
<v Speaker 1>from an earlier compression strategy, which isn't terribly satisfying as

0:13:53.280 --> 0:13:55.559
<v Speaker 1>an answer. But the reason many parts of the MP

0:13:55.640 --> 0:13:57.840
<v Speaker 1>three compression algorithm are the way they are is because

0:13:57.880 --> 0:14:01.560
<v Speaker 1>that's the way we've always done it. So next I'm

0:14:01.600 --> 0:14:07.760
<v Speaker 1>gonna dive into the phases of compression. But before I

0:14:07.800 --> 0:14:10.680
<v Speaker 1>do that, let's all take a deep breath and take

0:14:10.720 --> 0:14:22.440
<v Speaker 1>a moment to thank our sponsor, and we're back. So

0:14:22.560 --> 0:14:25.080
<v Speaker 1>there are two big phases we'll need to talk about

0:14:25.160 --> 0:14:29.760
<v Speaker 1>with MP three compression. The first phase is analysis and

0:14:29.800 --> 0:14:33.960
<v Speaker 1>the second phase is the actual compression itself. And after

0:14:34.040 --> 0:14:37.080
<v Speaker 1>that there's the process of decoding and MP three for playback.

0:14:37.560 --> 0:14:40.120
<v Speaker 1>But that's way simpler once we get an understanding of

0:14:40.160 --> 0:14:45.920
<v Speaker 1>how the encoding process actually happens. So let's begin with analysis. Now.

0:14:45.960 --> 0:14:49.480
<v Speaker 1>This is the part where the standard has to figure

0:14:49.520 --> 0:14:53.800
<v Speaker 1>out which frequencies within an audio range are recording rather

0:14:53.920 --> 0:14:59.720
<v Speaker 1>are important or perceptible. So how does a program and

0:14:59.760 --> 0:15:02.680
<v Speaker 1>in coder figure out what we can hear and what

0:15:02.800 --> 0:15:06.160
<v Speaker 1>we cannot hear? All? Right, time to get technical. So

0:15:06.880 --> 0:15:10.440
<v Speaker 1>you start off with your pulse code modulation audio file

0:15:10.720 --> 0:15:13.480
<v Speaker 1>or PCM file. And you might remember I talked about

0:15:13.480 --> 0:15:16.720
<v Speaker 1>PCM audio in the first episode of this series, but

0:15:16.840 --> 0:15:20.600
<v Speaker 1>just in case you don't, it's a lossless digital audio file.

0:15:20.680 --> 0:15:23.720
<v Speaker 1>The actual format could be a wave or ai f

0:15:23.720 --> 0:15:26.480
<v Speaker 1>F or something along those lines, but the important thing

0:15:26.920 --> 0:15:31.080
<v Speaker 1>to keep in mind is that it is uncompressed. Now,

0:15:31.120 --> 0:15:33.560
<v Speaker 1>that means those files tend to be pretty big. This

0:15:33.640 --> 0:15:36.040
<v Speaker 1>is our raw material that we want to take and

0:15:36.120 --> 0:15:40.560
<v Speaker 1>squish down to a more manageable, transferable size. And in

0:15:40.640 --> 0:15:43.320
<v Speaker 1>our our last episode in this series, I also mentioned

0:15:43.320 --> 0:15:46.680
<v Speaker 1>that the standard for c D audio is a sample

0:15:46.760 --> 0:15:49.880
<v Speaker 1>rate of forty four point one. Killer hurts and we

0:15:50.040 --> 0:15:52.680
<v Speaker 1>learned that you need a sample rate twice the frequency

0:15:52.840 --> 0:15:56.800
<v Speaker 1>of the highest frequency in your recording, and since human

0:15:56.840 --> 0:15:59.600
<v Speaker 1>hearing tops out at around twenty kill hurts, the standard

0:15:59.600 --> 0:16:02.520
<v Speaker 1>for CDs is forty four point one killer hurts. The

0:16:02.640 --> 0:16:05.640
<v Speaker 1>MP three standard can support lots of different sample rates,

0:16:05.720 --> 0:16:08.160
<v Speaker 1>but forty four point one killer Hurts is pretty much

0:16:08.200 --> 0:16:12.600
<v Speaker 1>the common standard. So you've got a number of samples

0:16:12.680 --> 0:16:15.120
<v Speaker 1>with your audio file, and that number will depend upon

0:16:15.120 --> 0:16:18.320
<v Speaker 1>how long the audio file is. You've got forty four

0:16:18.320 --> 0:16:23.120
<v Speaker 1>thousand one samples per second, actually twice that for stereo,

0:16:23.280 --> 0:16:25.760
<v Speaker 1>but for the purposes of this discussion, let's kind of

0:16:25.920 --> 0:16:28.960
<v Speaker 1>stick with mono sounds so that I don't start having

0:16:29.040 --> 0:16:31.720
<v Speaker 1>math coming out of my ears. And we're still in

0:16:31.720 --> 0:16:34.920
<v Speaker 1>the very easy, simple part as far as math goes.

0:16:34.960 --> 0:16:37.520
<v Speaker 1>We haven't gotten to the complicated stuff yet, all right,

0:16:37.600 --> 0:16:41.600
<v Speaker 1>So you've got forty four thousand, one hundred samples per second.

0:16:42.160 --> 0:16:45.320
<v Speaker 1>To compress it into an MP three format, the algorithm

0:16:45.360 --> 0:16:49.320
<v Speaker 1>first groups all of these samples into collections called frames.

0:16:50.440 --> 0:16:53.640
<v Speaker 1>So take those forty four thousand one per second, and

0:16:53.640 --> 0:16:56.480
<v Speaker 1>then you start saying, okay, we're gonna group you in batches.

0:16:56.960 --> 0:17:00.080
<v Speaker 1>Each batch is called a frame and each frame contains

0:17:00.120 --> 0:17:04.480
<v Speaker 1>one thousand, one fifty two samples. Now that's specifically to

0:17:04.560 --> 0:17:09.280
<v Speaker 1>maintain backwards compatibility to IMPEG Layer two, which established that

0:17:09.320 --> 0:17:12.119
<v Speaker 1>one thousand, one or fifty two number. But we're not

0:17:12.160 --> 0:17:16.360
<v Speaker 1>talking about IMPEG layer two. We're talking about IMPEG Layer three,

0:17:16.800 --> 0:17:18.400
<v Speaker 1>and though that means we have to get a little

0:17:18.400 --> 0:17:25.440
<v Speaker 1>more complicated. So each frame consists of two subgroups called granules.

0:17:25.440 --> 0:17:29.320
<v Speaker 1>So each granule has five undred seventy six samples seventy

0:17:29.359 --> 0:17:32.639
<v Speaker 1>six times two one thousand fifty two, so five seventy

0:17:32.680 --> 0:17:36.680
<v Speaker 1>six samples per granule. Now, technically MP three encoders only

0:17:36.680 --> 0:17:39.000
<v Speaker 1>work on one granule at a time, but they may

0:17:39.040 --> 0:17:42.879
<v Speaker 1>reference the granules immediately before and immediately after the current

0:17:42.920 --> 0:17:45.520
<v Speaker 1>one in order to see how the audio within the

0:17:45.560 --> 0:17:49.480
<v Speaker 1>file changes over time. All right, so now you've got

0:17:49.480 --> 0:17:54.000
<v Speaker 1>your granules of five hundred seventy six samples each. Then

0:17:54.040 --> 0:17:57.480
<v Speaker 1>the MP three encoder runs the samples through a filter bank,

0:17:57.960 --> 0:18:01.960
<v Speaker 1>which sorts the sound into thirty two frequency ranges. Are

0:18:02.000 --> 0:18:05.239
<v Speaker 1>you are you crazy about the numbers yet, Dylan? Are you?

0:18:05.720 --> 0:18:10.520
<v Speaker 1>Dylan's Dylan's nodding. Dylan gets worse from here. So you

0:18:10.560 --> 0:18:13.560
<v Speaker 1>have thirty two frequency ranges, which is another nod to

0:18:13.560 --> 0:18:15.840
<v Speaker 1>the layer two method which use those thirty two ranges

0:18:15.880 --> 0:18:20.240
<v Speaker 1>for encoding purposes. But we're not talking about layer two early, No,

0:18:20.760 --> 0:18:24.320
<v Speaker 1>we're talking MP three. Gosh darn it. That means we

0:18:24.359 --> 0:18:27.159
<v Speaker 1>take those thirty two ranges and we subdivide them by

0:18:27.200 --> 0:18:31.320
<v Speaker 1>a factor of eighteen. That means we have five hundred

0:18:31.320 --> 0:18:36.879
<v Speaker 1>seventies six bands of frequencies, each band containing one six

0:18:37.080 --> 0:18:41.199
<v Speaker 1>of the frequency range of the original sample. So what

0:18:41.280 --> 0:18:44.320
<v Speaker 1>that actually means, and this this is actually pretty easy.

0:18:44.720 --> 0:18:48.159
<v Speaker 1>The bands are not limited to a specific number for

0:18:48.240 --> 0:18:53.240
<v Speaker 1>their frequency range. Right. The bands don't mean that on

0:18:53.280 --> 0:18:56.359
<v Speaker 1>the on band number one it goes from twenty hurts

0:18:56.440 --> 0:18:58.840
<v Speaker 1>up to a certain range and on band five D

0:18:59.000 --> 0:19:02.399
<v Speaker 1>seventy six in that twenty killer hurts. That's not what

0:19:02.440 --> 0:19:05.600
<v Speaker 1>it means. They're dependent upon the original audio. So if

0:19:05.600 --> 0:19:09.680
<v Speaker 1>the original audio contains sounds within a narrow range of frequencies,

0:19:10.040 --> 0:19:13.760
<v Speaker 1>the five bands will be more precise. But if the

0:19:13.760 --> 0:19:17.600
<v Speaker 1>original recording has a vast range of frequencies, the bands

0:19:17.640 --> 0:19:20.440
<v Speaker 1>are less precise. So another way to think about this

0:19:21.119 --> 0:19:24.160
<v Speaker 1>is with a pizza. So let's say you get extra

0:19:24.240 --> 0:19:26.960
<v Speaker 1>large pizza and you cut it into eight equal slices.

0:19:27.600 --> 0:19:30.280
<v Speaker 1>And then you get a small pizza and you cut

0:19:30.320 --> 0:19:33.600
<v Speaker 1>that into eight equal slices. Well, in both cases you

0:19:33.640 --> 0:19:37.760
<v Speaker 1>have with each slice one eighth of a pizza. But

0:19:37.840 --> 0:19:42.080
<v Speaker 1>the extra large pizza pizza slice is bigger than the

0:19:42.119 --> 0:19:45.280
<v Speaker 1>small pizza pizza slice. It all depends on the size

0:19:45.280 --> 0:19:47.960
<v Speaker 1>of the pizza. So in this case, it depends upon

0:19:48.000 --> 0:19:51.080
<v Speaker 1>the range of frequencies. And and Dylan, do you think

0:19:51.080 --> 0:19:53.280
<v Speaker 1>we could go for some pizza, you know, just just

0:19:53.320 --> 0:19:56.159
<v Speaker 1>put the episode on hole and go get pizza. Dylan's nodding.

0:19:56.720 --> 0:20:00.879
<v Speaker 1>It's great for audio. Yeah, so, uh, pizza, We'll be

0:20:00.960 --> 0:20:05.800
<v Speaker 1>right back. Okay, that was good pizza. Now um oh man,

0:20:05.840 --> 0:20:08.400
<v Speaker 1>I got a whole bunch more notes. Okay, well, let's

0:20:08.440 --> 0:20:10.879
<v Speaker 1>let's go ahead and and do the rest of this.

0:20:10.920 --> 0:20:12.840
<v Speaker 1>All right, So you've got your sound divided up into

0:20:12.880 --> 0:20:16.320
<v Speaker 1>those five seventy six sub brands of frequencies, you know,

0:20:16.640 --> 0:20:19.840
<v Speaker 1>the thing I compared to pizza slices earlier. Now you

0:20:19.880 --> 0:20:25.359
<v Speaker 1>get two different mathematical processes applied to this data. One

0:20:25.520 --> 0:20:28.919
<v Speaker 1>is the fast Furrier transform or f f T, and

0:20:28.960 --> 0:20:32.720
<v Speaker 1>the other is the modified discrete cosine transform or m

0:20:32.800 --> 0:20:36.760
<v Speaker 1>d c T. Now I am not going to dive

0:20:36.800 --> 0:20:40.040
<v Speaker 1>deeply into how these transforms work because frankly, they are

0:20:40.119 --> 0:20:44.439
<v Speaker 1>beyond my mathematical understanding. But I know what they do.

0:20:44.680 --> 0:20:49.280
<v Speaker 1>I just cannot explain the process like how they do

0:20:49.400 --> 0:20:51.479
<v Speaker 1>what they do. So I'm going to give you the

0:20:51.480 --> 0:20:54.720
<v Speaker 1>explanation of what they do what the outcome of each

0:20:54.760 --> 0:20:58.840
<v Speaker 1>of these transformed processes happens to be. But I'm not

0:20:58.920 --> 0:21:00.800
<v Speaker 1>going to be able to tell you the actual mathematical

0:21:00.840 --> 0:21:03.479
<v Speaker 1>steps involved in each because I don't math. So good guys,

0:21:04.640 --> 0:21:07.520
<v Speaker 1>But let's start with a fast for your transform. So

0:21:07.640 --> 0:21:09.720
<v Speaker 1>transform is kind of what it sounds like. It's all

0:21:09.720 --> 0:21:13.960
<v Speaker 1>about transforming information in some way. So in this particular case,

0:21:14.119 --> 0:21:17.359
<v Speaker 1>the f f T transforms the frequency bands we just

0:21:17.400 --> 0:21:22.360
<v Speaker 1>talked about into data that can be further analyzed by

0:21:22.480 --> 0:21:26.600
<v Speaker 1>a psychoacoustic model that's in the encoder. So this is

0:21:26.640 --> 0:21:29.960
<v Speaker 1>that simulated human ear and brain we were talking about earlier.

0:21:30.840 --> 0:21:34.800
<v Speaker 1>So what the encoder does is it analyzes each bed

0:21:34.920 --> 0:21:38.600
<v Speaker 1>of data and looks for signs that it represents audio

0:21:38.680 --> 0:21:41.680
<v Speaker 1>that wouldn't be perceived by a human. So it's looks

0:21:41.800 --> 0:21:46.240
<v Speaker 1>looking for any potential for masking possibilities. So are there

0:21:46.240 --> 0:21:48.800
<v Speaker 1>collections of frequencies that are grouped close together, and is

0:21:48.840 --> 0:21:51.320
<v Speaker 1>one of those frequencies louder than the others, you might

0:21:51.359 --> 0:21:53.919
<v Speaker 1>be able to do away with those softer frequencies because

0:21:53.960 --> 0:21:57.480
<v Speaker 1>of frequency masking. The encoder will also look at whether

0:21:57.560 --> 0:21:59.879
<v Speaker 1>or not the audio has a lot of complexity to it,

0:22:00.800 --> 0:22:02.960
<v Speaker 1>if it has a lot of changes, or if it's

0:22:03.000 --> 0:22:07.840
<v Speaker 1>just relatively steady or simple audio. Any transient sounds that

0:22:07.880 --> 0:22:11.600
<v Speaker 1>are present in the audio might end up being temporal masking,

0:22:11.680 --> 0:22:14.040
<v Speaker 1>so it'll analyze those as well and see if that's

0:22:14.040 --> 0:22:19.000
<v Speaker 1>a possibility. So really what they're looking is for, you know,

0:22:20.280 --> 0:22:23.320
<v Speaker 1>just any really loud sounds that stand out above the

0:22:23.400 --> 0:22:26.119
<v Speaker 1>rest of the recording. That's what the f f T

0:22:26.280 --> 0:22:30.200
<v Speaker 1>is doing. So what about the modified discrete cosign transform. Well,

0:22:30.240 --> 0:22:32.359
<v Speaker 1>this is happening in parallel with the f f T

0:22:32.800 --> 0:22:36.280
<v Speaker 1>and the samples get sorted into different patterns called windows

0:22:37.119 --> 0:22:39.679
<v Speaker 1>uh and the criterion for sorting all has to do

0:22:39.720 --> 0:22:43.719
<v Speaker 1>with whether the sample represents a steady sound or varied sound.

0:22:44.240 --> 0:22:47.359
<v Speaker 1>So if you have a simple steady sound that goes

0:22:47.400 --> 0:22:51.200
<v Speaker 1>into a long window, if there's a lot of variation

0:22:51.240 --> 0:22:53.960
<v Speaker 1>in the sound, like there are a lot of consonants

0:22:53.960 --> 0:22:56.760
<v Speaker 1>in a vocal line or it's like a drum solo

0:22:56.960 --> 0:22:59.600
<v Speaker 1>or something like that. It would get sorted into it

0:22:59.720 --> 0:23:02.960
<v Speaker 1>series ease of three short windows, and each short window

0:23:03.000 --> 0:23:09.320
<v Speaker 1>contains one two samples. That amounts to four whole milliseconds,

0:23:09.440 --> 0:23:15.000
<v Speaker 1>so four thousands of a second in three patterned windows.

0:23:15.040 --> 0:23:18.080
<v Speaker 1>So you've got these windows now, either long windows for

0:23:18.119 --> 0:23:21.600
<v Speaker 1>simple sounds or short windows for the more complex sounds.

0:23:21.640 --> 0:23:24.600
<v Speaker 1>And then the modified discrete cosine transform kicks into gear.

0:23:24.680 --> 0:23:26.840
<v Speaker 1>It looks at each long window or set of three

0:23:26.840 --> 0:23:30.920
<v Speaker 1>short windows and converts them into a set of spectral values.

0:23:31.520 --> 0:23:33.800
<v Speaker 1>To some of you, that probably sounds meaningless. So let's

0:23:33.840 --> 0:23:37.720
<v Speaker 1>talk about spectral analysis for a second. First, I was

0:23:38.000 --> 0:23:40.919
<v Speaker 1>very disappointed to learn that spectral analysis doesn't involve a

0:23:40.920 --> 0:23:46.199
<v Speaker 1>psychologist talking to a ghost about its emotional state, so bummer.

0:23:47.000 --> 0:23:50.560
<v Speaker 1>But spectral analysis is when you look at a spectrum

0:23:50.600 --> 0:23:54.800
<v Speaker 1>of information, like a spectrum of frequencies or related information

0:23:54.840 --> 0:23:58.399
<v Speaker 1>like energy states. That's what this transform does. It takes

0:23:58.520 --> 0:24:02.119
<v Speaker 1>data that originally represents a slice of time in a

0:24:02.200 --> 0:24:05.360
<v Speaker 1>sound waveform. That's what sample is. A sample is an

0:24:05.400 --> 0:24:09.280
<v Speaker 1>instance of time in a wave form and converts it

0:24:09.320 --> 0:24:15.800
<v Speaker 1>into information representing sound as energy across a range of frequencies. Now,

0:24:15.840 --> 0:24:18.080
<v Speaker 1>you can plot out spectral information in a lot of

0:24:18.080 --> 0:24:21.000
<v Speaker 1>different ways, but one common method is to use brightness

0:24:21.040 --> 0:24:25.800
<v Speaker 1>to indicate energy levels. Higher energy levels are brighter patches

0:24:26.040 --> 0:24:31.120
<v Speaker 1>in your visual representation of spectral data. High frequencies would

0:24:31.119 --> 0:24:34.160
<v Speaker 1>appear at the top of a spectral view, like imagine

0:24:34.200 --> 0:24:37.400
<v Speaker 1>a box, and at the top of the box that's

0:24:37.400 --> 0:24:39.440
<v Speaker 1>where you would find high frequencies, at the bottom of

0:24:39.440 --> 0:24:41.720
<v Speaker 1>the box that's where you find low frequencies, and it's

0:24:41.760 --> 0:24:44.840
<v Speaker 1>just lots of patches of color. The really bright patches

0:24:44.840 --> 0:24:50.200
<v Speaker 1>of color represent very high energy frequencies, so they could

0:24:50.240 --> 0:24:54.000
<v Speaker 1>be high or low in in actual frequency, but we're

0:24:54.040 --> 0:24:57.600
<v Speaker 1>talking about energy levels, not whether it's a higher low pitch.

0:24:59.440 --> 0:25:02.120
<v Speaker 1>Looking left to write represents the passing of time, and

0:25:02.160 --> 0:25:05.560
<v Speaker 1>looking along any vertical points shows you the actual frequency

0:25:06.240 --> 0:25:09.800
<v Speaker 1>or pitch, and then the respective energy level is the brightness.

0:25:09.920 --> 0:25:12.080
<v Speaker 1>So it's kind of like looking at sound as a wave,

0:25:12.240 --> 0:25:14.760
<v Speaker 1>but instead of being a wave, you're looking at information

0:25:14.760 --> 0:25:19.600
<v Speaker 1>that indicates frequency range and energy level. That representation is

0:25:19.600 --> 0:25:22.480
<v Speaker 1>actually kind of analogous to how we hear audio. So

0:25:22.560 --> 0:25:25.679
<v Speaker 1>an encoder can analyze the spectral view and start to

0:25:25.680 --> 0:25:29.880
<v Speaker 1>filter out the data we wouldn't perceive due to psychoacoustics. Now,

0:25:29.920 --> 0:25:33.920
<v Speaker 1>after all that processing, the encoder looks at the frequency

0:25:34.000 --> 0:25:37.200
<v Speaker 1>sub brands and the levels of spectral intensity for each

0:25:37.800 --> 0:25:41.200
<v Speaker 1>and that information can then be used for the next phase,

0:25:41.800 --> 0:25:45.240
<v Speaker 1>which is compression. But right now I think we could

0:25:45.280 --> 0:25:48.760
<v Speaker 1>all stand a little decompression, So let's take another quick

0:25:48.760 --> 0:26:00.280
<v Speaker 1>break to thank our sponsor. All right, so now you're

0:26:00.280 --> 0:26:04.280
<v Speaker 1>ready to compress your analyzed audio. Good for you, and

0:26:04.320 --> 0:26:08.080
<v Speaker 1>by you I mean encoders. This has to be simpler

0:26:08.119 --> 0:26:11.119
<v Speaker 1>than that analysis segment, right, I mean that got a

0:26:11.119 --> 0:26:14.959
<v Speaker 1>little crazy with all the different bands and sub bands

0:26:15.000 --> 0:26:22.119
<v Speaker 1>and windows and frames and granules. Sadly it gets more complicated,

0:26:22.119 --> 0:26:25.280
<v Speaker 1>all right. So there are two layers of compression going

0:26:25.320 --> 0:26:30.000
<v Speaker 1>on with MPEG Layer three. One of those layers depends

0:26:30.080 --> 0:26:34.480
<v Speaker 1>upon the psychoacoustic analysis and the other doesn't. So why

0:26:34.520 --> 0:26:37.800
<v Speaker 1>would you use two layers with different strategies like that? Well,

0:26:37.840 --> 0:26:40.840
<v Speaker 1>the reason is that one strategy is great for complex

0:26:40.880 --> 0:26:43.639
<v Speaker 1>audio with lots of components, but not so great with

0:26:43.760 --> 0:26:46.639
<v Speaker 1>simpler sounds, and the other strategy is kind of the opposite.

0:26:47.160 --> 0:26:49.520
<v Speaker 1>So the psychoacoustic approach is the one that's really good

0:26:49.560 --> 0:26:53.480
<v Speaker 1>for complicated sounds. If if you've got a lot of

0:26:53.680 --> 0:26:57.840
<v Speaker 1>volume changes, lots of different frequencies, it's just complicated and

0:26:57.960 --> 0:27:00.840
<v Speaker 1>rich sound, you've got a lot of opportunity to look

0:27:00.880 --> 0:27:04.240
<v Speaker 1>for masking and other acoustic elements that limit the actual

0:27:04.320 --> 0:27:08.159
<v Speaker 1>sounds that people perceive. So it means there are a

0:27:08.200 --> 0:27:11.760
<v Speaker 1>lot of chances for you to uh fudge by dropping

0:27:11.760 --> 0:27:16.720
<v Speaker 1>all the stuff that people probably wouldn't notice anyway. And uh,

0:27:16.800 --> 0:27:18.399
<v Speaker 1>if you take a piece that's got a lot of

0:27:18.400 --> 0:27:21.879
<v Speaker 1>elements at varying volumes, there are likely several opportunities to

0:27:21.880 --> 0:27:25.760
<v Speaker 1>to do this. But if you're talking about relatively straightforward

0:27:26.440 --> 0:27:31.320
<v Speaker 1>audio with few components, few changes in volume, there's really

0:27:31.320 --> 0:27:33.399
<v Speaker 1>not a whole lot of data you can ditch without

0:27:33.440 --> 0:27:35.919
<v Speaker 1>it actually affecting the quality of the audio in a

0:27:35.960 --> 0:27:40.240
<v Speaker 1>perceptible way. And this is part of what Brandenburg, that

0:27:40.280 --> 0:27:42.439
<v Speaker 1>guy I was talking about in our first episode in

0:27:42.480 --> 0:27:45.399
<v Speaker 1>this series. Uh, that's what he discovered when he was

0:27:45.800 --> 0:27:48.960
<v Speaker 1>working with the MP three standard and he was listening

0:27:49.000 --> 0:27:53.760
<v Speaker 1>back to that Suzanne Vega acapella track Tom's Diner. He

0:27:53.840 --> 0:27:55.560
<v Speaker 1>was listening to a compressed version of it, and he

0:27:55.600 --> 0:27:58.480
<v Speaker 1>said it was terrible. He said it ruined the quality

0:27:58.480 --> 0:28:01.679
<v Speaker 1>of the audio. And part of that is because that

0:28:01.720 --> 0:28:05.040
<v Speaker 1>particular song is fairly simple. There's just not a lot

0:28:05.040 --> 0:28:08.280
<v Speaker 1>of opportunity to take advantage of masking and other tricks

0:28:08.760 --> 0:28:13.800
<v Speaker 1>without potentially compromising the quality. So they decided to also

0:28:13.840 --> 0:28:17.800
<v Speaker 1>incorporate some traditional compression strategies, which which work better with

0:28:17.880 --> 0:28:20.880
<v Speaker 1>those types of recordings. So the MP three format takes

0:28:20.880 --> 0:28:24.760
<v Speaker 1>advantage of both the traditional approach and the psychoacoustic approach,

0:28:25.480 --> 0:28:28.520
<v Speaker 1>and that allows the encoder to compressed files into smaller

0:28:28.560 --> 0:28:32.679
<v Speaker 1>size without just following a single strategy, like it doesn't

0:28:32.680 --> 0:28:34.760
<v Speaker 1>have to do a one size fits all for all

0:28:34.840 --> 0:28:39.600
<v Speaker 1>elements of audio. Now, combining those two strategies requires a

0:28:39.600 --> 0:28:43.320
<v Speaker 1>little more mathematical gymnastics. So let's go back to those

0:28:43.440 --> 0:28:47.200
<v Speaker 1>five seventy six frequency bins. You know, those sub bands

0:28:47.240 --> 0:28:50.320
<v Speaker 1>we talked about earlier. You've got to quantize those suckers.

0:28:51.440 --> 0:28:54.000
<v Speaker 1>What does that mean. It means assigning a quantity to

0:28:54.160 --> 0:28:58.479
<v Speaker 1>each to each frequency bin, you have to give it

0:28:58.520 --> 0:29:01.400
<v Speaker 1>a quantity of some sorts so that you can end

0:29:01.480 --> 0:29:06.600
<v Speaker 1>up judging how much you can get away with dropping data.

0:29:06.960 --> 0:29:09.800
<v Speaker 1>So to do this, the encoder sorts those five six

0:29:09.840 --> 0:29:13.280
<v Speaker 1>bins into twenty two scale factor bands. How you doing

0:29:13.280 --> 0:29:17.640
<v Speaker 1>over there? Dylan just checking in on you? Okay, Dylan's

0:29:17.680 --> 0:29:20.400
<v Speaker 1>got Dylan's got a thousand yards stare going. I hope

0:29:20.440 --> 0:29:22.880
<v Speaker 1>you guys are doing okay over there? All right, So

0:29:23.080 --> 0:29:25.040
<v Speaker 1>before smoke starts coming out of your ears, let me

0:29:25.080 --> 0:29:28.760
<v Speaker 1>explain what the scale factor bands are all about. The

0:29:28.800 --> 0:29:32.360
<v Speaker 1>whole purpose of the scale factor bands is to determine

0:29:32.440 --> 0:29:36.960
<v Speaker 1>how the information will be stored within the compressed state.

0:29:37.800 --> 0:29:39.800
<v Speaker 1>So you want to get away with as little data

0:29:39.880 --> 0:29:43.040
<v Speaker 1>as possible before affecting sound quality. So if you can

0:29:43.080 --> 0:29:46.760
<v Speaker 1>say the same thing in a shorter space without affecting

0:29:46.760 --> 0:29:49.600
<v Speaker 1>the quality of what it is you're saying, you go

0:29:49.680 --> 0:29:54.680
<v Speaker 1>with it. Brevity is the soul of compression. So if

0:29:54.680 --> 0:29:57.960
<v Speaker 1>we were talking about language, I would say it's more

0:29:57.960 --> 0:30:02.880
<v Speaker 1>efficient to say it's raining outside, or even just it's raining,

0:30:03.200 --> 0:30:06.280
<v Speaker 1>because you would assume that it would be outside where

0:30:06.280 --> 0:30:08.840
<v Speaker 1>the rain is happening, and it would be inefficient for

0:30:08.840 --> 0:30:11.360
<v Speaker 1>me to say it's coming down like cats and dogs

0:30:11.360 --> 0:30:15.240
<v Speaker 1>out there. It's not as efficient as saying it's raining.

0:30:16.000 --> 0:30:20.760
<v Speaker 1>So if you can get away with shorter statements without

0:30:20.840 --> 0:30:24.680
<v Speaker 1>affecting the actual quality, and you could argue that by

0:30:24.840 --> 0:30:27.280
<v Speaker 1>switching from it's coming down like cats and dogs out

0:30:27.320 --> 0:30:30.840
<v Speaker 1>there and it's raining changes the quality, And that could

0:30:30.880 --> 0:30:32.640
<v Speaker 1>be a valid argument. But if you can get away

0:30:33.080 --> 0:30:37.400
<v Speaker 1>with shorter without affecting quality, you do it. So each

0:30:37.440 --> 0:30:41.960
<v Speaker 1>scale factor band is represented by a quantity, Then the

0:30:42.040 --> 0:30:46.440
<v Speaker 1>encoder divides that quantity by a given number called the quantizer,

0:30:46.800 --> 0:30:50.440
<v Speaker 1>which is the same across the entire frequency spectrum for

0:30:50.560 --> 0:30:55.040
<v Speaker 1>that recording. The resulting number is then rounded up or

0:30:55.160 --> 0:31:00.280
<v Speaker 1>down to a whole digit. And here's an important point.

0:31:00.680 --> 0:31:04.160
<v Speaker 1>Individual scale factor bands can be scaled up or down

0:31:04.280 --> 0:31:08.280
<v Speaker 1>for more or less precision to represent the actual value

0:31:08.440 --> 0:31:12.440
<v Speaker 1>of those bands. So what the heck does all that mean? Well,

0:31:12.520 --> 0:31:15.080
<v Speaker 1>the purpose of dividing and rounding is just to simplify

0:31:15.120 --> 0:31:17.840
<v Speaker 1>the data to reduce the amount you need in order

0:31:17.880 --> 0:31:20.640
<v Speaker 1>to store the information. So let's go with a totally

0:31:20.720 --> 0:31:24.520
<v Speaker 1>hypothetical example. Let's say you've got a scale factor band

0:31:25.320 --> 0:31:29.040
<v Speaker 1>and you've decided you're representing that scale factor band with

0:31:29.160 --> 0:31:33.160
<v Speaker 1>the quantity seven eight four zero seven thousand, eight hundred forty,

0:31:33.880 --> 0:31:37.200
<v Speaker 1>and you've chosen the number one hundred to quantize your data,

0:31:37.280 --> 0:31:41.719
<v Speaker 1>meaning that you will divide each uh scale factor bands

0:31:41.800 --> 0:31:45.880
<v Speaker 1>quantity by one hundred. So this is seven thousand, eight

0:31:45.960 --> 0:31:49.400
<v Speaker 1>hundred forty. You divide it by one hundred. Uh and

0:31:49.440 --> 0:31:52.680
<v Speaker 1>the scale factor for this particular band you have determined

0:31:52.840 --> 0:31:56.280
<v Speaker 1>is one point zero. That means that once you get

0:31:56.320 --> 0:31:59.840
<v Speaker 1>that result where you've divided the quantity by the quantizer,

0:32:00.080 --> 0:32:03.120
<v Speaker 1>you multiply by one. That means there's no change. Multiply

0:32:03.160 --> 0:32:05.440
<v Speaker 1>by one you get the same number. More on that

0:32:05.480 --> 0:32:07.960
<v Speaker 1>end a bit. Okay, So you take that seven thousand,

0:32:08.000 --> 0:32:11.000
<v Speaker 1>eight hundred forty you divided by one hundred. That gives

0:32:11.040 --> 0:32:14.000
<v Speaker 1>you seventy eight point four. Well, now you have to

0:32:14.080 --> 0:32:17.960
<v Speaker 1>round that number, so you round it down to seventy eight. Now,

0:32:17.960 --> 0:32:20.200
<v Speaker 1>when you have a decoder and you're ready to play

0:32:20.240 --> 0:32:23.960
<v Speaker 1>back the information, it comes across this quantity the seventy eight,

0:32:24.400 --> 0:32:28.200
<v Speaker 1>and it knows what the quantizer number was, so it

0:32:28.280 --> 0:32:31.080
<v Speaker 1>multiplies by one hundred to get back to seven thousand,

0:32:31.120 --> 0:32:35.280
<v Speaker 1>eight hundred. So the replicated number is actually forty off

0:32:35.560 --> 0:32:38.760
<v Speaker 1>from the original number. The original number again with seven thousand,

0:32:38.800 --> 0:32:43.200
<v Speaker 1>eight hundred forty, the replicated number is seven thousand, eight hundred. Now,

0:32:43.240 --> 0:32:48.680
<v Speaker 1>those inconsistencies manifest as noise in the actual playback. So

0:32:48.720 --> 0:32:51.400
<v Speaker 1>if you wanted to increase the precision of any given

0:32:51.440 --> 0:32:53.760
<v Speaker 1>scale factor band, you could do so by changing the

0:32:53.800 --> 0:32:56.800
<v Speaker 1>scale factor number. So in that example, just now, I

0:32:56.840 --> 0:32:59.160
<v Speaker 1>said the number was one point zero, meaning there's no

0:32:59.280 --> 0:33:02.680
<v Speaker 1>change to that result. But I could have said it

0:33:02.760 --> 0:33:05.840
<v Speaker 1>was ten, which means we would multiply the quantized number

0:33:05.840 --> 0:33:07.960
<v Speaker 1>by ten. So we would take that seven thousand, eight

0:33:08.040 --> 0:33:10.520
<v Speaker 1>hundred forty divided by one hundred you get seventy eight

0:33:10.520 --> 0:33:14.040
<v Speaker 1>point four, then multiplied by ten to get seven four.

0:33:14.760 --> 0:33:18.600
<v Speaker 1>So when the decoder decompresses the file, it would reverse

0:33:18.720 --> 0:33:21.320
<v Speaker 1>this this whole thing. It would just multiply by a

0:33:21.400 --> 0:33:24.160
<v Speaker 1>hundred um. You would end up getting seven thousand, hundred

0:33:24.160 --> 0:33:26.960
<v Speaker 1>forty again, which means that you wouldn't introduce any noise

0:33:27.160 --> 0:33:30.200
<v Speaker 1>to the file. You would have a perfect representation. But

0:33:30.320 --> 0:33:33.760
<v Speaker 1>in some cases, the encoder may determine that any noise

0:33:33.800 --> 0:33:37.440
<v Speaker 1>that you generate wouldn't be noticed or it wouldn't impact

0:33:37.440 --> 0:33:39.240
<v Speaker 1>the quality of the audio enough for it to be

0:33:39.240 --> 0:33:42.680
<v Speaker 1>a problem because of other factors for that particular scale

0:33:42.680 --> 0:33:45.440
<v Speaker 1>factor band, like maybe it's really quiet, or maybe it's

0:33:45.440 --> 0:33:48.800
<v Speaker 1>really complex. So in those cases, you could reduce the

0:33:48.840 --> 0:33:52.120
<v Speaker 1>scale factor number by making it something else like point

0:33:52.160 --> 0:33:54.920
<v Speaker 1>one instead of one point oh. So that means you

0:33:54.960 --> 0:33:58.520
<v Speaker 1>would multiply the quantized number by point one, So the

0:33:58.600 --> 0:34:01.760
<v Speaker 1>seventy eight point four would become seven point eight four,

0:34:01.880 --> 0:34:03.280
<v Speaker 1>and then you have to round it to get a

0:34:03.280 --> 0:34:06.440
<v Speaker 1>whole integer, so you get eight seven point eight four

0:34:06.520 --> 0:34:09.880
<v Speaker 1>rounds up to eight. Now, when a decode or decompresses

0:34:09.880 --> 0:34:14.000
<v Speaker 1>the audio, it multiplies eight by one hundred. That quantizer

0:34:14.040 --> 0:34:17.400
<v Speaker 1>that we've talked about so much, uh and uh, actually

0:34:17.440 --> 0:34:19.080
<v Speaker 1>at this point would have to be eight thousand because

0:34:19.080 --> 0:34:22.759
<v Speaker 1>it's also taking into account the scale factor, so it's

0:34:22.800 --> 0:34:26.879
<v Speaker 1>multiplying it by a thousand, not just a hundred. So

0:34:27.000 --> 0:34:29.480
<v Speaker 1>you would get a number that would pop up to

0:34:29.600 --> 0:34:32.520
<v Speaker 1>eight thousand. And remember the original with seven thousand, eight

0:34:32.560 --> 0:34:34.960
<v Speaker 1>hundred forty. So you look at the difference between these two,

0:34:35.000 --> 0:34:37.759
<v Speaker 1>the original seven thousand forty, the new fact number is

0:34:37.840 --> 0:34:40.680
<v Speaker 1>eight thousand. There's a pretty big difference there. That change

0:34:40.760 --> 0:34:43.120
<v Speaker 1>might introduce enough noise for it to be a problem.

0:34:43.160 --> 0:34:45.440
<v Speaker 1>So how does the encoder determine if a scale factor

0:34:45.520 --> 0:34:48.120
<v Speaker 1>band is meeting the proper criteria? How can it tell

0:34:48.960 --> 0:34:53.120
<v Speaker 1>if there is ah too much noise or if the

0:34:53.160 --> 0:34:56.440
<v Speaker 1>noise falls below the threshold? Well, it goes through what

0:34:56.480 --> 0:35:00.400
<v Speaker 1>it's called a Huffman coding process. At this point, Dylan

0:35:01.360 --> 0:35:05.000
<v Speaker 1>is currently just staring at the wall and drool is

0:35:05.040 --> 0:35:09.719
<v Speaker 1>coming out. Huffman coding process. It's converts scale factor bands

0:35:09.719 --> 0:35:12.920
<v Speaker 1>into binary strings, and the process goes through a series

0:35:12.920 --> 0:35:15.120
<v Speaker 1>of tables to determine if the data within the scale

0:35:15.120 --> 0:35:18.320
<v Speaker 1>factor band requires more or less precision to describe the

0:35:18.360 --> 0:35:22.160
<v Speaker 1>sound without affecting the audio quality. So, Huffman coding is

0:35:22.160 --> 0:35:24.520
<v Speaker 1>a process. And when you start with a large number

0:35:24.520 --> 0:35:27.239
<v Speaker 1>of possibilities and you begin to narrow it down, uh.

0:35:27.320 --> 0:35:30.880
<v Speaker 1>Some people describe it as the coding equivalent of twenty questions.

0:35:31.560 --> 0:35:34.760
<v Speaker 1>So you ask your first question like animal, vegetable or mineral.

0:35:35.040 --> 0:35:38.200
<v Speaker 1>You get an answer so animal. While that first answer

0:35:38.280 --> 0:35:42.200
<v Speaker 1>eliminates a ton of other possibilities and narrows the focus

0:35:42.239 --> 0:35:45.279
<v Speaker 1>like anything that doesn't pertain to animal, you can automatically

0:35:45.320 --> 0:35:49.440
<v Speaker 1>discount because you already know it can apply to that answer.

0:35:51.080 --> 0:35:53.840
<v Speaker 1>With MP three compression, this means making certain the number

0:35:53.920 --> 0:35:57.840
<v Speaker 1>of bits representing a granule because remember I mentioned that

0:35:58.480 --> 0:36:01.919
<v Speaker 1>in MP three formats you have frames, and each frame.

0:36:02.280 --> 0:36:05.200
<v Speaker 1>Each frame has a thousand, one or fifty two samples

0:36:05.239 --> 0:36:09.200
<v Speaker 1>and consists of two granules with five s each. So

0:36:09.440 --> 0:36:11.640
<v Speaker 1>when you answer the first question, it eliminates a lot

0:36:11.680 --> 0:36:16.000
<v Speaker 1>of other possibilities and narrows the focus. So like with animal, vegetable, mineral,

0:36:16.000 --> 0:36:19.080
<v Speaker 1>if I say animal, you're gonna not ask any questions

0:36:19.320 --> 0:36:22.520
<v Speaker 1>that have to do with minerals or vegetables only because

0:36:22.520 --> 0:36:25.520
<v Speaker 1>it wouldn't make sense. You know, those aren't gonna apply.

0:36:25.760 --> 0:36:28.120
<v Speaker 1>Same thing with m P three's except this time it

0:36:28.120 --> 0:36:30.920
<v Speaker 1>means making certain the number of bits representing a granule.

0:36:31.080 --> 0:36:36.239
<v Speaker 1>Remember their two granules per frame with the MP three layer, Uh,

0:36:36.360 --> 0:36:39.120
<v Speaker 1>you want to make sure that the number of bits

0:36:39.160 --> 0:36:42.839
<v Speaker 1>representing that granule match the chosen bit rate for a compression.

0:36:43.200 --> 0:36:45.600
<v Speaker 1>So if after going through this process, the encoder says, hey,

0:36:45.600 --> 0:36:48.719
<v Speaker 1>this granule has more bits than what's allowed. It's too

0:36:48.800 --> 0:36:51.640
<v Speaker 1>many bits. The we gotta get rid of some of these,

0:36:51.800 --> 0:36:54.160
<v Speaker 1>the encoder can adjust the scale factor band so that

0:36:54.200 --> 0:36:58.560
<v Speaker 1>there's less precision meaning that multiplier in other words, that

0:36:59.040 --> 0:37:02.440
<v Speaker 1>but I talked about earlier, and thus reduce the amount

0:37:02.440 --> 0:37:07.080
<v Speaker 1>of data needed to represent that particular granule. If a

0:37:07.120 --> 0:37:11.080
<v Speaker 1>granule comes in under the bit rate, the encoder can

0:37:11.120 --> 0:37:15.279
<v Speaker 1>increase the precision to reduce noise and fill that granule

0:37:15.400 --> 0:37:22.000
<v Speaker 1>out properly so it matches the actual threshold. After all this,

0:37:22.120 --> 0:37:25.320
<v Speaker 1>the pairs of granules become frames within the MP three files.

0:37:25.320 --> 0:37:27.839
<v Speaker 1>And the only other component in an MP three file

0:37:27.960 --> 0:37:31.399
<v Speaker 1>apart from these frames is the I D three metadata.

0:37:31.719 --> 0:37:33.759
<v Speaker 1>This is pretty simple. This is like a header, and

0:37:33.800 --> 0:37:36.040
<v Speaker 1>it comes before all the frames in the audio file

0:37:36.120 --> 0:37:39.920
<v Speaker 1>and contains information about about the file itself, which can

0:37:39.960 --> 0:37:42.680
<v Speaker 1>include stuff like the title of a song, an artist name,

0:37:42.800 --> 0:37:46.600
<v Speaker 1>an album title, other stuff like that. It can also

0:37:46.640 --> 0:37:50.080
<v Speaker 1>include copyright information as well as information about the file itself,

0:37:50.120 --> 0:37:52.279
<v Speaker 1>such as whether or not it's a stereo recording or

0:37:52.320 --> 0:37:56.080
<v Speaker 1>a mono recording. So when you use a decoder like

0:37:56.120 --> 0:38:00.480
<v Speaker 1>an MP three player, it takes this compressed information. These

0:38:01.320 --> 0:38:06.560
<v Speaker 1>these these representations that the music has been reduced to,

0:38:07.840 --> 0:38:11.480
<v Speaker 1>and it converts that Huffman data back into the quantized format,

0:38:12.040 --> 0:38:14.719
<v Speaker 1>scales the data back up to its original size or

0:38:14.760 --> 0:38:20.560
<v Speaker 1>close approximation. Remember the the uncompressed version may actually be

0:38:20.680 --> 0:38:25.240
<v Speaker 1>off by a significant amount depending upon each individual granule.

0:38:25.800 --> 0:38:28.040
<v Speaker 1>And all of that data gets recombined into a new

0:38:28.120 --> 0:38:30.319
<v Speaker 1>pc M sample that can be played back to you.

0:38:31.000 --> 0:38:34.080
<v Speaker 1>And that's all there is to it. Nothing could be easier.

0:38:35.280 --> 0:38:38.880
<v Speaker 1>All right, that took a lot out of me, so

0:38:38.920 --> 0:38:41.280
<v Speaker 1>I got really technical, and I apologize if I lost

0:38:41.320 --> 0:38:43.560
<v Speaker 1>any of you out there, or for those of you

0:38:43.560 --> 0:38:46.080
<v Speaker 1>who have a lot of experience working on compression algorithms,

0:38:46.120 --> 0:38:50.000
<v Speaker 1>for oversimplifying in several cases. But now we've got a

0:38:50.000 --> 0:38:52.480
<v Speaker 1>full episode about this, and I hope you have a

0:38:52.480 --> 0:38:55.600
<v Speaker 1>better understanding of how a big sound file can be

0:38:55.640 --> 0:38:59.799
<v Speaker 1>reduced to a smaller sound file. Next time, I'll just

0:38:59.800 --> 0:39:04.359
<v Speaker 1>say magic. It will make everyone happier. But I hope

0:39:04.360 --> 0:39:06.920
<v Speaker 1>you guys appreciated this. In the next episode in this

0:39:07.000 --> 0:39:09.160
<v Speaker 1>series it will be far less technical. I'm going to

0:39:09.239 --> 0:39:12.839
<v Speaker 1>be more historical. I'm going to talk about the progression

0:39:13.040 --> 0:39:16.279
<v Speaker 1>of the MP three player, how it came, about, how

0:39:16.280 --> 0:39:19.000
<v Speaker 1>it evolved, and how the iPod ended up becoming the

0:39:19.120 --> 0:39:24.600
<v Speaker 1>dominant brand in a c of MP three players, and

0:39:24.600 --> 0:39:27.520
<v Speaker 1>then maybe kind of explore where MP three players are today,

0:39:28.480 --> 0:39:30.600
<v Speaker 1>like how many are there, how how big is the market?

0:39:30.960 --> 0:39:33.360
<v Speaker 1>Are are people still buying them? That kind of question.

0:39:35.000 --> 0:39:37.280
<v Speaker 1>If you guys have any questions for me, or comments

0:39:37.400 --> 0:39:40.799
<v Speaker 1>or suggestions anything like that, send me a message. My

0:39:40.920 --> 0:39:44.400
<v Speaker 1>email is tech Stuff at how stuff works dot com,

0:39:44.520 --> 0:39:46.680
<v Speaker 1>or you can drop me a line on Facebook or Twitter,

0:39:46.920 --> 0:39:49.279
<v Speaker 1>the handle of both of those those tech stuff h

0:39:49.480 --> 0:39:53.000
<v Speaker 1>s W and I'll talk to you guys again really

0:39:53.080 --> 0:40:00.960
<v Speaker 1>soon for more on this and sense of other topics.

0:40:01.200 --> 0:40:11.920
<v Speaker 1>Is it how stuff works? Dot com m