1
00:00:04,160 --> 00:00:07,160
Speaker 1: Get in tech with technology with tech Stuff from how

2
00:00:07,240 --> 00:00:14,040
Speaker 1: stuff works dot com. Hey there, and welcome to tech Stuff.

3
00:00:14,080 --> 00:00:17,520
Speaker 1: I'm your host, Jonathan Strickland. And in a recent episode

4
00:00:17,560 --> 00:00:20,560
Speaker 1: I explored how digital audio works and gave kind of

5
00:00:20,560 --> 00:00:24,639
Speaker 1: a brief history on the MP three file format. I

6
00:00:24,760 --> 00:00:27,680
Speaker 1: warned you back then that that was part one of

7
00:00:27,720 --> 00:00:30,760
Speaker 1: a three part series, and today we're gonna explore part two.

8
00:00:31,440 --> 00:00:34,599
Speaker 1: So I hadn't forgotten about it. We're back to it, uh,

9
00:00:34,640 --> 00:00:36,440
Speaker 1: And today we're gonna do a deeper dive with m

10
00:00:36,479 --> 00:00:39,959
Speaker 1: P three's and how do they compress audio? And how

11
00:00:39,960 --> 00:00:42,239
Speaker 1: can you take a file filled with information and make

12
00:00:42,280 --> 00:00:44,920
Speaker 1: it a smaller size? What do you have to give

13
00:00:45,080 --> 00:00:48,159
Speaker 1: up in order to make files smaller? And today we're

14
00:00:48,159 --> 00:00:51,280
Speaker 1: gonna try and unravel the technical mystery behind the MP

15
00:00:51,400 --> 00:00:54,760
Speaker 1: three And I am not going to lie to you people.

16
00:00:55,720 --> 00:01:01,240
Speaker 1: This is gonna get a bit you know, man athy

17
00:01:01,440 --> 00:01:04,759
Speaker 1: And that was an English major, So you mathematicians out there,

18
00:01:04,760 --> 00:01:07,400
Speaker 1: get ready with your corrections because I'm probably gonna make

19
00:01:07,440 --> 00:01:10,760
Speaker 1: some over generalizations for the purposes of my own sanity.

20
00:01:11,280 --> 00:01:14,160
Speaker 1: There does get to a point where to really get

21
00:01:14,200 --> 00:01:19,000
Speaker 1: into the technical details, it would likely be uh impossible

22
00:01:19,040 --> 00:01:21,080
Speaker 1: for me to describe it in a way that would

23
00:01:21,080 --> 00:01:25,880
Speaker 1: make sense and be accurate. Um, and I have given

24
00:01:26,120 --> 00:01:30,399
Speaker 1: my producer Dylan the mandate that, should I get to

25
00:01:31,120 --> 00:01:36,200
Speaker 1: cryptic and incomprehensible with my explanation, that he is to

26
00:01:36,240 --> 00:01:40,200
Speaker 1: intervene in a way that he sees fit. Just not

27
00:01:40,240 --> 00:01:44,120
Speaker 1: in the face, Dylan. It's not in the face. It's moneymaker, man.

28
00:01:44,240 --> 00:01:47,120
Speaker 1: I gotta gotta take care of it. So let's remember

29
00:01:47,160 --> 00:01:52,320
Speaker 1: that the heart of digital information is the bit that's

30
00:01:52,320 --> 00:01:56,440
Speaker 1: either a zero or a one. The basic unit of

31
00:01:56,960 --> 00:02:01,920
Speaker 1: information for digital formats zeros and ones. Now we can

32
00:02:02,000 --> 00:02:05,160
Speaker 1: use those zeros and ones to describe all sorts of information,

33
00:02:05,800 --> 00:02:09,280
Speaker 1: from text to audio, to video and really pretty much

34
00:02:09,280 --> 00:02:12,240
Speaker 1: anything you can think of that's represented digitally. Ultimately, when

35
00:02:12,240 --> 00:02:14,000
Speaker 1: you get down to it, it's a bunch of zeros

36
00:02:14,040 --> 00:02:17,000
Speaker 1: and ones. So let's say you start off with your

37
00:02:17,080 --> 00:02:21,520
Speaker 1: uncompressed audio file. You've got this enormous audio file in

38
00:02:21,520 --> 00:02:23,560
Speaker 1: front of you. It's made up of zeros and ones.

39
00:02:24,080 --> 00:02:26,840
Speaker 1: How do you make that file smaller? So in the

40
00:02:26,840 --> 00:02:29,560
Speaker 1: real world, we can compress stuff, right, we can apply

41
00:02:29,800 --> 00:02:33,760
Speaker 1: physical pressure to things. Think about packing a suitcase. You

42
00:02:33,760 --> 00:02:36,240
Speaker 1: can make sure you get that extra outfit in if

43
00:02:36,280 --> 00:02:38,600
Speaker 1: you just press it down hard enough and get that

44
00:02:38,680 --> 00:02:42,240
Speaker 1: zipper zipped before it can burst open. But once you

45
00:02:42,280 --> 00:02:44,920
Speaker 1: get to a certain level of compression, you cannot make

46
00:02:45,080 --> 00:02:48,600
Speaker 1: things smaller, at least not without hurting yourself or whatever

47
00:02:48,639 --> 00:02:51,720
Speaker 1: it is you're trying to compress. Digital files are a

48
00:02:51,720 --> 00:02:55,400
Speaker 1: little different because you cannot physically cram the zeros and

49
00:02:55,520 --> 00:02:58,120
Speaker 1: ones closer together. It doesn't work like that. These are

50
00:02:58,240 --> 00:03:02,600
Speaker 1: abstract things. You can't make them smaller, right. You can't

51
00:03:02,720 --> 00:03:06,000
Speaker 1: decrease the font. It doesn't work that way. The numbers

52
00:03:06,040 --> 00:03:09,240
Speaker 1: represent two different states. So if you want to create

53
00:03:09,240 --> 00:03:12,840
Speaker 1: a smaller audio file containing the recording that was in

54
00:03:12,880 --> 00:03:17,680
Speaker 1: a larger audio file, you have to start getting creative now.

55
00:03:17,720 --> 00:03:20,120
Speaker 1: In the last part of this series, I talked about

56
00:03:20,160 --> 00:03:22,920
Speaker 1: how the MP three compression algorithm was born from an

57
00:03:22,960 --> 00:03:26,600
Speaker 1: applied research institution in Germany and the team behind the

58
00:03:26,720 --> 00:03:29,040
Speaker 1: MP three wanted to find a way to compress audio,

59
00:03:29,160 --> 00:03:34,800
Speaker 1: specifically music for transmission over phone lines. Eventually, this evolved

60
00:03:34,840 --> 00:03:39,480
Speaker 1: into the Motion Pictures Expert Group Audio Layer three compression methodology,

61
00:03:39,680 --> 00:03:44,560
Speaker 1: better known as the MP three, and there's also IMPACT

62
00:03:44,640 --> 00:03:47,360
Speaker 1: two and IMPEG four standards. Impact two, by the way,

63
00:03:47,400 --> 00:03:50,320
Speaker 1: is the basis of compression on DVDs, although the actual

64
00:03:50,440 --> 00:03:54,720
Speaker 1: DVD format is really a modification of Impact two and

65
00:03:54,840 --> 00:03:57,360
Speaker 1: Impact four is a compression strategy for audio and video

66
00:03:57,400 --> 00:04:00,840
Speaker 1: that's frequently used in lots of different up pacities, including

67
00:04:00,880 --> 00:04:05,160
Speaker 1: streaming media services. So by the late nineteen seventies, researchers

68
00:04:05,200 --> 00:04:08,720
Speaker 1: began to explore the possibility of leveraging psycho acoustics to

69
00:04:08,760 --> 00:04:12,960
Speaker 1: figure out how to compress audio. And psychoacoustics refers to

70
00:04:13,200 --> 00:04:17,120
Speaker 1: the way we perceive sound it's uh and also the

71
00:04:17,120 --> 00:04:21,360
Speaker 1: physiological effects of sound on us. So this involves not

72
00:04:21,480 --> 00:04:24,640
Speaker 1: just our our physical sense of hearing, but also our

73
00:04:24,680 --> 00:04:28,400
Speaker 1: brains and the way our brains interpret sound. So, for example,

74
00:04:28,720 --> 00:04:32,480
Speaker 1: there's a psychoacoustic phenomenon that's called the Hawse effect h

75
00:04:32,640 --> 00:04:35,560
Speaker 1: A A S. And I think it's pretty interesting. So

76
00:04:35,760 --> 00:04:38,200
Speaker 1: here's how the Hawse effect works. If you hear the

77
00:04:38,279 --> 00:04:43,280
Speaker 1: exact same sound coming from different directions, but the two

78
00:04:43,279 --> 00:04:46,640
Speaker 1: sounds arrive within thirty to forty milliseconds of each other,

79
00:04:47,040 --> 00:04:50,000
Speaker 1: your brain will be convinced that you really only heard

80
00:04:50,040 --> 00:04:53,440
Speaker 1: one sound and it came from the direction that hit

81
00:04:53,520 --> 00:04:57,200
Speaker 1: you first. So let's say a sounds coming from directly

82
00:04:57,240 --> 00:04:59,680
Speaker 1: in front of you and to your left, and you

83
00:05:00,080 --> 00:05:03,480
Speaker 1: get both of them within that thirty to forty millisecond range,

84
00:05:04,279 --> 00:05:06,440
Speaker 1: and you hear the one coming from ahead of you

85
00:05:06,520 --> 00:05:10,039
Speaker 1: first to you, you're convinced that you only heard that

86
00:05:10,120 --> 00:05:13,080
Speaker 1: sound once and it came from dead on straight ahead

87
00:05:13,080 --> 00:05:16,680
Speaker 1: of you. Your brain kind of discounts the one that

88
00:05:16,760 --> 00:05:20,159
Speaker 1: came off from the left, although it can reinforce it,

89
00:05:20,279 --> 00:05:22,520
Speaker 1: which ends up being really useful if you're planning out

90
00:05:22,520 --> 00:05:25,279
Speaker 1: p A systems for stage shows. I'm not joking. That

91
00:05:25,320 --> 00:05:28,080
Speaker 1: really is the way that people plan those things out.

92
00:05:28,360 --> 00:05:31,080
Speaker 1: It's pretty neat. Humans perceive sounds in a way that's

93
00:05:31,080 --> 00:05:35,200
Speaker 1: not necessarily representational of all the sounds surrounding us. You

94
00:05:35,200 --> 00:05:38,600
Speaker 1: can think of your brain as the filter between your

95
00:05:38,720 --> 00:05:42,679
Speaker 1: understanding and what reality actually is. A lot of stuff

96
00:05:42,720 --> 00:05:45,599
Speaker 1: goes on that it ends up getting rid of information

97
00:05:45,640 --> 00:05:48,040
Speaker 1: that your brain just says, you know what, he or

98
00:05:48,080 --> 00:05:52,040
Speaker 1: she doesn't need that, it's just gonna confuse things. We're

99
00:05:52,040 --> 00:05:55,400
Speaker 1: gonna dump it. And that's kind of how it works.

100
00:05:55,440 --> 00:05:57,599
Speaker 1: It's all on an unconscious level. It's not like you're

101
00:05:57,800 --> 00:06:01,919
Speaker 1: actively working to do this. So let's say you're in

102
00:06:01,920 --> 00:06:04,320
Speaker 1: a relatively busy hallway, and there could be a lot

103
00:06:04,360 --> 00:06:07,800
Speaker 1: of sounds in that hallway, stuff that's going on constantly

104
00:06:07,800 --> 00:06:11,000
Speaker 1: around you. Maybe they are doors opening and closing, Maybe

105
00:06:11,040 --> 00:06:13,960
Speaker 1: their footsteps going up and down the hallway. Maybe someone

106
00:06:14,080 --> 00:06:17,719
Speaker 1: shoes are squeaking against the linoleum floor. People are chattering

107
00:06:17,760 --> 00:06:20,839
Speaker 1: away in there. But you are having a conversation with someone,

108
00:06:21,240 --> 00:06:23,960
Speaker 1: so you turn your focus on that person and other

109
00:06:24,040 --> 00:06:28,200
Speaker 1: sounds seemingly fade away. They're still present, but they're not important.

110
00:06:28,800 --> 00:06:31,520
Speaker 1: So in this example, you would actually call those other

111
00:06:31,600 --> 00:06:35,479
Speaker 1: sounds of distraction and you would really focus on the conversation. Uh.

112
00:06:35,520 --> 00:06:40,000
Speaker 1: That also shows how we're able to consciously direct our

113
00:06:40,080 --> 00:06:43,719
Speaker 1: sense our perception of hearing. So both of these factors

114
00:06:43,760 --> 00:06:47,120
Speaker 1: come into play. Now. One thing that MP three encoding

115
00:06:47,160 --> 00:06:51,080
Speaker 1: takes advantage of is something called masking, and there are

116
00:06:51,080 --> 00:06:54,120
Speaker 1: a couple of different variations of the masking effect. One

117
00:06:54,160 --> 00:06:57,520
Speaker 1: of them is called frequency masking. So let's say you've

118
00:06:57,560 --> 00:07:00,480
Speaker 1: got to sound frequencies that are similar ahaps, there're just

119
00:07:00,520 --> 00:07:04,200
Speaker 1: a few hurts apart. Remember, frequencies are measured in hurts,

120
00:07:04,680 --> 00:07:08,520
Speaker 1: which is really the number of oscillations per second. So

121
00:07:08,640 --> 00:07:14,040
Speaker 1: let's say you've got a sound that's at I don't know, uh,

122
00:07:14,360 --> 00:07:19,360
Speaker 1: one thousand killer hurts, and another one that's at one

123
00:07:19,480 --> 00:07:23,560
Speaker 1: thousand and ten killer hurts. Now, the human ear is

124
00:07:23,600 --> 00:07:26,920
Speaker 1: precise enough to be able to tell the difference of

125
00:07:27,040 --> 00:07:29,840
Speaker 1: two sounds that are at least two hurts apart from

126
00:07:29,840 --> 00:07:33,360
Speaker 1: each other. That's how precise our resolution of hearing it's

127
00:07:33,440 --> 00:07:36,760
Speaker 1: it's at that level. But if you get two sounds

128
00:07:36,840 --> 00:07:40,520
Speaker 1: played at the same time and they are that close

129
00:07:40,560 --> 00:07:44,080
Speaker 1: together in frequency, and one of those frequencies is played

130
00:07:44,120 --> 00:07:47,280
Speaker 1: at a greater volume than the other, our brains will

131
00:07:47,280 --> 00:07:50,160
Speaker 1: pick up on the louder sound and ignore the quieter sound,

132
00:07:50,240 --> 00:07:53,880
Speaker 1: even though both of them are present. What becomes important

133
00:07:53,880 --> 00:07:56,520
Speaker 1: at that point is the amplitude. Now, the further apart

134
00:07:56,560 --> 00:08:00,400
Speaker 1: in frequencies you get, the less that hasn't a effect.

135
00:08:00,480 --> 00:08:02,360
Speaker 1: So if you get far enough apart where they are

136
00:08:02,360 --> 00:08:05,680
Speaker 1: two pitches, one of them noticeably louder than the other,

137
00:08:06,040 --> 00:08:08,320
Speaker 1: but they're far enough apart, you will hear both of them.

138
00:08:08,360 --> 00:08:11,560
Speaker 1: It only works if the two pitches are relatively close together,

139
00:08:12,680 --> 00:08:15,560
Speaker 1: and there's not a universal formula for frequency masking. As

140
00:08:15,560 --> 00:08:18,520
Speaker 1: you get closer to the boundaries of human hearing, frequency

141
00:08:18,560 --> 00:08:20,920
Speaker 1: masking becomes easier. So if it's a really low pitch

142
00:08:21,000 --> 00:08:23,600
Speaker 1: or a really high pitch, it's easier to get away

143
00:08:23,600 --> 00:08:26,400
Speaker 1: with it. Once you start getting into what is the

144
00:08:26,400 --> 00:08:28,960
Speaker 1: ought of as the sweet spot for human hearing, which

145
00:08:29,000 --> 00:08:32,120
Speaker 1: is generally considered to be between two and five killer hurts,

146
00:08:33,200 --> 00:08:37,200
Speaker 1: you need a greater difference in volume or a smaller

147
00:08:37,240 --> 00:08:41,640
Speaker 1: difference in frequency in order for masking to work. Frequency

148
00:08:41,720 --> 00:08:45,480
Speaker 1: masking at any rate. But then there's also temporal masking,

149
00:08:46,600 --> 00:08:48,880
Speaker 1: and you might say, okay, I got it. Temporal that

150
00:08:48,920 --> 00:08:53,040
Speaker 1: means time. Indeed it does, my friend. This describes the

151
00:08:53,040 --> 00:08:56,040
Speaker 1: effect of a short but loud sound masking a softer

152
00:08:56,120 --> 00:09:00,360
Speaker 1: sound for a short time. Weird thing is the loud

153
00:09:00,360 --> 00:09:03,960
Speaker 1: sound can actually mask sounds that precede it slightly, not

154
00:09:04,040 --> 00:09:06,760
Speaker 1: by a whole lot, but a little bit. MP three

155
00:09:06,760 --> 00:09:10,880
Speaker 1: compression takes advantage of both frequency and temporal masking when

156
00:09:10,880 --> 00:09:14,079
Speaker 1: it's trying to determine which data needs to be included

157
00:09:14,160 --> 00:09:16,920
Speaker 1: and which data can be dumped, because it won't affect

158
00:09:16,960 --> 00:09:19,840
Speaker 1: your perception of whatever the the audio file is in

159
00:09:19,840 --> 00:09:23,720
Speaker 1: the first place. So you also probably remember I talked

160
00:09:23,720 --> 00:09:26,560
Speaker 1: about the physical limitation to what we humans can hear,

161
00:09:26,800 --> 00:09:28,920
Speaker 1: no matter what our brains might be up to, so

162
00:09:29,000 --> 00:09:31,400
Speaker 1: that this doesn't have to do with our brains, you know,

163
00:09:31,480 --> 00:09:34,240
Speaker 1: filtering through the information that's coming in. This has to

164
00:09:34,280 --> 00:09:38,200
Speaker 1: do with the physical limitations of the human ear. In

165
00:09:38,240 --> 00:09:41,199
Speaker 1: the last episode of the series, I said typical human hearing.

166
00:09:41,840 --> 00:09:45,559
Speaker 1: Keep in mind typical there are exceptions. UH covers the

167
00:09:45,679 --> 00:09:48,560
Speaker 1: range of frequencies between about twenty hurts and twenty killer

168
00:09:48,640 --> 00:09:52,000
Speaker 1: hurts or twenty thousand hurts. So twenty to twenty thousand

169
00:09:52,800 --> 00:09:57,280
Speaker 1: higher frequencies represent higher pitches and sound lower frequencies lower pitches, right,

170
00:09:58,080 --> 00:10:00,640
Speaker 1: And as you get older, your ability to perceive those

171
00:10:00,720 --> 00:10:05,040
Speaker 1: higher frequencies starts to diminish. So most adults actually have

172
00:10:05,320 --> 00:10:10,880
Speaker 1: an upper range closer to sixteen killer hurts, not twenty. UH.

173
00:10:11,080 --> 00:10:13,480
Speaker 1: Kids they can hear those higher pitches. You may have

174
00:10:13,600 --> 00:10:17,920
Speaker 1: heard the story about how some convenience stores experimented with

175
00:10:18,160 --> 00:10:23,600
Speaker 1: getting rid of teenage loiterers by by UH projecting out

176
00:10:24,000 --> 00:10:27,280
Speaker 1: the super high pitches that that adults could not hear

177
00:10:27,640 --> 00:10:30,600
Speaker 1: but kids could, and it discouraged kids from hanging out

178
00:10:30,640 --> 00:10:35,080
Speaker 1: at the convenience store and loitering. UM. I love that

179
00:10:35,200 --> 00:10:39,600
Speaker 1: idea so much. Anyway, that's because I'm old and my

180
00:10:39,640 --> 00:10:43,520
Speaker 1: hearing is terrible. Well, remember I also mentioned you can

181
00:10:43,559 --> 00:10:46,400
Speaker 1: detect changes in pitch at two hurts increments if you

182
00:10:46,440 --> 00:10:48,960
Speaker 1: get below two hurts and change, Like, if it's just

183
00:10:49,040 --> 00:10:54,600
Speaker 1: a one hurts difference between two frequencies, it's too low

184
00:10:54,640 --> 00:10:56,800
Speaker 1: a resolution for us to detect. To us, it will

185
00:10:56,800 --> 00:11:01,040
Speaker 1: sound exactly the same. So if you were to hear

186
00:11:01,520 --> 00:11:06,800
Speaker 1: a frequency at one thousand one hurts or one point

187
00:11:07,000 --> 00:11:10,800
Speaker 1: zero zero one killer hurts and one point zero zero

188
00:11:10,840 --> 00:11:13,800
Speaker 1: to killer hurts, you wouldn't notice the difference. They would

189
00:11:13,840 --> 00:11:16,960
Speaker 1: sound exactly the same to you. So if you're gonna

190
00:11:17,000 --> 00:11:19,240
Speaker 1: take audio and compress it, one step you could consider

191
00:11:19,360 --> 00:11:23,960
Speaker 1: is eliminating anything that's outside the actual range of frequencies

192
00:11:24,040 --> 00:11:27,560
Speaker 1: that we can hear, or simplifying any changes in frequency

193
00:11:27,640 --> 00:11:31,240
Speaker 1: that are smaller than two hurts. If you get take

194
00:11:31,240 --> 00:11:34,760
Speaker 1: all that data and you say it is physically impossible

195
00:11:34,800 --> 00:11:38,439
Speaker 1: for a human to perceive this, get rid of that information,

196
00:11:38,559 --> 00:11:41,800
Speaker 1: then in theory it wouldn't have any effect on the

197
00:11:41,840 --> 00:11:46,120
Speaker 1: rest of the recording. But how you go further than that? Right,

198
00:11:46,200 --> 00:11:48,959
Speaker 1: how do you create a method so that you can

199
00:11:49,000 --> 00:11:51,120
Speaker 1: really compress this file? You want a method that will

200
00:11:51,120 --> 00:11:54,439
Speaker 1: preserve the important sounds while potentially ignoring all the unimportant

201
00:11:54,520 --> 00:11:58,320
Speaker 1: or incidel sounds. And you want to be automatic because

202
00:11:58,760 --> 00:12:01,440
Speaker 1: if you have a man you really then that's going

203
00:12:01,520 --> 00:12:05,640
Speaker 1: to take countless hours just to edit a single sound file.

204
00:12:06,760 --> 00:12:10,959
Speaker 1: So that was the challenge that the MP three research

205
00:12:11,040 --> 00:12:16,040
Speaker 1: team faced as a group. Now, their solution, which ultimately

206
00:12:16,080 --> 00:12:18,559
Speaker 1: created even more challenges, was to come up with what

207
00:12:18,640 --> 00:12:22,480
Speaker 1: was essentially a simulated human ear and brain. They needed

208
00:12:22,520 --> 00:12:27,880
Speaker 1: to replicate the experience of perceiving music so that an

209
00:12:27,880 --> 00:12:32,160
Speaker 1: algorithm could evaluate every sound in an audio file and

210
00:12:32,280 --> 00:12:35,359
Speaker 1: judge if an in fact was relevant enough to include

211
00:12:35,400 --> 00:12:39,720
Speaker 1: in the final compressed version. If a sound were imperceptible,

212
00:12:39,760 --> 00:12:41,600
Speaker 1: then it wouldn't make sense to include it in the

213
00:12:41,720 --> 00:12:44,720
Speaker 1: MP three file. So by leaving out all the irrelevant data,

214
00:12:44,760 --> 00:12:48,680
Speaker 1: they can make the audio information take up less bandwidth.

215
00:12:48,679 --> 00:12:51,240
Speaker 1: The file itself would be smaller because you just dumped

216
00:12:51,280 --> 00:12:54,880
Speaker 1: everything that wasn't important. So the team used an algorithm

217
00:12:55,000 --> 00:13:00,000
Speaker 1: called the low complexity adaptive transform coding or lc DASH

218
00:13:00,160 --> 00:13:03,080
Speaker 1: a t C as the foundation for their research. This

219
00:13:03,160 --> 00:13:06,440
Speaker 1: was kind of their starting point, and this is an

220
00:13:06,480 --> 00:13:10,120
Speaker 1: approach that tries to do away with redundancy as much

221
00:13:10,160 --> 00:13:15,199
Speaker 1: as possible. And it also incorporates adaptation to perceptual requirements. Also,

222
00:13:15,320 --> 00:13:19,199
Speaker 1: MP three's oh a lot to the IMPEG Layer two standard,

223
00:13:19,760 --> 00:13:23,199
Speaker 1: So the layer two obviously came out before Layer three,

224
00:13:23,720 --> 00:13:26,199
Speaker 1: and so a lot of the features of layer three

225
00:13:26,320 --> 00:13:31,760
Speaker 1: are really um their legacy features from layer two. Uh.

226
00:13:31,800 --> 00:13:34,000
Speaker 1: In other words, MP three group kind of got stuck

227
00:13:34,000 --> 00:13:36,560
Speaker 1: with them because otherwise they would have had a problem

228
00:13:36,559 --> 00:13:39,880
Speaker 1: with backwards compatibility. So the result is kind of a

229
00:13:39,920 --> 00:13:43,400
Speaker 1: clunky arrangement under the hood, and some of the features

230
00:13:43,600 --> 00:13:46,160
Speaker 1: may make very little sense when I go through them,

231
00:13:46,600 --> 00:13:48,600
Speaker 1: but some of that is because it's a hold over

232
00:13:48,640 --> 00:13:53,280
Speaker 1: from an earlier compression strategy, which isn't terribly satisfying as

233
00:13:53,280 --> 00:13:55,559
Speaker 1: an answer. But the reason many parts of the MP

234
00:13:55,640 --> 00:13:57,840
Speaker 1: three compression algorithm are the way they are is because

235
00:13:57,880 --> 00:14:01,560
Speaker 1: that's the way we've always done it. So next I'm

236
00:14:01,600 --> 00:14:07,760
Speaker 1: gonna dive into the phases of compression. But before I

237
00:14:07,800 --> 00:14:10,680
Speaker 1: do that, let's all take a deep breath and take

238
00:14:10,720 --> 00:14:22,440
Speaker 1: a moment to thank our sponsor, and we're back. So

239
00:14:22,560 --> 00:14:25,080
Speaker 1: there are two big phases we'll need to talk about

240
00:14:25,160 --> 00:14:29,760
Speaker 1: with MP three compression. The first phase is analysis and

241
00:14:29,800 --> 00:14:33,960
Speaker 1: the second phase is the actual compression itself. And after

242
00:14:34,040 --> 00:14:37,080
Speaker 1: that there's the process of decoding and MP three for playback.

243
00:14:37,560 --> 00:14:40,120
Speaker 1: But that's way simpler once we get an understanding of

244
00:14:40,160 --> 00:14:45,920
Speaker 1: how the encoding process actually happens. So let's begin with analysis. Now.

245
00:14:45,960 --> 00:14:49,480
Speaker 1: This is the part where the standard has to figure

246
00:14:49,520 --> 00:14:53,800
Speaker 1: out which frequencies within an audio range are recording rather

247
00:14:53,920 --> 00:14:59,720
Speaker 1: are important or perceptible. So how does a program and

248
00:14:59,760 --> 00:15:02,680
Speaker 1: in coder figure out what we can hear and what

249
00:15:02,800 --> 00:15:06,160
Speaker 1: we cannot hear? All? Right, time to get technical. So

250
00:15:06,880 --> 00:15:10,440
Speaker 1: you start off with your pulse code modulation audio file

251
00:15:10,720 --> 00:15:13,480
Speaker 1: or PCM file. And you might remember I talked about

252
00:15:13,480 --> 00:15:16,720
Speaker 1: PCM audio in the first episode of this series, but

253
00:15:16,840 --> 00:15:20,600
Speaker 1: just in case you don't, it's a lossless digital audio file.

254
00:15:20,680 --> 00:15:23,720
Speaker 1: The actual format could be a wave or ai f

255
00:15:23,720 --> 00:15:26,480
Speaker 1: F or something along those lines, but the important thing

256
00:15:26,920 --> 00:15:31,080
Speaker 1: to keep in mind is that it is uncompressed. Now,

257
00:15:31,120 --> 00:15:33,560
Speaker 1: that means those files tend to be pretty big. This

258
00:15:33,640 --> 00:15:36,040
Speaker 1: is our raw material that we want to take and

259
00:15:36,120 --> 00:15:40,560
Speaker 1: squish down to a more manageable, transferable size. And in

260
00:15:40,640 --> 00:15:43,320
Speaker 1: our our last episode in this series, I also mentioned

261
00:15:43,320 --> 00:15:46,680
Speaker 1: that the standard for c D audio is a sample

262
00:15:46,760 --> 00:15:49,880
Speaker 1: rate of forty four point one. Killer hurts and we

263
00:15:50,040 --> 00:15:52,680
Speaker 1: learned that you need a sample rate twice the frequency

264
00:15:52,840 --> 00:15:56,800
Speaker 1: of the highest frequency in your recording, and since human

265
00:15:56,840 --> 00:15:59,600
Speaker 1: hearing tops out at around twenty kill hurts, the standard

266
00:15:59,600 --> 00:16:02,520
Speaker 1: for CDs is forty four point one killer hurts. The

267
00:16:02,640 --> 00:16:05,640
Speaker 1: MP three standard can support lots of different sample rates,

268
00:16:05,720 --> 00:16:08,160
Speaker 1: but forty four point one killer Hurts is pretty much

269
00:16:08,200 --> 00:16:12,600
Speaker 1: the common standard. So you've got a number of samples

270
00:16:12,680 --> 00:16:15,120
Speaker 1: with your audio file, and that number will depend upon

271
00:16:15,120 --> 00:16:18,320
Speaker 1: how long the audio file is. You've got forty four

272
00:16:18,320 --> 00:16:23,120
Speaker 1: thousand one samples per second, actually twice that for stereo,

273
00:16:23,280 --> 00:16:25,760
Speaker 1: but for the purposes of this discussion, let's kind of

274
00:16:25,920 --> 00:16:28,960
Speaker 1: stick with mono sounds so that I don't start having

275
00:16:29,040 --> 00:16:31,720
Speaker 1: math coming out of my ears. And we're still in

276
00:16:31,720 --> 00:16:34,920
Speaker 1: the very easy, simple part as far as math goes.

277
00:16:34,960 --> 00:16:37,520
Speaker 1: We haven't gotten to the complicated stuff yet, all right,

278
00:16:37,600 --> 00:16:41,600
Speaker 1: So you've got forty four thousand, one hundred samples per second.

279
00:16:42,160 --> 00:16:45,320
Speaker 1: To compress it into an MP three format, the algorithm

280
00:16:45,360 --> 00:16:49,320
Speaker 1: first groups all of these samples into collections called frames.

281
00:16:50,440 --> 00:16:53,640
Speaker 1: So take those forty four thousand one per second, and

282
00:16:53,640 --> 00:16:56,480
Speaker 1: then you start saying, okay, we're gonna group you in batches.

283
00:16:56,960 --> 00:17:00,080
Speaker 1: Each batch is called a frame and each frame contains

284
00:17:00,120 --> 00:17:04,480
Speaker 1: one thousand, one fifty two samples. Now that's specifically to

285
00:17:04,560 --> 00:17:09,280
Speaker 1: maintain backwards compatibility to IMPEG Layer two, which established that

286
00:17:09,320 --> 00:17:12,119
Speaker 1: one thousand, one or fifty two number. But we're not

287
00:17:12,160 --> 00:17:16,360
Speaker 1: talking about IMPEG layer two. We're talking about IMPEG Layer three,

288
00:17:16,800 --> 00:17:18,400
Speaker 1: and though that means we have to get a little

289
00:17:18,400 --> 00:17:25,440
Speaker 1: more complicated. So each frame consists of two subgroups called granules.

290
00:17:25,440 --> 00:17:29,320
Speaker 1: So each granule has five undred seventy six samples seventy

291
00:17:29,359 --> 00:17:32,639
Speaker 1: six times two one thousand fifty two, so five seventy

292
00:17:32,680 --> 00:17:36,680
Speaker 1: six samples per granule. Now, technically MP three encoders only

293
00:17:36,680 --> 00:17:39,000
Speaker 1: work on one granule at a time, but they may

294
00:17:39,040 --> 00:17:42,879
Speaker 1: reference the granules immediately before and immediately after the current

295
00:17:42,920 --> 00:17:45,520
Speaker 1: one in order to see how the audio within the

296
00:17:45,560 --> 00:17:49,480
Speaker 1: file changes over time. All right, so now you've got

297
00:17:49,480 --> 00:17:54,000
Speaker 1: your granules of five hundred seventy six samples each. Then

298
00:17:54,040 --> 00:17:57,480
Speaker 1: the MP three encoder runs the samples through a filter bank,

299
00:17:57,960 --> 00:18:01,960
Speaker 1: which sorts the sound into thirty two frequency ranges. Are

300
00:18:02,000 --> 00:18:05,239
Speaker 1: you are you crazy about the numbers yet, Dylan? Are you?

301
00:18:05,720 --> 00:18:10,520
Speaker 1: Dylan's Dylan's nodding. Dylan gets worse from here. So you

302
00:18:10,560 --> 00:18:13,560
Speaker 1: have thirty two frequency ranges, which is another nod to

303
00:18:13,560 --> 00:18:15,840
Speaker 1: the layer two method which use those thirty two ranges

304
00:18:15,880 --> 00:18:20,240
Speaker 1: for encoding purposes. But we're not talking about layer two early, No,

305
00:18:20,760 --> 00:18:24,320
Speaker 1: we're talking MP three. Gosh darn it. That means we

306
00:18:24,359 --> 00:18:27,159
Speaker 1: take those thirty two ranges and we subdivide them by

307
00:18:27,200 --> 00:18:31,320
Speaker 1: a factor of eighteen. That means we have five hundred

308
00:18:31,320 --> 00:18:36,879
Speaker 1: seventies six bands of frequencies, each band containing one six

309
00:18:37,080 --> 00:18:41,199
Speaker 1: of the frequency range of the original sample. So what

310
00:18:41,280 --> 00:18:44,320
Speaker 1: that actually means, and this this is actually pretty easy.

311
00:18:44,720 --> 00:18:48,159
Speaker 1: The bands are not limited to a specific number for

312
00:18:48,240 --> 00:18:53,240
Speaker 1: their frequency range. Right. The bands don't mean that on

313
00:18:53,280 --> 00:18:56,359
Speaker 1: the on band number one it goes from twenty hurts

314
00:18:56,440 --> 00:18:58,840
Speaker 1: up to a certain range and on band five D

315
00:18:59,000 --> 00:19:02,399
Speaker 1: seventy six in that twenty killer hurts. That's not what

316
00:19:02,440 --> 00:19:05,600
Speaker 1: it means. They're dependent upon the original audio. So if

317
00:19:05,600 --> 00:19:09,680
Speaker 1: the original audio contains sounds within a narrow range of frequencies,

318
00:19:10,040 --> 00:19:13,760
Speaker 1: the five bands will be more precise. But if the

319
00:19:13,760 --> 00:19:17,600
Speaker 1: original recording has a vast range of frequencies, the bands

320
00:19:17,640 --> 00:19:20,440
Speaker 1: are less precise. So another way to think about this

321
00:19:21,119 --> 00:19:24,160
Speaker 1: is with a pizza. So let's say you get extra

322
00:19:24,240 --> 00:19:26,960
Speaker 1: large pizza and you cut it into eight equal slices.

323
00:19:27,600 --> 00:19:30,280
Speaker 1: And then you get a small pizza and you cut

324
00:19:30,320 --> 00:19:33,600
Speaker 1: that into eight equal slices. Well, in both cases you

325
00:19:33,640 --> 00:19:37,760
Speaker 1: have with each slice one eighth of a pizza. But

326
00:19:37,840 --> 00:19:42,080
Speaker 1: the extra large pizza pizza slice is bigger than the

327
00:19:42,119 --> 00:19:45,280
Speaker 1: small pizza pizza slice. It all depends on the size

328
00:19:45,280 --> 00:19:47,960
Speaker 1: of the pizza. So in this case, it depends upon

329
00:19:48,000 --> 00:19:51,080
Speaker 1: the range of frequencies. And and Dylan, do you think

330
00:19:51,080 --> 00:19:53,280
Speaker 1: we could go for some pizza, you know, just just

331
00:19:53,320 --> 00:19:56,159
Speaker 1: put the episode on hole and go get pizza. Dylan's nodding.

332
00:19:56,720 --> 00:20:00,879
Speaker 1: It's great for audio. Yeah, so, uh, pizza, We'll be

333
00:20:00,960 --> 00:20:05,800
Speaker 1: right back. Okay, that was good pizza. Now um oh man,

334
00:20:05,840 --> 00:20:08,400
Speaker 1: I got a whole bunch more notes. Okay, well, let's

335
00:20:08,440 --> 00:20:10,879
Speaker 1: let's go ahead and and do the rest of this.

336
00:20:10,920 --> 00:20:12,840
Speaker 1: All right, So you've got your sound divided up into

337
00:20:12,880 --> 00:20:16,320
Speaker 1: those five seventy six sub brands of frequencies, you know,

338
00:20:16,640 --> 00:20:19,840
Speaker 1: the thing I compared to pizza slices earlier. Now you

339
00:20:19,880 --> 00:20:25,359
Speaker 1: get two different mathematical processes applied to this data. One

340
00:20:25,520 --> 00:20:28,919
Speaker 1: is the fast Furrier transform or f f T, and

341
00:20:28,960 --> 00:20:32,720
Speaker 1: the other is the modified discrete cosine transform or m

342
00:20:32,800 --> 00:20:36,760
Speaker 1: d c T. Now I am not going to dive

343
00:20:36,800 --> 00:20:40,040
Speaker 1: deeply into how these transforms work because frankly, they are

344
00:20:40,119 --> 00:20:44,439
Speaker 1: beyond my mathematical understanding. But I know what they do.

345
00:20:44,680 --> 00:20:49,280
Speaker 1: I just cannot explain the process like how they do

346
00:20:49,400 --> 00:20:51,479
Speaker 1: what they do. So I'm going to give you the

347
00:20:51,480 --> 00:20:54,720
Speaker 1: explanation of what they do what the outcome of each

348
00:20:54,760 --> 00:20:58,840
Speaker 1: of these transformed processes happens to be. But I'm not

349
00:20:58,920 --> 00:21:00,800
Speaker 1: going to be able to tell you the actual mathematical

350
00:21:00,840 --> 00:21:03,479
Speaker 1: steps involved in each because I don't math. So good guys,

351
00:21:04,640 --> 00:21:07,520
Speaker 1: But let's start with a fast for your transform. So

352
00:21:07,640 --> 00:21:09,720
Speaker 1: transform is kind of what it sounds like. It's all

353
00:21:09,720 --> 00:21:13,960
Speaker 1: about transforming information in some way. So in this particular case,

354
00:21:14,119 --> 00:21:17,359
Speaker 1: the f f T transforms the frequency bands we just

355
00:21:17,400 --> 00:21:22,360
Speaker 1: talked about into data that can be further analyzed by

356
00:21:22,480 --> 00:21:26,600
Speaker 1: a psychoacoustic model that's in the encoder. So this is

357
00:21:26,640 --> 00:21:29,960
Speaker 1: that simulated human ear and brain we were talking about earlier.

358
00:21:30,840 --> 00:21:34,800
Speaker 1: So what the encoder does is it analyzes each bed

359
00:21:34,920 --> 00:21:38,600
Speaker 1: of data and looks for signs that it represents audio

360
00:21:38,680 --> 00:21:41,680
Speaker 1: that wouldn't be perceived by a human. So it's looks

361
00:21:41,800 --> 00:21:46,240
Speaker 1: looking for any potential for masking possibilities. So are there

362
00:21:46,240 --> 00:21:48,800
Speaker 1: collections of frequencies that are grouped close together, and is

363
00:21:48,840 --> 00:21:51,320
Speaker 1: one of those frequencies louder than the others, you might

364
00:21:51,359 --> 00:21:53,919
Speaker 1: be able to do away with those softer frequencies because

365
00:21:53,960 --> 00:21:57,480
Speaker 1: of frequency masking. The encoder will also look at whether

366
00:21:57,560 --> 00:21:59,879
Speaker 1: or not the audio has a lot of complexity to it,

367
00:22:00,800 --> 00:22:02,960
Speaker 1: if it has a lot of changes, or if it's

368
00:22:03,000 --> 00:22:07,840
Speaker 1: just relatively steady or simple audio. Any transient sounds that

369
00:22:07,880 --> 00:22:11,600
Speaker 1: are present in the audio might end up being temporal masking,

370
00:22:11,680 --> 00:22:14,040
Speaker 1: so it'll analyze those as well and see if that's

371
00:22:14,040 --> 00:22:19,000
Speaker 1: a possibility. So really what they're looking is for, you know,

372
00:22:20,280 --> 00:22:23,320
Speaker 1: just any really loud sounds that stand out above the

373
00:22:23,400 --> 00:22:26,119
Speaker 1: rest of the recording. That's what the f f T

374
00:22:26,280 --> 00:22:30,200
Speaker 1: is doing. So what about the modified discrete cosign transform. Well,

375
00:22:30,240 --> 00:22:32,359
Speaker 1: this is happening in parallel with the f f T

376
00:22:32,800 --> 00:22:36,280
Speaker 1: and the samples get sorted into different patterns called windows

377
00:22:37,119 --> 00:22:39,679
Speaker 1: uh and the criterion for sorting all has to do

378
00:22:39,720 --> 00:22:43,719
Speaker 1: with whether the sample represents a steady sound or varied sound.

379
00:22:44,240 --> 00:22:47,359
Speaker 1: So if you have a simple steady sound that goes

380
00:22:47,400 --> 00:22:51,200
Speaker 1: into a long window, if there's a lot of variation

381
00:22:51,240 --> 00:22:53,960
Speaker 1: in the sound, like there are a lot of consonants

382
00:22:53,960 --> 00:22:56,760
Speaker 1: in a vocal line or it's like a drum solo

383
00:22:56,960 --> 00:22:59,600
Speaker 1: or something like that. It would get sorted into it

384
00:22:59,720 --> 00:23:02,960
Speaker 1: series ease of three short windows, and each short window

385
00:23:03,000 --> 00:23:09,320
Speaker 1: contains one two samples. That amounts to four whole milliseconds,

386
00:23:09,440 --> 00:23:15,000
Speaker 1: so four thousands of a second in three patterned windows.

387
00:23:15,040 --> 00:23:18,080
Speaker 1: So you've got these windows now, either long windows for

388
00:23:18,119 --> 00:23:21,600
Speaker 1: simple sounds or short windows for the more complex sounds.

389
00:23:21,640 --> 00:23:24,600
Speaker 1: And then the modified discrete cosine transform kicks into gear.

390
00:23:24,680 --> 00:23:26,840
Speaker 1: It looks at each long window or set of three

391
00:23:26,840 --> 00:23:30,920
Speaker 1: short windows and converts them into a set of spectral values.

392
00:23:31,520 --> 00:23:33,800
Speaker 1: To some of you, that probably sounds meaningless. So let's

393
00:23:33,840 --> 00:23:37,720
Speaker 1: talk about spectral analysis for a second. First, I was

394
00:23:38,000 --> 00:23:40,919
Speaker 1: very disappointed to learn that spectral analysis doesn't involve a

395
00:23:40,920 --> 00:23:46,199
Speaker 1: psychologist talking to a ghost about its emotional state, so bummer.

396
00:23:47,000 --> 00:23:50,560
Speaker 1: But spectral analysis is when you look at a spectrum

397
00:23:50,600 --> 00:23:54,800
Speaker 1: of information, like a spectrum of frequencies or related information

398
00:23:54,840 --> 00:23:58,399
Speaker 1: like energy states. That's what this transform does. It takes

399
00:23:58,520 --> 00:24:02,119
Speaker 1: data that originally represents a slice of time in a

400
00:24:02,200 --> 00:24:05,360
Speaker 1: sound waveform. That's what sample is. A sample is an

401
00:24:05,400 --> 00:24:09,280
Speaker 1: instance of time in a wave form and converts it

402
00:24:09,320 --> 00:24:15,800
Speaker 1: into information representing sound as energy across a range of frequencies. Now,

403
00:24:15,840 --> 00:24:18,080
Speaker 1: you can plot out spectral information in a lot of

404
00:24:18,080 --> 00:24:21,000
Speaker 1: different ways, but one common method is to use brightness

405
00:24:21,040 --> 00:24:25,800
Speaker 1: to indicate energy levels. Higher energy levels are brighter patches

406
00:24:26,040 --> 00:24:31,120
Speaker 1: in your visual representation of spectral data. High frequencies would

407
00:24:31,119 --> 00:24:34,160
Speaker 1: appear at the top of a spectral view, like imagine

408
00:24:34,200 --> 00:24:37,400
Speaker 1: a box, and at the top of the box that's

409
00:24:37,400 --> 00:24:39,440
Speaker 1: where you would find high frequencies, at the bottom of

410
00:24:39,440 --> 00:24:41,720
Speaker 1: the box that's where you find low frequencies, and it's

411
00:24:41,760 --> 00:24:44,840
Speaker 1: just lots of patches of color. The really bright patches

412
00:24:44,840 --> 00:24:50,200
Speaker 1: of color represent very high energy frequencies, so they could

413
00:24:50,240 --> 00:24:54,000
Speaker 1: be high or low in in actual frequency, but we're

414
00:24:54,040 --> 00:24:57,600
Speaker 1: talking about energy levels, not whether it's a higher low pitch.

415
00:24:59,440 --> 00:25:02,120
Speaker 1: Looking left to write represents the passing of time, and

416
00:25:02,160 --> 00:25:05,560
Speaker 1: looking along any vertical points shows you the actual frequency

417
00:25:06,240 --> 00:25:09,800
Speaker 1: or pitch, and then the respective energy level is the brightness.

418
00:25:09,920 --> 00:25:12,080
Speaker 1: So it's kind of like looking at sound as a wave,

419
00:25:12,240 --> 00:25:14,760
Speaker 1: but instead of being a wave, you're looking at information

420
00:25:14,760 --> 00:25:19,600
Speaker 1: that indicates frequency range and energy level. That representation is

421
00:25:19,600 --> 00:25:22,480
Speaker 1: actually kind of analogous to how we hear audio. So

422
00:25:22,560 --> 00:25:25,679
Speaker 1: an encoder can analyze the spectral view and start to

423
00:25:25,680 --> 00:25:29,880
Speaker 1: filter out the data we wouldn't perceive due to psychoacoustics. Now,

424
00:25:29,920 --> 00:25:33,920
Speaker 1: after all that processing, the encoder looks at the frequency

425
00:25:34,000 --> 00:25:37,200
Speaker 1: sub brands and the levels of spectral intensity for each

426
00:25:37,800 --> 00:25:41,200
Speaker 1: and that information can then be used for the next phase,

427
00:25:41,800 --> 00:25:45,240
Speaker 1: which is compression. But right now I think we could

428
00:25:45,280 --> 00:25:48,760
Speaker 1: all stand a little decompression, So let's take another quick

429
00:25:48,760 --> 00:26:00,280
Speaker 1: break to thank our sponsor. All right, so now you're

430
00:26:00,280 --> 00:26:04,280
Speaker 1: ready to compress your analyzed audio. Good for you, and

431
00:26:04,320 --> 00:26:08,080
Speaker 1: by you I mean encoders. This has to be simpler

432
00:26:08,119 --> 00:26:11,119
Speaker 1: than that analysis segment, right, I mean that got a

433
00:26:11,119 --> 00:26:14,959
Speaker 1: little crazy with all the different bands and sub bands

434
00:26:15,000 --> 00:26:22,119
Speaker 1: and windows and frames and granules. Sadly it gets more complicated,

435
00:26:22,119 --> 00:26:25,280
Speaker 1: all right. So there are two layers of compression going

436
00:26:25,320 --> 00:26:30,000
Speaker 1: on with MPEG Layer three. One of those layers depends

437
00:26:30,080 --> 00:26:34,480
Speaker 1: upon the psychoacoustic analysis and the other doesn't. So why

438
00:26:34,520 --> 00:26:37,800
Speaker 1: would you use two layers with different strategies like that? Well,

439
00:26:37,840 --> 00:26:40,840
Speaker 1: the reason is that one strategy is great for complex

440
00:26:40,880 --> 00:26:43,639
Speaker 1: audio with lots of components, but not so great with

441
00:26:43,760 --> 00:26:46,639
Speaker 1: simpler sounds, and the other strategy is kind of the opposite.

442
00:26:47,160 --> 00:26:49,520
Speaker 1: So the psychoacoustic approach is the one that's really good

443
00:26:49,560 --> 00:26:53,480
Speaker 1: for complicated sounds. If if you've got a lot of

444
00:26:53,680 --> 00:26:57,840
Speaker 1: volume changes, lots of different frequencies, it's just complicated and

445
00:26:57,960 --> 00:27:00,840
Speaker 1: rich sound, you've got a lot of opportunity to look

446
00:27:00,880 --> 00:27:04,240
Speaker 1: for masking and other acoustic elements that limit the actual

447
00:27:04,320 --> 00:27:08,159
Speaker 1: sounds that people perceive. So it means there are a

448
00:27:08,200 --> 00:27:11,760
Speaker 1: lot of chances for you to uh fudge by dropping

449
00:27:11,760 --> 00:27:16,720
Speaker 1: all the stuff that people probably wouldn't notice anyway. And uh,

450
00:27:16,800 --> 00:27:18,399
Speaker 1: if you take a piece that's got a lot of

451
00:27:18,400 --> 00:27:21,879
Speaker 1: elements at varying volumes, there are likely several opportunities to

452
00:27:21,880 --> 00:27:25,760
Speaker 1: to do this. But if you're talking about relatively straightforward

453
00:27:26,440 --> 00:27:31,320
Speaker 1: audio with few components, few changes in volume, there's really

454
00:27:31,320 --> 00:27:33,399
Speaker 1: not a whole lot of data you can ditch without

455
00:27:33,440 --> 00:27:35,919
Speaker 1: it actually affecting the quality of the audio in a

456
00:27:35,960 --> 00:27:40,240
Speaker 1: perceptible way. And this is part of what Brandenburg, that

457
00:27:40,280 --> 00:27:42,439
Speaker 1: guy I was talking about in our first episode in

458
00:27:42,480 --> 00:27:45,399
Speaker 1: this series. Uh, that's what he discovered when he was

459
00:27:45,800 --> 00:27:48,960
Speaker 1: working with the MP three standard and he was listening

460
00:27:49,000 --> 00:27:53,760
Speaker 1: back to that Suzanne Vega acapella track Tom's Diner. He

461
00:27:53,840 --> 00:27:55,560
Speaker 1: was listening to a compressed version of it, and he

462
00:27:55,600 --> 00:27:58,480
Speaker 1: said it was terrible. He said it ruined the quality

463
00:27:58,480 --> 00:28:01,679
Speaker 1: of the audio. And part of that is because that

464
00:28:01,720 --> 00:28:05,040
Speaker 1: particular song is fairly simple. There's just not a lot

465
00:28:05,040 --> 00:28:08,280
Speaker 1: of opportunity to take advantage of masking and other tricks

466
00:28:08,760 --> 00:28:13,800
Speaker 1: without potentially compromising the quality. So they decided to also

467
00:28:13,840 --> 00:28:17,800
Speaker 1: incorporate some traditional compression strategies, which which work better with

468
00:28:17,880 --> 00:28:20,880
Speaker 1: those types of recordings. So the MP three format takes

469
00:28:20,880 --> 00:28:24,760
Speaker 1: advantage of both the traditional approach and the psychoacoustic approach,

470
00:28:25,480 --> 00:28:28,520
Speaker 1: and that allows the encoder to compressed files into smaller

471
00:28:28,560 --> 00:28:32,679
Speaker 1: size without just following a single strategy, like it doesn't

472
00:28:32,680 --> 00:28:34,760
Speaker 1: have to do a one size fits all for all

473
00:28:34,840 --> 00:28:39,600
Speaker 1: elements of audio. Now, combining those two strategies requires a

474
00:28:39,600 --> 00:28:43,320
Speaker 1: little more mathematical gymnastics. So let's go back to those

475
00:28:43,440 --> 00:28:47,200
Speaker 1: five seventy six frequency bins. You know, those sub bands

476
00:28:47,240 --> 00:28:50,320
Speaker 1: we talked about earlier. You've got to quantize those suckers.

477
00:28:51,440 --> 00:28:54,000
Speaker 1: What does that mean. It means assigning a quantity to

478
00:28:54,160 --> 00:28:58,479
Speaker 1: each to each frequency bin, you have to give it

479
00:28:58,520 --> 00:29:01,400
Speaker 1: a quantity of some sorts so that you can end

480
00:29:01,480 --> 00:29:06,600
Speaker 1: up judging how much you can get away with dropping data.

481
00:29:06,960 --> 00:29:09,800
Speaker 1: So to do this, the encoder sorts those five six

482
00:29:09,840 --> 00:29:13,280
Speaker 1: bins into twenty two scale factor bands. How you doing

483
00:29:13,280 --> 00:29:17,640
Speaker 1: over there? Dylan just checking in on you? Okay, Dylan's

484
00:29:17,680 --> 00:29:20,400
Speaker 1: got Dylan's got a thousand yards stare going. I hope

485
00:29:20,440 --> 00:29:22,880
Speaker 1: you guys are doing okay over there? All right, So

486
00:29:23,080 --> 00:29:25,040
Speaker 1: before smoke starts coming out of your ears, let me

487
00:29:25,080 --> 00:29:28,760
Speaker 1: explain what the scale factor bands are all about. The

488
00:29:28,800 --> 00:29:32,360
Speaker 1: whole purpose of the scale factor bands is to determine

489
00:29:32,440 --> 00:29:36,960
Speaker 1: how the information will be stored within the compressed state.

490
00:29:37,800 --> 00:29:39,800
Speaker 1: So you want to get away with as little data

491
00:29:39,880 --> 00:29:43,040
Speaker 1: as possible before affecting sound quality. So if you can

492
00:29:43,080 --> 00:29:46,760
Speaker 1: say the same thing in a shorter space without affecting

493
00:29:46,760 --> 00:29:49,600
Speaker 1: the quality of what it is you're saying, you go

494
00:29:49,680 --> 00:29:54,680
Speaker 1: with it. Brevity is the soul of compression. So if

495
00:29:54,680 --> 00:29:57,960
Speaker 1: we were talking about language, I would say it's more

496
00:29:57,960 --> 00:30:02,880
Speaker 1: efficient to say it's raining outside, or even just it's raining,

497
00:30:03,200 --> 00:30:06,280
Speaker 1: because you would assume that it would be outside where

498
00:30:06,280 --> 00:30:08,840
Speaker 1: the rain is happening, and it would be inefficient for

499
00:30:08,840 --> 00:30:11,360
Speaker 1: me to say it's coming down like cats and dogs

500
00:30:11,360 --> 00:30:15,240
Speaker 1: out there. It's not as efficient as saying it's raining.

501
00:30:16,000 --> 00:30:20,760
Speaker 1: So if you can get away with shorter statements without

502
00:30:20,840 --> 00:30:24,680
Speaker 1: affecting the actual quality, and you could argue that by

503
00:30:24,840 --> 00:30:27,280
Speaker 1: switching from it's coming down like cats and dogs out

504
00:30:27,320 --> 00:30:30,840
Speaker 1: there and it's raining changes the quality, And that could

505
00:30:30,880 --> 00:30:32,640
Speaker 1: be a valid argument. But if you can get away

506
00:30:33,080 --> 00:30:37,400
Speaker 1: with shorter without affecting quality, you do it. So each

507
00:30:37,440 --> 00:30:41,960
Speaker 1: scale factor band is represented by a quantity, Then the

508
00:30:42,040 --> 00:30:46,440
Speaker 1: encoder divides that quantity by a given number called the quantizer,

509
00:30:46,800 --> 00:30:50,440
Speaker 1: which is the same across the entire frequency spectrum for

510
00:30:50,560 --> 00:30:55,040
Speaker 1: that recording. The resulting number is then rounded up or

511
00:30:55,160 --> 00:31:00,280
Speaker 1: down to a whole digit. And here's an important point.

512
00:31:00,680 --> 00:31:04,160
Speaker 1: Individual scale factor bands can be scaled up or down

513
00:31:04,280 --> 00:31:08,280
Speaker 1: for more or less precision to represent the actual value

514
00:31:08,440 --> 00:31:12,440
Speaker 1: of those bands. So what the heck does all that mean? Well,

515
00:31:12,520 --> 00:31:15,080
Speaker 1: the purpose of dividing and rounding is just to simplify

516
00:31:15,120 --> 00:31:17,840
Speaker 1: the data to reduce the amount you need in order

517
00:31:17,880 --> 00:31:20,640
Speaker 1: to store the information. So let's go with a totally

518
00:31:20,720 --> 00:31:24,520
Speaker 1: hypothetical example. Let's say you've got a scale factor band

519
00:31:25,320 --> 00:31:29,040
Speaker 1: and you've decided you're representing that scale factor band with

520
00:31:29,160 --> 00:31:33,160
Speaker 1: the quantity seven eight four zero seven thousand, eight hundred forty,

521
00:31:33,880 --> 00:31:37,200
Speaker 1: and you've chosen the number one hundred to quantize your data,

522
00:31:37,280 --> 00:31:41,719
Speaker 1: meaning that you will divide each uh scale factor bands

523
00:31:41,800 --> 00:31:45,880
Speaker 1: quantity by one hundred. So this is seven thousand, eight

524
00:31:45,960 --> 00:31:49,400
Speaker 1: hundred forty. You divide it by one hundred. Uh and

525
00:31:49,440 --> 00:31:52,680
Speaker 1: the scale factor for this particular band you have determined

526
00:31:52,840 --> 00:31:56,280
Speaker 1: is one point zero. That means that once you get

527
00:31:56,320 --> 00:31:59,840
Speaker 1: that result where you've divided the quantity by the quantizer,

528
00:32:00,080 --> 00:32:03,120
Speaker 1: you multiply by one. That means there's no change. Multiply

529
00:32:03,160 --> 00:32:05,440
Speaker 1: by one you get the same number. More on that

530
00:32:05,480 --> 00:32:07,960
Speaker 1: end a bit. Okay, So you take that seven thousand,

531
00:32:08,000 --> 00:32:11,000
Speaker 1: eight hundred forty you divided by one hundred. That gives

532
00:32:11,040 --> 00:32:14,000
Speaker 1: you seventy eight point four. Well, now you have to

533
00:32:14,080 --> 00:32:17,960
Speaker 1: round that number, so you round it down to seventy eight. Now,

534
00:32:17,960 --> 00:32:20,200
Speaker 1: when you have a decoder and you're ready to play

535
00:32:20,240 --> 00:32:23,960
Speaker 1: back the information, it comes across this quantity the seventy eight,

536
00:32:24,400 --> 00:32:28,200
Speaker 1: and it knows what the quantizer number was, so it

537
00:32:28,280 --> 00:32:31,080
Speaker 1: multiplies by one hundred to get back to seven thousand,

538
00:32:31,120 --> 00:32:35,280
Speaker 1: eight hundred. So the replicated number is actually forty off

539
00:32:35,560 --> 00:32:38,760
Speaker 1: from the original number. The original number again with seven thousand,

540
00:32:38,800 --> 00:32:43,200
Speaker 1: eight hundred forty, the replicated number is seven thousand, eight hundred. Now,

541
00:32:43,240 --> 00:32:48,680
Speaker 1: those inconsistencies manifest as noise in the actual playback. So

542
00:32:48,720 --> 00:32:51,400
Speaker 1: if you wanted to increase the precision of any given

543
00:32:51,440 --> 00:32:53,760
Speaker 1: scale factor band, you could do so by changing the

544
00:32:53,800 --> 00:32:56,800
Speaker 1: scale factor number. So in that example, just now, I

545
00:32:56,840 --> 00:32:59,160
Speaker 1: said the number was one point zero, meaning there's no

546
00:32:59,280 --> 00:33:02,680
Speaker 1: change to that result. But I could have said it

547
00:33:02,760 --> 00:33:05,840
Speaker 1: was ten, which means we would multiply the quantized number

548
00:33:05,840 --> 00:33:07,960
Speaker 1: by ten. So we would take that seven thousand, eight

549
00:33:08,040 --> 00:33:10,520
Speaker 1: hundred forty divided by one hundred you get seventy eight

550
00:33:10,520 --> 00:33:14,040
Speaker 1: point four, then multiplied by ten to get seven four.

551
00:33:14,760 --> 00:33:18,600
Speaker 1: So when the decoder decompresses the file, it would reverse

552
00:33:18,720 --> 00:33:21,320
Speaker 1: this this whole thing. It would just multiply by a

553
00:33:21,400 --> 00:33:24,160
Speaker 1: hundred um. You would end up getting seven thousand, hundred

554
00:33:24,160 --> 00:33:26,960
Speaker 1: forty again, which means that you wouldn't introduce any noise

555
00:33:27,160 --> 00:33:30,200
Speaker 1: to the file. You would have a perfect representation. But

556
00:33:30,320 --> 00:33:33,760
Speaker 1: in some cases, the encoder may determine that any noise

557
00:33:33,800 --> 00:33:37,440
Speaker 1: that you generate wouldn't be noticed or it wouldn't impact

558
00:33:37,440 --> 00:33:39,240
Speaker 1: the quality of the audio enough for it to be

559
00:33:39,240 --> 00:33:42,680
Speaker 1: a problem because of other factors for that particular scale

560
00:33:42,680 --> 00:33:45,440
Speaker 1: factor band, like maybe it's really quiet, or maybe it's

561
00:33:45,440 --> 00:33:48,800
Speaker 1: really complex. So in those cases, you could reduce the

562
00:33:48,840 --> 00:33:52,120
Speaker 1: scale factor number by making it something else like point

563
00:33:52,160 --> 00:33:54,920
Speaker 1: one instead of one point oh. So that means you

564
00:33:54,960 --> 00:33:58,520
Speaker 1: would multiply the quantized number by point one, So the

565
00:33:58,600 --> 00:34:01,760
Speaker 1: seventy eight point four would become seven point eight four,

566
00:34:01,880 --> 00:34:03,280
Speaker 1: and then you have to round it to get a

567
00:34:03,280 --> 00:34:06,440
Speaker 1: whole integer, so you get eight seven point eight four

568
00:34:06,520 --> 00:34:09,880
Speaker 1: rounds up to eight. Now, when a decode or decompresses

569
00:34:09,880 --> 00:34:14,000
Speaker 1: the audio, it multiplies eight by one hundred. That quantizer

570
00:34:14,040 --> 00:34:17,400
Speaker 1: that we've talked about so much, uh and uh, actually

571
00:34:17,440 --> 00:34:19,080
Speaker 1: at this point would have to be eight thousand because

572
00:34:19,080 --> 00:34:22,759
Speaker 1: it's also taking into account the scale factor, so it's

573
00:34:22,800 --> 00:34:26,879
Speaker 1: multiplying it by a thousand, not just a hundred. So

574
00:34:27,000 --> 00:34:29,480
Speaker 1: you would get a number that would pop up to

575
00:34:29,600 --> 00:34:32,520
Speaker 1: eight thousand. And remember the original with seven thousand, eight

576
00:34:32,560 --> 00:34:34,960
Speaker 1: hundred forty. So you look at the difference between these two,

577
00:34:35,000 --> 00:34:37,759
Speaker 1: the original seven thousand forty, the new fact number is

578
00:34:37,840 --> 00:34:40,680
Speaker 1: eight thousand. There's a pretty big difference there. That change

579
00:34:40,760 --> 00:34:43,120
Speaker 1: might introduce enough noise for it to be a problem.

580
00:34:43,160 --> 00:34:45,440
Speaker 1: So how does the encoder determine if a scale factor

581
00:34:45,520 --> 00:34:48,120
Speaker 1: band is meeting the proper criteria? How can it tell

582
00:34:48,960 --> 00:34:53,120
Speaker 1: if there is ah too much noise or if the

583
00:34:53,160 --> 00:34:56,440
Speaker 1: noise falls below the threshold? Well, it goes through what

584
00:34:56,480 --> 00:35:00,400
Speaker 1: it's called a Huffman coding process. At this point, Dylan

585
00:35:01,360 --> 00:35:05,000
Speaker 1: is currently just staring at the wall and drool is

586
00:35:05,040 --> 00:35:09,719
Speaker 1: coming out. Huffman coding process. It's converts scale factor bands

587
00:35:09,719 --> 00:35:12,920
Speaker 1: into binary strings, and the process goes through a series

588
00:35:12,920 --> 00:35:15,120
Speaker 1: of tables to determine if the data within the scale

589
00:35:15,120 --> 00:35:18,320
Speaker 1: factor band requires more or less precision to describe the

590
00:35:18,360 --> 00:35:22,160
Speaker 1: sound without affecting the audio quality. So, Huffman coding is

591
00:35:22,160 --> 00:35:24,520
Speaker 1: a process. And when you start with a large number

592
00:35:24,520 --> 00:35:27,239
Speaker 1: of possibilities and you begin to narrow it down, uh.

593
00:35:27,320 --> 00:35:30,880
Speaker 1: Some people describe it as the coding equivalent of twenty questions.

594
00:35:31,560 --> 00:35:34,760
Speaker 1: So you ask your first question like animal, vegetable or mineral.

595
00:35:35,040 --> 00:35:38,200
Speaker 1: You get an answer so animal. While that first answer

596
00:35:38,280 --> 00:35:42,200
Speaker 1: eliminates a ton of other possibilities and narrows the focus

597
00:35:42,239 --> 00:35:45,279
Speaker 1: like anything that doesn't pertain to animal, you can automatically

598
00:35:45,320 --> 00:35:49,440
Speaker 1: discount because you already know it can apply to that answer.

599
00:35:51,080 --> 00:35:53,840
Speaker 1: With MP three compression, this means making certain the number

600
00:35:53,920 --> 00:35:57,840
Speaker 1: of bits representing a granule because remember I mentioned that

601
00:35:58,480 --> 00:36:01,919
Speaker 1: in MP three formats you have frames, and each frame.

602
00:36:02,280 --> 00:36:05,200
Speaker 1: Each frame has a thousand, one or fifty two samples

603
00:36:05,239 --> 00:36:09,200
Speaker 1: and consists of two granules with five s each. So

604
00:36:09,440 --> 00:36:11,640
Speaker 1: when you answer the first question, it eliminates a lot

605
00:36:11,680 --> 00:36:16,000
Speaker 1: of other possibilities and narrows the focus. So like with animal, vegetable, mineral,

606
00:36:16,000 --> 00:36:19,080
Speaker 1: if I say animal, you're gonna not ask any questions

607
00:36:19,320 --> 00:36:22,520
Speaker 1: that have to do with minerals or vegetables only because

608
00:36:22,520 --> 00:36:25,520
Speaker 1: it wouldn't make sense. You know, those aren't gonna apply.

609
00:36:25,760 --> 00:36:28,120
Speaker 1: Same thing with m P three's except this time it

610
00:36:28,120 --> 00:36:30,920
Speaker 1: means making certain the number of bits representing a granule.

611
00:36:31,080 --> 00:36:36,239
Speaker 1: Remember their two granules per frame with the MP three layer, Uh,

612
00:36:36,360 --> 00:36:39,120
Speaker 1: you want to make sure that the number of bits

613
00:36:39,160 --> 00:36:42,839
Speaker 1: representing that granule match the chosen bit rate for a compression.

614
00:36:43,200 --> 00:36:45,600
Speaker 1: So if after going through this process, the encoder says, hey,

615
00:36:45,600 --> 00:36:48,719
Speaker 1: this granule has more bits than what's allowed. It's too

616
00:36:48,800 --> 00:36:51,640
Speaker 1: many bits. The we gotta get rid of some of these,

617
00:36:51,800 --> 00:36:54,160
Speaker 1: the encoder can adjust the scale factor band so that

618
00:36:54,200 --> 00:36:58,560
Speaker 1: there's less precision meaning that multiplier in other words, that

619
00:36:59,040 --> 00:37:02,440
Speaker 1: but I talked about earlier, and thus reduce the amount

620
00:37:02,440 --> 00:37:07,080
Speaker 1: of data needed to represent that particular granule. If a

621
00:37:07,120 --> 00:37:11,080
Speaker 1: granule comes in under the bit rate, the encoder can

622
00:37:11,120 --> 00:37:15,279
Speaker 1: increase the precision to reduce noise and fill that granule

623
00:37:15,400 --> 00:37:22,000
Speaker 1: out properly so it matches the actual threshold. After all this,

624
00:37:22,120 --> 00:37:25,320
Speaker 1: the pairs of granules become frames within the MP three files.

625
00:37:25,320 --> 00:37:27,839
Speaker 1: And the only other component in an MP three file

626
00:37:27,960 --> 00:37:31,399
Speaker 1: apart from these frames is the I D three metadata.

627
00:37:31,719 --> 00:37:33,759
Speaker 1: This is pretty simple. This is like a header, and

628
00:37:33,800 --> 00:37:36,040
Speaker 1: it comes before all the frames in the audio file

629
00:37:36,120 --> 00:37:39,920
Speaker 1: and contains information about about the file itself, which can

630
00:37:39,960 --> 00:37:42,680
Speaker 1: include stuff like the title of a song, an artist name,

631
00:37:42,800 --> 00:37:46,600
Speaker 1: an album title, other stuff like that. It can also

632
00:37:46,640 --> 00:37:50,080
Speaker 1: include copyright information as well as information about the file itself,

633
00:37:50,120 --> 00:37:52,279
Speaker 1: such as whether or not it's a stereo recording or

634
00:37:52,320 --> 00:37:56,080
Speaker 1: a mono recording. So when you use a decoder like

635
00:37:56,120 --> 00:38:00,480
Speaker 1: an MP three player, it takes this compressed information. These

636
00:38:01,320 --> 00:38:06,560
Speaker 1: these these representations that the music has been reduced to,

637
00:38:07,840 --> 00:38:11,480
Speaker 1: and it converts that Huffman data back into the quantized format,

638
00:38:12,040 --> 00:38:14,719
Speaker 1: scales the data back up to its original size or

639
00:38:14,760 --> 00:38:20,560
Speaker 1: close approximation. Remember the the uncompressed version may actually be

640
00:38:20,680 --> 00:38:25,240
Speaker 1: off by a significant amount depending upon each individual granule.

641
00:38:25,800 --> 00:38:28,040
Speaker 1: And all of that data gets recombined into a new

642
00:38:28,120 --> 00:38:30,319
Speaker 1: pc M sample that can be played back to you.

643
00:38:31,000 --> 00:38:34,080
Speaker 1: And that's all there is to it. Nothing could be easier.

644
00:38:35,280 --> 00:38:38,880
Speaker 1: All right, that took a lot out of me, so

645
00:38:38,920 --> 00:38:41,280
Speaker 1: I got really technical, and I apologize if I lost

646
00:38:41,320 --> 00:38:43,560
Speaker 1: any of you out there, or for those of you

647
00:38:43,560 --> 00:38:46,080
Speaker 1: who have a lot of experience working on compression algorithms,

648
00:38:46,120 --> 00:38:50,000
Speaker 1: for oversimplifying in several cases. But now we've got a

649
00:38:50,000 --> 00:38:52,480
Speaker 1: full episode about this, and I hope you have a

650
00:38:52,480 --> 00:38:55,600
Speaker 1: better understanding of how a big sound file can be

651
00:38:55,640 --> 00:38:59,799
Speaker 1: reduced to a smaller sound file. Next time, I'll just

652
00:38:59,800 --> 00:39:04,359
Speaker 1: say magic. It will make everyone happier. But I hope

653
00:39:04,360 --> 00:39:06,920
Speaker 1: you guys appreciated this. In the next episode in this

654
00:39:07,000 --> 00:39:09,160
Speaker 1: series it will be far less technical. I'm going to

655
00:39:09,239 --> 00:39:12,839
Speaker 1: be more historical. I'm going to talk about the progression

656
00:39:13,040 --> 00:39:16,279
Speaker 1: of the MP three player, how it came, about, how

657
00:39:16,280 --> 00:39:19,000
Speaker 1: it evolved, and how the iPod ended up becoming the

658
00:39:19,120 --> 00:39:24,600
Speaker 1: dominant brand in a c of MP three players, and

659
00:39:24,600 --> 00:39:27,520
Speaker 1: then maybe kind of explore where MP three players are today,

660
00:39:28,480 --> 00:39:30,600
Speaker 1: like how many are there, how how big is the market?

661
00:39:30,960 --> 00:39:33,360
Speaker 1: Are are people still buying them? That kind of question.

662
00:39:35,000 --> 00:39:37,280
Speaker 1: If you guys have any questions for me, or comments

663
00:39:37,400 --> 00:39:40,799
Speaker 1: or suggestions anything like that, send me a message. My

664
00:39:40,920 --> 00:39:44,400
Speaker 1: email is tech Stuff at how stuff works dot com,

665
00:39:44,520 --> 00:39:46,680
Speaker 1: or you can drop me a line on Facebook or Twitter,

666
00:39:46,920 --> 00:39:49,279
Speaker 1: the handle of both of those those tech stuff h

667
00:39:49,480 --> 00:39:53,000
Speaker 1: s W and I'll talk to you guys again really

668
00:39:53,080 --> 00:40:00,960
Speaker 1: soon for more on this and sense of other topics.

669
00:40:01,200 --> 00:40:11,920
Speaker 1: Is it how stuff works? Dot com m