1
00:00:04,160 --> 00:00:07,160
Speaker 1: Get in touch with technology with tech Stuff from how

2
00:00:07,240 --> 00:00:14,160
Speaker 1: stuff works dot com. Hey everybody, and welcome to tech Stuff.

3
00:00:14,200 --> 00:00:16,680
Speaker 1: I'm Jonathan Strickland. I'm the host of the show, and

4
00:00:16,800 --> 00:00:20,439
Speaker 1: this is a Saturday morning rerun episode where we take

5
00:00:20,440 --> 00:00:23,560
Speaker 1: a classic episode of tech Stuff and we present it

6
00:00:23,560 --> 00:00:25,760
Speaker 1: to you guys who may have missed it. I've been

7
00:00:25,800 --> 00:00:29,000
Speaker 1: talking a lot about tech and music recently. If you've

8
00:00:29,000 --> 00:00:31,360
Speaker 1: been listening to the recent episodes, you know all about that,

9
00:00:31,920 --> 00:00:34,640
Speaker 1: and there have been some great discussions. But it also

10
00:00:34,760 --> 00:00:38,760
Speaker 1: requires a little bit of uh knowledge of previous episodes

11
00:00:38,880 --> 00:00:40,960
Speaker 1: at times, and I know it can be tricky to

12
00:00:41,040 --> 00:00:44,559
Speaker 1: dig through the archives. So in this classic episode, I

13
00:00:44,600 --> 00:00:48,919
Speaker 1: talk about how the MP three compression format works, so

14
00:00:48,920 --> 00:00:51,559
Speaker 1: that you can actually understand how MP three works as

15
00:00:51,560 --> 00:00:54,600
Speaker 1: opposed to something like middy, and you can get an

16
00:00:54,600 --> 00:00:58,960
Speaker 1: appreciation for the differences between the two formats. This episode

17
00:00:58,960 --> 00:01:02,320
Speaker 1: originally published on January two thousand and seventeen. This is

18
00:01:02,360 --> 00:01:06,039
Speaker 1: a whole year ago more than that. Now we're in

19
00:01:06,080 --> 00:01:08,920
Speaker 1: April two eighteen as I record this. I hope you

20
00:01:09,000 --> 00:01:11,120
Speaker 1: enjoyed this classic episode. I hope it gives you a

21
00:01:11,160 --> 00:01:15,959
Speaker 1: deeper appreciation of the technical aspect of creating digital music

22
00:01:16,360 --> 00:01:19,440
Speaker 1: and I'll see you guys on the other side. So

23
00:01:19,560 --> 00:01:23,840
Speaker 1: let's remember that the heart of digital information is the

24
00:01:23,959 --> 00:01:28,080
Speaker 1: bit that's either a zero or a one. The basic

25
00:01:28,360 --> 00:01:34,720
Speaker 1: unit of information for digital formats zeros and ones. Now

26
00:01:34,720 --> 00:01:36,840
Speaker 1: we can use those zeros and ones to describe all

27
00:01:36,840 --> 00:01:41,120
Speaker 1: sorts of information, from text to audio, to video and

28
00:01:41,480 --> 00:01:45,120
Speaker 1: really pretty much anything you can think of that's represented digitally. Ultimately,

29
00:01:45,120 --> 00:01:46,680
Speaker 1: when you get down to it, it's a bunch of

30
00:01:46,760 --> 00:01:49,840
Speaker 1: zeros and ones. So let's say you start off with

31
00:01:49,880 --> 00:01:54,440
Speaker 1: your uncompressed audio file. You've got this enormous audio file

32
00:01:54,480 --> 00:01:56,600
Speaker 1: in front of you. It's made up of zeros and ones.

33
00:01:57,120 --> 00:02:00,480
Speaker 1: How do you make that file smaller? So in the world,

34
00:02:00,480 --> 00:02:04,120
Speaker 1: we can compress stuff, right, we can apply physical pressure

35
00:02:04,160 --> 00:02:07,400
Speaker 1: to things. Think about packing a suitcase. You can make

36
00:02:07,400 --> 00:02:09,640
Speaker 1: sure you get that extra outfit and if you just

37
00:02:09,919 --> 00:02:12,480
Speaker 1: press it down hard enough and get that zipper zipped

38
00:02:12,480 --> 00:02:15,600
Speaker 1: before it can burst open. But once you get to

39
00:02:15,639 --> 00:02:19,239
Speaker 1: a certain level of compression, you cannot make things smaller,

40
00:02:19,480 --> 00:02:21,880
Speaker 1: at least not without hurting yourself or whatever it is

41
00:02:21,919 --> 00:02:25,440
Speaker 1: you're trying to compress. Digital files are a little different

42
00:02:25,760 --> 00:02:29,800
Speaker 1: because you cannot physically cram the zeros and ones closer together.

43
00:02:29,880 --> 00:02:33,400
Speaker 1: It doesn't work like that. These are abstract things. You

44
00:02:33,440 --> 00:02:36,840
Speaker 1: can't make them smaller, right, You can't decrease the font.

45
00:02:36,960 --> 00:02:40,840
Speaker 1: It doesn't work that way. The numbers represent two different states.

46
00:02:41,400 --> 00:02:43,640
Speaker 1: So if you want to create a smaller audio file

47
00:02:44,080 --> 00:02:47,280
Speaker 1: containing the recording that was in a larger audio file,

48
00:02:47,760 --> 00:02:51,200
Speaker 1: you have to start getting creative now. In the last

49
00:02:51,240 --> 00:02:53,720
Speaker 1: part of this series, I talked about how the MP

50
00:02:53,800 --> 00:02:57,560
Speaker 1: three compression algorithm was born from an applied research institution

51
00:02:57,600 --> 00:03:00,240
Speaker 1: in Germany and the team behind the m B three

52
00:03:00,280 --> 00:03:03,239
Speaker 1: wanted to find a way to compress audio, specifically music

53
00:03:03,600 --> 00:03:08,280
Speaker 1: for transmission over phone lines. Eventually this evolved into the

54
00:03:08,400 --> 00:03:13,000
Speaker 1: Motion Pictures Expert Group audio Layer three compression methodology, better

55
00:03:13,080 --> 00:03:17,960
Speaker 1: known as the MP three, and there's also Impact two

56
00:03:18,000 --> 00:03:20,519
Speaker 1: and IMPEG four standards. Impact two, by the way, is

57
00:03:20,560 --> 00:03:23,799
Speaker 1: the basis of compression on DVDs, although the actual DVD

58
00:03:23,880 --> 00:03:28,240
Speaker 1: format is really a modification of Impact two. An Impact

59
00:03:28,280 --> 00:03:30,600
Speaker 1: four is a compression strategy for audio and video that's

60
00:03:30,639 --> 00:03:34,320
Speaker 1: frequently used in lots of different up capacities, including streaming

61
00:03:34,320 --> 00:03:38,480
Speaker 1: media services. So by the late nineteen seventies, researchers began

62
00:03:38,560 --> 00:03:42,280
Speaker 1: to explore the possibility of leveraging psychoacoustics to figure out

63
00:03:42,320 --> 00:03:46,640
Speaker 1: how to compress audio. And psychoacoustics refers to the way

64
00:03:46,640 --> 00:03:51,360
Speaker 1: we perceive sound, it's uh and also the physiological effects

65
00:03:51,400 --> 00:03:55,080
Speaker 1: of sound on us. So this involves not just our

66
00:03:55,160 --> 00:03:58,200
Speaker 1: our physical sense of hearing, but also our brains and

67
00:03:58,240 --> 00:04:01,480
Speaker 1: the way our brains interpret sound. Owned So, for example,

68
00:04:01,760 --> 00:04:05,520
Speaker 1: there's a psychoacoustic phenomenon that's called the Hawse effect h

69
00:04:05,680 --> 00:04:08,600
Speaker 1: A A S. And I think it's pretty interesting. So

70
00:04:08,800 --> 00:04:11,240
Speaker 1: here's how the Hawse effect works. If you hear the

71
00:04:11,320 --> 00:04:16,320
Speaker 1: exact same sound coming from different directions, but the two

72
00:04:16,320 --> 00:04:19,680
Speaker 1: sounds arrive within thirty to forty milliseconds of each other,

73
00:04:20,080 --> 00:04:23,039
Speaker 1: your brain will be convinced that you really only heard

74
00:04:23,080 --> 00:04:26,479
Speaker 1: one sound and it came from the direction that hit

75
00:04:26,560 --> 00:04:30,240
Speaker 1: you first. So let's say a sounds coming from directly

76
00:04:30,279 --> 00:04:32,720
Speaker 1: in front of you and to your left, and you

77
00:04:33,080 --> 00:04:36,520
Speaker 1: get both of them within that thirty forty millisecond range,

78
00:04:37,360 --> 00:04:39,479
Speaker 1: and you hear the one coming from ahead of you

79
00:04:39,560 --> 00:04:43,080
Speaker 1: first to you. You're convinced that you only heard that

80
00:04:43,160 --> 00:04:46,119
Speaker 1: sound once and it came from dead on straight ahead

81
00:04:46,160 --> 00:04:49,719
Speaker 1: of you. Your brain kind of discounts the one that

82
00:04:49,800 --> 00:04:53,200
Speaker 1: came off from the left, although it can reinforce it,

83
00:04:53,320 --> 00:04:55,560
Speaker 1: which ends up being really useful if you're planning out

84
00:04:55,560 --> 00:04:58,320
Speaker 1: p A systems for stage shows. I'm not joking. That

85
00:04:58,360 --> 00:05:01,120
Speaker 1: really is the way that uh people plan those things out.

86
00:05:01,400 --> 00:05:04,120
Speaker 1: It's pretty neat. Humans perceived sounds in a way that's

87
00:05:04,120 --> 00:05:08,240
Speaker 1: not necessarily representational of all the sounds surrounding us. You

88
00:05:08,240 --> 00:05:11,640
Speaker 1: can think of your brain as the filter between your

89
00:05:11,760 --> 00:05:15,719
Speaker 1: understanding and what reality actually is. A lot of stuff

90
00:05:15,760 --> 00:05:18,640
Speaker 1: goes on that it ends up getting rid of information

91
00:05:18,680 --> 00:05:21,080
Speaker 1: that your brain just says, you know what, he or

92
00:05:21,120 --> 00:05:25,080
Speaker 1: she doesn't need that, it's just gonna confuse things. We're

93
00:05:25,080 --> 00:05:28,440
Speaker 1: gonna dump it. And that's kind of how it works.

94
00:05:28,480 --> 00:05:30,640
Speaker 1: It's all on an unconscious level. It's not like you're

95
00:05:30,839 --> 00:05:34,960
Speaker 1: actively working to do this. So let's say you're in

96
00:05:34,960 --> 00:05:37,320
Speaker 1: a relatively busy hallway and there could be a lot

97
00:05:37,400 --> 00:05:40,839
Speaker 1: of sounds in that hallway. Stuff that's going on constantly

98
00:05:40,839 --> 00:05:44,080
Speaker 1: around you. Maybe they are doors opening and closing, Maybe

99
00:05:44,080 --> 00:05:47,000
Speaker 1: their footsteps going up and down the hallway. Maybe someone

100
00:05:47,120 --> 00:05:50,760
Speaker 1: shoes are squeaking against the linoleum floor. People are chattering

101
00:05:50,800 --> 00:05:53,880
Speaker 1: away in there. But you are having a conversation with someone,

102
00:05:54,279 --> 00:05:57,000
Speaker 1: so you turn your focus on that person and other

103
00:05:57,080 --> 00:06:01,240
Speaker 1: sounds seemingly fade away. They're still doesn't but they're not important.

104
00:06:01,839 --> 00:06:04,560
Speaker 1: So in this example, you would actually call those other

105
00:06:04,640 --> 00:06:08,520
Speaker 1: sounds of distraction and you would really focus on the conversation. Uh.

106
00:06:08,560 --> 00:06:13,040
Speaker 1: That also shows how we're able to consciously direct our

107
00:06:13,120 --> 00:06:16,760
Speaker 1: since our perception of hearing. So both of these factors

108
00:06:16,800 --> 00:06:20,159
Speaker 1: come into play. Now. One thing that MP three encoding

109
00:06:20,200 --> 00:06:24,120
Speaker 1: takes advantage of is something called masking, and there are

110
00:06:24,120 --> 00:06:27,160
Speaker 1: a couple of different variations of the masking effect. One

111
00:06:27,200 --> 00:06:30,560
Speaker 1: of them is called frequency masking. So let's say you've

112
00:06:30,600 --> 00:06:33,520
Speaker 1: got to sound frequencies that are similar, perhaps there's just

113
00:06:33,560 --> 00:06:37,240
Speaker 1: a few hurts apart. Remember, UH, frequencies are measured in hurts,

114
00:06:37,720 --> 00:06:41,560
Speaker 1: which is really the number of oscillations per second. So

115
00:06:41,680 --> 00:06:47,000
Speaker 1: let's say you've got a sound that's at I don't know, uh,

116
00:06:47,400 --> 00:06:52,400
Speaker 1: one thousand killer hurts, and another one that's at one

117
00:06:52,520 --> 00:06:56,599
Speaker 1: thousand and ten killer hurts. Now, the human ear is

118
00:06:56,640 --> 00:07:00,080
Speaker 1: precise enough to be able to tell the difference of

119
00:07:00,160 --> 00:07:02,840
Speaker 1: two sounds that are at least two hurts apart from

120
00:07:02,880 --> 00:07:06,400
Speaker 1: each other. That's how precise our resolution of hearing, it's

121
00:07:06,480 --> 00:07:09,840
Speaker 1: it's at that level. But if you get two sounds

122
00:07:09,880 --> 00:07:13,560
Speaker 1: played at the same time and they are that close

123
00:07:13,600 --> 00:07:17,160
Speaker 1: together in frequency, and one of those frequencies is played

124
00:07:17,160 --> 00:07:20,320
Speaker 1: at a greater volume than the other, our brains will

125
00:07:20,320 --> 00:07:23,200
Speaker 1: pick up on the louder sound and ignore the quieter sound,

126
00:07:23,280 --> 00:07:26,920
Speaker 1: even though both of them are present. What becomes important

127
00:07:26,920 --> 00:07:29,560
Speaker 1: at that point is the amplitude. Now, the further apart

128
00:07:29,600 --> 00:07:33,400
Speaker 1: in frequencies you get, the less that has an effect.

129
00:07:33,520 --> 00:07:35,400
Speaker 1: So if you get far enough apart where there are

130
00:07:35,400 --> 00:07:38,720
Speaker 1: two pitches, one of them noticeably louder than the other,

131
00:07:39,080 --> 00:07:41,360
Speaker 1: but they're far enough apart, you will hear both of them.

132
00:07:41,400 --> 00:07:44,600
Speaker 1: It only works if the two pitches are relatively close together,

133
00:07:45,720 --> 00:07:48,600
Speaker 1: and there's not a universal formula for frequency masking. As

134
00:07:48,600 --> 00:07:51,560
Speaker 1: you get closer to the boundaries of human hearing, frequency

135
00:07:51,600 --> 00:07:53,960
Speaker 1: masking becomes easier, So if it's a really low pitch

136
00:07:54,040 --> 00:07:56,640
Speaker 1: or a really high pitch, it's easier to get away

137
00:07:56,640 --> 00:07:59,400
Speaker 1: with it. Once you started getting into what is the

138
00:07:59,440 --> 00:08:02,040
Speaker 1: out of as the sweet spot for human hearing, which

139
00:08:02,080 --> 00:08:05,160
Speaker 1: is generally considered to be between two and five killer hurts,

140
00:08:06,240 --> 00:08:10,240
Speaker 1: you need a greater difference in volume or a smaller

141
00:08:10,280 --> 00:08:14,720
Speaker 1: difference in frequency in order for masking to work. Frequency

142
00:08:14,760 --> 00:08:18,560
Speaker 1: masking at any rate. But then there's also temporal masking,

143
00:08:19,640 --> 00:08:21,920
Speaker 1: and you might say, okay, I got it. Temporal that

144
00:08:21,960 --> 00:08:26,080
Speaker 1: means time. Indeed it does, my friend. This describes the

145
00:08:26,080 --> 00:08:29,080
Speaker 1: effect of a short but loud sound masking a softer

146
00:08:29,160 --> 00:08:33,360
Speaker 1: sound for a short time. Weird thing is the loud

147
00:08:33,400 --> 00:08:37,000
Speaker 1: sound can actually mask sounds that precede it slightly, not

148
00:08:37,080 --> 00:08:39,800
Speaker 1: by a whole lot, but a little bit. MP three

149
00:08:39,800 --> 00:08:43,920
Speaker 1: compression takes advantage of both frequency and temporal masking when

150
00:08:43,920 --> 00:08:47,120
Speaker 1: it's trying to determine which data needs to be included

151
00:08:47,200 --> 00:08:49,960
Speaker 1: and which data can be dumped, because it won't affect

152
00:08:50,000 --> 00:08:52,880
Speaker 1: your perception of whatever the the audio file is in

153
00:08:52,920 --> 00:08:56,760
Speaker 1: the first place. So you also probably remember I talked

154
00:08:56,760 --> 00:08:59,600
Speaker 1: about the physical limitation to what we humans can hear,

155
00:08:59,800 --> 00:09:01,960
Speaker 1: no matter what our brains might be up to, so

156
00:09:02,040 --> 00:09:04,440
Speaker 1: that this doesn't have to do with our brains, you know,

157
00:09:04,520 --> 00:09:07,280
Speaker 1: filtering through the information that's coming in. This has to

158
00:09:07,320 --> 00:09:11,240
Speaker 1: do with the physical limitations of the human ear. In

159
00:09:11,280 --> 00:09:14,240
Speaker 1: the last episode of the series, I said typical human hearing.

160
00:09:14,880 --> 00:09:18,599
Speaker 1: Keep in mind typical there are exceptions. UH covers the

161
00:09:18,720 --> 00:09:21,600
Speaker 1: range of frequencies between about twenty hurts and twenty killer

162
00:09:21,679 --> 00:09:24,800
Speaker 1: hurts or twenty thousand hurts, So twenty to twenty thou

163
00:09:25,840 --> 00:09:30,360
Speaker 1: higher frequencies represent higher pitches and sound lower frequencies lower pitches, right,

164
00:09:31,120 --> 00:09:33,679
Speaker 1: And as you get older, your ability to perceive those

165
00:09:33,760 --> 00:09:38,080
Speaker 1: higher frequencies starts to diminish. So most adults actually have

166
00:09:38,360 --> 00:09:44,480
Speaker 1: an upper range closer to sixteen killer hurts, not twenty. Uh. Kids,

167
00:09:44,720 --> 00:09:46,920
Speaker 1: they can hear those higher pitches. You may have heard

168
00:09:46,920 --> 00:09:51,480
Speaker 1: the story about how some convenience stores experimented with getting

169
00:09:51,559 --> 00:09:57,280
Speaker 1: rid of teenage loiterers by by uh projecting out these

170
00:09:57,280 --> 00:10:00,760
Speaker 1: super high pitches that that adults could not here but

171
00:10:00,920 --> 00:10:03,800
Speaker 1: kids could, and it discouraged kids from hanging out at

172
00:10:03,800 --> 00:10:08,600
Speaker 1: the convenience store and loitering. Um. I love that idea

173
00:10:09,559 --> 00:10:12,959
Speaker 1: so much. Anyway, that's because I'm old and my hearing

174
00:10:13,040 --> 00:10:16,920
Speaker 1: is terrible. Well, remember I also mentioned you can detect

175
00:10:17,000 --> 00:10:19,760
Speaker 1: changes in pitch at two hurts increments if you get

176
00:10:19,880 --> 00:10:23,440
Speaker 1: below two hurts and change, like, if it's just a

177
00:10:23,520 --> 00:10:27,760
Speaker 1: one hurts difference between two frequencies, it's too low a

178
00:10:27,800 --> 00:10:30,080
Speaker 1: resolution for us to detect. To us, it will sound

179
00:10:30,160 --> 00:10:34,599
Speaker 1: exactly the same. So if you were to hear a

180
00:10:35,400 --> 00:10:40,400
Speaker 1: frequency at one thousand one hurts or one point zero

181
00:10:40,679 --> 00:10:43,960
Speaker 1: zero one killer hurts and one point zero zero to

182
00:10:44,160 --> 00:10:47,120
Speaker 1: kill hurts, you wouldn't notice the difference. They would sound

183
00:10:47,120 --> 00:10:50,199
Speaker 1: exactly the same to you. So if you're gonna take

184
00:10:50,200 --> 00:10:52,439
Speaker 1: audio and compress it, one step you could consider is

185
00:10:52,480 --> 00:10:57,240
Speaker 1: eliminating anything that's outside the actual range of frequencies that

186
00:10:57,280 --> 00:11:00,719
Speaker 1: we can hear, or simplifying any changes in frequency that

187
00:11:00,760 --> 00:11:04,439
Speaker 1: are smaller than two hurts. If you get take all

188
00:11:04,440 --> 00:11:07,920
Speaker 1: that data and you say it is physically impossible for

189
00:11:08,000 --> 00:11:11,479
Speaker 1: a human to perceive this, get rid of that information,

190
00:11:11,600 --> 00:11:14,800
Speaker 1: then in theory it wouldn't have any effect on the

191
00:11:14,880 --> 00:11:19,160
Speaker 1: rest of the recording. But how you go further than that, right,

192
00:11:19,240 --> 00:11:22,000
Speaker 1: how do you create a method so that you can

193
00:11:22,040 --> 00:11:24,160
Speaker 1: really compress this file? You want a method that will

194
00:11:24,160 --> 00:11:27,479
Speaker 1: preserve the important sounds while potentially ignoring all the unimportant

195
00:11:27,559 --> 00:11:31,360
Speaker 1: or incidel sounds. And you wanted to be automatic because

196
00:11:31,800 --> 00:11:34,920
Speaker 1: if you have it manually, then that's going to take

197
00:11:35,679 --> 00:11:40,000
Speaker 1: countless hours just to edit a single sound file. So

198
00:11:41,160 --> 00:11:44,360
Speaker 1: that was the challenge that the MP three research team

199
00:11:44,400 --> 00:11:49,480
Speaker 1: faced as a group. Now, their solution, which ultimately created

200
00:11:49,520 --> 00:11:51,800
Speaker 1: even more challenges was to come up with what was

201
00:11:51,920 --> 00:11:55,640
Speaker 1: essentially a simulated human ear and brain. They needed to

202
00:11:55,679 --> 00:12:01,559
Speaker 1: replicate the experience of perceiving music so that an algorithm

203
00:12:01,559 --> 00:12:05,720
Speaker 1: could evaluate every sound in an audio file and judge

204
00:12:05,800 --> 00:12:08,719
Speaker 1: if in fact was relevant enough to include in the

205
00:12:08,720 --> 00:12:13,000
Speaker 1: final compressed version. If a sound were imperceptible, then it

206
00:12:13,000 --> 00:12:15,520
Speaker 1: wouldn't make sense to include it in the MP three file.

207
00:12:15,800 --> 00:12:18,080
Speaker 1: So by leaving out all the irrelevant data, they can

208
00:12:18,160 --> 00:12:22,199
Speaker 1: make the audio information take up less bandwidth. The file

209
00:12:22,240 --> 00:12:24,800
Speaker 1: itself would be smaller because you just dumped everything that

210
00:12:24,880 --> 00:12:28,400
Speaker 1: wasn't important. So the team used an algorithm called the

211
00:12:28,559 --> 00:12:33,760
Speaker 1: low complexity Adaptive Transform Coding or lc DASH a TC

212
00:12:34,080 --> 00:12:36,520
Speaker 1: as the foundation for their research. This was kind of

213
00:12:36,559 --> 00:12:40,319
Speaker 1: their starting point, and this is an approach that that

214
00:12:40,600 --> 00:12:43,800
Speaker 1: tries to do away with redundancy as much as possible,

215
00:12:43,840 --> 00:12:48,520
Speaker 1: and it also incorporates adaptation to perceptual requirements. Also, MP

216
00:12:48,640 --> 00:12:52,239
Speaker 1: three's oh a lot to the IMPEG Layer two standard,

217
00:12:52,800 --> 00:12:56,600
Speaker 1: So the Layer two obviously came out before Layer three,

218
00:12:56,760 --> 00:12:59,160
Speaker 1: and so a lot of the features of layer three

219
00:12:59,320 --> 00:13:04,800
Speaker 1: are really um their legacy features from Layer two. Uh.

220
00:13:04,840 --> 00:13:07,040
Speaker 1: In other words, MP three group kind of got stuck

221
00:13:07,040 --> 00:13:09,600
Speaker 1: with them because otherwise they would have had a problem

222
00:13:09,600 --> 00:13:12,880
Speaker 1: with backwards compatibility. So the result is kind of a

223
00:13:12,960 --> 00:13:16,439
Speaker 1: clunky arrangement under the hood, and some of the features

224
00:13:16,640 --> 00:13:19,600
Speaker 1: may make very little sense when I go through them,

225
00:13:19,640 --> 00:13:21,839
Speaker 1: but some of that is because it's a holdover from

226
00:13:21,840 --> 00:13:26,840
Speaker 1: an earlier compression strategy, which isn't terribly satisfying as an answer.

227
00:13:26,880 --> 00:13:29,240
Speaker 1: But the reason many parts of the MP three compression

228
00:13:29,280 --> 00:13:31,480
Speaker 1: algorithm are the way they are is because that's the

229
00:13:31,480 --> 00:13:35,520
Speaker 1: way we've always done it. So next I'm gonna dive

230
00:13:35,600 --> 00:13:41,240
Speaker 1: into the phases of compression. But before I do that,

231
00:13:41,440 --> 00:13:44,160
Speaker 1: let's all take a deep breath and take a moment

232
00:13:44,200 --> 00:13:55,880
Speaker 1: to thank our sponsor, and we're back. So there are

233
00:13:55,920 --> 00:13:58,760
Speaker 1: two big phases we'll need to talk about with MP

234
00:13:58,920 --> 00:14:03,320
Speaker 1: three compression. The first phase is analysis and the second

235
00:14:03,320 --> 00:14:07,559
Speaker 1: phase is the actual compression itself. And after that there's

236
00:14:07,559 --> 00:14:10,680
Speaker 1: the process of decoding and MP three for playback. But

237
00:14:10,760 --> 00:14:13,520
Speaker 1: that's way simpler once we get an understanding of how

238
00:14:13,720 --> 00:14:18,959
Speaker 1: the encoding process actually happens. So let's begin with analysis. Now.

239
00:14:19,000 --> 00:14:22,560
Speaker 1: This is the part where the standard has to figure

240
00:14:22,560 --> 00:14:26,840
Speaker 1: out which frequencies within an audio range are recording rather

241
00:14:26,960 --> 00:14:32,760
Speaker 1: are important or perceptible. So how does a program and

242
00:14:33,000 --> 00:14:35,920
Speaker 1: encoder figure out what we can hear and what we

243
00:14:36,000 --> 00:14:40,400
Speaker 1: cannot hear? Alright, time to get technical. So you start

244
00:14:40,440 --> 00:14:45,000
Speaker 1: off with your pulse code modulation audio file or PCM file.

245
00:14:45,160 --> 00:14:47,560
Speaker 1: And you might remember I talked about PCM audio in

246
00:14:47,600 --> 00:14:50,400
Speaker 1: the first episode of this series, but just in case

247
00:14:50,440 --> 00:14:54,160
Speaker 1: you don't, it's a lossless digital audio file. The actual

248
00:14:54,200 --> 00:14:57,040
Speaker 1: format could be a wave or ai f F or

249
00:14:57,080 --> 00:15:00,400
Speaker 1: something along those lines, but the important thing to keep

250
00:15:00,440 --> 00:15:04,520
Speaker 1: in mind is that it is uncompressed. Now, that means

251
00:15:04,560 --> 00:15:06,880
Speaker 1: those files tend to be pretty big. This is our

252
00:15:06,960 --> 00:15:09,840
Speaker 1: raw material that we want to take and squish down

253
00:15:09,880 --> 00:15:14,120
Speaker 1: to a more manageable transferable size. And in our our

254
00:15:14,200 --> 00:15:16,640
Speaker 1: last episode in this series, I also mentioned that the

255
00:15:16,760 --> 00:15:20,120
Speaker 1: standard for c D audio is a sample rate of

256
00:15:20,160 --> 00:15:23,400
Speaker 1: forty four point one killer hurts. And we learned that

257
00:15:23,440 --> 00:15:26,120
Speaker 1: you need a sample rate twice the frequency of the

258
00:15:26,240 --> 00:15:30,520
Speaker 1: highest frequency in your recording, and since human hearing tops

259
00:15:30,520 --> 00:15:32,800
Speaker 1: out at around twenty kill hurts, the standard for c

260
00:15:32,960 --> 00:15:35,880
Speaker 1: ds is forty four point one killer hurts. The MP

261
00:15:36,000 --> 00:15:38,840
Speaker 1: three standard can support lots of different sample rates, but

262
00:15:39,000 --> 00:15:41,320
Speaker 1: forty four point one killer hurts is pretty much the

263
00:15:41,480 --> 00:15:45,800
Speaker 1: common standard. So you've got a number of samples with

264
00:15:45,880 --> 00:15:48,400
Speaker 1: your audio file, and that number will depend upon how

265
00:15:48,440 --> 00:15:53,160
Speaker 1: long the audio file is. You've got forty four samples

266
00:15:53,200 --> 00:15:56,720
Speaker 1: per second, actually twice that for stereo. But for the

267
00:15:56,720 --> 00:15:59,680
Speaker 1: purposes of this discussion, let's kind of stick with mono

268
00:15:59,720 --> 00:16:02,720
Speaker 1: sound so that I don't start having math coming out

269
00:16:02,760 --> 00:16:06,040
Speaker 1: of my ears. And we're still in the very easy,

270
00:16:06,080 --> 00:16:08,480
Speaker 1: simple part as far as math goes. We haven't gotten

271
00:16:08,520 --> 00:16:11,080
Speaker 1: to the complicated stuff yet. All right, So you've got

272
00:16:11,080 --> 00:16:15,880
Speaker 1: forty four thousand, one hundred samples per second. To compress

273
00:16:15,920 --> 00:16:19,280
Speaker 1: it into an MP three format, the algorithm first groups

274
00:16:19,320 --> 00:16:24,520
Speaker 1: all of these samples into collections called frames. So take

275
00:16:24,560 --> 00:16:27,840
Speaker 1: those four thousand one per second, and then you start saying, okay,

276
00:16:27,840 --> 00:16:30,880
Speaker 1: we're gonna group you in batches. Each batch is called

277
00:16:30,920 --> 00:16:34,800
Speaker 1: a frame, and each frame contains one thousand, one fifty

278
00:16:34,800 --> 00:16:39,320
Speaker 1: two samples. Now that's specifically to maintain backwards compatibility to

279
00:16:39,560 --> 00:16:43,520
Speaker 1: IMPEG Layer two, which established that one thousand, one fifty

280
00:16:43,520 --> 00:16:46,720
Speaker 1: two number. But we're not talking about IMPEG layer two.

281
00:16:46,720 --> 00:16:50,760
Speaker 1: We're talking about IMPEG Layer three, and though that means

282
00:16:50,760 --> 00:16:52,560
Speaker 1: we have to get a little more complicated. So each

283
00:16:52,600 --> 00:16:59,280
Speaker 1: frame consists of two subgroups called granules. So each granule

284
00:16:59,320 --> 00:17:04,240
Speaker 1: has five hundred seventy six samples six times two one two,

285
00:17:04,400 --> 00:17:08,560
Speaker 1: so five seventy six samples per granule. Now, technically MP

286
00:17:08,640 --> 00:17:11,520
Speaker 1: three encoders only work on one granule at a time,

287
00:17:11,560 --> 00:17:15,040
Speaker 1: but they may reference the granules immediately before and immediately

288
00:17:15,160 --> 00:17:17,639
Speaker 1: after the current one in order to see how the

289
00:17:17,680 --> 00:17:21,960
Speaker 1: audio within the file changes over time. All right, So

290
00:17:22,040 --> 00:17:24,960
Speaker 1: now you've got your granules of five hundred seventy six

291
00:17:25,119 --> 00:17:29,200
Speaker 1: samples each. Then the MP three encoder runs the samples

292
00:17:29,240 --> 00:17:33,439
Speaker 1: through a filter bank, which sorts the sound into thirty

293
00:17:33,440 --> 00:17:36,359
Speaker 1: two frequency ranges. Are you? Are you crazy about the

294
00:17:36,400 --> 00:17:41,000
Speaker 1: numbers yet, Dylan? Are you? Dylan's Dan's nodding. Dylan gets

295
00:17:41,040 --> 00:17:45,360
Speaker 1: worse from here. So you have thirty two frequency ranges,

296
00:17:45,600 --> 00:17:47,720
Speaker 1: which is another nod to the layer two method, which

297
00:17:47,800 --> 00:17:50,880
Speaker 1: use those thirty two ranges for encoding purposes. But we're

298
00:17:50,880 --> 00:17:54,440
Speaker 1: not talking about layer two, are we. No, we're talking

299
00:17:54,560 --> 00:17:57,760
Speaker 1: MP three. Gosh darn it. That means we take those

300
00:17:57,800 --> 00:18:00,679
Speaker 1: thirty two ranges and we subdivide them by a factor

301
00:18:00,720 --> 00:18:05,240
Speaker 1: of eighteen. That means we have five hundred seventies six

302
00:18:05,440 --> 00:18:10,199
Speaker 1: bands of frequencies each band containing one seventy six of

303
00:18:10,200 --> 00:18:14,439
Speaker 1: the frequency range of the original sample. So what that

304
00:18:14,520 --> 00:18:17,840
Speaker 1: actually means and this this is actually pretty easy. The

305
00:18:17,880 --> 00:18:21,440
Speaker 1: bands are not limited to a specific number for their

306
00:18:21,480 --> 00:18:26,399
Speaker 1: frequency range, right. The bands don't mean that on the

307
00:18:26,560 --> 00:18:29,640
Speaker 1: on band number one it goes from twenty hurts up

308
00:18:29,680 --> 00:18:32,359
Speaker 1: to a certain range, and on band five D seventy

309
00:18:32,400 --> 00:18:35,439
Speaker 1: six it ends at twenty killer hurts. That's not what

310
00:18:35,480 --> 00:18:38,639
Speaker 1: it means. They're dependent upon the original audio. So if

311
00:18:38,680 --> 00:18:42,720
Speaker 1: the original audio contains sounds within a narrow range of frequencies,

312
00:18:43,080 --> 00:18:46,680
Speaker 1: the five seventy bands will be more precise. But if

313
00:18:46,720 --> 00:18:50,280
Speaker 1: the original recording has a vast range of frequencies, the

314
00:18:50,320 --> 00:18:53,280
Speaker 1: bands are less precise. So another way to think about

315
00:18:53,320 --> 00:18:56,840
Speaker 1: this is with a pizza. So let's say you get

316
00:18:56,960 --> 00:19:00,000
Speaker 1: extra large pizza and you cut it into eight equal slices,

317
00:19:00,640 --> 00:19:03,320
Speaker 1: and then you get a small pizza and you cut

318
00:19:03,359 --> 00:19:06,679
Speaker 1: that into eight equal slices. Well, in both cases you

319
00:19:06,680 --> 00:19:10,800
Speaker 1: have with each slice one eighth of a pizza. But

320
00:19:10,880 --> 00:19:15,119
Speaker 1: the extra large pizza pizza slice is bigger than the

321
00:19:15,160 --> 00:19:18,320
Speaker 1: small pizza pizza slice. It all depends on the size

322
00:19:18,320 --> 00:19:21,000
Speaker 1: of the pizza. So in this case, it depends upon

323
00:19:21,040 --> 00:19:24,120
Speaker 1: the range of frequencies. And and Dylan, do you think

324
00:19:24,119 --> 00:19:26,320
Speaker 1: we could go for some pizza, you know, just just

325
00:19:26,359 --> 00:19:29,199
Speaker 1: put the episode on hold and go get pizza. Dylan's nodding.

326
00:19:29,760 --> 00:19:33,919
Speaker 1: It's great for audio. Yeah, so, uh, pizza, We'll be

327
00:19:34,000 --> 00:19:38,840
Speaker 1: right back. Okay, I was good pizza. Now um oh, man,

328
00:19:38,880 --> 00:19:41,440
Speaker 1: I got a whole bunch more notes. Okay, well, let's

329
00:19:41,480 --> 00:19:43,919
Speaker 1: let's go ahead and and do the rest of this.

330
00:19:43,960 --> 00:19:45,840
Speaker 1: All right, So you've got your sound divided up into

331
00:19:45,920 --> 00:19:49,359
Speaker 1: those five seventy six sub brands of frequencies, you know,

332
00:19:49,680 --> 00:19:52,879
Speaker 1: the thing I compared to pizza slices earlier. Now you

333
00:19:52,920 --> 00:19:58,399
Speaker 1: get two different mathematical processes applied to this data. One

334
00:19:58,560 --> 00:20:01,959
Speaker 1: is the fast Furrier trans form or f T, and

335
00:20:02,000 --> 00:20:05,720
Speaker 1: the other is the modified discrete Cosine transform or m

336
00:20:05,840 --> 00:20:09,800
Speaker 1: d c T. Now, I am not going to dive

337
00:20:09,840 --> 00:20:13,080
Speaker 1: deeply into how these transforms work, because frankly, they are

338
00:20:13,160 --> 00:20:17,480
Speaker 1: beyond my mathematical understanding. But I know what they do.

339
00:20:17,760 --> 00:20:22,320
Speaker 1: I just cannot explain the process like how they do

340
00:20:22,440 --> 00:20:24,520
Speaker 1: what they do. So I'm going to give you the

341
00:20:24,560 --> 00:20:27,760
Speaker 1: explanation of what they do. What the outcome of each

342
00:20:27,800 --> 00:20:31,880
Speaker 1: of these transformed processes happens to be, but I'm not

343
00:20:31,960 --> 00:20:33,840
Speaker 1: going to be able to tell you the actual mathematical

344
00:20:33,880 --> 00:20:36,520
Speaker 1: steps involved in each because I don't math. So good guys,

345
00:20:37,680 --> 00:20:40,560
Speaker 1: But let's start with a fast for your transform. So

346
00:20:40,680 --> 00:20:42,760
Speaker 1: transform is kind of what it sounds like. It's all

347
00:20:42,760 --> 00:20:47,000
Speaker 1: about transforming information in some way. So in this particular case,

348
00:20:47,160 --> 00:20:50,399
Speaker 1: the f f T transforms the frequency bands we just

349
00:20:50,440 --> 00:20:55,400
Speaker 1: talked about into data that can be further analyzed by

350
00:20:55,520 --> 00:20:59,639
Speaker 1: a psychoacoustic model that's in the encoder. So this is

351
00:20:59,680 --> 00:21:03,000
Speaker 1: that simulated human ear and brain we were talking about earlier.

352
00:21:03,880 --> 00:21:07,840
Speaker 1: So what the encoder does is it analyzes each bit

353
00:21:07,960 --> 00:21:11,639
Speaker 1: of data and looks for signs that it represents audio

354
00:21:11,720 --> 00:21:14,640
Speaker 1: that wouldn't be perceived by a human. So it's look

355
00:21:14,840 --> 00:21:19,280
Speaker 1: looking for any potential for masking possibilities. So are there

356
00:21:19,280 --> 00:21:21,840
Speaker 1: collections of frequencies that are grouped close together, and is

357
00:21:21,880 --> 00:21:24,359
Speaker 1: one of those frequencies louder than the others. You might

358
00:21:24,400 --> 00:21:27,000
Speaker 1: be able to do away with those softerw frequencies because

359
00:21:27,000 --> 00:21:30,520
Speaker 1: of frequency masking. The encoder will also look at whether

360
00:21:30,640 --> 00:21:33,000
Speaker 1: or not the audio has a lot of complexity to it,

361
00:21:33,840 --> 00:21:36,000
Speaker 1: if it has a lot of changes, or if it's

362
00:21:36,040 --> 00:21:40,879
Speaker 1: just relatively steady or simple audio. Any transient sounds that

363
00:21:40,920 --> 00:21:44,640
Speaker 1: are present in the audio might end up being temporal masking,

364
00:21:44,720 --> 00:21:47,080
Speaker 1: so it'll analyze those as well and see if that's

365
00:21:47,080 --> 00:21:52,040
Speaker 1: a possibility. So really what they're looking is for, you know,

366
00:21:53,320 --> 00:21:56,399
Speaker 1: just any really loud sounds that stand out above the

367
00:21:56,440 --> 00:21:59,159
Speaker 1: rest of the recording. That's what the f f T

368
00:21:59,320 --> 00:22:03,240
Speaker 1: is doing. So what about the modified discrete cosine transform. Well,

369
00:22:03,280 --> 00:22:05,399
Speaker 1: this is happening in parallel with the f f T,

370
00:22:05,840 --> 00:22:10,360
Speaker 1: and the samples get sorted into different patterns called windows. Uh.

371
00:22:10,359 --> 00:22:12,920
Speaker 1: And the criterion for sorting all has to do with

372
00:22:12,920 --> 00:22:16,760
Speaker 1: whether the sample represents a steady sound or varied sound.

373
00:22:17,280 --> 00:22:20,400
Speaker 1: So if you have a simple steady sound that goes

374
00:22:20,440 --> 00:22:24,240
Speaker 1: into a long window. If there's a lot of variation

375
00:22:24,280 --> 00:22:27,000
Speaker 1: in the sound, like there are a lot of consonants

376
00:22:27,000 --> 00:22:29,800
Speaker 1: in a vocal line, or it's like a drum solo

377
00:22:30,000 --> 00:22:32,720
Speaker 1: or something like that, it would get sorted into a

378
00:22:32,800 --> 00:22:36,480
Speaker 1: series of three short windows. And each short window contains

379
00:22:36,520 --> 00:22:42,560
Speaker 1: one two samples. That amounts to four whole milliseconds, so

380
00:22:42,720 --> 00:22:48,159
Speaker 1: four thousands of a second in three patterned windows. So

381
00:22:48,200 --> 00:22:51,440
Speaker 1: you've got these windows now, either long windows for simple

382
00:22:51,480 --> 00:22:54,760
Speaker 1: sounds or short windows for the more complex sounds, and

383
00:22:54,760 --> 00:22:57,800
Speaker 1: then the modified discrete cosine transformed kicks into gear. It

384
00:22:57,800 --> 00:23:00,200
Speaker 1: looks at each long window or set of three sort

385
00:23:00,240 --> 00:23:03,960
Speaker 1: windows and converts them into a set of spectral values.

386
00:23:04,560 --> 00:23:06,840
Speaker 1: To some of you, that probably sounds meaningless. So let's

387
00:23:06,880 --> 00:23:10,760
Speaker 1: talk about spectral analysis for a second. First, I was

388
00:23:11,040 --> 00:23:13,960
Speaker 1: very disappointed to learn that spectral analysis doesn't involve a

389
00:23:13,960 --> 00:23:19,280
Speaker 1: psychologist talking to a ghost about its emotional state. So bummer.

390
00:23:20,040 --> 00:23:23,600
Speaker 1: But spectral analysis is when you look at a spectrum

391
00:23:23,640 --> 00:23:27,840
Speaker 1: of information, like a spectrum of frequencies or related information

392
00:23:27,880 --> 00:23:31,480
Speaker 1: like energy states. That's what this transform does. It takes

393
00:23:31,560 --> 00:23:35,159
Speaker 1: data that originally represented a slice of time in a

394
00:23:35,240 --> 00:23:38,400
Speaker 1: sound waveform. That's what sample is. A sample is an

395
00:23:38,440 --> 00:23:42,320
Speaker 1: instance of time in a wave form and converts it

396
00:23:42,359 --> 00:23:48,880
Speaker 1: into information representing sound as energy across a range of frequencies. Now,

397
00:23:48,880 --> 00:23:51,119
Speaker 1: you can plot out spectral information in a lot of

398
00:23:51,119 --> 00:23:54,040
Speaker 1: different ways, but one common method is to use brightness

399
00:23:54,080 --> 00:23:58,840
Speaker 1: to indicate energy levels. Higher energy levels are brighter patches

400
00:23:59,080 --> 00:24:03,840
Speaker 1: in your vision. Dual representation of spectral data. High frequencies

401
00:24:03,920 --> 00:24:06,720
Speaker 1: would appear at the top of a spectral view like

402
00:24:06,800 --> 00:24:10,000
Speaker 1: imagine a box, and at the top of the box

403
00:24:10,200 --> 00:24:12,440
Speaker 1: that's where you would find high frequencies. At the bottom

404
00:24:12,440 --> 00:24:14,760
Speaker 1: of the boxes where you find low frequencies, and it's

405
00:24:14,800 --> 00:24:17,880
Speaker 1: just lots of patches of color. The really bright patches

406
00:24:17,880 --> 00:24:23,280
Speaker 1: of color represent very high energy frequencies, so they could

407
00:24:23,280 --> 00:24:27,080
Speaker 1: be high or low in in actual frequency, but we're

408
00:24:27,080 --> 00:24:30,640
Speaker 1: talking about energy levels, not whether it's a higher low pitch.

409
00:24:32,520 --> 00:24:35,160
Speaker 1: Looking left or right represents the passing of time, and

410
00:24:35,200 --> 00:24:38,600
Speaker 1: looking along any vertical points shows you the actual frequency

411
00:24:39,280 --> 00:24:42,840
Speaker 1: or pitch, and then the respective energy level is the brightness.

412
00:24:42,960 --> 00:24:45,119
Speaker 1: So it's kind of like looking at sound as a wave,

413
00:24:45,280 --> 00:24:47,800
Speaker 1: but instead of being a wave, you're looking at information

414
00:24:47,800 --> 00:24:52,639
Speaker 1: that indicates frequency range and energy level. That representation is

415
00:24:52,640 --> 00:24:55,520
Speaker 1: actually kind of analogous to how we hear audio, So

416
00:24:55,600 --> 00:24:58,720
Speaker 1: and encoder can analyze the spectral view and start to

417
00:24:58,720 --> 00:25:02,920
Speaker 1: filter out the data we would and perceived due to psychoacoustics. Now,

418
00:25:02,960 --> 00:25:06,960
Speaker 1: after all that processing, the encoder looks at the frequency

419
00:25:07,040 --> 00:25:10,240
Speaker 1: sub brands and the levels of spectral intensity for each

420
00:25:10,840 --> 00:25:14,240
Speaker 1: and that information can then be used for the next phase,

421
00:25:14,840 --> 00:25:18,280
Speaker 1: which is compression. But right now I think we could

422
00:25:18,320 --> 00:25:21,800
Speaker 1: all stand a little decompression, So let's take another quick

423
00:25:21,800 --> 00:25:33,280
Speaker 1: break to thank our sponsor all right, So now you're

424
00:25:33,320 --> 00:25:37,320
Speaker 1: ready to compress your analyzed audio. Good for you, and

425
00:25:37,359 --> 00:25:41,120
Speaker 1: by you I mean encoders. This has to be simpler

426
00:25:41,160 --> 00:25:44,159
Speaker 1: than that analysis segment, right, I mean that got a

427
00:25:44,160 --> 00:25:47,880
Speaker 1: little crazy with all the different bands and sub bands

428
00:25:48,040 --> 00:25:55,160
Speaker 1: and windows and frames and granules. Sadly it gets more complicated.

429
00:25:55,160 --> 00:25:58,320
Speaker 1: All right. So there are two layers of compression going

430
00:25:58,359 --> 00:26:03,040
Speaker 1: on with IMPEG Layer three. One of those layers depends

431
00:26:03,119 --> 00:26:07,560
Speaker 1: upon the psychoacoustic analysis and the other doesn't. So why

432
00:26:07,560 --> 00:26:10,840
Speaker 1: would you use two layers with different strategies like that? Well,

433
00:26:10,880 --> 00:26:13,879
Speaker 1: the reason is that one strategy is great for complex

434
00:26:13,920 --> 00:26:16,679
Speaker 1: audio with lots of components, but not so great with

435
00:26:16,800 --> 00:26:19,679
Speaker 1: simpler sounds, and the other strategy is kind of the opposite.

436
00:26:20,160 --> 00:26:22,560
Speaker 1: So the psychoacoustic approach is the one that's really good

437
00:26:22,600 --> 00:26:26,520
Speaker 1: for complicated sounds. If if you've got a lot of

438
00:26:26,720 --> 00:26:30,879
Speaker 1: volume changes, lots of different frequencies, it's just complicated and

439
00:26:31,000 --> 00:26:33,880
Speaker 1: rich sound. You've got a lot of opportunities to look

440
00:26:33,920 --> 00:26:37,280
Speaker 1: for masking and other acoustic elements that limit the actual

441
00:26:37,359 --> 00:26:41,200
Speaker 1: sounds that people perceive. So it means there are a

442
00:26:41,240 --> 00:26:44,800
Speaker 1: lot of chances for you to uh fudge by dropping

443
00:26:44,800 --> 00:26:49,720
Speaker 1: all the stuff that people probably wouldn't notice anyway. And Uh,

444
00:26:49,880 --> 00:26:51,439
Speaker 1: if you take a piece that's got a lot of

445
00:26:51,440 --> 00:26:54,960
Speaker 1: elements at varying volumes, there are likely several opportunities to

446
00:26:54,960 --> 00:26:58,800
Speaker 1: to do this. But if you're talking about relatively straightforward

447
00:26:59,440 --> 00:27:04,359
Speaker 1: audio with few components, few changes in volume, there's really

448
00:27:04,359 --> 00:27:06,439
Speaker 1: not a whole lot of data you can ditch without

449
00:27:06,480 --> 00:27:08,960
Speaker 1: it actually affecting the quality of the audio in a

450
00:27:09,000 --> 00:27:13,280
Speaker 1: perceptible way. And this is part of what Brandenburg, that

451
00:27:13,320 --> 00:27:15,480
Speaker 1: guy I was talking about in our first episode in

452
00:27:15,520 --> 00:27:18,439
Speaker 1: this series. Uh, that's when he discovered when he was

453
00:27:18,840 --> 00:27:22,000
Speaker 1: working with the MP three standard and he was listening

454
00:27:22,040 --> 00:27:26,600
Speaker 1: back to that Suzanne Vega acapella track Tom's Diner. Uh,

455
00:27:26,720 --> 00:27:28,560
Speaker 1: he was listening to a compressed version of it, and

456
00:27:28,560 --> 00:27:31,159
Speaker 1: he said it was terrible. He said it ruined the

457
00:27:31,200 --> 00:27:34,520
Speaker 1: quality of the audio. And part of that is because

458
00:27:34,600 --> 00:27:37,919
Speaker 1: that particular song is fairly simple, there's just not a

459
00:27:37,920 --> 00:27:40,800
Speaker 1: lot of opportunity to take advantage of masking and other

460
00:27:40,920 --> 00:27:46,520
Speaker 1: tricks without potentially compromising the quality. So they decided to

461
00:27:46,560 --> 00:27:50,600
Speaker 1: also incorporate some traditional compression strategies which which worked better

462
00:27:50,760 --> 00:27:53,679
Speaker 1: with those types of recordings. So the MP three format

463
00:27:53,720 --> 00:27:57,800
Speaker 1: takes advantage of both the traditional approach and the psychoacoustic approach,

464
00:27:58,520 --> 00:28:01,560
Speaker 1: and that allows the encoder to compressed files into smaller

465
00:28:01,600 --> 00:28:05,720
Speaker 1: size without just following a single strategy, like it doesn't

466
00:28:05,720 --> 00:28:07,800
Speaker 1: have to do a one size fits all for all

467
00:28:07,880 --> 00:28:12,639
Speaker 1: elements of audio. Now, combining those two strategies requires a

468
00:28:12,640 --> 00:28:16,359
Speaker 1: little more mathematical gymnastics. So let's go back to those

469
00:28:16,480 --> 00:28:20,240
Speaker 1: five seventy six frequency bins. You know, those sub bands

470
00:28:20,280 --> 00:28:24,560
Speaker 1: we talked about earlier. You gotta quantize those suckers. What

471
00:28:24,600 --> 00:28:27,480
Speaker 1: does that mean. It means assigning a quantity to each

472
00:28:27,800 --> 00:28:31,639
Speaker 1: to each frequency bin, you have to give it a

473
00:28:31,720 --> 00:28:34,720
Speaker 1: quantity of some sorts so that you can end up

474
00:28:34,840 --> 00:28:39,640
Speaker 1: judging how much you can get away with dropping data.

475
00:28:40,000 --> 00:28:42,840
Speaker 1: So to do this, the encoder sorts those five six

476
00:28:42,880 --> 00:28:46,320
Speaker 1: bins into twenty two scale factor bands. How you doing

477
00:28:46,320 --> 00:28:50,680
Speaker 1: over there, Dylan? Just checking in on you? Okay, Dylan's

478
00:28:50,720 --> 00:28:53,440
Speaker 1: got Dylan's got a thousand yards stare going. I hope

479
00:28:53,480 --> 00:28:55,920
Speaker 1: you guys are doing okay over there? All right, So

480
00:28:56,120 --> 00:28:58,080
Speaker 1: before smoke starts coming out of your ears, let me

481
00:28:58,120 --> 00:29:01,800
Speaker 1: explain what the scale factor bands are all about. The

482
00:29:01,840 --> 00:29:05,400
Speaker 1: whole purpose of the scale factor bands is to determine

483
00:29:05,480 --> 00:29:10,000
Speaker 1: how the information will be stored within the compressed state.

484
00:29:10,880 --> 00:29:12,840
Speaker 1: So you want to get away with as little data

485
00:29:12,920 --> 00:29:16,080
Speaker 1: as possible before affecting sound quality. So if you can

486
00:29:16,120 --> 00:29:19,800
Speaker 1: say the same thing in a shorter space without affecting

487
00:29:19,800 --> 00:29:22,640
Speaker 1: the quality of what it is you're saying, you go

488
00:29:22,720 --> 00:29:27,720
Speaker 1: with it. Brevity is the soul of compression. So if

489
00:29:27,720 --> 00:29:31,000
Speaker 1: we were talking about language, I would say it's more

490
00:29:31,000 --> 00:29:35,920
Speaker 1: efficient to say it's raining outside, or even just it's raining,

491
00:29:36,240 --> 00:29:39,320
Speaker 1: because you would assume that it would be outside where

492
00:29:39,320 --> 00:29:41,880
Speaker 1: the rain is happening, and it would be inefficient for

493
00:29:41,920 --> 00:29:44,400
Speaker 1: me to say it's coming down like cats and dogs

494
00:29:44,440 --> 00:29:48,280
Speaker 1: out there. It's not as efficient as saying it's raining.

495
00:29:49,040 --> 00:29:53,800
Speaker 1: So if you can get away with shorter statements without

496
00:29:53,880 --> 00:29:57,680
Speaker 1: affecting the actual quality, and you could argue that by

497
00:29:57,840 --> 00:30:00,360
Speaker 1: switching from it's coming down like cats and dog out

498
00:30:00,360 --> 00:30:03,920
Speaker 1: there and it's raining changes the quality, and that could

499
00:30:03,920 --> 00:30:05,680
Speaker 1: be a valid argument. But if you can get away

500
00:30:06,120 --> 00:30:10,440
Speaker 1: with shorter without affecting quality, you do it. So each

501
00:30:10,480 --> 00:30:15,000
Speaker 1: scale factor band is represented by a quantity, Then the

502
00:30:15,080 --> 00:30:19,480
Speaker 1: encoder divides that quantity by a given number called the quantizer,

503
00:30:19,840 --> 00:30:23,520
Speaker 1: which is the same across the entire frequency spectrum for

504
00:30:23,600 --> 00:30:28,080
Speaker 1: that recording. The resulting number is then rounded up or

505
00:30:28,200 --> 00:30:33,320
Speaker 1: down to a whole digit. And here's an important point.

506
00:30:33,720 --> 00:30:37,200
Speaker 1: Individual scale factor bands can be scaled up or down

507
00:30:37,320 --> 00:30:41,320
Speaker 1: for more or less precision to represent the actual value

508
00:30:41,480 --> 00:30:45,480
Speaker 1: of those bands. So what the heck does all that mean? Well,

509
00:30:45,560 --> 00:30:48,120
Speaker 1: the purpose of dividing and rounding is just to simplify

510
00:30:48,160 --> 00:30:50,880
Speaker 1: the data to reduce the amount you need in order

511
00:30:50,920 --> 00:30:53,680
Speaker 1: to store the information. So let's go with a totally

512
00:30:53,760 --> 00:30:57,560
Speaker 1: hypothetical example. Let's say you've got a scale factor band

513
00:30:58,360 --> 00:31:01,240
Speaker 1: and you've decided your rep is sending that scale factor

514
00:31:01,320 --> 00:31:05,280
Speaker 1: band with the quantity seven eight four zero seven thousand,

515
00:31:05,360 --> 00:31:08,880
Speaker 1: eight hundred forty, and you've chosen the number one hundred

516
00:31:08,920 --> 00:31:12,480
Speaker 1: to quantize your data, meaning that you will divide each

517
00:31:13,400 --> 00:31:18,160
Speaker 1: uh scale factor bands quantity by one hundred. So this

518
00:31:18,200 --> 00:31:20,560
Speaker 1: is seven thousand, eight hundred forty. You divide it by

519
00:31:20,680 --> 00:31:24,440
Speaker 1: one hundred UH, and the scale factor for this particular

520
00:31:24,480 --> 00:31:28,080
Speaker 1: band you have determined is one point zero. That means

521
00:31:28,160 --> 00:31:31,360
Speaker 1: that once you get that result where you've divided the

522
00:31:31,440 --> 00:31:34,560
Speaker 1: quantity by the quantizer, you multiply by one. That means

523
00:31:34,560 --> 00:31:36,880
Speaker 1: there's no change. You multiply by one you get the

524
00:31:36,960 --> 00:31:40,080
Speaker 1: same number. More on that end a bit. Okay, So

525
00:31:40,120 --> 00:31:42,680
Speaker 1: you take that seven thousand, eight hundred forty you divided

526
00:31:42,720 --> 00:31:46,520
Speaker 1: by one hundred. That gives you seventy eight point four. Well,

527
00:31:46,600 --> 00:31:48,680
Speaker 1: now you have to round that number, so you round

528
00:31:48,680 --> 00:31:51,520
Speaker 1: it down to seventy eight. Now, when you have a

529
00:31:51,560 --> 00:31:54,240
Speaker 1: decoder and you're ready to play back the information, it

530
00:31:54,320 --> 00:31:59,040
Speaker 1: comes across this quantity the sight and it knows what

531
00:31:59,200 --> 00:32:02,760
Speaker 1: the quantizer number was, so it multiplies by one hundred

532
00:32:02,800 --> 00:32:05,720
Speaker 1: to get back to seven thousand, eight hundred. So the

533
00:32:05,800 --> 00:32:09,720
Speaker 1: replicated number is actually forty off from the original number.

534
00:32:09,760 --> 00:32:12,800
Speaker 1: The original number again was seven thousand, eight hundred forty.

535
00:32:13,080 --> 00:32:16,560
Speaker 1: The replicated number is seven thousand, eight hundred. Now those

536
00:32:16,600 --> 00:32:21,920
Speaker 1: inconsistencies manifest as noise in the actual playback. So if

537
00:32:21,920 --> 00:32:24,840
Speaker 1: you wanted to increase the precision of any given scale

538
00:32:24,840 --> 00:32:27,200
Speaker 1: factor band, you could do so by changing the scale

539
00:32:27,200 --> 00:32:30,080
Speaker 1: factor number. So in that example, just now, I said

540
00:32:30,120 --> 00:32:32,680
Speaker 1: the number was one point zero, meaning there's no change

541
00:32:32,840 --> 00:32:36,160
Speaker 1: to that result. But I could have said it was ten,

542
00:32:36,640 --> 00:32:39,280
Speaker 1: which means we would multiply the quanties number by ten.

543
00:32:39,640 --> 00:32:41,720
Speaker 1: So we would take that seven thousand, eight hundred forty

544
00:32:41,840 --> 00:32:44,040
Speaker 1: divided by one hundred, you get seventy eight point four,

545
00:32:44,520 --> 00:32:48,120
Speaker 1: then multiplied by ten to get seven eight four. So

546
00:32:48,880 --> 00:32:52,160
Speaker 1: when the decoder decompresses the file, it would reverse this

547
00:32:52,160 --> 00:32:55,400
Speaker 1: this whole thing. It would just multiply by a hundred um.

548
00:32:55,440 --> 00:32:57,720
Speaker 1: You would end up getting seven thousand, hundred forty again,

549
00:32:57,800 --> 00:33:00,680
Speaker 1: which means that you wouldn't introduce any noise to the file.

550
00:33:00,720 --> 00:33:04,040
Speaker 1: You would have a perfect representation. But in some cases

551
00:33:04,040 --> 00:33:07,560
Speaker 1: the encoder may determine that any noise that you generate

552
00:33:07,880 --> 00:33:11,000
Speaker 1: wouldn't be noticed or it wouldn't impact the quality of

553
00:33:11,000 --> 00:33:13,520
Speaker 1: the audio enough for it to be a problem because

554
00:33:13,520 --> 00:33:16,440
Speaker 1: of other factors for that particular scale factor band, like

555
00:33:16,520 --> 00:33:20,000
Speaker 1: maybe it's really quiet, or maybe it's really complex. So

556
00:33:20,040 --> 00:33:22,920
Speaker 1: in those cases, you could reduce the scale factor number

557
00:33:23,320 --> 00:33:26,120
Speaker 1: by making it something else, like point one instead of

558
00:33:26,160 --> 00:33:28,720
Speaker 1: one point oh. So that means you would multiply the

559
00:33:28,800 --> 00:33:32,400
Speaker 1: quantized number by point one, So the seventy eight point

560
00:33:32,440 --> 00:33:35,240
Speaker 1: four would become seven point eight four, and then you

561
00:33:35,280 --> 00:33:37,320
Speaker 1: have to round it to get a whole integer, so

562
00:33:37,360 --> 00:33:41,320
Speaker 1: you get eight seven point eight four rounds up to eight. Now,

563
00:33:41,320 --> 00:33:44,880
Speaker 1: when a decode or decompresses the audio and multiplies eight

564
00:33:44,920 --> 00:33:48,200
Speaker 1: by one hundred, that quantizer that we've talked about so much.

565
00:33:49,120 --> 00:33:51,200
Speaker 1: Uh and uh. Actually at this point it would have

566
00:33:51,200 --> 00:33:53,680
Speaker 1: to be eight thousand because it's also taking into account

567
00:33:53,680 --> 00:33:57,520
Speaker 1: the scale factor, so it's multiplying it by a thousand,

568
00:33:57,520 --> 00:34:01,760
Speaker 1: not just a hundred. So you would get a number

569
00:34:01,800 --> 00:34:04,440
Speaker 1: that would pop up to eight thousand. And remember the

570
00:34:04,440 --> 00:34:06,800
Speaker 1: original with seven thousand, eight hundred forty. So you look

571
00:34:06,800 --> 00:34:09,640
Speaker 1: at the difference between these two, the original seven thousand forty,

572
00:34:09,719 --> 00:34:12,240
Speaker 1: the new fact number is eight thousand. There's a pretty

573
00:34:12,239 --> 00:34:15,040
Speaker 1: big difference there. That change might introduce enough noise for

574
00:34:15,040 --> 00:34:17,240
Speaker 1: it to be a problem. So how does the encoder

575
00:34:17,280 --> 00:34:20,400
Speaker 1: determine if a scale factor band is meeting the proper criteria?

576
00:34:20,440 --> 00:34:25,319
Speaker 1: How can it tell if there is uh too much

577
00:34:25,400 --> 00:34:28,799
Speaker 1: noise or if the noise falls below the threshold. Well,

578
00:34:28,840 --> 00:34:32,360
Speaker 1: it goes through what it's called a Huffman coding process.

579
00:34:32,440 --> 00:34:37,160
Speaker 1: At this point, Dylan is currently just staring at the

580
00:34:37,160 --> 00:34:41,480
Speaker 1: wall and drool is coming out. Huffman coding process. It's

581
00:34:41,520 --> 00:34:45,160
Speaker 1: converts scale factor bands into binary strings, and the process

582
00:34:45,200 --> 00:34:47,160
Speaker 1: goes through a series of tables to determine if the

583
00:34:47,239 --> 00:34:50,160
Speaker 1: data within the scale factor band requires more or less

584
00:34:50,200 --> 00:34:53,200
Speaker 1: precision to describe the sound without affecting the audio quality.

585
00:34:54,320 --> 00:34:56,719
Speaker 1: So Huffman coding is a process. And when you start

586
00:34:56,760 --> 00:34:58,880
Speaker 1: with a large number of possibilities and you begin to

587
00:34:58,960 --> 00:35:01,880
Speaker 1: narrow it down. Uh. Some people describe it as the

588
00:35:01,920 --> 00:35:05,719
Speaker 1: coding equivalent of twenty questions. So you ask your first

589
00:35:05,800 --> 00:35:08,960
Speaker 1: question like animal, vegetable, or mineral. You get an answer

590
00:35:09,080 --> 00:35:12,640
Speaker 1: so animal. While that first answer eliminates a ton of

591
00:35:12,680 --> 00:35:16,400
Speaker 1: other possibilities and narrows the focus, like anything that doesn't

592
00:35:16,400 --> 00:35:20,120
Speaker 1: pertain to animal, you can automatically discount because you already

593
00:35:20,160 --> 00:35:25,280
Speaker 1: know it can apply to that answer. With MP three compression,

594
00:35:25,320 --> 00:35:28,319
Speaker 1: this means making certain the number of bits representing a

595
00:35:28,360 --> 00:35:33,160
Speaker 1: granule because remember I mentioned that an MP three formats

596
00:35:33,280 --> 00:35:36,400
Speaker 1: you have frames, and each frame, each frame has a thousand,

597
00:35:36,400 --> 00:35:40,000
Speaker 1: one or fifty two samples and consists of two granules

598
00:35:40,000 --> 00:35:43,840
Speaker 1: with five s each. So when you answer the first question,

599
00:35:43,960 --> 00:35:46,640
Speaker 1: it eliminates a lot of other possibilities and narrows the focus.

600
00:35:46,640 --> 00:35:49,800
Speaker 1: So like with animal, vegetable, mineral, if I say animal,

601
00:35:49,920 --> 00:35:52,840
Speaker 1: you're gonna not ask any questions that have to do

602
00:35:52,880 --> 00:35:56,480
Speaker 1: with minerals or vegetables only because it wouldn't make sense.

603
00:35:57,239 --> 00:35:59,400
Speaker 1: You know, those aren't gonna apply. Same thing with m

604
00:35:59,440 --> 00:36:02,160
Speaker 1: P three's, except this time it means making certain the

605
00:36:02,239 --> 00:36:05,799
Speaker 1: number of bits representing a granule. Remember their two granules

606
00:36:05,800 --> 00:36:09,680
Speaker 1: per frame with the MP three layer, Uh, you want

607
00:36:09,680 --> 00:36:12,759
Speaker 1: to make sure that the number of bits representing that

608
00:36:12,800 --> 00:36:16,319
Speaker 1: granule match the chosen bit rate for a compression. So

609
00:36:16,360 --> 00:36:18,640
Speaker 1: if after going through this process, the encoder says, hey,

610
00:36:18,640 --> 00:36:21,839
Speaker 1: this granule has more bits than what's allowed. It's too

611
00:36:21,840 --> 00:36:24,680
Speaker 1: many bits. The we gotta get rid of some of these,

612
00:36:24,840 --> 00:36:27,200
Speaker 1: the encoder can adjust the scale factor band so that

613
00:36:27,239 --> 00:36:31,560
Speaker 1: there's less precision meaning that multiplier in other words, that

614
00:36:32,120 --> 00:36:35,480
Speaker 1: but I talked about earlier, and thus reduce the amount

615
00:36:35,480 --> 00:36:40,120
Speaker 1: of data needed to represent that particular granule. If a

616
00:36:40,160 --> 00:36:44,120
Speaker 1: granule comes in under the bit rate, the encoder can

617
00:36:44,160 --> 00:36:48,320
Speaker 1: increase the precision to reduce noise and fill that granule

618
00:36:48,440 --> 00:36:55,040
Speaker 1: out properly so that matches the actual threshold. After all this,

619
00:36:55,160 --> 00:36:58,360
Speaker 1: the pairs of granules become frames within the MP three files,

620
00:36:58,360 --> 00:37:01,280
Speaker 1: and the only other component then MP three file apart

621
00:37:01,320 --> 00:37:04,719
Speaker 1: from these frames is the I D three metadata. And

622
00:37:04,719 --> 00:37:06,799
Speaker 1: this is pretty simple. This is like a header and

623
00:37:06,840 --> 00:37:09,080
Speaker 1: it comes before all the frames in the audio file

624
00:37:09,160 --> 00:37:13,000
Speaker 1: and contains information about about the file itself, which can

625
00:37:13,000 --> 00:37:15,719
Speaker 1: include stuff like the title of a song, an artist name,

626
00:37:15,840 --> 00:37:19,640
Speaker 1: an album title, other stuff like that. It can also

627
00:37:19,680 --> 00:37:23,080
Speaker 1: include copyright information as well as information about the file itself,

628
00:37:23,160 --> 00:37:25,440
Speaker 1: such as whether or not it's stereo recording or a

629
00:37:25,440 --> 00:37:29,279
Speaker 1: mono recording. So when you use a decoder like an

630
00:37:29,360 --> 00:37:34,720
Speaker 1: MP three player, it takes this compressed information, these these

631
00:37:34,719 --> 00:37:40,960
Speaker 1: these representations that the music has been reduced to, and

632
00:37:41,040 --> 00:37:44,520
Speaker 1: it converts that Huffman data back into the quantized format,

633
00:37:45,080 --> 00:37:47,759
Speaker 1: scales the data back up to its original size or

634
00:37:47,800 --> 00:37:53,640
Speaker 1: close approximation. Remember the the uncompressed version may actually be

635
00:37:53,719 --> 00:37:58,280
Speaker 1: off by a significant amount depending upon each individual granule.

636
00:37:58,840 --> 00:38:01,080
Speaker 1: And all of that data gets combined into a new

637
00:38:01,160 --> 00:38:04,200
Speaker 1: PCM sample that can be played back to you. And

638
00:38:04,320 --> 00:38:07,120
Speaker 1: that's all there is to it. Nothing could be easier,

639
00:38:08,320 --> 00:38:11,920
Speaker 1: all right. That took a lot out of me, So

640
00:38:11,960 --> 00:38:14,320
Speaker 1: I got really technical, and I apologize if I lost

641
00:38:14,360 --> 00:38:16,600
Speaker 1: any of you out there, or for those of you

642
00:38:16,600 --> 00:38:19,160
Speaker 1: who have a lot of experience working on compression algorithms,

643
00:38:19,160 --> 00:38:23,040
Speaker 1: for oversimplifying in several cases. But now we've got a

644
00:38:23,040 --> 00:38:25,520
Speaker 1: full episode about this, and I hope you have a

645
00:38:25,520 --> 00:38:28,640
Speaker 1: better understanding of how a big sound file can be

646
00:38:28,719 --> 00:38:32,880
Speaker 1: reduced to a smaller sound file. Next time, I'll just

647
00:38:32,920 --> 00:38:36,160
Speaker 1: say magic. It will make everyone happier. If you guys

648
00:38:36,200 --> 00:38:39,320
Speaker 1: have any questions for me, or comments or suggestions, anything

649
00:38:39,360 --> 00:38:42,480
Speaker 1: like that, send me a message. My email is tech

650
00:38:42,520 --> 00:38:45,520
Speaker 1: Stuff at how stuff works dot com, or you can

651
00:38:45,560 --> 00:38:48,120
Speaker 1: drop me a line on Facebook or Twitter to handle it.

652
00:38:48,239 --> 00:38:51,359
Speaker 1: Both of those is tech Stuff H. S W. And

653
00:38:51,400 --> 00:38:59,919
Speaker 1: I'll talk to you guys again really soon. For more

654
00:39:00,000 --> 00:39:02,279
Speaker 1: on this and thousands of other topics, is it how

655
00:39:02,320 --> 00:39:08,640
Speaker 1: stuff works dot com, wh