1
00:00:11,400 --> 00:00:14,080
Speaker 1: You mentioned that you weren't really releasing music. Can you

2
00:00:14,080 --> 00:00:15,280
Speaker 1: can you tell me about that decision.

3
00:00:15,600 --> 00:00:18,680
Speaker 2: I discovered that if you typed in across Suno and

4
00:00:18,800 --> 00:00:21,959
Speaker 2: Udio make music like The Flashbulb, it would just sound

5
00:00:22,000 --> 00:00:24,160
Speaker 2: like crappy versions of my music.

6
00:00:25,560 --> 00:00:28,639
Speaker 1: Ben Jordan is a musician in YouTuber, and he releases

7
00:00:28,760 --> 00:00:31,440
Speaker 1: music under the name The flash Bulb. You might not

8
00:00:31,520 --> 00:00:33,720
Speaker 1: have heard his music yet, but if you've tried an

9
00:00:33,720 --> 00:00:37,120
Speaker 1: AI music generator like Zuno or Judio, you might have

10
00:00:37,320 --> 00:00:40,320
Speaker 1: used his music. AI music generators work kind of like

11
00:00:40,360 --> 00:00:43,280
Speaker 1: an audio version of chat GBT. You can type in

12
00:00:43,400 --> 00:00:46,559
Speaker 1: something like make me a techno song with upbeat vocals

13
00:00:46,600 --> 00:00:50,159
Speaker 1: and pianos, and it does it based on a massive

14
00:00:50,240 --> 00:00:53,920
Speaker 1: library of music that it's scraped. And when Ben tried

15
00:00:53,960 --> 00:00:56,640
Speaker 1: messing around with one of these, he realized that his

16
00:00:56,760 --> 00:01:01,160
Speaker 1: music had been scraped into that library too, without his It's.

17
00:01:01,040 --> 00:01:02,720
Speaker 2: One of those things where I feel like a lot

18
00:01:02,760 --> 00:01:05,880
Speaker 2: of people could type in their name and in might

19
00:01:05,920 --> 00:01:08,600
Speaker 2: guess something similar to a song that they made, but like,

20
00:01:08,680 --> 00:01:11,399
Speaker 2: this was like undeniable. This was just like, oh, this

21
00:01:11,520 --> 00:01:15,399
Speaker 2: is literally just everything that you would expect in a

22
00:01:15,480 --> 00:01:17,920
Speaker 2: song of mine, all the even the weird things except

23
00:01:18,000 --> 00:01:20,679
Speaker 2: for the me part.

24
00:01:21,640 --> 00:01:24,560
Speaker 1: So Bin came up with a solution, a program that

25
00:01:24,640 --> 00:01:28,959
Speaker 1: adds imperceptible noise to a music track, confusing AI models

26
00:01:29,000 --> 00:01:31,800
Speaker 1: and preventing them from replicating the track. This is a

27
00:01:31,800 --> 00:01:33,559
Speaker 1: technique called poison pilling.

28
00:01:33,959 --> 00:01:37,880
Speaker 2: Poison pilling started with images. There's one called night shade,

29
00:01:38,240 --> 00:01:41,920
Speaker 2: and what it did is it essentially just generated some

30
00:01:41,959 --> 00:01:45,000
Speaker 2: stuff in the images that was mostly invisible to humans,

31
00:01:45,480 --> 00:01:47,880
Speaker 2: and then the AI would see it as something else,

32
00:01:47,960 --> 00:01:48,880
Speaker 2: or it would confuse it.

33
00:01:49,080 --> 00:01:51,440
Speaker 1: A couple of years ago, as l limbs were really

34
00:01:51,480 --> 00:01:54,360
Speaker 1: taken off, a group of researchers at the University of

35
00:01:54,440 --> 00:01:58,680
Speaker 1: Chicago developed night Shade and glaze. These are programs that

36
00:01:58,720 --> 00:02:01,800
Speaker 1: take an image and make tiny changes to them. These

37
00:02:01,880 --> 00:02:05,440
Speaker 1: changes are basically imperceptible to the human eye, but it

38
00:02:05,440 --> 00:02:08,560
Speaker 1: would confuse an AI model. The thinking was that if

39
00:02:08,639 --> 00:02:12,160
Speaker 1: artists applied nightshade to their images and those images were

40
00:02:12,160 --> 00:02:15,000
Speaker 1: then scraped to train AI models, it would not only

41
00:02:15,040 --> 00:02:17,880
Speaker 1: prevent the l elms from learning anything from the individual

42
00:02:18,000 --> 00:02:21,640
Speaker 1: artists work, but it would also quote poison the data

43
00:02:21,680 --> 00:02:24,160
Speaker 1: sets and make those models less reliable.

44
00:02:24,600 --> 00:02:27,280
Speaker 2: And so I was like, Okay, well, how possible is

45
00:02:27,320 --> 00:02:30,200
Speaker 2: this it's music because I know that adversarial noise attacks

46
00:02:30,200 --> 00:02:34,200
Speaker 2: are possible on things like your Google Home or Oxa

47
00:02:34,280 --> 00:02:37,440
Speaker 2: or Siri, and it turns out that it is totally possible.

48
00:02:43,960 --> 00:02:53,400
Speaker 1: I'm afraid from Kaleidoscope and iHeart podcasts. This is kill switch.

49
00:02:54,200 --> 00:03:28,359
Speaker 1: I'm defter, Thomas. I'm sorry. When was the first time

50
00:03:28,360 --> 00:03:33,480
Speaker 1: that you really started feeling AI impacting your music personally?

51
00:03:34,160 --> 00:03:38,240
Speaker 2: The beginning would be almost in a positive exploratory way,

52
00:03:38,360 --> 00:03:42,160
Speaker 2: in like twenty sixteen, Google release Magenta, and in one

53
00:03:42,200 --> 00:03:45,000
Speaker 2: of my albums, I used it to sort of generate

54
00:03:45,040 --> 00:03:49,120
Speaker 2: this weird morphing sound between three different instruments, and it

55
00:03:49,160 --> 00:03:51,160
Speaker 2: was like unlike anything I had really heard before, and

56
00:03:51,200 --> 00:03:52,880
Speaker 2: it was this new type synthesis. So of course I

57
00:03:52,960 --> 00:03:54,800
Speaker 2: like jumped all over it, and I I was fascinated

58
00:03:54,800 --> 00:03:57,760
Speaker 2: with it, and then you know, moving on from there.

59
00:03:58,120 --> 00:04:00,600
Speaker 2: It's just the current landscape that we're in that makes

60
00:04:00,600 --> 00:04:04,160
Speaker 2: it so bad. So for example, with Spotify refusing to

61
00:04:04,200 --> 00:04:07,000
Speaker 2: pay less than a thousand streams per song, you have

62
00:04:07,080 --> 00:04:08,760
Speaker 2: like a year to get a thousand streams. If you

63
00:04:08,760 --> 00:04:10,400
Speaker 2: don't get that, they don't pay you. So you have

64
00:04:10,520 --> 00:04:14,240
Speaker 2: all this already, like low royalties being paid out by

65
00:04:14,240 --> 00:04:17,600
Speaker 2: these digital streaming platforms, and you have way too many

66
00:04:17,680 --> 00:04:20,840
Speaker 2: artists already, you know, for the system to actually work

67
00:04:20,880 --> 00:04:23,640
Speaker 2: where people would be making a living. And now you

68
00:04:23,720 --> 00:04:26,640
Speaker 2: have people who aren't musicians who are just using these

69
00:04:26,640 --> 00:04:29,600
Speaker 2: services to generate as many songs as they possibly can

70
00:04:29,640 --> 00:04:30,799
Speaker 2: and in their monthly subscription.

71
00:04:32,920 --> 00:04:36,760
Speaker 1: AI generated music is starting to creep into music platforms

72
00:04:36,800 --> 00:04:39,880
Speaker 1: like Spotify and even YouTube. Now you might have come

73
00:04:39,920 --> 00:04:42,800
Speaker 1: across it at this point, and maybe you recognized it,

74
00:04:42,839 --> 00:04:45,719
Speaker 1: and maybe you didn't. Aside from being really annoying for

75
00:04:45,800 --> 00:04:49,320
Speaker 1: people who actually care about music, for musicians, this is

76
00:04:49,360 --> 00:04:53,320
Speaker 1: taking away attention and thus money, because a lot of artists'

77
00:04:53,320 --> 00:04:56,240
Speaker 1: income depends on the number of streams they get. And

78
00:04:56,279 --> 00:04:59,360
Speaker 1: it also doesn't help that the CEO of Suno, which

79
00:04:59,400 --> 00:05:02,240
Speaker 1: is one of the most popular AI music companies right now,

80
00:05:02,760 --> 00:05:06,200
Speaker 1: doesn't seem to really appreciate the music creation process.

81
00:05:06,600 --> 00:05:09,800
Speaker 3: It's not really enjoyable to make music now people enjoy it.

82
00:05:11,000 --> 00:05:12,960
Speaker 3: It takes a lot of time, it takes a lot

83
00:05:12,960 --> 00:05:14,720
Speaker 3: of practice. You need to get really good at an

84
00:05:14,720 --> 00:05:17,200
Speaker 3: instrument or really good at a piece of production software.

85
00:05:17,279 --> 00:05:19,960
Speaker 3: I think the majority of people don't enjoy the majority

86
00:05:19,960 --> 00:05:21,479
Speaker 3: of the time they spend making music.

87
00:05:23,000 --> 00:05:25,400
Speaker 2: That's like one of the most absurd things I've ever

88
00:05:25,440 --> 00:05:27,880
Speaker 2: heard in my life. I mean, that's how I hear

89
00:05:27,960 --> 00:05:31,640
Speaker 2: that is a CEO trying to justify the existence of

90
00:05:31,680 --> 00:05:34,159
Speaker 2: a company in a practical business sense.

91
00:05:34,720 --> 00:05:37,520
Speaker 1: So before he went the poison pill route, Bin's first

92
00:05:37,520 --> 00:05:39,919
Speaker 1: idea was to make something that would detect if music

93
00:05:40,000 --> 00:05:43,520
Speaker 1: was generated by AI. That way, a platform like Spotify

94
00:05:43,640 --> 00:05:46,520
Speaker 1: could use it to just reject any AI generated music

95
00:05:46,600 --> 00:05:47,919
Speaker 1: that someone tried to upload.

96
00:05:48,360 --> 00:05:52,760
Speaker 2: Basically, when you put something on Spotify or really anywhere

97
00:05:52,760 --> 00:05:55,880
Speaker 2: where you listen to music generally, obviously, file size and

98
00:05:56,320 --> 00:05:59,719
Speaker 2: bandwidth are giant considerations, and so you have to compress

99
00:05:59,760 --> 00:06:05,040
Speaker 2: all and within that you use techniques like inverted discrete

100
00:06:05,160 --> 00:06:08,480
Speaker 2: cosine transform and a bunch of smart sounding things, and

101
00:06:08,720 --> 00:06:13,039
Speaker 2: you can detect discrete cosine transform. And so the thing

102
00:06:13,160 --> 00:06:15,960
Speaker 2: is is that Suno and Udio, they allegedly went on

103
00:06:16,040 --> 00:06:19,120
Speaker 2: YouTube and Spotify and they just scraped and scraped and

104
00:06:19,120 --> 00:06:22,240
Speaker 2: scraped and learned and learned and learned, and so it's

105
00:06:22,360 --> 00:06:23,800
Speaker 2: quite easy to detect it.

106
00:06:23,960 --> 00:06:25,760
Speaker 1: I see what you're saying. So you're you were able

107
00:06:25,800 --> 00:06:30,719
Speaker 1: to basically detect Okay, this was downloaded from Spotify because

108
00:06:30,760 --> 00:06:34,239
Speaker 1: it's basically a sound signature. Even if a person doesn't

109
00:06:34,240 --> 00:06:36,720
Speaker 1: hear it, computer wise, you can sell YEP.

110
00:06:36,960 --> 00:06:39,840
Speaker 2: And so the sort of idea after I announced that,

111
00:06:40,000 --> 00:06:41,520
Speaker 2: I got a lot of people saying, yeah, but now

112
00:06:41,520 --> 00:06:43,440
Speaker 2: they're just going to use raw wave files, they're just

113
00:06:43,440 --> 00:06:45,360
Speaker 2: going to use the masters, and it's like good. Then

114
00:06:45,440 --> 00:06:48,640
Speaker 2: they have to negotiate with the artist like that opens

115
00:06:48,640 --> 00:06:49,240
Speaker 2: a conversation.

116
00:06:49,600 --> 00:06:53,080
Speaker 1: The masters we're talking about here are the original, highest

117
00:06:53,120 --> 00:06:55,880
Speaker 1: quality track, which would be owned by the artist or

118
00:06:55,920 --> 00:06:58,880
Speaker 1: the label. But in order to use those master files,

119
00:06:58,920 --> 00:07:01,279
Speaker 1: you need to get them directly from the person who

120
00:07:01,279 --> 00:07:04,000
Speaker 1: owns them, and that would mean you'd probably need to

121
00:07:04,040 --> 00:07:08,000
Speaker 1: pay them. The goal isn't necessarily to stop AI, just

122
00:07:08,120 --> 00:07:11,040
Speaker 1: to stop AI that the artist isn't getting paid for.

123
00:07:11,400 --> 00:07:14,520
Speaker 1: And Ben thought this project could dissuade AI companies from

124
00:07:14,560 --> 00:07:18,040
Speaker 1: scraping data because platforms could use this tool to detect

125
00:07:18,200 --> 00:07:22,640
Speaker 1: and again then reject that AI music. But Spotify hasn't

126
00:07:22,680 --> 00:07:25,840
Speaker 1: implemented this, and as far as we know, it hasn't

127
00:07:25,880 --> 00:07:28,840
Speaker 1: stopped AI companies from continuing to scrape music.

128
00:07:29,200 --> 00:07:33,000
Speaker 2: I mean, tell Sam Altman to pay for all the

129
00:07:33,120 --> 00:07:35,680
Speaker 2: data that he's training everything with them. So, so I

130
00:07:35,680 --> 00:07:38,880
Speaker 2: guess that's what led to the next step, right, It's like, okay, well,

131
00:07:38,920 --> 00:07:40,520
Speaker 2: how do we prevent it from being trained?

132
00:07:41,920 --> 00:07:45,000
Speaker 1: Enter the poison pill. Ben knew that you could do

133
00:07:45,040 --> 00:07:48,080
Speaker 1: this in images, but how would this work in audio?

134
00:07:48,760 --> 00:07:51,800
Speaker 1: It turns out the process is actually pretty similar.

135
00:07:52,520 --> 00:07:55,400
Speaker 2: It's actually not all that different from a technical standpoint,

136
00:07:55,400 --> 00:07:59,160
Speaker 2: because the majority of AI music sites, how they're really

137
00:07:59,200 --> 00:08:02,120
Speaker 2: working is it's all based on the original unit model

138
00:08:02,320 --> 00:08:05,640
Speaker 2: that was for microscopic imaging. That's sort of like what

139
00:08:05,840 --> 00:08:09,040
Speaker 2: changed this whole generative AI thing and made it so

140
00:08:09,120 --> 00:08:10,800
Speaker 2: much easier to train things than it used to be.

141
00:08:10,920 --> 00:08:13,320
Speaker 2: But if you've ever seen an audio spectrogram that you

142
00:08:13,320 --> 00:08:16,320
Speaker 2: would see like a violin, it would be like yeah,

143
00:08:16,440 --> 00:08:19,400
Speaker 2: and so you would see the note or a line

144
00:08:19,600 --> 00:08:22,520
Speaker 2: slowly get thicker and thicker as the violin got louder,

145
00:08:22,520 --> 00:08:26,520
Speaker 2: where as a guitar or a piano it would be

146
00:08:26,560 --> 00:08:29,480
Speaker 2: an instant star snare drum would be like an instant

147
00:08:29,600 --> 00:08:31,560
Speaker 2: start and then maybe a little bit of fade out

148
00:08:31,560 --> 00:08:34,680
Speaker 2: on the end depending on how it's mixed. You know,

149
00:08:34,720 --> 00:08:37,120
Speaker 2: there are programs, for example, where you could listen to

150
00:08:37,200 --> 00:08:39,600
Speaker 2: audio that you draw, and so it's basically doing that.

151
00:08:39,720 --> 00:08:43,079
Speaker 2: It's just reading the spectrogram of the audio and then

152
00:08:43,360 --> 00:08:46,080
Speaker 2: learning from that and then re encoding another spectrum and

153
00:08:46,080 --> 00:08:49,439
Speaker 2: then converting that back into audio. So it's kind of

154
00:08:49,440 --> 00:08:51,880
Speaker 2: funny because that's like that's a little bit of a

155
00:08:51,960 --> 00:08:56,360
Speaker 2: flawed way of generating audio to begin with. Like you

156
00:08:56,360 --> 00:08:58,240
Speaker 2: could usually hear something if it's been converted to a

157
00:08:58,280 --> 00:09:00,520
Speaker 2: spectrum in back, and you can hear that almost all

158
00:09:00,559 --> 00:09:02,439
Speaker 2: EI music. That's kind of why it sounds a little

159
00:09:02,640 --> 00:09:05,840
Speaker 2: glitchy or squeaky or not. It's hard to describe.

160
00:09:06,080 --> 00:09:11,400
Speaker 1: I didn't realize that. The way that lms are essentially,

161
00:09:11,480 --> 00:09:13,600
Speaker 1: you know these models are interpreting music, is they're really

162
00:09:13,640 --> 00:09:19,600
Speaker 1: interpreting them as images. Yeah, using this technique, been created

163
00:09:19,600 --> 00:09:22,960
Speaker 1: a process he calls Poisonify. When you run a track

164
00:09:22,960 --> 00:09:27,480
Speaker 1: through Poisonify, it'll add noise. It's imperceptible to us but

165
00:09:27,760 --> 00:09:31,560
Speaker 1: visible in the audio spectrogram. This confuses the AI training

166
00:09:31,600 --> 00:09:34,560
Speaker 1: on it to the point where it can't identify instruments.

167
00:09:35,360 --> 00:09:40,520
Speaker 2: So Poisonify is essentially preventing what Magenta initially did, where

168
00:09:40,520 --> 00:09:45,160
Speaker 2: it learns primarily can identify instruments and can identify style

169
00:09:45,280 --> 00:09:47,679
Speaker 2: and things like that. It cloaks it to where it

170
00:09:47,760 --> 00:09:49,840
Speaker 2: just thinks that it's tearing something else. You could have

171
00:09:49,920 --> 00:09:53,960
Speaker 2: targeted attacks where you can say I want my piano

172
00:09:54,559 --> 00:09:58,120
Speaker 2: to sound like a harmonica or something, or you could

173
00:09:58,160 --> 00:10:00,640
Speaker 2: have untargeted attacks, or it'll just kind of go with

174
00:10:00,679 --> 00:10:03,920
Speaker 2: whatever's easiest. And when using them in a particolar way,

175
00:10:04,360 --> 00:10:08,640
Speaker 2: you can successfully make these instruments and these styles unidentifiable.

176
00:10:09,040 --> 00:10:12,440
Speaker 2: So Sono or Udio they would get confused. Then when

177
00:10:12,480 --> 00:10:16,480
Speaker 2: it came time to draw a new spectrogram to convert

178
00:10:16,480 --> 00:10:19,920
Speaker 2: into audio, they would probably draw some of the wrong ones.

179
00:10:20,120 --> 00:10:22,079
Speaker 1: So this means that you could put say an e

180
00:10:22,200 --> 00:10:26,120
Speaker 1: EDM track that's treated with Poisonify into Suno, for example,

181
00:10:26,400 --> 00:10:29,280
Speaker 1: and if you ask Suno to generate something similar or

182
00:10:29,400 --> 00:10:33,440
Speaker 1: extend the song, it'll spit out something totally unrelated, like

183
00:10:33,640 --> 00:10:36,720
Speaker 1: acoustic guitar music. Here's a clip from Ben testing that

184
00:10:36,760 --> 00:10:39,480
Speaker 1: out with his own music in his YouTube video about

185
00:10:39,480 --> 00:10:41,000
Speaker 1: this process. So here we go.

186
00:10:41,080 --> 00:10:49,760
Speaker 2: We can upload my original song here and now here

187
00:10:49,880 --> 00:10:58,240
Speaker 2: is SUNO's AI extension of that song, Sweedish.

188
00:11:00,280 --> 00:11:00,760
Speaker 1: Glowing.

189
00:11:03,800 --> 00:11:13,200
Speaker 2: Okay, now let's upload my poison if encoded track and

190
00:11:13,360 --> 00:11:22,400
Speaker 2: here is SUNO's AI generated extension. I would describe this

191
00:11:22,480 --> 00:11:26,079
Speaker 2: as music from an Airport SPA that somebody downloaded off

192
00:11:26,080 --> 00:11:27,880
Speaker 2: of Napster in nineteen ninety nine.

193
00:11:28,240 --> 00:11:31,120
Speaker 1: The entire video is definitely worth checking out, and will

194
00:11:31,160 --> 00:11:33,360
Speaker 1: include a link to it in the show notes. But

195
00:11:33,640 --> 00:11:36,959
Speaker 1: it's really interesting to hear how confused SUNO gets when

196
00:11:37,000 --> 00:11:39,080
Speaker 1: the track is encoded with Poisonify.

197
00:11:39,760 --> 00:11:41,800
Speaker 2: In some of those demonstrations, a lot of people were like,

198
00:11:41,880 --> 00:11:43,760
Speaker 2: that is so crazy, and it's like, no, really, what

199
00:11:43,880 --> 00:11:46,760
Speaker 2: it's doing is like, that's its own safety mechanism, Like

200
00:11:46,840 --> 00:11:49,079
Speaker 2: because it knows that it was confused.

201
00:11:49,640 --> 00:11:53,520
Speaker 1: This is pretty fascinating Poisonify. It doesn't make AI think

202
00:11:53,600 --> 00:11:57,040
Speaker 1: the drums or flutes, It just confuses it so much

203
00:11:57,080 --> 00:11:59,480
Speaker 1: with the noise that it adds to the spectrogram that

204
00:11:59,480 --> 00:12:02,320
Speaker 1: the AI doesn't know what to do and falls back

205
00:12:02,360 --> 00:12:05,400
Speaker 1: and randomly chooses something that it knows to be music.

206
00:12:06,080 --> 00:12:08,319
Speaker 2: So if you've very used like generative AI with images,

207
00:12:08,360 --> 00:12:10,600
Speaker 2: you'll notice something that happens quite often. As you'll say

208
00:12:10,679 --> 00:12:13,960
Speaker 2: I want a dog in a canoe eating a banana

209
00:12:14,440 --> 00:12:18,360
Speaker 2: headed towards a sunset, and the image you would get

210
00:12:19,120 --> 00:12:23,040
Speaker 2: might be like a dog on a jet ski not

211
00:12:23,120 --> 00:12:26,520
Speaker 2: eating anything in a lake with no sunset, And those

212
00:12:26,559 --> 00:12:29,000
Speaker 2: are really just failsafes like it tried to make a sunset,

213
00:12:29,240 --> 00:12:31,679
Speaker 2: it didn't have enough confidence. It literally is called the

214
00:12:31,679 --> 00:12:34,320
Speaker 2: confidence rating, and so it just said, okay, let's just

215
00:12:34,400 --> 00:12:37,440
Speaker 2: make what we normally make in the background. And so

216
00:12:37,720 --> 00:12:39,040
Speaker 2: it's sort of the same thing with music.

217
00:12:39,280 --> 00:12:41,640
Speaker 1: And there's another program that takes it a step beyond

218
00:12:41,640 --> 00:12:45,720
Speaker 1: poison Ifi. Instead of masking the instruments, it masks the

219
00:12:45,800 --> 00:12:47,240
Speaker 1: music itself.

220
00:12:47,280 --> 00:12:49,760
Speaker 2: Harmony Cloak. That's like above my pay grade. I don't

221
00:12:49,760 --> 00:12:52,440
Speaker 2: really understand how that model works. I'm just really glad

222
00:12:52,440 --> 00:12:54,800
Speaker 2: that they're working on it. But yeah, I mean they

223
00:12:54,920 --> 00:13:00,080
Speaker 2: obfuscate melody and harmony, which is pretty crazy.

224
00:13:00,800 --> 00:13:03,240
Speaker 1: I talked to the developer of Harmony Cloak about how

225
00:13:03,360 --> 00:13:06,160
Speaker 1: exactly they do this, and they even helped us test

226
00:13:06,200 --> 00:13:09,920
Speaker 1: it out on the kill Switch theme song that's after

227
00:13:09,960 --> 00:13:24,480
Speaker 1: the break. At the same time that Ben was working

228
00:13:24,520 --> 00:13:28,560
Speaker 1: on Poisonify, researchers at the University of Tennessee Knoxville were

229
00:13:28,559 --> 00:13:32,080
Speaker 1: working on another way to poison pill music, a program

230
00:13:32,160 --> 00:13:33,439
Speaker 1: they call Harmony Cloak.

231
00:13:34,040 --> 00:13:39,240
Speaker 4: Hu mess and machinks interpret data in different ways, so

232
00:13:39,360 --> 00:13:41,920
Speaker 4: there's a perceptual gap between humans and the machinks.

233
00:13:42,559 --> 00:13:45,600
Speaker 1: Jin Lu is an assistant professor at the University of Tennessee,

234
00:13:45,640 --> 00:13:49,280
Speaker 1: Knoxville and the lead developer of Harmony Cloak. He's also

235
00:13:49,440 --> 00:13:50,880
Speaker 1: really into music himself.

236
00:13:51,280 --> 00:13:54,319
Speaker 4: I love Milzaic. Actually I also play Militzach. I played

237
00:13:54,320 --> 00:13:54,880
Speaker 4: best guitar.

238
00:13:55,280 --> 00:13:58,360
Speaker 1: Harmony Cloak is similar to Poisonify and then it adds

239
00:13:58,520 --> 00:14:02,360
Speaker 1: imperceptible noise to the fire file. But unlike PUISONI five,

240
00:14:03,000 --> 00:14:05,720
Speaker 1: Homony Cloak doesn't just work on the level of the instruments.

241
00:14:06,120 --> 00:14:09,480
Speaker 1: It completely confuses the AI, so it can't learn from

242
00:14:09,480 --> 00:14:10,520
Speaker 1: the music at all.

243
00:14:11,800 --> 00:14:15,120
Speaker 4: So what we are doing right now is to use perturbation.

244
00:14:15,800 --> 00:14:20,600
Speaker 4: So we inject imperceptible perturbations to the music sampos to

245
00:14:21,160 --> 00:14:25,000
Speaker 4: trick the model into believing that they have already learned

246
00:14:25,000 --> 00:14:28,640
Speaker 4: this before. So there's no new knowledge, no new information

247
00:14:28,760 --> 00:14:31,480
Speaker 4: in bad in this music samples, so they couldn't learn

248
00:14:31,680 --> 00:14:33,960
Speaker 4: anything from this piece of work.

249
00:14:34,400 --> 00:14:38,320
Speaker 1: So the AI thinks there's no new information and essentially

250
00:14:38,440 --> 00:14:41,640
Speaker 1: ignores everything that's in the file. This means that AI

251
00:14:41,720 --> 00:14:45,920
Speaker 1: models can't train on music. With Harmony Cloak applied, you're

252
00:14:45,960 --> 00:14:49,760
Speaker 1: talking about introducing noise or what you call, you know,

253
00:14:50,080 --> 00:14:54,720
Speaker 1: technical term perturbations, right perturbations into the music, which you

254
00:14:54,760 --> 00:14:59,160
Speaker 1: say are imperceptible? Are these actually imperceptible? Right? If you're

255
00:14:59,280 --> 00:15:03,400
Speaker 1: adding noise, if you're adding extra data into the music,

256
00:15:03,880 --> 00:15:05,640
Speaker 1: can I as a listener hear.

257
00:15:05,480 --> 00:15:10,560
Speaker 4: That the perturbation we injected should have a minimum impact

258
00:15:10,760 --> 00:15:14,240
Speaker 4: on the perceptual quality of the music, because no one

259
00:15:14,240 --> 00:15:17,880
Speaker 4: wants to add noises to their artwork. So we conducted

260
00:15:18,160 --> 00:15:22,440
Speaker 4: a very comprehensive under study. We present both original one

261
00:15:22,560 --> 00:15:25,560
Speaker 4: and the perfect one to musicians and we ask them

262
00:15:25,600 --> 00:15:29,120
Speaker 4: to tell the difference, and our studies shows that they

263
00:15:29,120 --> 00:15:32,160
Speaker 4: won't tell difference between these two. I think in terms

264
00:15:32,200 --> 00:15:36,080
Speaker 4: of the musical quality, there's no big difference. So actually

265
00:15:36,160 --> 00:15:37,560
Speaker 4: the noises.

266
00:15:37,240 --> 00:15:39,280
Speaker 1: Itself is audible.

267
00:15:39,560 --> 00:15:43,040
Speaker 4: So if you listen to the noises only, like you

268
00:15:43,080 --> 00:15:46,480
Speaker 4: separate the perurbation from the music examples, you can hear

269
00:15:46,560 --> 00:15:50,360
Speaker 4: something it's audible. But if you combine these too, the

270
00:15:50,400 --> 00:15:53,840
Speaker 4: noises will be hidden under the music samples because we

271
00:15:54,000 --> 00:15:58,200
Speaker 4: leveraged the psycho acoustic phenomenon. So when we listen to

272
00:15:58,240 --> 00:16:01,480
Speaker 4: these music samples and pation and together, now is this

273
00:16:01,560 --> 00:16:02,840
Speaker 4: will become imperceptible.

274
00:16:03,680 --> 00:16:07,560
Speaker 1: I was curious about this whole imperceptible thing. So Jin

275
00:16:07,640 --> 00:16:09,720
Speaker 1: said he would not only help us test it, but

276
00:16:09,920 --> 00:16:13,320
Speaker 1: also use an updated process they're calling music shield. So

277
00:16:13,480 --> 00:16:15,640
Speaker 1: I'm gonna play a snippet of the kill Switch theme

278
00:16:15,760 --> 00:16:19,840
Speaker 1: song with and without music shield applied, and you see

279
00:16:19,880 --> 00:16:57,560
Speaker 1: if you can tell the difference. Here's sample one, okay,

280
00:16:57,840 --> 00:17:37,160
Speaker 1: and here is sample two, So you tell me. Could

281
00:17:37,160 --> 00:17:41,720
Speaker 1: you hear the difference? It's very slight, if anything. Oh

282
00:17:41,760 --> 00:17:43,880
Speaker 1: and by the way, if you were wondering, the one

283
00:17:43,920 --> 00:17:46,600
Speaker 1: that was run through music shield was the first sample.

284
00:17:47,600 --> 00:17:51,520
Speaker 1: And here's something else. Music shield actually goes one step

285
00:17:51,560 --> 00:17:54,399
Speaker 1: further than the harmony cloak process that we were talking about.

286
00:17:54,840 --> 00:17:57,920
Speaker 1: It not only stops AI models from training on the music,

287
00:17:58,400 --> 00:18:01,560
Speaker 1: but it can also prevent music generators like Suno or

288
00:18:01,680 --> 00:18:05,159
Speaker 1: Udio from being able to edit or remix tracks. So

289
00:18:05,359 --> 00:18:07,320
Speaker 1: let's give it a shot, and let's put this thing

290
00:18:07,359 --> 00:18:11,240
Speaker 1: into Suno. So if we upload our original untreated theme

291
00:18:11,359 --> 00:18:14,280
Speaker 1: into Suno and tell it to remix it, here's what

292
00:18:14,320 --> 00:18:52,120
Speaker 1: we get, which for an AI generator is not bad.

293
00:18:52,240 --> 00:18:55,000
Speaker 1: That's in the ballpark of what the original song sounds like.

294
00:18:55,480 --> 00:18:58,119
Speaker 1: And this is what happens when we upload our same

295
00:18:58,200 --> 00:19:21,360
Speaker 1: theme that's been music shielded. Okay, Yeah, this is different.

296
00:19:21,920 --> 00:19:26,160
Speaker 1: It's a lot more soothing than the song we gave it.

297
00:19:26,160 --> 00:19:29,040
Speaker 1: It kind of feels like a corporate video, maybe for

298
00:19:29,320 --> 00:19:33,040
Speaker 1: investors at a defense contractor. There's maybe a guy in

299
00:19:33,080 --> 00:19:35,320
Speaker 1: a suit on the screen and he's telling you about

300
00:19:35,320 --> 00:19:42,840
Speaker 1: how their business is really all about family. Clearly it works,

301
00:19:43,240 --> 00:19:46,000
Speaker 1: Zuno got so confused that it just spit out some

302
00:19:46,160 --> 00:19:50,720
Speaker 1: generic corporate music. So I've read your paper that you

303
00:19:50,800 --> 00:19:56,399
Speaker 1: published recently entitled Harmony Cloak Making Music Unlearnable for Generative AI.

304
00:19:57,400 --> 00:20:00,679
Speaker 1: One of the things that I found really interesting is

305
00:20:01,320 --> 00:20:05,800
Speaker 1: the language that you use, right, because you're osensibly you're

306
00:20:05,840 --> 00:20:11,520
Speaker 1: talking about music, very kind of broad, very easy to understand. Thing.

307
00:20:11,880 --> 00:20:15,160
Speaker 1: Because I'm reading through your paper, I'm realizing I'm reading

308
00:20:15,160 --> 00:20:20,240
Speaker 1: a security paper. Section three point one is entitled threat model.

309
00:20:21,000 --> 00:20:23,879
Speaker 1: And then as I read further down, I'm seeing, you know,

310
00:20:23,880 --> 00:20:25,680
Speaker 1: I'm just gonna read from this a little bit. Yeah,

311
00:20:25,680 --> 00:20:27,920
Speaker 1: you're laughing, but this is amazing to me. I mean,

312
00:20:28,600 --> 00:20:32,679
Speaker 1: the attacker eg AI companies or model owners might scrape

313
00:20:32,760 --> 00:20:35,520
Speaker 1: music data from the Internet or music streaming platforms to

314
00:20:35,600 --> 00:20:39,720
Speaker 1: train their music generative AI models, potentially leading to copyright

315
00:20:39,720 --> 00:20:43,320
Speaker 1: infringements and harming musicians. This part right here, I love

316
00:20:43,359 --> 00:20:48,480
Speaker 1: this we assume the attacker possesses substantial advantages and capabilities,

317
00:20:48,840 --> 00:20:52,399
Speaker 1: including unrestricted access to the training data set and model parameters,

318
00:20:52,960 --> 00:20:56,760
Speaker 1: facilitating comprehensive data and grading expansions, and the ability to

319
00:20:56,760 --> 00:21:01,400
Speaker 1: perform adaptive attack strategies. And it goes on. But I mean,

320
00:21:01,440 --> 00:21:04,800
Speaker 1: this is fascinating because usually if you read it security paper,

321
00:21:04,840 --> 00:21:08,280
Speaker 1: you're thinking of the defender is you know, somebody with

322
00:21:08,320 --> 00:21:09,920
Speaker 1: some resources. It could be a bank, it could be

323
00:21:09,920 --> 00:21:12,960
Speaker 1: a tech company, it could be a governmental agency, and

324
00:21:13,040 --> 00:21:16,600
Speaker 1: the attacker is somebody with considerably less resources. This is

325
00:21:16,800 --> 00:21:20,440
Speaker 1: the reverse of that. The attacker is somebody with a

326
00:21:20,480 --> 00:21:23,800
Speaker 1: lot of resources, it's a large tech company probably, and

327
00:21:24,200 --> 00:21:26,879
Speaker 1: the defender is just you know, some kid with the

328
00:21:26,880 --> 00:21:27,520
Speaker 1: bass guitar.

329
00:21:28,160 --> 00:21:31,399
Speaker 4: Yeah, exactly. And actually in the paper we also discussed

330
00:21:31,440 --> 00:21:35,919
Speaker 4: the possible a tech because the big tech company may

331
00:21:35,920 --> 00:21:40,400
Speaker 4: also leverage additional strategy to relearn or process. You are

332
00:21:40,600 --> 00:21:44,960
Speaker 4: protecting music to learn something from it. Right, So one

333
00:21:45,280 --> 00:21:50,600
Speaker 4: very straightforward way is to use noise cancelation techniques to

334
00:21:50,880 --> 00:21:55,920
Speaker 4: remove any perturbations from the music samposts. So that's maybe

335
00:21:56,080 --> 00:22:01,040
Speaker 4: one strategy they leverage. But the easy here is that

336
00:22:01,560 --> 00:22:05,520
Speaker 4: you can leverage whatever way to remove noises. Yes, you

337
00:22:05,560 --> 00:22:10,040
Speaker 4: can reduce the effectiveness of our framework, but on the

338
00:22:10,080 --> 00:22:13,359
Speaker 4: other side, the quality of the music will be dropped

339
00:22:13,359 --> 00:22:16,480
Speaker 4: as well, because when you remove noise is certain music

340
00:22:16,560 --> 00:22:18,760
Speaker 4: or futures will be removed as well.

341
00:22:19,440 --> 00:22:21,720
Speaker 1: Yeah, so I think I see what you're saying here.

342
00:22:21,840 --> 00:22:26,760
Speaker 1: So your framework, harmony cloak, the entire purpose is to

343
00:22:26,800 --> 00:22:29,879
Speaker 1: make the song unusable for the tech company. And if

344
00:22:29,880 --> 00:22:33,679
Speaker 1: the tech company then tries to do something to mitigate that,

345
00:22:33,800 --> 00:22:37,200
Speaker 1: to remove the noise, the perturbations that you've introduced into

346
00:22:37,200 --> 00:22:40,520
Speaker 1: the music, that reduces the quality, and do they really

347
00:22:40,600 --> 00:22:44,280
Speaker 1: want to be putting bad quality data into the data set. No,

348
00:22:44,520 --> 00:22:47,280
Speaker 1: So you've accomplished your goal, which is again to make

349
00:22:47,320 --> 00:22:48,280
Speaker 1: it unusable for them.

350
00:22:48,760 --> 00:22:48,960
Speaker 4: Yeah.

351
00:22:49,359 --> 00:22:53,040
Speaker 1: Ben Jordan has a similar philosophy with this Poisonified project.

352
00:22:53,240 --> 00:22:55,640
Speaker 1: He knows it's going to be a battle with AI companies,

353
00:22:56,080 --> 00:22:57,520
Speaker 1: but that's kind of the point.

354
00:22:58,040 --> 00:23:00,400
Speaker 2: A lot of people have said, well, you know, with

355
00:23:00,440 --> 00:23:03,119
Speaker 2: what happened with Nightshade and Images, they're just going to

356
00:23:03,200 --> 00:23:04,119
Speaker 2: do something like that.

357
00:23:05,200 --> 00:23:08,240
Speaker 1: To clarify what Ben's talking about here. Pretty soon after

358
00:23:08,320 --> 00:23:11,199
Speaker 1: Nightshade came out, and that's the poison pilling tool for

359
00:23:11,240 --> 00:23:14,560
Speaker 1: Images mentioned earlier people started saying that they'd figured out

360
00:23:14,560 --> 00:23:17,440
Speaker 1: a way to bypass it by blurting or sharpening out

361
00:23:17,440 --> 00:23:21,439
Speaker 1: the noise. The team that developed Nightshade disputes this, but

362
00:23:21,600 --> 00:23:26,040
Speaker 1: their other popular tool called Glade, was also briefly bypassed

363
00:23:26,160 --> 00:23:30,000
Speaker 1: using some image upscaling techniques. And it's basically a back

364
00:23:30,040 --> 00:23:34,240
Speaker 1: and forth war here. And because with audio AI still

365
00:23:34,320 --> 00:23:39,000
Speaker 1: processing a spectrogram image, in theory an AI company could

366
00:23:39,040 --> 00:23:43,480
Speaker 1: also use similar techniques to bypass poisonify and harmony cloak.

367
00:23:43,800 --> 00:23:46,240
Speaker 2: They might There are a couple considerations out, So like

368
00:23:46,280 --> 00:23:48,960
Speaker 2: when we talked about that audio going into a spectrogram

369
00:23:48,960 --> 00:23:51,240
Speaker 2: back into audio thing, if you think about what would

370
00:23:51,280 --> 00:23:54,080
Speaker 2: happen if you think about how a snare drum works

371
00:23:54,320 --> 00:23:56,919
Speaker 2: and a spectrogram, Okay, it's like it kind of starts

372
00:23:56,920 --> 00:23:59,199
Speaker 2: immediately and then it fades out a little bit, and

373
00:23:59,240 --> 00:24:03,400
Speaker 2: that gives it like the right, Yeah, if you were

374
00:24:03,440 --> 00:24:06,920
Speaker 2: to bore that, now it sounds really bad.

375
00:24:07,040 --> 00:24:07,640
Speaker 1: You have this.

376
00:24:10,280 --> 00:24:14,080
Speaker 2: And so different things need to be precise. And not

377
00:24:14,119 --> 00:24:15,760
Speaker 2: only that, so they use like the boring and the

378
00:24:15,760 --> 00:24:18,400
Speaker 2: AI sharpening, and if they were to do that with spectrograms,

379
00:24:18,880 --> 00:24:22,960
Speaker 2: it still that's a lot of extra compute and expense,

380
00:24:23,160 --> 00:24:26,199
Speaker 2: and really the goal is to just pressure them to

381
00:24:26,240 --> 00:24:28,879
Speaker 2: work with musicians, Like if they actually want to make

382
00:24:28,920 --> 00:24:31,119
Speaker 2: money off of this and they want to continue selling

383
00:24:31,200 --> 00:24:35,200
Speaker 2: subscriptions to generate this stuff, then uh, you know, it's

384
00:24:35,280 --> 00:24:39,400
Speaker 2: just to pressure them to make getting the wavefile directly

385
00:24:39,400 --> 00:24:42,919
Speaker 2: from a musician cheaper and easier than doing it the

386
00:24:42,920 --> 00:24:44,280
Speaker 2: way they've been doing it without consent.

387
00:24:44,880 --> 00:24:48,000
Speaker 1: Uh huh. So part of this is you're not necessarily

388
00:24:48,080 --> 00:24:54,080
Speaker 1: thinking that this is an undefeatable attack against you know,

389
00:24:54,119 --> 00:24:57,840
Speaker 1: AI scraping. It kind of sounds like you're sort of

390
00:24:57,920 --> 00:25:00,639
Speaker 1: hoping that it becomes obsolete because we know there's going

391
00:25:00,680 --> 00:25:03,359
Speaker 1: to be an arms race. We know companies are going

392
00:25:03,359 --> 00:25:08,160
Speaker 1: to figure out a way to defeat your poison pill attack.

393
00:25:08,280 --> 00:25:10,320
Speaker 1: Let's just be real. They got more resources than you,

394
00:25:10,359 --> 00:25:12,199
Speaker 1: they got more engineers than you do. Yeah, they're going

395
00:25:12,200 --> 00:25:14,200
Speaker 1: to figure it out, sure, but it kind of sounds

396
00:25:14,200 --> 00:25:16,240
Speaker 1: like you just rather them decide, you know what, the

397
00:25:16,280 --> 00:25:16,879
Speaker 1: sain't worth it?

398
00:25:17,320 --> 00:25:20,159
Speaker 2: Yeah, I mean you know what I do. Hear like

399
00:25:20,200 --> 00:25:22,600
Speaker 2: the arms race analogy all the time, and it's like

400
00:25:22,680 --> 00:25:26,120
Speaker 2: war is almost always a net loss, Like and if

401
00:25:26,119 --> 00:25:31,080
Speaker 2: they have anybody smart in whoever's funding them, they won't understand.

402
00:25:30,600 --> 00:25:34,679
Speaker 1: That you can make it so annoying, yeah, to scrape

403
00:25:34,680 --> 00:25:37,840
Speaker 1: people's music that they're just going to go the you know,

404
00:25:38,080 --> 00:25:39,240
Speaker 1: quote unquote do the right thing.

405
00:25:39,359 --> 00:25:42,680
Speaker 2: Yeah, Ultimately you might have it to be is so

406
00:25:42,720 --> 00:25:46,439
Speaker 2: omnipresent that AI music sites actually have to say, okay,

407
00:25:46,560 --> 00:25:48,719
Speaker 2: well we need to just talk to artists now, we

408
00:25:48,760 --> 00:25:50,760
Speaker 2: need to just start training on stuff that we know

409
00:25:50,840 --> 00:25:53,639
Speaker 2: doesn't have this, because we're wasting too much money on

410
00:25:53,960 --> 00:25:58,680
Speaker 2: compute for things that are just degrading the model quality.

411
00:25:59,800 --> 00:26:02,920
Speaker 1: So if you're a musician, you're probably thinking, when can

412
00:26:03,000 --> 00:26:05,760
Speaker 1: I start using this stuff to protect my music? Or

413
00:26:05,800 --> 00:26:07,679
Speaker 1: if you're a music fan, you might want to know

414
00:26:07,680 --> 00:26:10,160
Speaker 1: when this stuff drops so your favorite artists can stop

415
00:26:10,200 --> 00:26:13,960
Speaker 1: getting their music scraped and stolen. Well, I've got bad

416
00:26:14,000 --> 00:26:17,400
Speaker 1: news for you, but also some good news that's after

417
00:26:17,440 --> 00:26:34,760
Speaker 1: the break. So I know that there's gonna be definitely

418
00:26:34,800 --> 00:26:37,760
Speaker 1: some musicians who will hear this and say this sounds amazing.

419
00:26:38,560 --> 00:26:41,760
Speaker 1: I want this now? Yeah, is this something that's available

420
00:26:41,840 --> 00:26:42,719
Speaker 1: right now? Now?

421
00:26:43,080 --> 00:26:46,240
Speaker 2: It's funny because after I release that video, I probably

422
00:26:46,240 --> 00:26:48,679
Speaker 2: got one hundred emails of people just linking me to

423
00:26:48,760 --> 00:26:51,080
Speaker 2: like Google drive of their songs and I'm just like, okay,

424
00:26:51,119 --> 00:26:55,040
Speaker 2: this is not how it works, like unfortunately, yeah, so

425
00:26:55,720 --> 00:26:58,679
Speaker 2: I am not like, I'm not an mL developer. I

426
00:26:58,680 --> 00:27:02,359
Speaker 2: can't write a phision code. I believe it took about

427
00:27:02,920 --> 00:27:06,760
Speaker 2: two weeks for ten songs or something like that on

428
00:27:06,960 --> 00:27:10,240
Speaker 2: my machine with two brand news state of the art,

429
00:27:10,359 --> 00:27:11,639
Speaker 2: you know, big video cards.

430
00:27:11,760 --> 00:27:15,000
Speaker 1: Two weeks as in on and off incoding, no NonStop

431
00:27:15,080 --> 00:27:17,800
Speaker 1: two weeks and non stop and coding to do your album. Okay, yeah,

432
00:27:17,800 --> 00:27:20,359
Speaker 1: that is that's not accessible. I will not be asking

433
00:27:20,359 --> 00:27:22,080
Speaker 1: you to handle my album for me then never mind.

434
00:27:22,160 --> 00:27:23,920
Speaker 2: Yeah, And so it's also like even if I could,

435
00:27:24,000 --> 00:27:26,520
Speaker 2: then it's still like, okay, well if it's this inefficient,

436
00:27:26,600 --> 00:27:30,320
Speaker 2: like using this much power, you know, what we don't

437
00:27:30,359 --> 00:27:33,080
Speaker 2: want is to set the planet on fire just to

438
00:27:33,119 --> 00:27:37,119
Speaker 2: protect our music from a couple startups.

439
00:27:37,600 --> 00:27:40,440
Speaker 1: So Ben Jordan's program might not be available to the

440
00:27:40,520 --> 00:27:43,720
Speaker 1: public anytime soon. That's the bad news. But here's some

441
00:27:43,800 --> 00:27:46,719
Speaker 1: good news. Professor Jin Leeu and his team do have

442
00:27:46,760 --> 00:27:49,479
Speaker 1: some near future plans to make their software more widely

443
00:27:49,520 --> 00:27:55,760
Speaker 1: available for a musician in the future. What would protecting

444
00:27:55,800 --> 00:27:59,600
Speaker 1: your music with something like Harmony Cloak look like? Is

445
00:27:59,600 --> 00:28:02,199
Speaker 1: it downloading an app? Is it, uploading it to a

446
00:28:02,240 --> 00:28:04,720
Speaker 1: site and redownloading it. What are they doing?

447
00:28:05,280 --> 00:28:08,720
Speaker 4: There are many many ways to use this technology to

448
00:28:08,840 --> 00:28:12,119
Speaker 4: protect their music. First of all, we are thinking to

449
00:28:12,280 --> 00:28:18,399
Speaker 4: integrate these technologies with other platforms, for example, Apple Music Spodify,

450
00:28:19,040 --> 00:28:22,119
Speaker 4: so in that case, once they upload their music to

451
00:28:22,160 --> 00:28:27,240
Speaker 4: their platform, they can automatically protect their music. We'll also

452
00:28:27,480 --> 00:28:30,000
Speaker 4: create a web set, so on our web set they

453
00:28:30,080 --> 00:28:34,879
Speaker 4: can upload their music then download the perturbrization version from it,

454
00:28:35,160 --> 00:28:37,640
Speaker 4: so musicians people can use this very easily.

455
00:28:37,920 --> 00:28:39,880
Speaker 1: Do you have a timeline for when this might be

456
00:28:39,880 --> 00:28:40,920
Speaker 1: available for the public.

457
00:28:41,160 --> 00:28:44,600
Speaker 4: In July, we plan to launch a test program which

458
00:28:44,640 --> 00:28:47,560
Speaker 4: will involve around two hundred musicians so that we can

459
00:28:47,600 --> 00:28:53,480
Speaker 4: folder fun team fold improved this system before large scale deployment,

460
00:28:53,840 --> 00:28:59,200
Speaker 4: and if everything goes smoothly, I think integration of this

461
00:28:59,400 --> 00:29:02,640
Speaker 4: technology in at the Plasma will be very quick. Hopefully

462
00:29:02,720 --> 00:29:06,800
Speaker 4: this can be integrated in August ord September this year.

463
00:29:09,600 --> 00:29:11,920
Speaker 1: Despite the fact that he's working on programs that are

464
00:29:11,960 --> 00:29:16,880
Speaker 1: actively fighting against AI, jin, Lu is not universally anti AI,

465
00:29:17,320 --> 00:29:20,120
Speaker 1: and neither is Ben Jordan. They both think that AI

466
00:29:20,200 --> 00:29:22,080
Speaker 1: can be a useful tool.

467
00:29:22,280 --> 00:29:26,360
Speaker 4: I think AI machine learning itself doesn't have any problems.

468
00:29:26,800 --> 00:29:30,280
Speaker 4: The problem is how this big tech company trend their models.

469
00:29:30,600 --> 00:29:34,240
Speaker 4: And also from the musician's perspective, because we talk to

470
00:29:34,680 --> 00:29:37,880
Speaker 4: many many musicians, actually some of them use AI. They

471
00:29:37,880 --> 00:29:41,120
Speaker 4: feel this AI model is pretty useful. But if these

472
00:29:41,160 --> 00:29:44,400
Speaker 4: company wants to use their music examples for training models,

473
00:29:44,920 --> 00:29:49,400
Speaker 4: they need to get explicit permission. Also they need to

474
00:29:49,480 --> 00:29:52,440
Speaker 4: offer compensation to musicians.

475
00:29:54,880 --> 00:29:56,680
Speaker 1: How do you feel like, say the next six months

476
00:29:56,800 --> 00:29:58,920
Speaker 1: year plays out for music and AI.

477
00:30:00,200 --> 00:30:03,800
Speaker 2: One thing that I mean it's probably good news for

478
00:30:04,280 --> 00:30:07,719
Speaker 2: anybody who's worried about AI music taking over anything, is that,

479
00:30:07,800 --> 00:30:13,080
Speaker 2: like psychoacoustics are really really complicated, Like telling a computer

480
00:30:13,160 --> 00:30:16,360
Speaker 2: to hear something without any sort of image analysis, to

481
00:30:16,520 --> 00:30:19,800
Speaker 2: just not go that spectral conversion route and just hear

482
00:30:19,960 --> 00:30:22,680
Speaker 2: something the way a human hears is like you may

483
00:30:22,720 --> 00:30:25,760
Speaker 2: as well just ask it to become self aware, because

484
00:30:25,760 --> 00:30:29,120
Speaker 2: that sounds easier to me, just because like what's happening,

485
00:30:29,440 --> 00:30:31,880
Speaker 2: you know, like our hearing is by far our most

486
00:30:31,920 --> 00:30:35,400
Speaker 2: sensitive sense that we have. And when you think about

487
00:30:35,400 --> 00:30:39,000
Speaker 2: what happens from picking up pressure waves and then little

488
00:30:39,000 --> 00:30:42,240
Speaker 2: hairs in our ear picking this up and interpreting them

489
00:30:42,440 --> 00:30:46,720
Speaker 2: in conjunction with our brain into sounds. It's kind of

490
00:30:46,800 --> 00:30:50,960
Speaker 2: mysterious and crazy. And so to just tell AI, like, hey,

491
00:30:51,080 --> 00:30:55,520
Speaker 2: listen to this this sonic pressure and figure out how

492
00:30:55,520 --> 00:30:57,520
Speaker 2: to make it again like that, that's a much bigger

493
00:30:57,960 --> 00:31:01,280
Speaker 2: ask than I think it sounds too your average investor

494
00:31:01,680 --> 00:31:03,960
Speaker 2: or something who thinks that AI music is going to

495
00:31:04,360 --> 00:31:08,880
Speaker 2: eventually I don't know, I guess replace musicians or something.

496
00:31:09,080 --> 00:31:09,600
Speaker 1: Yeah.

497
00:31:09,680 --> 00:31:13,240
Speaker 2: Ideally, what I would really like to see is, as

498
00:31:13,280 --> 00:31:15,640
Speaker 2: people get more used to AI, two things I would

499
00:31:15,680 --> 00:31:19,120
Speaker 2: like people to use it locally. So, for example, image

500
00:31:19,160 --> 00:31:21,880
Speaker 2: and heap. She sent me a bunch of like probably

501
00:31:21,920 --> 00:31:24,600
Speaker 2: over an hour of her singing, sometimes in like really

502
00:31:24,600 --> 00:31:26,760
Speaker 2: weird ways, and then we sort of work together and

503
00:31:26,760 --> 00:31:28,400
Speaker 2: I create a voice model, and then she sang through

504
00:31:28,400 --> 00:31:30,360
Speaker 2: the voice model, and that was all locally. It wasn't

505
00:31:30,360 --> 00:31:33,240
Speaker 2: happening through any sort of service. I really liked that idea.

506
00:31:33,400 --> 00:31:37,280
Speaker 2: And I like the idea of artists being able to

507
00:31:37,360 --> 00:31:40,080
Speaker 2: sell their voice or sell their music style or their

508
00:31:40,080 --> 00:31:43,320
Speaker 2: instruments or something like that and put it in a marketplace.

509
00:31:43,760 --> 00:31:46,680
Speaker 2: And really the only technology that would need to exist

510
00:31:47,320 --> 00:31:49,640
Speaker 2: in any sort of centralized way would just be somebody

511
00:31:49,680 --> 00:31:52,680
Speaker 2: to water market or something. I really like that and

512
00:31:52,720 --> 00:31:55,640
Speaker 2: the other thing is, right now, we're in this land

513
00:31:55,880 --> 00:32:01,000
Speaker 2: in generative AI where the ideas are just huge and

514
00:32:01,240 --> 00:32:05,840
Speaker 2: kind of nonsensical, Like, you know, what if AI replaced music?

515
00:32:05,960 --> 00:32:07,840
Speaker 2: It's like, Nope, it's not gonna do that, But what

516
00:32:08,000 --> 00:32:12,920
Speaker 2: if it replaced samplers, you know, like violin, sample instruments

517
00:32:13,000 --> 00:32:15,920
Speaker 2: or something like. That's actually somewhere where AI can do

518
00:32:15,960 --> 00:32:19,480
Speaker 2: a really really good job to make writing music more

519
00:32:19,520 --> 00:32:22,520
Speaker 2: fun and more accurate, I guess, and you know, things

520
00:32:22,520 --> 00:32:25,840
Speaker 2: like that. And so once we sort of realize that

521
00:32:25,840 --> 00:32:28,360
Speaker 2: that not every single person is going to adopt AI

522
00:32:28,440 --> 00:32:32,160
Speaker 2: music and stop listening to humans, then maybe we can

523
00:32:32,520 --> 00:32:35,440
Speaker 2: invest money into making practical solutions.

524
00:32:38,760 --> 00:32:42,200
Speaker 1: And that is it for this particular discussion about AI

525
00:32:42,400 --> 00:32:45,240
Speaker 1: and music. And I say this particular discussion because this

526
00:32:45,400 --> 00:32:47,200
Speaker 1: is not the last time we're going to be talking

527
00:32:47,240 --> 00:32:50,280
Speaker 1: about AI and music. This is a really big topic

528
00:32:50,400 --> 00:32:52,840
Speaker 1: and all of us at kill Switch are pretty into music,

529
00:32:53,120 --> 00:32:55,600
Speaker 1: so we're absolutely going to be getting back into this again.

530
00:32:56,200 --> 00:32:58,360
Speaker 1: And you know, if you've got any music related stuff

531
00:32:58,360 --> 00:33:01,440
Speaker 1: that you're curious about, let us No. Before I get

532
00:33:01,440 --> 00:33:04,000
Speaker 1: out of here, though, I gotta do some shout outs. First,

533
00:33:04,160 --> 00:33:06,360
Speaker 1: big shout out to Ben Jordan. If you found his

534
00:33:06,440 --> 00:33:10,360
Speaker 1: Poisonify concept interesting, he has a whole YouTube video on it,

535
00:33:10,520 --> 00:33:12,600
Speaker 1: and the link for that is in the show notes.

536
00:33:12,880 --> 00:33:16,040
Speaker 1: He's also started a company called top Set Labs that's

537
00:33:16,080 --> 00:33:19,520
Speaker 1: developing AI voice models that are trained on artists who

538
00:33:19,560 --> 00:33:22,239
Speaker 1: have given their explicit consent, and a lot of them

539
00:33:22,240 --> 00:33:24,760
Speaker 1: are making more money in those royalties than they do

540
00:33:24,840 --> 00:33:27,840
Speaker 1: on Spotify. So if you're a musician or you're just

541
00:33:27,920 --> 00:33:30,000
Speaker 1: curious on how that works, might want to check that

542
00:33:30,040 --> 00:33:32,800
Speaker 1: out too. Also a big shout out to our other guest,

543
00:33:32,960 --> 00:33:36,120
Speaker 1: professor jenlu as well as Saydeerfond from the University of

544
00:33:36,120 --> 00:33:39,080
Speaker 1: Tennessee Knoxville for letting us test out Harmony Cloak and

545
00:33:39,200 --> 00:33:41,040
Speaker 1: music Shield. And if you want to check out that

546
00:33:41,080 --> 00:33:43,720
Speaker 1: paper we were referencing, there's a link to that also

547
00:33:43,760 --> 00:33:46,479
Speaker 1: in the show notes. Thank you so much again for

548
00:33:46,520 --> 00:33:48,960
Speaker 1: listening to kill Switch and again let us know what

549
00:33:49,000 --> 00:33:50,960
Speaker 1: you think and if there's something you want us to cover,

550
00:33:51,280 --> 00:33:53,320
Speaker 1: We're easy to find. You can hit us up at

551
00:33:53,400 --> 00:33:57,320
Speaker 1: kill Switch at Kaleidoscope dot NYC, or you can check

552
00:33:57,360 --> 00:34:00,560
Speaker 1: us out on Instagram at kill switch pod or I'm

553
00:34:00,720 --> 00:34:03,200
Speaker 1: dex Digi. That's d e x d I g I

554
00:34:03,520 --> 00:34:07,080
Speaker 1: on Instagram or Blue Sky, and wherever you're listening to us,

555
00:34:07,360 --> 00:34:09,520
Speaker 1: make sure to leave us a review because it helps

556
00:34:09,600 --> 00:34:12,040
Speaker 1: other people find the show and that helps us keep

557
00:34:12,040 --> 00:34:15,880
Speaker 1: doing our things. Kill Switch is hosted by Me Dexter Thomas.

558
00:34:15,920 --> 00:34:19,839
Speaker 1: It's produced by Shina Ozaki, Darluk Potts and Kate Osborne.

559
00:34:20,040 --> 00:34:22,719
Speaker 1: Our theme song is by me and Kyle Murdoch and

560
00:34:22,840 --> 00:34:26,520
Speaker 1: Kyle also mixed the show. From Kaleidoscope. Our executive producers

561
00:34:26,560 --> 00:34:31,440
Speaker 1: are Ozo Lashin, Mangesh Hatikadur and Kate Osborne. From iHeart,

562
00:34:31,480 --> 00:34:35,000
Speaker 1: our executive producers are Katrina Norville and Nikki E. Tour

563
00:34:35,520 --> 00:34:36,800
Speaker 1: catch All the Next One