1 00:00:11,400 --> 00:00:14,080 Speaker 1: You mentioned that you weren't really releasing music. Can you 2 00:00:14,080 --> 00:00:15,280 Speaker 1: can you tell me about that decision. 3 00:00:15,600 --> 00:00:18,680 Speaker 2: I discovered that if you typed in across Suno and 4 00:00:18,800 --> 00:00:21,959 Speaker 2: Udio make music like The Flashbulb, it would just sound 5 00:00:22,000 --> 00:00:24,160 Speaker 2: like crappy versions of my music. 6 00:00:25,560 --> 00:00:28,639 Speaker 1: Ben Jordan is a musician in YouTuber, and he releases 7 00:00:28,760 --> 00:00:31,440 Speaker 1: music under the name The flash Bulb. You might not 8 00:00:31,520 --> 00:00:33,720 Speaker 1: have heard his music yet, but if you've tried an 9 00:00:33,720 --> 00:00:37,120 Speaker 1: AI music generator like Zuno or Judio, you might have 10 00:00:37,320 --> 00:00:40,320 Speaker 1: used his music. AI music generators work kind of like 11 00:00:40,360 --> 00:00:43,280 Speaker 1: an audio version of chat GBT. You can type in 12 00:00:43,400 --> 00:00:46,559 Speaker 1: something like make me a techno song with upbeat vocals 13 00:00:46,600 --> 00:00:50,159 Speaker 1: and pianos, and it does it based on a massive 14 00:00:50,240 --> 00:00:53,920 Speaker 1: library of music that it's scraped. And when Ben tried 15 00:00:53,960 --> 00:00:56,640 Speaker 1: messing around with one of these, he realized that his 16 00:00:56,760 --> 00:01:01,160 Speaker 1: music had been scraped into that library too, without his It's. 17 00:01:01,040 --> 00:01:02,720 Speaker 2: One of those things where I feel like a lot 18 00:01:02,760 --> 00:01:05,880 Speaker 2: of people could type in their name and in might 19 00:01:05,920 --> 00:01:08,600 Speaker 2: guess something similar to a song that they made, but like, 20 00:01:08,680 --> 00:01:11,399 Speaker 2: this was like undeniable. This was just like, oh, this 21 00:01:11,520 --> 00:01:15,399 Speaker 2: is literally just everything that you would expect in a 22 00:01:15,480 --> 00:01:17,920 Speaker 2: song of mine, all the even the weird things except 23 00:01:18,000 --> 00:01:20,679 Speaker 2: for the me part. 24 00:01:21,640 --> 00:01:24,560 Speaker 1: So Bin came up with a solution, a program that 25 00:01:24,640 --> 00:01:28,959 Speaker 1: adds imperceptible noise to a music track, confusing AI models 26 00:01:29,000 --> 00:01:31,800 Speaker 1: and preventing them from replicating the track. This is a 27 00:01:31,800 --> 00:01:33,559 Speaker 1: technique called poison pilling. 28 00:01:33,959 --> 00:01:37,880 Speaker 2: Poison pilling started with images. There's one called night shade, 29 00:01:38,240 --> 00:01:41,920 Speaker 2: and what it did is it essentially just generated some 30 00:01:41,959 --> 00:01:45,000 Speaker 2: stuff in the images that was mostly invisible to humans, 31 00:01:45,480 --> 00:01:47,880 Speaker 2: and then the AI would see it as something else, 32 00:01:47,960 --> 00:01:48,880 Speaker 2: or it would confuse it. 33 00:01:49,080 --> 00:01:51,440 Speaker 1: A couple of years ago, as l limbs were really 34 00:01:51,480 --> 00:01:54,360 Speaker 1: taken off, a group of researchers at the University of 35 00:01:54,440 --> 00:01:58,680 Speaker 1: Chicago developed night Shade and glaze. These are programs that 36 00:01:58,720 --> 00:02:01,800 Speaker 1: take an image and make tiny changes to them. These 37 00:02:01,880 --> 00:02:05,440 Speaker 1: changes are basically imperceptible to the human eye, but it 38 00:02:05,440 --> 00:02:08,560 Speaker 1: would confuse an AI model. The thinking was that if 39 00:02:08,639 --> 00:02:12,160 Speaker 1: artists applied nightshade to their images and those images were 40 00:02:12,160 --> 00:02:15,000 Speaker 1: then scraped to train AI models, it would not only 41 00:02:15,040 --> 00:02:17,880 Speaker 1: prevent the l elms from learning anything from the individual 42 00:02:18,000 --> 00:02:21,640 Speaker 1: artists work, but it would also quote poison the data 43 00:02:21,680 --> 00:02:24,160 Speaker 1: sets and make those models less reliable. 44 00:02:24,600 --> 00:02:27,280 Speaker 2: And so I was like, Okay, well, how possible is 45 00:02:27,320 --> 00:02:30,200 Speaker 2: this it's music because I know that adversarial noise attacks 46 00:02:30,200 --> 00:02:34,200 Speaker 2: are possible on things like your Google Home or Oxa 47 00:02:34,280 --> 00:02:37,440 Speaker 2: or Siri, and it turns out that it is totally possible. 48 00:02:43,960 --> 00:02:53,400 Speaker 1: I'm afraid from Kaleidoscope and iHeart podcasts. This is kill switch. 49 00:02:54,200 --> 00:03:28,359 Speaker 1: I'm defter, Thomas. I'm sorry. When was the first time 50 00:03:28,360 --> 00:03:33,480 Speaker 1: that you really started feeling AI impacting your music personally? 51 00:03:34,160 --> 00:03:38,240 Speaker 2: The beginning would be almost in a positive exploratory way, 52 00:03:38,360 --> 00:03:42,160 Speaker 2: in like twenty sixteen, Google release Magenta, and in one 53 00:03:42,200 --> 00:03:45,000 Speaker 2: of my albums, I used it to sort of generate 54 00:03:45,040 --> 00:03:49,120 Speaker 2: this weird morphing sound between three different instruments, and it 55 00:03:49,160 --> 00:03:51,160 Speaker 2: was like unlike anything I had really heard before, and 56 00:03:51,200 --> 00:03:52,880 Speaker 2: it was this new type synthesis. So of course I 57 00:03:52,960 --> 00:03:54,800 Speaker 2: like jumped all over it, and I I was fascinated 58 00:03:54,800 --> 00:03:57,760 Speaker 2: with it, and then you know, moving on from there. 59 00:03:58,120 --> 00:04:00,600 Speaker 2: It's just the current landscape that we're in that makes 60 00:04:00,600 --> 00:04:04,160 Speaker 2: it so bad. So for example, with Spotify refusing to 61 00:04:04,200 --> 00:04:07,000 Speaker 2: pay less than a thousand streams per song, you have 62 00:04:07,080 --> 00:04:08,760 Speaker 2: like a year to get a thousand streams. If you 63 00:04:08,760 --> 00:04:10,400 Speaker 2: don't get that, they don't pay you. So you have 64 00:04:10,520 --> 00:04:14,240 Speaker 2: all this already, like low royalties being paid out by 65 00:04:14,240 --> 00:04:17,600 Speaker 2: these digital streaming platforms, and you have way too many 66 00:04:17,680 --> 00:04:20,840 Speaker 2: artists already, you know, for the system to actually work 67 00:04:20,880 --> 00:04:23,640 Speaker 2: where people would be making a living. And now you 68 00:04:23,720 --> 00:04:26,640 Speaker 2: have people who aren't musicians who are just using these 69 00:04:26,640 --> 00:04:29,600 Speaker 2: services to generate as many songs as they possibly can 70 00:04:29,640 --> 00:04:30,799 Speaker 2: and in their monthly subscription. 71 00:04:32,920 --> 00:04:36,760 Speaker 1: AI generated music is starting to creep into music platforms 72 00:04:36,800 --> 00:04:39,880 Speaker 1: like Spotify and even YouTube. Now you might have come 73 00:04:39,920 --> 00:04:42,800 Speaker 1: across it at this point, and maybe you recognized it, 74 00:04:42,839 --> 00:04:45,719 Speaker 1: and maybe you didn't. Aside from being really annoying for 75 00:04:45,800 --> 00:04:49,320 Speaker 1: people who actually care about music, for musicians, this is 76 00:04:49,360 --> 00:04:53,320 Speaker 1: taking away attention and thus money, because a lot of artists' 77 00:04:53,320 --> 00:04:56,240 Speaker 1: income depends on the number of streams they get. And 78 00:04:56,279 --> 00:04:59,360 Speaker 1: it also doesn't help that the CEO of Suno, which 79 00:04:59,400 --> 00:05:02,240 Speaker 1: is one of the most popular AI music companies right now, 80 00:05:02,760 --> 00:05:06,200 Speaker 1: doesn't seem to really appreciate the music creation process. 81 00:05:06,600 --> 00:05:09,800 Speaker 3: It's not really enjoyable to make music now people enjoy it. 82 00:05:11,000 --> 00:05:12,960 Speaker 3: It takes a lot of time, it takes a lot 83 00:05:12,960 --> 00:05:14,720 Speaker 3: of practice. You need to get really good at an 84 00:05:14,720 --> 00:05:17,200 Speaker 3: instrument or really good at a piece of production software. 85 00:05:17,279 --> 00:05:19,960 Speaker 3: I think the majority of people don't enjoy the majority 86 00:05:19,960 --> 00:05:21,479 Speaker 3: of the time they spend making music. 87 00:05:23,000 --> 00:05:25,400 Speaker 2: That's like one of the most absurd things I've ever 88 00:05:25,440 --> 00:05:27,880 Speaker 2: heard in my life. I mean, that's how I hear 89 00:05:27,960 --> 00:05:31,640 Speaker 2: that is a CEO trying to justify the existence of 90 00:05:31,680 --> 00:05:34,159 Speaker 2: a company in a practical business sense. 91 00:05:34,720 --> 00:05:37,520 Speaker 1: So before he went the poison pill route, Bin's first 92 00:05:37,520 --> 00:05:39,919 Speaker 1: idea was to make something that would detect if music 93 00:05:40,000 --> 00:05:43,520 Speaker 1: was generated by AI. That way, a platform like Spotify 94 00:05:43,640 --> 00:05:46,520 Speaker 1: could use it to just reject any AI generated music 95 00:05:46,600 --> 00:05:47,919 Speaker 1: that someone tried to upload. 96 00:05:48,360 --> 00:05:52,760 Speaker 2: Basically, when you put something on Spotify or really anywhere 97 00:05:52,760 --> 00:05:55,880 Speaker 2: where you listen to music generally, obviously, file size and 98 00:05:56,320 --> 00:05:59,719 Speaker 2: bandwidth are giant considerations, and so you have to compress 99 00:05:59,760 --> 00:06:05,040 Speaker 2: all and within that you use techniques like inverted discrete 100 00:06:05,160 --> 00:06:08,480 Speaker 2: cosine transform and a bunch of smart sounding things, and 101 00:06:08,720 --> 00:06:13,039 Speaker 2: you can detect discrete cosine transform. And so the thing 102 00:06:13,160 --> 00:06:15,960 Speaker 2: is is that Suno and Udio, they allegedly went on 103 00:06:16,040 --> 00:06:19,120 Speaker 2: YouTube and Spotify and they just scraped and scraped and 104 00:06:19,120 --> 00:06:22,240 Speaker 2: scraped and learned and learned and learned, and so it's 105 00:06:22,360 --> 00:06:23,800 Speaker 2: quite easy to detect it. 106 00:06:23,960 --> 00:06:25,760 Speaker 1: I see what you're saying. So you're you were able 107 00:06:25,800 --> 00:06:30,719 Speaker 1: to basically detect Okay, this was downloaded from Spotify because 108 00:06:30,760 --> 00:06:34,239 Speaker 1: it's basically a sound signature. Even if a person doesn't 109 00:06:34,240 --> 00:06:36,720 Speaker 1: hear it, computer wise, you can sell YEP. 110 00:06:36,960 --> 00:06:39,840 Speaker 2: And so the sort of idea after I announced that, 111 00:06:40,000 --> 00:06:41,520 Speaker 2: I got a lot of people saying, yeah, but now 112 00:06:41,520 --> 00:06:43,440 Speaker 2: they're just going to use raw wave files, they're just 113 00:06:43,440 --> 00:06:45,360 Speaker 2: going to use the masters, and it's like good. Then 114 00:06:45,440 --> 00:06:48,640 Speaker 2: they have to negotiate with the artist like that opens 115 00:06:48,640 --> 00:06:49,240 Speaker 2: a conversation. 116 00:06:49,600 --> 00:06:53,080 Speaker 1: The masters we're talking about here are the original, highest 117 00:06:53,120 --> 00:06:55,880 Speaker 1: quality track, which would be owned by the artist or 118 00:06:55,920 --> 00:06:58,880 Speaker 1: the label. But in order to use those master files, 119 00:06:58,920 --> 00:07:01,279 Speaker 1: you need to get them directly from the person who 120 00:07:01,279 --> 00:07:04,000 Speaker 1: owns them, and that would mean you'd probably need to 121 00:07:04,040 --> 00:07:08,000 Speaker 1: pay them. The goal isn't necessarily to stop AI, just 122 00:07:08,120 --> 00:07:11,040 Speaker 1: to stop AI that the artist isn't getting paid for. 123 00:07:11,400 --> 00:07:14,520 Speaker 1: And Ben thought this project could dissuade AI companies from 124 00:07:14,560 --> 00:07:18,040 Speaker 1: scraping data because platforms could use this tool to detect 125 00:07:18,200 --> 00:07:22,640 Speaker 1: and again then reject that AI music. But Spotify hasn't 126 00:07:22,680 --> 00:07:25,840 Speaker 1: implemented this, and as far as we know, it hasn't 127 00:07:25,880 --> 00:07:28,840 Speaker 1: stopped AI companies from continuing to scrape music. 128 00:07:29,200 --> 00:07:33,000 Speaker 2: I mean, tell Sam Altman to pay for all the 129 00:07:33,120 --> 00:07:35,680 Speaker 2: data that he's training everything with them. So, so I 130 00:07:35,680 --> 00:07:38,880 Speaker 2: guess that's what led to the next step, right, It's like, okay, well, 131 00:07:38,920 --> 00:07:40,520 Speaker 2: how do we prevent it from being trained? 132 00:07:41,920 --> 00:07:45,000 Speaker 1: Enter the poison pill. Ben knew that you could do 133 00:07:45,040 --> 00:07:48,080 Speaker 1: this in images, but how would this work in audio? 134 00:07:48,760 --> 00:07:51,800 Speaker 1: It turns out the process is actually pretty similar. 135 00:07:52,520 --> 00:07:55,400 Speaker 2: It's actually not all that different from a technical standpoint, 136 00:07:55,400 --> 00:07:59,160 Speaker 2: because the majority of AI music sites, how they're really 137 00:07:59,200 --> 00:08:02,120 Speaker 2: working is it's all based on the original unit model 138 00:08:02,320 --> 00:08:05,640 Speaker 2: that was for microscopic imaging. That's sort of like what 139 00:08:05,840 --> 00:08:09,040 Speaker 2: changed this whole generative AI thing and made it so 140 00:08:09,120 --> 00:08:10,800 Speaker 2: much easier to train things than it used to be. 141 00:08:10,920 --> 00:08:13,320 Speaker 2: But if you've ever seen an audio spectrogram that you 142 00:08:13,320 --> 00:08:16,320 Speaker 2: would see like a violin, it would be like yeah, 143 00:08:16,440 --> 00:08:19,400 Speaker 2: and so you would see the note or a line 144 00:08:19,600 --> 00:08:22,520 Speaker 2: slowly get thicker and thicker as the violin got louder, 145 00:08:22,520 --> 00:08:26,520 Speaker 2: where as a guitar or a piano it would be 146 00:08:26,560 --> 00:08:29,480 Speaker 2: an instant star snare drum would be like an instant 147 00:08:29,600 --> 00:08:31,560 Speaker 2: start and then maybe a little bit of fade out 148 00:08:31,560 --> 00:08:34,680 Speaker 2: on the end depending on how it's mixed. You know, 149 00:08:34,720 --> 00:08:37,120 Speaker 2: there are programs, for example, where you could listen to 150 00:08:37,200 --> 00:08:39,600 Speaker 2: audio that you draw, and so it's basically doing that. 151 00:08:39,720 --> 00:08:43,079 Speaker 2: It's just reading the spectrogram of the audio and then 152 00:08:43,360 --> 00:08:46,080 Speaker 2: learning from that and then re encoding another spectrum and 153 00:08:46,080 --> 00:08:49,439 Speaker 2: then converting that back into audio. So it's kind of 154 00:08:49,440 --> 00:08:51,880 Speaker 2: funny because that's like that's a little bit of a 155 00:08:51,960 --> 00:08:56,360 Speaker 2: flawed way of generating audio to begin with. Like you 156 00:08:56,360 --> 00:08:58,240 Speaker 2: could usually hear something if it's been converted to a 157 00:08:58,280 --> 00:09:00,520 Speaker 2: spectrum in back, and you can hear that almost all 158 00:09:00,559 --> 00:09:02,439 Speaker 2: EI music. That's kind of why it sounds a little 159 00:09:02,640 --> 00:09:05,840 Speaker 2: glitchy or squeaky or not. It's hard to describe. 160 00:09:06,080 --> 00:09:11,400 Speaker 1: I didn't realize that. The way that lms are essentially, 161 00:09:11,480 --> 00:09:13,600 Speaker 1: you know these models are interpreting music, is they're really 162 00:09:13,640 --> 00:09:19,600 Speaker 1: interpreting them as images. Yeah, using this technique, been created 163 00:09:19,600 --> 00:09:22,960 Speaker 1: a process he calls Poisonify. When you run a track 164 00:09:22,960 --> 00:09:27,480 Speaker 1: through Poisonify, it'll add noise. It's imperceptible to us but 165 00:09:27,760 --> 00:09:31,560 Speaker 1: visible in the audio spectrogram. This confuses the AI training 166 00:09:31,600 --> 00:09:34,560 Speaker 1: on it to the point where it can't identify instruments. 167 00:09:35,360 --> 00:09:40,520 Speaker 2: So Poisonify is essentially preventing what Magenta initially did, where 168 00:09:40,520 --> 00:09:45,160 Speaker 2: it learns primarily can identify instruments and can identify style 169 00:09:45,280 --> 00:09:47,679 Speaker 2: and things like that. It cloaks it to where it 170 00:09:47,760 --> 00:09:49,840 Speaker 2: just thinks that it's tearing something else. You could have 171 00:09:49,920 --> 00:09:53,960 Speaker 2: targeted attacks where you can say I want my piano 172 00:09:54,559 --> 00:09:58,120 Speaker 2: to sound like a harmonica or something, or you could 173 00:09:58,160 --> 00:10:00,640 Speaker 2: have untargeted attacks, or it'll just kind of go with 174 00:10:00,679 --> 00:10:03,920 Speaker 2: whatever's easiest. And when using them in a particolar way, 175 00:10:04,360 --> 00:10:08,640 Speaker 2: you can successfully make these instruments and these styles unidentifiable. 176 00:10:09,040 --> 00:10:12,440 Speaker 2: So Sono or Udio they would get confused. Then when 177 00:10:12,480 --> 00:10:16,480 Speaker 2: it came time to draw a new spectrogram to convert 178 00:10:16,480 --> 00:10:19,920 Speaker 2: into audio, they would probably draw some of the wrong ones. 179 00:10:20,120 --> 00:10:22,079 Speaker 1: So this means that you could put say an e 180 00:10:22,200 --> 00:10:26,120 Speaker 1: EDM track that's treated with Poisonify into Suno, for example, 181 00:10:26,400 --> 00:10:29,280 Speaker 1: and if you ask Suno to generate something similar or 182 00:10:29,400 --> 00:10:33,440 Speaker 1: extend the song, it'll spit out something totally unrelated, like 183 00:10:33,640 --> 00:10:36,720 Speaker 1: acoustic guitar music. Here's a clip from Ben testing that 184 00:10:36,760 --> 00:10:39,480 Speaker 1: out with his own music in his YouTube video about 185 00:10:39,480 --> 00:10:41,000 Speaker 1: this process. So here we go. 186 00:10:41,080 --> 00:10:49,760 Speaker 2: We can upload my original song here and now here 187 00:10:49,880 --> 00:10:58,240 Speaker 2: is SUNO's AI extension of that song, Sweedish. 188 00:11:00,280 --> 00:11:00,760 Speaker 1: Glowing. 189 00:11:03,800 --> 00:11:13,200 Speaker 2: Okay, now let's upload my poison if encoded track and 190 00:11:13,360 --> 00:11:22,400 Speaker 2: here is SUNO's AI generated extension. I would describe this 191 00:11:22,480 --> 00:11:26,079 Speaker 2: as music from an Airport SPA that somebody downloaded off 192 00:11:26,080 --> 00:11:27,880 Speaker 2: of Napster in nineteen ninety nine. 193 00:11:28,240 --> 00:11:31,120 Speaker 1: The entire video is definitely worth checking out, and will 194 00:11:31,160 --> 00:11:33,360 Speaker 1: include a link to it in the show notes. But 195 00:11:33,640 --> 00:11:36,959 Speaker 1: it's really interesting to hear how confused SUNO gets when 196 00:11:37,000 --> 00:11:39,080 Speaker 1: the track is encoded with Poisonify. 197 00:11:39,760 --> 00:11:41,800 Speaker 2: In some of those demonstrations, a lot of people were like, 198 00:11:41,880 --> 00:11:43,760 Speaker 2: that is so crazy, and it's like, no, really, what 199 00:11:43,880 --> 00:11:46,760 Speaker 2: it's doing is like, that's its own safety mechanism, Like 200 00:11:46,840 --> 00:11:49,079 Speaker 2: because it knows that it was confused. 201 00:11:49,640 --> 00:11:53,520 Speaker 1: This is pretty fascinating Poisonify. It doesn't make AI think 202 00:11:53,600 --> 00:11:57,040 Speaker 1: the drums or flutes, It just confuses it so much 203 00:11:57,080 --> 00:11:59,480 Speaker 1: with the noise that it adds to the spectrogram that 204 00:11:59,480 --> 00:12:02,320 Speaker 1: the AI doesn't know what to do and falls back 205 00:12:02,360 --> 00:12:05,400 Speaker 1: and randomly chooses something that it knows to be music. 206 00:12:06,080 --> 00:12:08,319 Speaker 2: So if you've very used like generative AI with images, 207 00:12:08,360 --> 00:12:10,600 Speaker 2: you'll notice something that happens quite often. As you'll say 208 00:12:10,679 --> 00:12:13,960 Speaker 2: I want a dog in a canoe eating a banana 209 00:12:14,440 --> 00:12:18,360 Speaker 2: headed towards a sunset, and the image you would get 210 00:12:19,120 --> 00:12:23,040 Speaker 2: might be like a dog on a jet ski not 211 00:12:23,120 --> 00:12:26,520 Speaker 2: eating anything in a lake with no sunset, And those 212 00:12:26,559 --> 00:12:29,000 Speaker 2: are really just failsafes like it tried to make a sunset, 213 00:12:29,240 --> 00:12:31,679 Speaker 2: it didn't have enough confidence. It literally is called the 214 00:12:31,679 --> 00:12:34,320 Speaker 2: confidence rating, and so it just said, okay, let's just 215 00:12:34,400 --> 00:12:37,440 Speaker 2: make what we normally make in the background. And so 216 00:12:37,720 --> 00:12:39,040 Speaker 2: it's sort of the same thing with music. 217 00:12:39,280 --> 00:12:41,640 Speaker 1: And there's another program that takes it a step beyond 218 00:12:41,640 --> 00:12:45,720 Speaker 1: poison Ifi. Instead of masking the instruments, it masks the 219 00:12:45,800 --> 00:12:47,240 Speaker 1: music itself. 220 00:12:47,280 --> 00:12:49,760 Speaker 2: Harmony Cloak. That's like above my pay grade. I don't 221 00:12:49,760 --> 00:12:52,440 Speaker 2: really understand how that model works. I'm just really glad 222 00:12:52,440 --> 00:12:54,800 Speaker 2: that they're working on it. But yeah, I mean they 223 00:12:54,920 --> 00:13:00,080 Speaker 2: obfuscate melody and harmony, which is pretty crazy. 224 00:13:00,800 --> 00:13:03,240 Speaker 1: I talked to the developer of Harmony Cloak about how 225 00:13:03,360 --> 00:13:06,160 Speaker 1: exactly they do this, and they even helped us test 226 00:13:06,200 --> 00:13:09,920 Speaker 1: it out on the kill Switch theme song that's after 227 00:13:09,960 --> 00:13:24,480 Speaker 1: the break. At the same time that Ben was working 228 00:13:24,520 --> 00:13:28,560 Speaker 1: on Poisonify, researchers at the University of Tennessee Knoxville were 229 00:13:28,559 --> 00:13:32,080 Speaker 1: working on another way to poison pill music, a program 230 00:13:32,160 --> 00:13:33,439 Speaker 1: they call Harmony Cloak. 231 00:13:34,040 --> 00:13:39,240 Speaker 4: Hu mess and machinks interpret data in different ways, so 232 00:13:39,360 --> 00:13:41,920 Speaker 4: there's a perceptual gap between humans and the machinks. 233 00:13:42,559 --> 00:13:45,600 Speaker 1: Jin Lu is an assistant professor at the University of Tennessee, 234 00:13:45,640 --> 00:13:49,280 Speaker 1: Knoxville and the lead developer of Harmony Cloak. He's also 235 00:13:49,440 --> 00:13:50,880 Speaker 1: really into music himself. 236 00:13:51,280 --> 00:13:54,319 Speaker 4: I love Milzaic. Actually I also play Militzach. I played 237 00:13:54,320 --> 00:13:54,880 Speaker 4: best guitar. 238 00:13:55,280 --> 00:13:58,360 Speaker 1: Harmony Cloak is similar to Poisonify and then it adds 239 00:13:58,520 --> 00:14:02,360 Speaker 1: imperceptible noise to the fire file. But unlike PUISONI five, 240 00:14:03,000 --> 00:14:05,720 Speaker 1: Homony Cloak doesn't just work on the level of the instruments. 241 00:14:06,120 --> 00:14:09,480 Speaker 1: It completely confuses the AI, so it can't learn from 242 00:14:09,480 --> 00:14:10,520 Speaker 1: the music at all. 243 00:14:11,800 --> 00:14:15,120 Speaker 4: So what we are doing right now is to use perturbation. 244 00:14:15,800 --> 00:14:20,600 Speaker 4: So we inject imperceptible perturbations to the music sampos to 245 00:14:21,160 --> 00:14:25,000 Speaker 4: trick the model into believing that they have already learned 246 00:14:25,000 --> 00:14:28,640 Speaker 4: this before. So there's no new knowledge, no new information 247 00:14:28,760 --> 00:14:31,480 Speaker 4: in bad in this music samples, so they couldn't learn 248 00:14:31,680 --> 00:14:33,960 Speaker 4: anything from this piece of work. 249 00:14:34,400 --> 00:14:38,320 Speaker 1: So the AI thinks there's no new information and essentially 250 00:14:38,440 --> 00:14:41,640 Speaker 1: ignores everything that's in the file. This means that AI 251 00:14:41,720 --> 00:14:45,920 Speaker 1: models can't train on music. With Harmony Cloak applied, you're 252 00:14:45,960 --> 00:14:49,760 Speaker 1: talking about introducing noise or what you call, you know, 253 00:14:50,080 --> 00:14:54,720 Speaker 1: technical term perturbations, right perturbations into the music, which you 254 00:14:54,760 --> 00:14:59,160 Speaker 1: say are imperceptible? Are these actually imperceptible? Right? If you're 255 00:14:59,280 --> 00:15:03,400 Speaker 1: adding noise, if you're adding extra data into the music, 256 00:15:03,880 --> 00:15:05,640 Speaker 1: can I as a listener hear. 257 00:15:05,480 --> 00:15:10,560 Speaker 4: That the perturbation we injected should have a minimum impact 258 00:15:10,760 --> 00:15:14,240 Speaker 4: on the perceptual quality of the music, because no one 259 00:15:14,240 --> 00:15:17,880 Speaker 4: wants to add noises to their artwork. So we conducted 260 00:15:18,160 --> 00:15:22,440 Speaker 4: a very comprehensive under study. We present both original one 261 00:15:22,560 --> 00:15:25,560 Speaker 4: and the perfect one to musicians and we ask them 262 00:15:25,600 --> 00:15:29,120 Speaker 4: to tell the difference, and our studies shows that they 263 00:15:29,120 --> 00:15:32,160 Speaker 4: won't tell difference between these two. I think in terms 264 00:15:32,200 --> 00:15:36,080 Speaker 4: of the musical quality, there's no big difference. So actually 265 00:15:36,160 --> 00:15:37,560 Speaker 4: the noises. 266 00:15:37,240 --> 00:15:39,280 Speaker 1: Itself is audible. 267 00:15:39,560 --> 00:15:43,040 Speaker 4: So if you listen to the noises only, like you 268 00:15:43,080 --> 00:15:46,480 Speaker 4: separate the perurbation from the music examples, you can hear 269 00:15:46,560 --> 00:15:50,360 Speaker 4: something it's audible. But if you combine these too, the 270 00:15:50,400 --> 00:15:53,840 Speaker 4: noises will be hidden under the music samples because we 271 00:15:54,000 --> 00:15:58,200 Speaker 4: leveraged the psycho acoustic phenomenon. So when we listen to 272 00:15:58,240 --> 00:16:01,480 Speaker 4: these music samples and pation and together, now is this 273 00:16:01,560 --> 00:16:02,840 Speaker 4: will become imperceptible. 274 00:16:03,680 --> 00:16:07,560 Speaker 1: I was curious about this whole imperceptible thing. So Jin 275 00:16:07,640 --> 00:16:09,720 Speaker 1: said he would not only help us test it, but 276 00:16:09,920 --> 00:16:13,320 Speaker 1: also use an updated process they're calling music shield. So 277 00:16:13,480 --> 00:16:15,640 Speaker 1: I'm gonna play a snippet of the kill Switch theme 278 00:16:15,760 --> 00:16:19,840 Speaker 1: song with and without music shield applied, and you see 279 00:16:19,880 --> 00:16:57,560 Speaker 1: if you can tell the difference. Here's sample one, okay, 280 00:16:57,840 --> 00:17:37,160 Speaker 1: and here is sample two, So you tell me. Could 281 00:17:37,160 --> 00:17:41,720 Speaker 1: you hear the difference? It's very slight, if anything. Oh 282 00:17:41,760 --> 00:17:43,880 Speaker 1: and by the way, if you were wondering, the one 283 00:17:43,920 --> 00:17:46,600 Speaker 1: that was run through music shield was the first sample. 284 00:17:47,600 --> 00:17:51,520 Speaker 1: And here's something else. Music shield actually goes one step 285 00:17:51,560 --> 00:17:54,399 Speaker 1: further than the harmony cloak process that we were talking about. 286 00:17:54,840 --> 00:17:57,920 Speaker 1: It not only stops AI models from training on the music, 287 00:17:58,400 --> 00:18:01,560 Speaker 1: but it can also prevent music generators like Suno or 288 00:18:01,680 --> 00:18:05,159 Speaker 1: Udio from being able to edit or remix tracks. So 289 00:18:05,359 --> 00:18:07,320 Speaker 1: let's give it a shot, and let's put this thing 290 00:18:07,359 --> 00:18:11,240 Speaker 1: into Suno. So if we upload our original untreated theme 291 00:18:11,359 --> 00:18:14,280 Speaker 1: into Suno and tell it to remix it, here's what 292 00:18:14,320 --> 00:18:52,120 Speaker 1: we get, which for an AI generator is not bad. 293 00:18:52,240 --> 00:18:55,000 Speaker 1: That's in the ballpark of what the original song sounds like. 294 00:18:55,480 --> 00:18:58,119 Speaker 1: And this is what happens when we upload our same 295 00:18:58,200 --> 00:19:21,360 Speaker 1: theme that's been music shielded. Okay, Yeah, this is different. 296 00:19:21,920 --> 00:19:26,160 Speaker 1: It's a lot more soothing than the song we gave it. 297 00:19:26,160 --> 00:19:29,040 Speaker 1: It kind of feels like a corporate video, maybe for 298 00:19:29,320 --> 00:19:33,040 Speaker 1: investors at a defense contractor. There's maybe a guy in 299 00:19:33,080 --> 00:19:35,320 Speaker 1: a suit on the screen and he's telling you about 300 00:19:35,320 --> 00:19:42,840 Speaker 1: how their business is really all about family. Clearly it works, 301 00:19:43,240 --> 00:19:46,000 Speaker 1: Zuno got so confused that it just spit out some 302 00:19:46,160 --> 00:19:50,720 Speaker 1: generic corporate music. So I've read your paper that you 303 00:19:50,800 --> 00:19:56,399 Speaker 1: published recently entitled Harmony Cloak Making Music Unlearnable for Generative AI. 304 00:19:57,400 --> 00:20:00,679 Speaker 1: One of the things that I found really interesting is 305 00:20:01,320 --> 00:20:05,800 Speaker 1: the language that you use, right, because you're osensibly you're 306 00:20:05,840 --> 00:20:11,520 Speaker 1: talking about music, very kind of broad, very easy to understand. Thing. 307 00:20:11,880 --> 00:20:15,160 Speaker 1: Because I'm reading through your paper, I'm realizing I'm reading 308 00:20:15,160 --> 00:20:20,240 Speaker 1: a security paper. Section three point one is entitled threat model. 309 00:20:21,000 --> 00:20:23,879 Speaker 1: And then as I read further down, I'm seeing, you know, 310 00:20:23,880 --> 00:20:25,680 Speaker 1: I'm just gonna read from this a little bit. Yeah, 311 00:20:25,680 --> 00:20:27,920 Speaker 1: you're laughing, but this is amazing to me. I mean, 312 00:20:28,600 --> 00:20:32,679 Speaker 1: the attacker eg AI companies or model owners might scrape 313 00:20:32,760 --> 00:20:35,520 Speaker 1: music data from the Internet or music streaming platforms to 314 00:20:35,600 --> 00:20:39,720 Speaker 1: train their music generative AI models, potentially leading to copyright 315 00:20:39,720 --> 00:20:43,320 Speaker 1: infringements and harming musicians. This part right here, I love 316 00:20:43,359 --> 00:20:48,480 Speaker 1: this we assume the attacker possesses substantial advantages and capabilities, 317 00:20:48,840 --> 00:20:52,399 Speaker 1: including unrestricted access to the training data set and model parameters, 318 00:20:52,960 --> 00:20:56,760 Speaker 1: facilitating comprehensive data and grading expansions, and the ability to 319 00:20:56,760 --> 00:21:01,400 Speaker 1: perform adaptive attack strategies. And it goes on. But I mean, 320 00:21:01,440 --> 00:21:04,800 Speaker 1: this is fascinating because usually if you read it security paper, 321 00:21:04,840 --> 00:21:08,280 Speaker 1: you're thinking of the defender is you know, somebody with 322 00:21:08,320 --> 00:21:09,920 Speaker 1: some resources. It could be a bank, it could be 323 00:21:09,920 --> 00:21:12,960 Speaker 1: a tech company, it could be a governmental agency, and 324 00:21:13,040 --> 00:21:16,600 Speaker 1: the attacker is somebody with considerably less resources. This is 325 00:21:16,800 --> 00:21:20,440 Speaker 1: the reverse of that. The attacker is somebody with a 326 00:21:20,480 --> 00:21:23,800 Speaker 1: lot of resources, it's a large tech company probably, and 327 00:21:24,200 --> 00:21:26,879 Speaker 1: the defender is just you know, some kid with the 328 00:21:26,880 --> 00:21:27,520 Speaker 1: bass guitar. 329 00:21:28,160 --> 00:21:31,399 Speaker 4: Yeah, exactly. And actually in the paper we also discussed 330 00:21:31,440 --> 00:21:35,919 Speaker 4: the possible a tech because the big tech company may 331 00:21:35,920 --> 00:21:40,400 Speaker 4: also leverage additional strategy to relearn or process. You are 332 00:21:40,600 --> 00:21:44,960 Speaker 4: protecting music to learn something from it. Right, So one 333 00:21:45,280 --> 00:21:50,600 Speaker 4: very straightforward way is to use noise cancelation techniques to 334 00:21:50,880 --> 00:21:55,920 Speaker 4: remove any perturbations from the music samposts. So that's maybe 335 00:21:56,080 --> 00:22:01,040 Speaker 4: one strategy they leverage. But the easy here is that 336 00:22:01,560 --> 00:22:05,520 Speaker 4: you can leverage whatever way to remove noises. Yes, you 337 00:22:05,560 --> 00:22:10,040 Speaker 4: can reduce the effectiveness of our framework, but on the 338 00:22:10,080 --> 00:22:13,359 Speaker 4: other side, the quality of the music will be dropped 339 00:22:13,359 --> 00:22:16,480 Speaker 4: as well, because when you remove noise is certain music 340 00:22:16,560 --> 00:22:18,760 Speaker 4: or futures will be removed as well. 341 00:22:19,440 --> 00:22:21,720 Speaker 1: Yeah, so I think I see what you're saying here. 342 00:22:21,840 --> 00:22:26,760 Speaker 1: So your framework, harmony cloak, the entire purpose is to 343 00:22:26,800 --> 00:22:29,879 Speaker 1: make the song unusable for the tech company. And if 344 00:22:29,880 --> 00:22:33,679 Speaker 1: the tech company then tries to do something to mitigate that, 345 00:22:33,800 --> 00:22:37,200 Speaker 1: to remove the noise, the perturbations that you've introduced into 346 00:22:37,200 --> 00:22:40,520 Speaker 1: the music, that reduces the quality, and do they really 347 00:22:40,600 --> 00:22:44,280 Speaker 1: want to be putting bad quality data into the data set. No, 348 00:22:44,520 --> 00:22:47,280 Speaker 1: So you've accomplished your goal, which is again to make 349 00:22:47,320 --> 00:22:48,280 Speaker 1: it unusable for them. 350 00:22:48,760 --> 00:22:48,960 Speaker 4: Yeah. 351 00:22:49,359 --> 00:22:53,040 Speaker 1: Ben Jordan has a similar philosophy with this Poisonified project. 352 00:22:53,240 --> 00:22:55,640 Speaker 1: He knows it's going to be a battle with AI companies, 353 00:22:56,080 --> 00:22:57,520 Speaker 1: but that's kind of the point. 354 00:22:58,040 --> 00:23:00,400 Speaker 2: A lot of people have said, well, you know, with 355 00:23:00,440 --> 00:23:03,119 Speaker 2: what happened with Nightshade and Images, they're just going to 356 00:23:03,200 --> 00:23:04,119 Speaker 2: do something like that. 357 00:23:05,200 --> 00:23:08,240 Speaker 1: To clarify what Ben's talking about here. Pretty soon after 358 00:23:08,320 --> 00:23:11,199 Speaker 1: Nightshade came out, and that's the poison pilling tool for 359 00:23:11,240 --> 00:23:14,560 Speaker 1: Images mentioned earlier people started saying that they'd figured out 360 00:23:14,560 --> 00:23:17,440 Speaker 1: a way to bypass it by blurting or sharpening out 361 00:23:17,440 --> 00:23:21,439 Speaker 1: the noise. The team that developed Nightshade disputes this, but 362 00:23:21,600 --> 00:23:26,040 Speaker 1: their other popular tool called Glade, was also briefly bypassed 363 00:23:26,160 --> 00:23:30,000 Speaker 1: using some image upscaling techniques. And it's basically a back 364 00:23:30,040 --> 00:23:34,240 Speaker 1: and forth war here. And because with audio AI still 365 00:23:34,320 --> 00:23:39,000 Speaker 1: processing a spectrogram image, in theory an AI company could 366 00:23:39,040 --> 00:23:43,480 Speaker 1: also use similar techniques to bypass poisonify and harmony cloak. 367 00:23:43,800 --> 00:23:46,240 Speaker 2: They might There are a couple considerations out, So like 368 00:23:46,280 --> 00:23:48,960 Speaker 2: when we talked about that audio going into a spectrogram 369 00:23:48,960 --> 00:23:51,240 Speaker 2: back into audio thing, if you think about what would 370 00:23:51,280 --> 00:23:54,080 Speaker 2: happen if you think about how a snare drum works 371 00:23:54,320 --> 00:23:56,919 Speaker 2: and a spectrogram, Okay, it's like it kind of starts 372 00:23:56,920 --> 00:23:59,199 Speaker 2: immediately and then it fades out a little bit, and 373 00:23:59,240 --> 00:24:03,400 Speaker 2: that gives it like the right, Yeah, if you were 374 00:24:03,440 --> 00:24:06,920 Speaker 2: to bore that, now it sounds really bad. 375 00:24:07,040 --> 00:24:07,640 Speaker 1: You have this. 376 00:24:10,280 --> 00:24:14,080 Speaker 2: And so different things need to be precise. And not 377 00:24:14,119 --> 00:24:15,760 Speaker 2: only that, so they use like the boring and the 378 00:24:15,760 --> 00:24:18,400 Speaker 2: AI sharpening, and if they were to do that with spectrograms, 379 00:24:18,880 --> 00:24:22,960 Speaker 2: it still that's a lot of extra compute and expense, 380 00:24:23,160 --> 00:24:26,199 Speaker 2: and really the goal is to just pressure them to 381 00:24:26,240 --> 00:24:28,879 Speaker 2: work with musicians, Like if they actually want to make 382 00:24:28,920 --> 00:24:31,119 Speaker 2: money off of this and they want to continue selling 383 00:24:31,200 --> 00:24:35,200 Speaker 2: subscriptions to generate this stuff, then uh, you know, it's 384 00:24:35,280 --> 00:24:39,400 Speaker 2: just to pressure them to make getting the wavefile directly 385 00:24:39,400 --> 00:24:42,919 Speaker 2: from a musician cheaper and easier than doing it the 386 00:24:42,920 --> 00:24:44,280 Speaker 2: way they've been doing it without consent. 387 00:24:44,880 --> 00:24:48,000 Speaker 1: Uh huh. So part of this is you're not necessarily 388 00:24:48,080 --> 00:24:54,080 Speaker 1: thinking that this is an undefeatable attack against you know, 389 00:24:54,119 --> 00:24:57,840 Speaker 1: AI scraping. It kind of sounds like you're sort of 390 00:24:57,920 --> 00:25:00,639 Speaker 1: hoping that it becomes obsolete because we know there's going 391 00:25:00,680 --> 00:25:03,359 Speaker 1: to be an arms race. We know companies are going 392 00:25:03,359 --> 00:25:08,160 Speaker 1: to figure out a way to defeat your poison pill attack. 393 00:25:08,280 --> 00:25:10,320 Speaker 1: Let's just be real. They got more resources than you, 394 00:25:10,359 --> 00:25:12,199 Speaker 1: they got more engineers than you do. Yeah, they're going 395 00:25:12,200 --> 00:25:14,200 Speaker 1: to figure it out, sure, but it kind of sounds 396 00:25:14,200 --> 00:25:16,240 Speaker 1: like you just rather them decide, you know what, the 397 00:25:16,280 --> 00:25:16,879 Speaker 1: sain't worth it? 398 00:25:17,320 --> 00:25:20,159 Speaker 2: Yeah, I mean you know what I do. Hear like 399 00:25:20,200 --> 00:25:22,600 Speaker 2: the arms race analogy all the time, and it's like 400 00:25:22,680 --> 00:25:26,120 Speaker 2: war is almost always a net loss, Like and if 401 00:25:26,119 --> 00:25:31,080 Speaker 2: they have anybody smart in whoever's funding them, they won't understand. 402 00:25:30,600 --> 00:25:34,679 Speaker 1: That you can make it so annoying, yeah, to scrape 403 00:25:34,680 --> 00:25:37,840 Speaker 1: people's music that they're just going to go the you know, 404 00:25:38,080 --> 00:25:39,240 Speaker 1: quote unquote do the right thing. 405 00:25:39,359 --> 00:25:42,680 Speaker 2: Yeah, Ultimately you might have it to be is so 406 00:25:42,720 --> 00:25:46,439 Speaker 2: omnipresent that AI music sites actually have to say, okay, 407 00:25:46,560 --> 00:25:48,719 Speaker 2: well we need to just talk to artists now, we 408 00:25:48,760 --> 00:25:50,760 Speaker 2: need to just start training on stuff that we know 409 00:25:50,840 --> 00:25:53,639 Speaker 2: doesn't have this, because we're wasting too much money on 410 00:25:53,960 --> 00:25:58,680 Speaker 2: compute for things that are just degrading the model quality. 411 00:25:59,800 --> 00:26:02,920 Speaker 1: So if you're a musician, you're probably thinking, when can 412 00:26:03,000 --> 00:26:05,760 Speaker 1: I start using this stuff to protect my music? Or 413 00:26:05,800 --> 00:26:07,679 Speaker 1: if you're a music fan, you might want to know 414 00:26:07,680 --> 00:26:10,160 Speaker 1: when this stuff drops so your favorite artists can stop 415 00:26:10,200 --> 00:26:13,960 Speaker 1: getting their music scraped and stolen. Well, I've got bad 416 00:26:14,000 --> 00:26:17,400 Speaker 1: news for you, but also some good news that's after 417 00:26:17,440 --> 00:26:34,760 Speaker 1: the break. So I know that there's gonna be definitely 418 00:26:34,800 --> 00:26:37,760 Speaker 1: some musicians who will hear this and say this sounds amazing. 419 00:26:38,560 --> 00:26:41,760 Speaker 1: I want this now? Yeah, is this something that's available 420 00:26:41,840 --> 00:26:42,719 Speaker 1: right now? Now? 421 00:26:43,080 --> 00:26:46,240 Speaker 2: It's funny because after I release that video, I probably 422 00:26:46,240 --> 00:26:48,679 Speaker 2: got one hundred emails of people just linking me to 423 00:26:48,760 --> 00:26:51,080 Speaker 2: like Google drive of their songs and I'm just like, okay, 424 00:26:51,119 --> 00:26:55,040 Speaker 2: this is not how it works, like unfortunately, yeah, so 425 00:26:55,720 --> 00:26:58,679 Speaker 2: I am not like, I'm not an mL developer. I 426 00:26:58,680 --> 00:27:02,359 Speaker 2: can't write a phision code. I believe it took about 427 00:27:02,920 --> 00:27:06,760 Speaker 2: two weeks for ten songs or something like that on 428 00:27:06,960 --> 00:27:10,240 Speaker 2: my machine with two brand news state of the art, 429 00:27:10,359 --> 00:27:11,639 Speaker 2: you know, big video cards. 430 00:27:11,760 --> 00:27:15,000 Speaker 1: Two weeks as in on and off incoding, no NonStop 431 00:27:15,080 --> 00:27:17,800 Speaker 1: two weeks and non stop and coding to do your album. Okay, yeah, 432 00:27:17,800 --> 00:27:20,359 Speaker 1: that is that's not accessible. I will not be asking 433 00:27:20,359 --> 00:27:22,080 Speaker 1: you to handle my album for me then never mind. 434 00:27:22,160 --> 00:27:23,920 Speaker 2: Yeah, And so it's also like even if I could, 435 00:27:24,000 --> 00:27:26,520 Speaker 2: then it's still like, okay, well if it's this inefficient, 436 00:27:26,600 --> 00:27:30,320 Speaker 2: like using this much power, you know, what we don't 437 00:27:30,359 --> 00:27:33,080 Speaker 2: want is to set the planet on fire just to 438 00:27:33,119 --> 00:27:37,119 Speaker 2: protect our music from a couple startups. 439 00:27:37,600 --> 00:27:40,440 Speaker 1: So Ben Jordan's program might not be available to the 440 00:27:40,520 --> 00:27:43,720 Speaker 1: public anytime soon. That's the bad news. But here's some 441 00:27:43,800 --> 00:27:46,719 Speaker 1: good news. Professor Jin Leeu and his team do have 442 00:27:46,760 --> 00:27:49,479 Speaker 1: some near future plans to make their software more widely 443 00:27:49,520 --> 00:27:55,760 Speaker 1: available for a musician in the future. What would protecting 444 00:27:55,800 --> 00:27:59,600 Speaker 1: your music with something like Harmony Cloak look like? Is 445 00:27:59,600 --> 00:28:02,199 Speaker 1: it downloading an app? Is it, uploading it to a 446 00:28:02,240 --> 00:28:04,720 Speaker 1: site and redownloading it. What are they doing? 447 00:28:05,280 --> 00:28:08,720 Speaker 4: There are many many ways to use this technology to 448 00:28:08,840 --> 00:28:12,119 Speaker 4: protect their music. First of all, we are thinking to 449 00:28:12,280 --> 00:28:18,399 Speaker 4: integrate these technologies with other platforms, for example, Apple Music Spodify, 450 00:28:19,040 --> 00:28:22,119 Speaker 4: so in that case, once they upload their music to 451 00:28:22,160 --> 00:28:27,240 Speaker 4: their platform, they can automatically protect their music. We'll also 452 00:28:27,480 --> 00:28:30,000 Speaker 4: create a web set, so on our web set they 453 00:28:30,080 --> 00:28:34,879 Speaker 4: can upload their music then download the perturbrization version from it, 454 00:28:35,160 --> 00:28:37,640 Speaker 4: so musicians people can use this very easily. 455 00:28:37,920 --> 00:28:39,880 Speaker 1: Do you have a timeline for when this might be 456 00:28:39,880 --> 00:28:40,920 Speaker 1: available for the public. 457 00:28:41,160 --> 00:28:44,600 Speaker 4: In July, we plan to launch a test program which 458 00:28:44,640 --> 00:28:47,560 Speaker 4: will involve around two hundred musicians so that we can 459 00:28:47,600 --> 00:28:53,480 Speaker 4: folder fun team fold improved this system before large scale deployment, 460 00:28:53,840 --> 00:28:59,200 Speaker 4: and if everything goes smoothly, I think integration of this 461 00:28:59,400 --> 00:29:02,640 Speaker 4: technology in at the Plasma will be very quick. Hopefully 462 00:29:02,720 --> 00:29:06,800 Speaker 4: this can be integrated in August ord September this year. 463 00:29:09,600 --> 00:29:11,920 Speaker 1: Despite the fact that he's working on programs that are 464 00:29:11,960 --> 00:29:16,880 Speaker 1: actively fighting against AI, jin, Lu is not universally anti AI, 465 00:29:17,320 --> 00:29:20,120 Speaker 1: and neither is Ben Jordan. They both think that AI 466 00:29:20,200 --> 00:29:22,080 Speaker 1: can be a useful tool. 467 00:29:22,280 --> 00:29:26,360 Speaker 4: I think AI machine learning itself doesn't have any problems. 468 00:29:26,800 --> 00:29:30,280 Speaker 4: The problem is how this big tech company trend their models. 469 00:29:30,600 --> 00:29:34,240 Speaker 4: And also from the musician's perspective, because we talk to 470 00:29:34,680 --> 00:29:37,880 Speaker 4: many many musicians, actually some of them use AI. They 471 00:29:37,880 --> 00:29:41,120 Speaker 4: feel this AI model is pretty useful. But if these 472 00:29:41,160 --> 00:29:44,400 Speaker 4: company wants to use their music examples for training models, 473 00:29:44,920 --> 00:29:49,400 Speaker 4: they need to get explicit permission. Also they need to 474 00:29:49,480 --> 00:29:52,440 Speaker 4: offer compensation to musicians. 475 00:29:54,880 --> 00:29:56,680 Speaker 1: How do you feel like, say the next six months 476 00:29:56,800 --> 00:29:58,920 Speaker 1: year plays out for music and AI. 477 00:30:00,200 --> 00:30:03,800 Speaker 2: One thing that I mean it's probably good news for 478 00:30:04,280 --> 00:30:07,719 Speaker 2: anybody who's worried about AI music taking over anything, is that, 479 00:30:07,800 --> 00:30:13,080 Speaker 2: like psychoacoustics are really really complicated, Like telling a computer 480 00:30:13,160 --> 00:30:16,360 Speaker 2: to hear something without any sort of image analysis, to 481 00:30:16,520 --> 00:30:19,800 Speaker 2: just not go that spectral conversion route and just hear 482 00:30:19,960 --> 00:30:22,680 Speaker 2: something the way a human hears is like you may 483 00:30:22,720 --> 00:30:25,760 Speaker 2: as well just ask it to become self aware, because 484 00:30:25,760 --> 00:30:29,120 Speaker 2: that sounds easier to me, just because like what's happening, 485 00:30:29,440 --> 00:30:31,880 Speaker 2: you know, like our hearing is by far our most 486 00:30:31,920 --> 00:30:35,400 Speaker 2: sensitive sense that we have. And when you think about 487 00:30:35,400 --> 00:30:39,000 Speaker 2: what happens from picking up pressure waves and then little 488 00:30:39,000 --> 00:30:42,240 Speaker 2: hairs in our ear picking this up and interpreting them 489 00:30:42,440 --> 00:30:46,720 Speaker 2: in conjunction with our brain into sounds. It's kind of 490 00:30:46,800 --> 00:30:50,960 Speaker 2: mysterious and crazy. And so to just tell AI, like, hey, 491 00:30:51,080 --> 00:30:55,520 Speaker 2: listen to this this sonic pressure and figure out how 492 00:30:55,520 --> 00:30:57,520 Speaker 2: to make it again like that, that's a much bigger 493 00:30:57,960 --> 00:31:01,280 Speaker 2: ask than I think it sounds too your average investor 494 00:31:01,680 --> 00:31:03,960 Speaker 2: or something who thinks that AI music is going to 495 00:31:04,360 --> 00:31:08,880 Speaker 2: eventually I don't know, I guess replace musicians or something. 496 00:31:09,080 --> 00:31:09,600 Speaker 1: Yeah. 497 00:31:09,680 --> 00:31:13,240 Speaker 2: Ideally, what I would really like to see is, as 498 00:31:13,280 --> 00:31:15,640 Speaker 2: people get more used to AI, two things I would 499 00:31:15,680 --> 00:31:19,120 Speaker 2: like people to use it locally. So, for example, image 500 00:31:19,160 --> 00:31:21,880 Speaker 2: and heap. She sent me a bunch of like probably 501 00:31:21,920 --> 00:31:24,600 Speaker 2: over an hour of her singing, sometimes in like really 502 00:31:24,600 --> 00:31:26,760 Speaker 2: weird ways, and then we sort of work together and 503 00:31:26,760 --> 00:31:28,400 Speaker 2: I create a voice model, and then she sang through 504 00:31:28,400 --> 00:31:30,360 Speaker 2: the voice model, and that was all locally. It wasn't 505 00:31:30,360 --> 00:31:33,240 Speaker 2: happening through any sort of service. I really liked that idea. 506 00:31:33,400 --> 00:31:37,280 Speaker 2: And I like the idea of artists being able to 507 00:31:37,360 --> 00:31:40,080 Speaker 2: sell their voice or sell their music style or their 508 00:31:40,080 --> 00:31:43,320 Speaker 2: instruments or something like that and put it in a marketplace. 509 00:31:43,760 --> 00:31:46,680 Speaker 2: And really the only technology that would need to exist 510 00:31:47,320 --> 00:31:49,640 Speaker 2: in any sort of centralized way would just be somebody 511 00:31:49,680 --> 00:31:52,680 Speaker 2: to water market or something. I really like that and 512 00:31:52,720 --> 00:31:55,640 Speaker 2: the other thing is, right now, we're in this land 513 00:31:55,880 --> 00:32:01,000 Speaker 2: in generative AI where the ideas are just huge and 514 00:32:01,240 --> 00:32:05,840 Speaker 2: kind of nonsensical, Like, you know, what if AI replaced music? 515 00:32:05,960 --> 00:32:07,840 Speaker 2: It's like, Nope, it's not gonna do that, But what 516 00:32:08,000 --> 00:32:12,920 Speaker 2: if it replaced samplers, you know, like violin, sample instruments 517 00:32:13,000 --> 00:32:15,920 Speaker 2: or something like. That's actually somewhere where AI can do 518 00:32:15,960 --> 00:32:19,480 Speaker 2: a really really good job to make writing music more 519 00:32:19,520 --> 00:32:22,520 Speaker 2: fun and more accurate, I guess, and you know, things 520 00:32:22,520 --> 00:32:25,840 Speaker 2: like that. And so once we sort of realize that 521 00:32:25,840 --> 00:32:28,360 Speaker 2: that not every single person is going to adopt AI 522 00:32:28,440 --> 00:32:32,160 Speaker 2: music and stop listening to humans, then maybe we can 523 00:32:32,520 --> 00:32:35,440 Speaker 2: invest money into making practical solutions. 524 00:32:38,760 --> 00:32:42,200 Speaker 1: And that is it for this particular discussion about AI 525 00:32:42,400 --> 00:32:45,240 Speaker 1: and music. And I say this particular discussion because this 526 00:32:45,400 --> 00:32:47,200 Speaker 1: is not the last time we're going to be talking 527 00:32:47,240 --> 00:32:50,280 Speaker 1: about AI and music. This is a really big topic 528 00:32:50,400 --> 00:32:52,840 Speaker 1: and all of us at kill Switch are pretty into music, 529 00:32:53,120 --> 00:32:55,600 Speaker 1: so we're absolutely going to be getting back into this again. 530 00:32:56,200 --> 00:32:58,360 Speaker 1: And you know, if you've got any music related stuff 531 00:32:58,360 --> 00:33:01,440 Speaker 1: that you're curious about, let us No. Before I get 532 00:33:01,440 --> 00:33:04,000 Speaker 1: out of here, though, I gotta do some shout outs. First, 533 00:33:04,160 --> 00:33:06,360 Speaker 1: big shout out to Ben Jordan. If you found his 534 00:33:06,440 --> 00:33:10,360 Speaker 1: Poisonify concept interesting, he has a whole YouTube video on it, 535 00:33:10,520 --> 00:33:12,600 Speaker 1: and the link for that is in the show notes. 536 00:33:12,880 --> 00:33:16,040 Speaker 1: He's also started a company called top Set Labs that's 537 00:33:16,080 --> 00:33:19,520 Speaker 1: developing AI voice models that are trained on artists who 538 00:33:19,560 --> 00:33:22,239 Speaker 1: have given their explicit consent, and a lot of them 539 00:33:22,240 --> 00:33:24,760 Speaker 1: are making more money in those royalties than they do 540 00:33:24,840 --> 00:33:27,840 Speaker 1: on Spotify. So if you're a musician or you're just 541 00:33:27,920 --> 00:33:30,000 Speaker 1: curious on how that works, might want to check that 542 00:33:30,040 --> 00:33:32,800 Speaker 1: out too. Also a big shout out to our other guest, 543 00:33:32,960 --> 00:33:36,120 Speaker 1: professor jenlu as well as Saydeerfond from the University of 544 00:33:36,120 --> 00:33:39,080 Speaker 1: Tennessee Knoxville for letting us test out Harmony Cloak and 545 00:33:39,200 --> 00:33:41,040 Speaker 1: music Shield. And if you want to check out that 546 00:33:41,080 --> 00:33:43,720 Speaker 1: paper we were referencing, there's a link to that also 547 00:33:43,760 --> 00:33:46,479 Speaker 1: in the show notes. Thank you so much again for 548 00:33:46,520 --> 00:33:48,960 Speaker 1: listening to kill Switch and again let us know what 549 00:33:49,000 --> 00:33:50,960 Speaker 1: you think and if there's something you want us to cover, 550 00:33:51,280 --> 00:33:53,320 Speaker 1: We're easy to find. You can hit us up at 551 00:33:53,400 --> 00:33:57,320 Speaker 1: kill Switch at Kaleidoscope dot NYC, or you can check 552 00:33:57,360 --> 00:34:00,560 Speaker 1: us out on Instagram at kill switch pod or I'm 553 00:34:00,720 --> 00:34:03,200 Speaker 1: dex Digi. That's d e x d I g I 554 00:34:03,520 --> 00:34:07,080 Speaker 1: on Instagram or Blue Sky, and wherever you're listening to us, 555 00:34:07,360 --> 00:34:09,520 Speaker 1: make sure to leave us a review because it helps 556 00:34:09,600 --> 00:34:12,040 Speaker 1: other people find the show and that helps us keep 557 00:34:12,040 --> 00:34:15,880 Speaker 1: doing our things. Kill Switch is hosted by Me Dexter Thomas. 558 00:34:15,920 --> 00:34:19,839 Speaker 1: It's produced by Shina Ozaki, Darluk Potts and Kate Osborne. 559 00:34:20,040 --> 00:34:22,719 Speaker 1: Our theme song is by me and Kyle Murdoch and 560 00:34:22,840 --> 00:34:26,520 Speaker 1: Kyle also mixed the show. From Kaleidoscope. Our executive producers 561 00:34:26,560 --> 00:34:31,440 Speaker 1: are Ozo Lashin, Mangesh Hatikadur and Kate Osborne. From iHeart, 562 00:34:31,480 --> 00:34:35,000 Speaker 1: our executive producers are Katrina Norville and Nikki E. Tour 563 00:34:35,520 --> 00:34:36,800 Speaker 1: catch All the Next One