1 00:00:04,440 --> 00:00:12,240 Speaker 1: Welcome to tech Stuff, a production from iHeartRadio. He there 2 00:00:12,320 --> 00:00:16,040 Speaker 1: and welcome to tech Stuff. I'm your host, Jonathan Strickland. 3 00:00:16,120 --> 00:00:19,000 Speaker 1: I'm an executive producer with iHeartRadio. And how the tech 4 00:00:19,040 --> 00:00:23,639 Speaker 1: are you. In April twenty twenty three, the lawyers for 5 00:00:23,760 --> 00:00:28,560 Speaker 1: Tesla CEO Elon Musk argued that submitted recordings of their 6 00:00:28,640 --> 00:00:33,279 Speaker 1: client from twenty sixteen might have been deep bakes, so 7 00:00:33,360 --> 00:00:37,680 Speaker 1: the ongoing case is an emotionally charged one. In twenty eighteen, 8 00:00:37,800 --> 00:00:41,800 Speaker 1: a man named Walter Huang died in a car accident. 9 00:00:41,960 --> 00:00:45,000 Speaker 1: The Tesla he was in was a Tesla Model X 10 00:00:45,440 --> 00:00:48,360 Speaker 1: and it was engaged in autopilot mode at the time 11 00:00:48,360 --> 00:00:52,360 Speaker 1: of the crash. His family contends that the Tesla's safety 12 00:00:52,400 --> 00:00:56,760 Speaker 1: systems failed and the vehicle steered itself into a concrete median, 13 00:00:57,440 --> 00:01:01,440 Speaker 1: and the family's lawyer submitted a recording of Elon Musk 14 00:01:01,520 --> 00:01:04,800 Speaker 1: as evidence that Wang was led to believe his vehicle 15 00:01:04,920 --> 00:01:09,520 Speaker 1: had greater capabilities than it actually possessed. So, in this recording, 16 00:01:09,680 --> 00:01:14,240 Speaker 1: Elon Musk said of Tesla vehicles, quote a Model S 17 00:01:14,480 --> 00:01:18,120 Speaker 1: and Model X at this point can drive autonomously with 18 00:01:18,240 --> 00:01:23,039 Speaker 1: greater safety than a person right now. End quote In response, 19 00:01:23,360 --> 00:01:27,280 Speaker 1: Musk's lawyers said, the recording could be faked. Now, not 20 00:01:27,400 --> 00:01:30,840 Speaker 1: to waffle about this, but if we're speaking solely on 21 00:01:31,160 --> 00:01:35,759 Speaker 1: a technical level, the recording could be faked. And by 22 00:01:35,760 --> 00:01:39,479 Speaker 1: that I mean there are technologies that are sophisticated enough 23 00:01:39,520 --> 00:01:43,560 Speaker 1: to create a fake recording. But just because something could 24 00:01:43,959 --> 00:01:47,640 Speaker 1: be faked doesn't mean it actually was faked. And the 25 00:01:47,720 --> 00:01:52,760 Speaker 1: judge in the Tesla Casevet Pennypacker, who has an amazing name, 26 00:01:53,200 --> 00:01:57,080 Speaker 1: said that this argument is a truly dangerous one. The 27 00:01:57,160 --> 00:02:00,920 Speaker 1: judge said that it implies quote that because mister Musk 28 00:02:01,080 --> 00:02:03,840 Speaker 1: is famous and might be more of a target for 29 00:02:03,960 --> 00:02:08,640 Speaker 1: deep fakes, his public statements are immune quote. In other words, 30 00:02:09,000 --> 00:02:13,320 Speaker 1: if you're notable enough or notorious enough, you have a 31 00:02:13,639 --> 00:02:17,800 Speaker 1: carte blanche excuse for anything that you are recorded as saying, 32 00:02:17,960 --> 00:02:21,640 Speaker 1: because maybe someone just targeted you and created a fake 33 00:02:21,760 --> 00:02:26,640 Speaker 1: version to discredit you. In twenty eighteen, Danielle Citron and 34 00:02:26,800 --> 00:02:30,440 Speaker 1: Robert Chesney wrote a paper in which they predicted this 35 00:02:30,560 --> 00:02:35,320 Speaker 1: sort of situation. They dubbed it the liar's dividend. That 36 00:02:35,840 --> 00:02:39,560 Speaker 1: when there is a proliferation of technology that can create 37 00:02:39,639 --> 00:02:45,480 Speaker 1: misinformation or outright disinformation, the liars out there reap the benefits, 38 00:02:45,600 --> 00:02:48,600 Speaker 1: because what is the truth anyway? When you can't trust 39 00:02:48,639 --> 00:02:53,359 Speaker 1: the evidence, everything falls apart. This is just one of 40 00:02:53,400 --> 00:02:58,000 Speaker 1: the many challenges deep fake technology presents. There are potentially 41 00:02:58,520 --> 00:03:03,360 Speaker 1: harmless or perhaps even beneficial uses of this technology, but 42 00:03:03,560 --> 00:03:07,000 Speaker 1: it doesn't take much imagination to come up with ways 43 00:03:07,000 --> 00:03:11,359 Speaker 1: to cause harm. Let's talk a second about the entertainment industry. 44 00:03:12,200 --> 00:03:16,000 Speaker 1: With deep fake technology, it becomes possible to create videos 45 00:03:16,040 --> 00:03:21,000 Speaker 1: and audio recordings that simulate celebrities, which potentially allows a 46 00:03:21,040 --> 00:03:24,359 Speaker 1: director to cast a film with people who otherwise would 47 00:03:24,400 --> 00:03:29,320 Speaker 1: be very much unavailable. Using sufficiently sophisticated deep fakes, you 48 00:03:29,400 --> 00:03:32,200 Speaker 1: could create a movie that combines a cast of modern 49 00:03:32,280 --> 00:03:36,040 Speaker 1: and classic film stars. Maybe you want the Marx Brothers 50 00:03:36,120 --> 00:03:39,560 Speaker 1: running around with Will Ferrell, Maybe you want Lon Cheney 51 00:03:39,640 --> 00:03:43,040 Speaker 1: Junior to show up in your modern werewolf movie. Or 52 00:03:43,360 --> 00:03:48,560 Speaker 1: maybe you're doing something slightly less extreme, maybe you're using 53 00:03:48,600 --> 00:03:51,840 Speaker 1: the technology to generate a younger version of your current 54 00:03:51,920 --> 00:03:55,800 Speaker 1: star ala Harrison Ford in the upcoming Indiana Jones and 55 00:03:55,840 --> 00:03:58,920 Speaker 1: the Dial of Destiny film. So it doesn't have to 56 00:03:58,960 --> 00:04:03,240 Speaker 1: be more or sinister, but it does bring into question 57 00:04:03,400 --> 00:04:06,480 Speaker 1: concepts like the right to personality or the right to 58 00:04:06,560 --> 00:04:11,600 Speaker 1: identity or the right to publicity. Presumably filmmakers wouldn't want 59 00:04:11,640 --> 00:04:14,840 Speaker 1: to move forward on any project with a computer generated 60 00:04:14,880 --> 00:04:18,680 Speaker 1: simulation of a real film star without the permission from 61 00:04:18,839 --> 00:04:22,600 Speaker 1: that person or their family. But it's possible to do it, 62 00:04:23,080 --> 00:04:26,080 Speaker 1: and depending on the movie, maybe they do go ahead 63 00:04:26,240 --> 00:04:31,719 Speaker 1: without securing permission first. Maybe it's a edgy parody film, 64 00:04:31,920 --> 00:04:34,640 Speaker 1: and the buzz around their decision to do this could 65 00:04:34,760 --> 00:04:37,960 Speaker 1: end up being a boost to marketing. People would say 66 00:04:37,960 --> 00:04:40,359 Speaker 1: how dare they do this, and then go buy tickets 67 00:04:40,400 --> 00:04:43,880 Speaker 1: to see the fallout of it. For actors, there's a 68 00:04:43,920 --> 00:04:46,719 Speaker 1: real concern that this technology could rob them of work, 69 00:04:47,040 --> 00:04:49,400 Speaker 1: that if they turned down a role, the filmmaker could 70 00:04:49,440 --> 00:04:52,000 Speaker 1: just get a computer generated version of them in there, 71 00:04:52,520 --> 00:04:55,479 Speaker 1: or that they could appear to you know, appear in 72 00:04:55,520 --> 00:04:59,120 Speaker 1: projects that they don't actually agree with, and perhaps most 73 00:04:59,200 --> 00:05:02,920 Speaker 1: importantly for many actors out there, that this could all 74 00:05:02,960 --> 00:05:07,400 Speaker 1: happen without compensation for the original actor. I know that 75 00:05:07,480 --> 00:05:10,760 Speaker 1: it could be tough to feel sympathetic toward big Hollywood stars, 76 00:05:10,760 --> 00:05:14,360 Speaker 1: but keep in mind the vast majority of working actors 77 00:05:14,400 --> 00:05:17,760 Speaker 1: out there are not raking in the huge movie deals. 78 00:05:18,200 --> 00:05:21,359 Speaker 1: They're just as worried about AI biting into their work 79 00:05:21,640 --> 00:05:24,599 Speaker 1: as the rest of us are. Then there's the world 80 00:05:24,839 --> 00:05:28,839 Speaker 1: of audio performance. Earlier this year, a TikTok user with 81 00:05:28,920 --> 00:05:33,520 Speaker 1: the handle ghost Writer nine seven seven wrote and produced 82 00:05:33,520 --> 00:05:37,359 Speaker 1: a song called Heart on My Sleeve. But ghost Writer 83 00:05:37,560 --> 00:05:41,240 Speaker 1: nine seven seven didn't provide the vocals for this track. Instead, 84 00:05:41,520 --> 00:05:46,720 Speaker 1: they used AI generated deep fake vocal simulations of artists 85 00:05:46,880 --> 00:05:51,400 Speaker 1: Drake and The Weekend. The songwriter then posted the release 86 00:05:51,440 --> 00:05:56,080 Speaker 1: on multiple platforms and it quickly went viral. Universal music 87 00:05:56,120 --> 00:05:59,919 Speaker 1: groups sprung into action right away and claimed copyright infringement. 88 00:06:00,600 --> 00:06:04,320 Speaker 1: And I am no legal expert, but in my mind 89 00:06:04,720 --> 00:06:08,680 Speaker 1: that's a weak argument. After all, the song itself was 90 00:06:08,800 --> 00:06:11,599 Speaker 1: an original, it was not a cover. It had not 91 00:06:11,680 --> 00:06:17,480 Speaker 1: been stolen from someone's discography. You cannot copyright the sound 92 00:06:17,560 --> 00:06:21,680 Speaker 1: of a voice. Universal Music Group doesn't own the vocal 93 00:06:21,760 --> 00:06:25,400 Speaker 1: quality of Drake or the Weekend, and I'm sure those 94 00:06:25,480 --> 00:06:29,240 Speaker 1: artists would be concerned to learn otherwise. And even if 95 00:06:29,279 --> 00:06:32,640 Speaker 1: the agreement between the label and the artists did go 96 00:06:32,760 --> 00:06:36,719 Speaker 1: all Ursula from the Little Mermaid and claim ownership of 97 00:06:36,760 --> 00:06:40,520 Speaker 1: the voices themselves, there's not really a legal foundation to 98 00:06:40,640 --> 00:06:45,320 Speaker 1: use that as a deterrent against Deep fakes. Universal Music 99 00:06:45,320 --> 00:06:48,800 Speaker 1: Group did argue that the deep fake voices used tons 100 00:06:48,880 --> 00:06:53,560 Speaker 1: of recorded material to train itself to sound like those artists. 101 00:06:54,240 --> 00:06:57,640 Speaker 1: That is most certainly the case. We'll dive into deep 102 00:06:57,720 --> 00:07:01,200 Speaker 1: fake techniques a bit later in this episode, but it 103 00:07:01,279 --> 00:07:04,520 Speaker 1: often boils down to machine learning and using a lot 104 00:07:04,560 --> 00:07:08,480 Speaker 1: of training material to educate a model about what it 105 00:07:08,560 --> 00:07:11,080 Speaker 1: is you want it to do. The more material you 106 00:07:11,120 --> 00:07:15,720 Speaker 1: can submit in training, the better, and Universal Music Group 107 00:07:15,760 --> 00:07:20,280 Speaker 1: said quote the training of generative AI using our artist's music, 108 00:07:20,480 --> 00:07:22,880 Speaker 1: which represents both a breach of our agreements and a 109 00:07:22,960 --> 00:07:26,920 Speaker 1: violation of copyright law end quote, before going on to 110 00:07:26,960 --> 00:07:29,200 Speaker 1: suggest that allowing Heart on my sleeve to exist as 111 00:07:29,200 --> 00:07:31,960 Speaker 1: akin to powering up skynet so that the terminators will 112 00:07:31,960 --> 00:07:36,520 Speaker 1: become real. I'm exaggerating only a little bit, and again 113 00:07:37,320 --> 00:07:39,880 Speaker 1: I am not a copyright expert, but it's hard for 114 00:07:39,920 --> 00:07:43,760 Speaker 1: me to imagine how training an AI model on music 115 00:07:44,280 --> 00:07:47,880 Speaker 1: is in itself a violation of copyright law. After all, 116 00:07:48,520 --> 00:07:54,080 Speaker 1: every musician, every artist, heck, every person who has been 117 00:07:54,120 --> 00:07:58,000 Speaker 1: around other people has been influenced by the work of 118 00:07:58,040 --> 00:08:03,560 Speaker 1: other people. Sometimes you can actually hear the influences in music. 119 00:08:04,480 --> 00:08:06,720 Speaker 1: You might hear an artist play and say, oh, that 120 00:08:06,800 --> 00:08:10,040 Speaker 1: reminds me of Johnny Cash or something like that. The 121 00:08:10,080 --> 00:08:14,640 Speaker 1: history of art is one in which succeeding generations iterate 122 00:08:15,000 --> 00:08:18,080 Speaker 1: on the works of those who came before them. Sometimes 123 00:08:18,360 --> 00:08:22,520 Speaker 1: they make drastic departures from the generations that came before them, 124 00:08:22,520 --> 00:08:26,560 Speaker 1: but even that is in response to the influence of 125 00:08:26,640 --> 00:08:31,880 Speaker 1: the earlier art. So, if you make the argument that 126 00:08:32,040 --> 00:08:36,080 Speaker 1: training AI on specific works is wrong, how do you 127 00:08:36,200 --> 00:08:40,680 Speaker 1: differentiate that from someone who gets their start playing song 128 00:08:40,760 --> 00:08:44,360 Speaker 1: covers or maybe writing their own stuff, but with musical 129 00:08:44,400 --> 00:08:49,520 Speaker 1: influences from identifiable artists. Because art is not created in 130 00:08:49,559 --> 00:08:53,800 Speaker 1: a vacuum, obviously, using AI is different. It can lead 131 00:08:53,800 --> 00:08:56,760 Speaker 1: to the creation of a near perfect simulation of the 132 00:08:56,800 --> 00:09:02,000 Speaker 1: original artist, But the method of training the AI isn't 133 00:09:02,040 --> 00:09:06,000 Speaker 1: really that different from a budding musician voraciously devouring the 134 00:09:06,160 --> 00:09:10,400 Speaker 1: entire discography of their favorite artists before emulating those artists 135 00:09:10,440 --> 00:09:14,440 Speaker 1: in their own work. It is a sticky wicket, no 136 00:09:14,600 --> 00:09:17,840 Speaker 1: question about it, and we're in the early stages of 137 00:09:17,880 --> 00:09:21,560 Speaker 1: figuring out how to handle it, which is particularly unfortunate 138 00:09:21,880 --> 00:09:26,640 Speaker 1: since the technology is already here. But how did we 139 00:09:26,760 --> 00:09:32,280 Speaker 1: get here? Well. An exhaustive history of deep fake technology 140 00:09:32,280 --> 00:09:36,120 Speaker 1: would require a full series of episodes about the history 141 00:09:36,120 --> 00:09:40,240 Speaker 1: of artificial intelligence and machine learning in general and computer 142 00:09:40,440 --> 00:09:44,600 Speaker 1: vision in particular, as well as text to speech and 143 00:09:44,679 --> 00:09:48,240 Speaker 1: lots of other related technologies. But for our purposes, we'll 144 00:09:48,280 --> 00:09:53,000 Speaker 1: simply acknowledge that countless computer scientists and programmers had spent 145 00:09:53,280 --> 00:09:57,040 Speaker 1: endless hours advancing computer technology with the goal of finding 146 00:09:57,040 --> 00:10:02,040 Speaker 1: ways to make machines quote unquote under and data. This 147 00:10:02,120 --> 00:10:05,559 Speaker 1: is easier said than done, so let's take images as 148 00:10:05,600 --> 00:10:08,920 Speaker 1: an example, as that will factor heavily in our discussion today. 149 00:10:09,280 --> 00:10:12,280 Speaker 1: We humans can glance at a photo and we can 150 00:10:12,320 --> 00:10:16,840 Speaker 1: immediately identify what is an object versus just a background. 151 00:10:16,960 --> 00:10:19,760 Speaker 1: So if you have a red mug placed in front 152 00:10:19,800 --> 00:10:23,320 Speaker 1: of a white cinder block wall, we can see what 153 00:10:23,520 --> 00:10:25,400 Speaker 1: is a mug and what is a wall. But we 154 00:10:25,480 --> 00:10:28,720 Speaker 1: have to teach computers how to do that, and when 155 00:10:28,720 --> 00:10:33,400 Speaker 1: you're talking about technologies that generate moving images, it becomes 156 00:10:33,640 --> 00:10:39,120 Speaker 1: even more complicated. So, for lack of a clear beginning, 157 00:10:39,640 --> 00:10:44,880 Speaker 1: I am somewhat arbitrarily going to start in nineteen ninety seven. Now, 158 00:10:44,920 --> 00:10:47,760 Speaker 1: a couple of things happened that year that would be 159 00:10:47,800 --> 00:10:51,520 Speaker 1: important for us to talk about, and one was not 160 00:10:51,800 --> 00:10:56,199 Speaker 1: quite deep baked technology, but it did illustrate some potential 161 00:10:57,240 --> 00:11:00,120 Speaker 1: ethical issues we had to think about. And that was 162 00:11:00,000 --> 00:11:04,679 Speaker 1: a commercial that aired during a big old American football game. 163 00:11:05,120 --> 00:11:08,720 Speaker 1: You know one that happens every year, You know, the 164 00:11:08,760 --> 00:11:11,600 Speaker 1: one I can't I can't call it by name for 165 00:11:11,679 --> 00:11:15,959 Speaker 1: you know, legal reasons. Anyway, one famous feature of this 166 00:11:16,120 --> 00:11:19,480 Speaker 1: big old American football game is that brands will shell 167 00:11:19,520 --> 00:11:22,839 Speaker 1: out huge amounts of money to air commercials during it. 168 00:11:23,200 --> 00:11:26,439 Speaker 1: And one brand to do that in nineteen ninety seven 169 00:11:27,120 --> 00:11:31,120 Speaker 1: was the Dirt Devil vacuum cleaner company. Now, those of 170 00:11:31,120 --> 00:11:33,120 Speaker 1: you across the pond would call it a hoover, not 171 00:11:33,120 --> 00:11:35,720 Speaker 1: a vacuum cleaner, but a hoover is a different brand altogether, 172 00:11:35,760 --> 00:11:39,720 Speaker 1: so stop confusing me. In the commercial, famous actor and 173 00:11:39,840 --> 00:11:44,400 Speaker 1: dancer Fred Astaire is shown dancing with Dirt Devil vacuum cleaners. 174 00:11:44,880 --> 00:11:48,400 Speaker 1: But here's the thing. Fred Astaire had died a decade earlier. 175 00:11:49,000 --> 00:11:52,760 Speaker 1: The footage was taken from his films, with Dirt Devil 176 00:11:52,880 --> 00:11:56,280 Speaker 1: inserting the imagery of its products into the footage to 177 00:11:56,320 --> 00:11:59,160 Speaker 1: make it seem as if Astaire had actually shot commercials 178 00:11:59,200 --> 00:12:02,400 Speaker 1: this way and really danced with vacuum cleaners. So in 179 00:12:02,440 --> 00:12:05,000 Speaker 1: this case, the footage of a stare was legitimate. It 180 00:12:05,120 --> 00:12:07,199 Speaker 1: was the appearance of the vacuum cleaners that had been 181 00:12:07,240 --> 00:12:10,680 Speaker 1: inserted into it. But the use of footage of performers 182 00:12:10,679 --> 00:12:14,360 Speaker 1: who have passed away prompted a debate about the ethics 183 00:12:14,400 --> 00:12:18,320 Speaker 1: of that practice, and people began to speculate about what 184 00:12:18,559 --> 00:12:21,600 Speaker 1: might happen once technology reached a point where a computer 185 00:12:21,679 --> 00:12:27,239 Speaker 1: simulation of a person would be indistinguishable from the real thing. Meanwhile, 186 00:12:27,400 --> 00:12:30,920 Speaker 1: also in nineteen ninety seven, a group of computer scientists 187 00:12:31,000 --> 00:12:36,880 Speaker 1: published and important work. The scientists were Christoph Or, Chris Bregler, 188 00:12:37,720 --> 00:12:43,040 Speaker 1: Michel Covell, and Malcolm Stanley. The paper's title is Video 189 00:12:43,160 --> 00:12:47,960 Speaker 1: Rewrite Driving Visual Speech with Audio. This work built on 190 00:12:48,040 --> 00:12:51,600 Speaker 1: top of a lot of other previous work. For example, 191 00:12:52,040 --> 00:12:56,480 Speaker 1: base interpretation was already a discipline in computer science. It 192 00:12:56,559 --> 00:12:59,400 Speaker 1: traces its history all the way back to the nineteen sixties. 193 00:13:00,040 --> 00:13:02,800 Speaker 1: Ditto for technology that could generate speech from texts. That 194 00:13:02,920 --> 00:13:06,760 Speaker 1: two dates back to the nineteen sixties. Computer animation had 195 00:13:06,800 --> 00:13:09,720 Speaker 1: been around for a while by nineteen ninety seven, so 196 00:13:09,920 --> 00:13:13,040 Speaker 1: creating a three D model of lips, one that you 197 00:13:13,040 --> 00:13:17,520 Speaker 1: could subsequently animate that was also a thing already. But 198 00:13:17,600 --> 00:13:20,800 Speaker 1: what these researchers did was they brought all these elements together. 199 00:13:21,360 --> 00:13:25,320 Speaker 1: It was a convergence of technologies that resulted in a 200 00:13:25,400 --> 00:13:30,000 Speaker 1: new application, one which would allow for computer generated synthetic 201 00:13:30,120 --> 00:13:35,559 Speaker 1: video of real people. The team created the video rewrite software, 202 00:13:35,960 --> 00:13:38,360 Speaker 1: and they also showed what it was capable of doing 203 00:13:38,400 --> 00:13:42,000 Speaker 1: in some very very short video clips. The results are 204 00:13:42,040 --> 00:13:46,199 Speaker 1: primitive by today's standards, but nonetheless impressive. In one two 205 00:13:46,320 --> 00:13:51,120 Speaker 1: second clip, President JFK appears to say I never met 206 00:13:51,160 --> 00:13:54,319 Speaker 1: Forrest Gump. It's a cheeky reference to the nineteen ninety 207 00:13:54,320 --> 00:13:57,640 Speaker 1: four film, which included a segment in which the titular 208 00:13:57,760 --> 00:14:02,360 Speaker 1: character Forst Gump appears to meet JFK and then informs 209 00:14:02,440 --> 00:14:05,080 Speaker 1: him that he needs to rush off to the restroom. 210 00:14:05,720 --> 00:14:10,800 Speaker 1: Video Rewrite served as a foundation for technologies that we 211 00:14:11,040 --> 00:14:14,559 Speaker 1: could refer to as deep fake tech. So just a 212 00:14:14,600 --> 00:14:18,040 Speaker 1: few years later, in two thousand and one, Christopher J. Taylor, 213 00:14:18,320 --> 00:14:22,520 Speaker 1: Gareth J. Edwards, and Timothy. My middle initial is F 214 00:14:22,640 --> 00:14:25,080 Speaker 1: and not Jay, which actually upsets Jonathan. Because of a 215 00:14:25,120 --> 00:14:29,240 Speaker 1: lack of consistency, Coots published a paper that was titled 216 00:14:29,560 --> 00:14:33,480 Speaker 1: Active Appearance Models. The abstract for this paper reads, in 217 00:14:33,560 --> 00:14:37,880 Speaker 1: part quote, we describe a new method of matching statistical 218 00:14:37,920 --> 00:14:43,640 Speaker 1: models of appearances to images. End quote now in plain English. 219 00:14:43,840 --> 00:14:47,240 Speaker 1: This paper describes a method in which computer vision relies 220 00:14:47,320 --> 00:14:52,280 Speaker 1: on statistical models to more accurately identify elements within the image. 221 00:14:52,440 --> 00:14:56,680 Speaker 1: So let's consider facial recognition technology. As I mentioned earlier, 222 00:14:57,160 --> 00:15:01,280 Speaker 1: computers do not inherently understand image. If presented with a 223 00:15:01,280 --> 00:15:04,920 Speaker 1: picture of a face, a computer cannot naturally determine what 224 00:15:05,040 --> 00:15:08,600 Speaker 1: the various features of that face are. Only through proper 225 00:15:08,640 --> 00:15:11,360 Speaker 1: programming and machine learning can you start to do this 226 00:15:11,760 --> 00:15:15,720 Speaker 1: and train a computer to recognize features like a nose, 227 00:15:16,240 --> 00:15:20,640 Speaker 1: a mouth, eyebrows, eyes, et cetera. And by training machines 228 00:15:20,680 --> 00:15:23,760 Speaker 1: on millions of faces, you can reach a point where 229 00:15:23,800 --> 00:15:26,400 Speaker 1: the machine can examine a new face, one that has 230 00:15:26,520 --> 00:15:29,920 Speaker 1: never before been submitted to the machine, and attempt to 231 00:15:30,000 --> 00:15:34,520 Speaker 1: identify those features. This is a necessary step with a 232 00:15:34,560 --> 00:15:37,840 Speaker 1: lot of deep fake technology. See to call all deep 233 00:15:37,880 --> 00:15:42,440 Speaker 1: fakes computer generated is a little misleading. Often what is 234 00:15:42,520 --> 00:15:46,920 Speaker 1: happening is a computer is replacing an existing person or 235 00:15:47,040 --> 00:15:51,600 Speaker 1: face in a video with someone else's features. In order 236 00:15:51,640 --> 00:15:53,520 Speaker 1: to do that, you first have to be able to 237 00:15:53,680 --> 00:15:57,840 Speaker 1: map and identify the original person that was in the video, 238 00:15:58,200 --> 00:16:01,920 Speaker 1: you need to be able to match the synthesized face 239 00:16:02,360 --> 00:16:06,200 Speaker 1: with the movements of the original face. To do that, 240 00:16:06,240 --> 00:16:09,920 Speaker 1: the computer first has to encode the original face, essentially 241 00:16:10,240 --> 00:16:13,760 Speaker 1: to break it down into lots of smaller shapes. Then 242 00:16:13,800 --> 00:16:16,600 Speaker 1: it has to be able to match the synthesized face 243 00:16:17,000 --> 00:16:20,920 Speaker 1: to the original one with a similar encoded approach, and 244 00:16:20,920 --> 00:16:24,640 Speaker 1: then decode that into the synthesized face that replaces the 245 00:16:24,680 --> 00:16:28,080 Speaker 1: original one and then follows the various motions of the 246 00:16:28,080 --> 00:16:32,880 Speaker 1: original face. So you're replacing one person with another through 247 00:16:32,880 --> 00:16:35,000 Speaker 1: the use of a computer, and as part of that, 248 00:16:35,040 --> 00:16:38,000 Speaker 1: the computer has to break down the original person into 249 00:16:38,080 --> 00:16:41,280 Speaker 1: points of data that the computer can handle. So with 250 00:16:41,400 --> 00:16:45,760 Speaker 1: this technology, I could stand facing a camera and deliver 251 00:16:45,840 --> 00:16:48,800 Speaker 1: a speech and then, using software designed to follow the 252 00:16:48,800 --> 00:16:52,880 Speaker 1: steps I just laid out, replace my image with that 253 00:16:53,000 --> 00:16:56,080 Speaker 1: of someone else. If I also used a program designed 254 00:16:56,120 --> 00:17:00,480 Speaker 1: to create a vocal impersonation of that someone else, well 255 00:17:00,560 --> 00:17:03,320 Speaker 1: I could create a video where some celebrity says things 256 00:17:03,320 --> 00:17:06,920 Speaker 1: that they would never say, Like maybe I could create 257 00:17:06,960 --> 00:17:10,040 Speaker 1: a video of Keanu Reeves saying tech Stuff is my 258 00:17:10,160 --> 00:17:13,720 Speaker 1: favorite podcast. Jonathan is such a cool host. I wish 259 00:17:13,840 --> 00:17:17,240 Speaker 1: I could hang out with him. For the record, mister Reeves, 260 00:17:17,320 --> 00:17:20,080 Speaker 1: I would never actually do that. I'm just saying I 261 00:17:20,200 --> 00:17:24,639 Speaker 1: could do it. Of course, creating a video image of 262 00:17:24,720 --> 00:17:27,640 Speaker 1: Keanu Reeves would just be one part of the equation. 263 00:17:27,920 --> 00:17:31,120 Speaker 1: Another would be replicating his voice. Now, I could try 264 00:17:31,160 --> 00:17:34,600 Speaker 1: and do my own impersonation, but this would so clearly 265 00:17:34,640 --> 00:17:37,399 Speaker 1: be fake that I would never achieve my goal of 266 00:17:37,440 --> 00:17:39,600 Speaker 1: trying to make it appear as though Keanu Reeves knows 267 00:17:39,600 --> 00:17:41,240 Speaker 1: who I am and wants to hang out with me. 268 00:17:41,920 --> 00:17:44,600 Speaker 1: I can't even say WHOA the way he does. To 269 00:17:44,720 --> 00:17:48,800 Speaker 1: achieve my dreams, I would need a voice synthesis program 270 00:17:48,960 --> 00:17:52,600 Speaker 1: that I could train on Keano's voice and then produce 271 00:17:52,640 --> 00:17:57,959 Speaker 1: a computer generated impersonation. The history of voice synthesis is 272 00:17:58,200 --> 00:18:01,080 Speaker 1: crazy long. I mean, if we really, we really wanted 273 00:18:01,119 --> 00:18:02,840 Speaker 1: to dive into it, we could go all the way 274 00:18:02,840 --> 00:18:07,560 Speaker 1: back to the late seventeen hundreds. But we won't because 275 00:18:07,560 --> 00:18:11,119 Speaker 1: I can't keep you here that long. Text to speech 276 00:18:11,200 --> 00:18:14,560 Speaker 1: technologies brings us a bit closer to modern day, but 277 00:18:14,960 --> 00:18:17,840 Speaker 1: then we're still talking about the nineteen sixties or thereabouts. 278 00:18:17,840 --> 00:18:20,600 Speaker 1: As I mentioned earlier in this episode. To get to 279 00:18:20,640 --> 00:18:23,720 Speaker 1: a point where computers are capable of producing an imitation 280 00:18:23,840 --> 00:18:27,480 Speaker 1: of a specific person's voice. Then we're getting up to 281 00:18:27,560 --> 00:18:31,199 Speaker 1: like the last decade or so, researchers built tools that, 282 00:18:31,320 --> 00:18:36,440 Speaker 1: after training on how a specific person produces different sounds phonemes. 283 00:18:36,880 --> 00:18:38,520 Speaker 1: If we want to think of it in terms of 284 00:18:38,600 --> 00:18:41,680 Speaker 1: language and the sounds of language, well, then we have 285 00:18:42,080 --> 00:18:46,040 Speaker 1: applications that can take text, interpret that text as a 286 00:18:46,119 --> 00:18:49,520 Speaker 1: series of sounds, pull upon the computer knowledge of how 287 00:18:49,600 --> 00:18:55,200 Speaker 1: a particular person makes those specific sounds, and then voila, 288 00:18:55,640 --> 00:18:58,679 Speaker 1: we have ourselves a copy. Now, early versions of this 289 00:18:58,760 --> 00:19:02,960 Speaker 1: technology were understandibly a bit limited. You would end up 290 00:19:03,040 --> 00:19:06,719 Speaker 1: with speech that on a service level sounded like the 291 00:19:06,760 --> 00:19:10,720 Speaker 1: person in question, the synthesized person, but it would typically 292 00:19:10,720 --> 00:19:15,080 Speaker 1: come across as flat or using incorrect inflection to emphasize 293 00:19:15,119 --> 00:19:17,960 Speaker 1: a point. So think of that kind of robotics sound 294 00:19:18,040 --> 00:19:21,320 Speaker 1: you would get with early personal assistance, right, like if 295 00:19:21,359 --> 00:19:25,280 Speaker 1: you were using a GPS system, which I realized I 296 00:19:25,440 --> 00:19:29,120 Speaker 1: just used a repetition there, like ATM machine. But let's 297 00:19:29,119 --> 00:19:33,040 Speaker 1: say you're using a GPS and it has a voice 298 00:19:33,080 --> 00:19:37,959 Speaker 1: associated with it. Older ones were very robotic, and they 299 00:19:37,960 --> 00:19:40,920 Speaker 1: could also say things that were hilariously wrong. I'll never 300 00:19:40,960 --> 00:19:43,640 Speaker 1: forget the time I was riding in a car and 301 00:19:43,720 --> 00:19:47,440 Speaker 1: the GPS told us to turn right on Oak Doctor 302 00:19:47,960 --> 00:19:52,879 Speaker 1: instead of Oak Drive. But over time the models improved 303 00:19:53,000 --> 00:19:55,800 Speaker 1: and things started to sound a bit more natural. So 304 00:19:57,040 --> 00:20:01,199 Speaker 1: those early ones not so good. Not mistake them for 305 00:20:01,280 --> 00:20:03,679 Speaker 1: being a real person. It would sound like a robot 306 00:20:03,880 --> 00:20:06,720 Speaker 1: in making an impersonation of that person, But models would 307 00:20:06,720 --> 00:20:11,160 Speaker 1: grow in sophistication, and training sessions would include examples where 308 00:20:11,359 --> 00:20:15,400 Speaker 1: the target's expressions would be associated with specific emotions like anger, 309 00:20:15,640 --> 00:20:20,240 Speaker 1: or happiness or sadness. You can actually use a voice 310 00:20:20,280 --> 00:20:24,520 Speaker 1: synthesizer yourself and train it. And as part of that, 311 00:20:24,600 --> 00:20:28,480 Speaker 1: you're typically told to read out sentences with different emotional 312 00:20:28,520 --> 00:20:31,679 Speaker 1: weight to them, So using a bit of appropriate text, 313 00:20:32,320 --> 00:20:35,199 Speaker 1: then maybe some metadata to indicate what emotion should be 314 00:20:35,320 --> 00:20:38,680 Speaker 1: used to read out that text. It then becomes possible 315 00:20:38,720 --> 00:20:43,439 Speaker 1: to craft vocal performances that were and are difficult to 316 00:20:43,520 --> 00:20:47,119 Speaker 1: distinguish from the real thing. We're going to take a 317 00:20:47,200 --> 00:20:49,800 Speaker 1: quick break to thank our sponsor, and then I'll be 318 00:20:49,880 --> 00:20:53,439 Speaker 1: back to talk more about the history and impact of 319 00:20:53,520 --> 00:21:08,080 Speaker 1: deep fake technology. Back to our history of video deep fakes. 320 00:21:08,160 --> 00:21:11,199 Speaker 1: We left off at two thousand and one, and for 321 00:21:11,280 --> 00:21:15,520 Speaker 1: nearly two decades computer scientists continued to work on systems 322 00:21:15,920 --> 00:21:20,600 Speaker 1: that would push forward the capabilities of synthesized video content. 323 00:21:21,280 --> 00:21:23,800 Speaker 1: By the time we get up to twenty seventeen, a 324 00:21:23,920 --> 00:21:28,320 Speaker 1: pair of papers explained that the advancements in consumer computers 325 00:21:28,640 --> 00:21:31,280 Speaker 1: had reached a point where it was actually possible to 326 00:21:31,400 --> 00:21:36,159 Speaker 1: achieve synthesized video using off the shelf computer systems, and 327 00:21:36,240 --> 00:21:39,320 Speaker 1: that would be a huge game changer. No longer would 328 00:21:39,359 --> 00:21:45,280 Speaker 1: you need access to incredibly powerful systems with specialized software. 329 00:21:45,720 --> 00:21:50,320 Speaker 1: Now you could potentially create or access an application on 330 00:21:50,400 --> 00:21:52,560 Speaker 1: an off the shelf computer to do the same thing. 331 00:21:53,240 --> 00:21:56,960 Speaker 1: So the tools to generate computer synthesized video now we're 332 00:21:57,040 --> 00:22:00,720 Speaker 1: within the grasp of the average computer user. With cloud 333 00:22:00,720 --> 00:22:04,120 Speaker 1: based services that could augment these efforts, it became possible 334 00:22:04,160 --> 00:22:06,480 Speaker 1: for a creative person to make videos that appear to 335 00:22:06,520 --> 00:22:10,480 Speaker 1: show people doing and saying things that they never actually did. 336 00:22:11,119 --> 00:22:14,800 Speaker 1: And again, there are multiple uses for such technology. Not 337 00:22:14,960 --> 00:22:18,600 Speaker 1: all of them are sinister, but it doesn't take much 338 00:22:18,640 --> 00:22:22,320 Speaker 1: imagination to come up with scenarios where things get grim, 339 00:22:22,560 --> 00:22:25,879 Speaker 1: And indeed, many early uses of this tech once it 340 00:22:25,880 --> 00:22:30,320 Speaker 1: became accessible, were bad. One big one was using face 341 00:22:30,400 --> 00:22:34,520 Speaker 1: swapping technology to make it appear as though someone famous 342 00:22:34,680 --> 00:22:39,520 Speaker 1: or otherwise was appearing in an adult video. And I 343 00:22:39,520 --> 00:22:41,920 Speaker 1: think it goes without saying that this is a total 344 00:22:42,080 --> 00:22:46,159 Speaker 1: violation of the victim. It robs them of agency and 345 00:22:46,320 --> 00:22:50,080 Speaker 1: they may end up suffering consequences despite not being remotely 346 00:22:50,160 --> 00:22:55,480 Speaker 1: responsible for the content. So imagine facing judgment for something 347 00:22:55,520 --> 00:22:58,680 Speaker 1: that not only you did not do, but you had 348 00:22:58,720 --> 00:23:03,240 Speaker 1: no way of preventing. Honestly, it's impossible for me to 349 00:23:03,280 --> 00:23:07,560 Speaker 1: communicate how devastating this can be. There are several accounts 350 00:23:07,600 --> 00:23:10,600 Speaker 1: online written by people who have been the victim of 351 00:23:10,640 --> 00:23:13,520 Speaker 1: this sort of activity, and they are worth your time. 352 00:23:13,880 --> 00:23:17,240 Speaker 1: They are harrowing to read, but it is important their 353 00:23:17,280 --> 00:23:20,800 Speaker 1: words will far more effectively explain how traumatizing this experience 354 00:23:20,840 --> 00:23:24,879 Speaker 1: can be. And just as a reminder, the rise of 355 00:23:24,960 --> 00:23:28,840 Speaker 1: social networks means that we've all been sharing a lot 356 00:23:28,960 --> 00:23:32,560 Speaker 1: of images of ourselves, videos of ourselves. There's a lot 357 00:23:32,560 --> 00:23:35,879 Speaker 1: of content out there that could be used to train 358 00:23:36,440 --> 00:23:40,399 Speaker 1: various machine models. So it's something to keep in mind 359 00:23:40,800 --> 00:23:44,280 Speaker 1: that even if you aren't concerned right now, there's nothing 360 00:23:44,320 --> 00:23:48,160 Speaker 1: to say that you couldn't become a victim tomorrow. Deep 361 00:23:48,200 --> 00:23:52,800 Speaker 1: fakes also pose a risk to organizations it's not just individuals. 362 00:23:53,240 --> 00:23:56,119 Speaker 1: So imagine for a moment that you see you have 363 00:23:56,160 --> 00:23:58,919 Speaker 1: a voicemail at work, and you pull it up, and 364 00:23:58,960 --> 00:24:01,520 Speaker 1: you listen to the voicemail, and it sounds like your boss, 365 00:24:01,960 --> 00:24:03,800 Speaker 1: and your boss is telling you that you need to 366 00:24:03,840 --> 00:24:09,000 Speaker 1: transfer company funds from the company account into a different one. 367 00:24:09,440 --> 00:24:11,879 Speaker 1: And perhaps they say that it's in order for you 368 00:24:11,920 --> 00:24:14,760 Speaker 1: to pay off some third party vendor for a project 369 00:24:14,760 --> 00:24:18,320 Speaker 1: that you're not really familiar with. But then maybe it 370 00:24:18,359 --> 00:24:21,160 Speaker 1: turns out that voicemail wasn't from your boss after all. 371 00:24:21,720 --> 00:24:25,679 Speaker 1: Maybe it was the result of spearfishing. Maybe a nefarious 372 00:24:25,680 --> 00:24:29,840 Speaker 1: thief has identified you as a possible key to stealing 373 00:24:29,920 --> 00:24:34,440 Speaker 1: money from your organization and has used tech to impersonate 374 00:24:34,480 --> 00:24:39,119 Speaker 1: your boss and direct you toward facilitating a crime. You 375 00:24:39,240 --> 00:24:43,560 Speaker 1: unknowingly have become an accomplice. There's actually been a case 376 00:24:43,640 --> 00:24:46,400 Speaker 1: where this sort of thing was alleged to have happened. 377 00:24:46,440 --> 00:24:49,960 Speaker 1: Now I have to say alleged, because there were questions 378 00:24:50,000 --> 00:24:53,159 Speaker 1: about whether or not it really was a case of 379 00:24:53,280 --> 00:24:57,560 Speaker 1: a synthesized voice, or if maybe this was more of 380 00:24:57,600 --> 00:25:02,240 Speaker 1: a straightforward embezzlement issue, and the deep fake defense aka 381 00:25:02,400 --> 00:25:06,760 Speaker 1: the liar's dividend came into play. Deep fakes have come 382 00:25:07,520 --> 00:25:11,840 Speaker 1: a long way in a few short years. However, they 383 00:25:11,840 --> 00:25:15,720 Speaker 1: are not perfect. There can be telltale signs that a 384 00:25:15,880 --> 00:25:20,120 Speaker 1: video is fake, though they can sometimes be too subtle 385 00:25:20,240 --> 00:25:23,800 Speaker 1: for the human eye to detect. Sometimes there's a dead giveaway. 386 00:25:24,400 --> 00:25:26,560 Speaker 1: You're watching a video and you think this person is 387 00:25:26,560 --> 00:25:31,560 Speaker 1: blinking too frequently or not frequently enough, or maybe their 388 00:25:31,560 --> 00:25:36,719 Speaker 1: eyes don't look quite right, or they movements you're seeing 389 00:25:36,960 --> 00:25:39,280 Speaker 1: don't line up. That a person is turning their head 390 00:25:39,320 --> 00:25:41,760 Speaker 1: one way, their eyes are shifting another in a way 391 00:25:41,760 --> 00:25:45,000 Speaker 1: that just doesn't seem natural. There are those sorts of 392 00:25:45,040 --> 00:25:47,000 Speaker 1: things that people can pick up on, there's some that 393 00:25:47,040 --> 00:25:50,280 Speaker 1: are far more subtle, and deep fake detection tools are 394 00:25:50,359 --> 00:25:52,920 Speaker 1: growing in importance as a result of this. There are 395 00:25:52,960 --> 00:25:57,639 Speaker 1: tools that are trained to spot signs of fakery, sometimes 396 00:25:57,760 --> 00:26:00,240 Speaker 1: ones that are far too subtle for us to notice. 397 00:26:00,400 --> 00:26:04,560 Speaker 1: So it may be things like inconsistencies in lighting and 398 00:26:04,640 --> 00:26:08,960 Speaker 1: the quality of reflections within the frame. Things like that 399 00:26:09,720 --> 00:26:12,720 Speaker 1: may end up being an indication that a video was 400 00:26:12,800 --> 00:26:19,399 Speaker 1: manufactured artificially rather than an actual recording, and they're becoming 401 00:26:19,440 --> 00:26:23,879 Speaker 1: more and more important for people and for organizations. So 402 00:26:23,920 --> 00:26:27,639 Speaker 1: in addition to those tools, Organization leaders should really prepare 403 00:26:27,680 --> 00:26:32,560 Speaker 1: employees for the possibility of encountering deep fakes. Critical thinking 404 00:26:32,720 --> 00:26:38,120 Speaker 1: is a big part of uncovering deception, as is preparation. Heck, 405 00:26:38,200 --> 00:26:40,760 Speaker 1: depending on the organization, you might go so far as 406 00:26:40,840 --> 00:26:45,040 Speaker 1: to set up a phrase or question as an authentication 407 00:26:45,200 --> 00:26:48,359 Speaker 1: process at the top of an official phone call or 408 00:26:48,800 --> 00:26:51,320 Speaker 1: video meeting, so that the person on the other end 409 00:26:51,320 --> 00:26:55,080 Speaker 1: of the line can verify that things are legit. I 410 00:26:55,119 --> 00:26:57,320 Speaker 1: know it sounds like you're going a far away, but 411 00:26:57,440 --> 00:27:01,119 Speaker 1: as this technology gets more sophisticated, as people deploy it 412 00:27:01,800 --> 00:27:05,960 Speaker 1: in ways that are potentially harmful, you have to start 413 00:27:06,000 --> 00:27:09,080 Speaker 1: to think about these things. What we do not want 414 00:27:09,119 --> 00:27:11,879 Speaker 1: to do is to enter into an era where we 415 00:27:11,960 --> 00:27:15,600 Speaker 1: can no longer reliably determine the real from the fate. 416 00:27:16,240 --> 00:27:18,600 Speaker 1: But there is no putting the cat back in the bag, 417 00:27:19,080 --> 00:27:21,919 Speaker 1: or the genie in the bottle or baby in the corner. 418 00:27:22,400 --> 00:27:26,679 Speaker 1: The technology isn't going away. It will not disappear. It 419 00:27:26,680 --> 00:27:30,480 Speaker 1: will continue to evolve and to improve, and so it 420 00:27:30,520 --> 00:27:34,359 Speaker 1: falls upon us to educate ourselves as best we can 421 00:27:34,840 --> 00:27:40,440 Speaker 1: in preparation for encountering it, and to think about how 422 00:27:40,480 --> 00:27:45,919 Speaker 1: we can address the flagrant misuses of the technology to 423 00:27:46,640 --> 00:27:49,439 Speaker 1: attempt to dissuade people from using it in that way, 424 00:27:49,680 --> 00:27:53,600 Speaker 1: because again, the victimization element of this can be really 425 00:27:53,640 --> 00:27:58,080 Speaker 1: severe and really traumatizing and incredibly disruptive to a person's life. 426 00:27:58,680 --> 00:28:03,560 Speaker 1: We should not forget that either. So in conclusion, I 427 00:28:03,600 --> 00:28:07,320 Speaker 1: will say that this technology is truly impressive and again 428 00:28:07,400 --> 00:28:10,720 Speaker 1: it can have some really incredible uses. I don't want 429 00:28:10,760 --> 00:28:13,560 Speaker 1: to paint it as just being a bad thing. It 430 00:28:13,640 --> 00:28:16,200 Speaker 1: is not good or bad. It is how we use 431 00:28:16,240 --> 00:28:19,480 Speaker 1: it that determines whether or not the end result is 432 00:28:19,520 --> 00:28:24,359 Speaker 1: a positive one or a negative one. But only by 433 00:28:24,440 --> 00:28:27,960 Speaker 1: learning about it can we prepare for what is to come. 434 00:28:28,520 --> 00:28:32,040 Speaker 1: So I hope that you found this episode informative, that 435 00:28:32,080 --> 00:28:35,480 Speaker 1: you have a deeper appreciation for what this technology does 436 00:28:35,680 --> 00:28:38,360 Speaker 1: and what it is capable of, and I will speak 437 00:28:38,360 --> 00:28:49,000 Speaker 1: to you again really soon. Tech Stuff is an iHeartRadio production. 438 00:28:49,280 --> 00:28:54,320 Speaker 1: For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts, 439 00:28:54,440 --> 00:28:56,440 Speaker 1: or wherever you listen to your favorite shows.