WEBVTT - The Origin and Impact of Deepfake Technology 0:00:04.440 --> 0:00:12.240 Welcome to tech Stuff, a production from iHeartRadio. He there 0:00:12.320 --> 0:00:16.040 and welcome to tech Stuff. I'm your host, Jonathan Strickland. 0:00:16.120 --> 0:00:19.000 I'm an executive producer with iHeartRadio. And how the tech 0:00:19.040 --> 0:00:23.639 are you. In April twenty twenty three, the lawyers for 0:00:23.760 --> 0:00:28.560 Tesla CEO Elon Musk argued that submitted recordings of their 0:00:28.640 --> 0:00:33.279 client from twenty sixteen might have been deep bakes, so 0:00:33.360 --> 0:00:37.680 the ongoing case is an emotionally charged one. In twenty eighteen, 0:00:37.800 --> 0:00:41.800 a man named Walter Huang died in a car accident. 0:00:41.960 --> 0:00:45.000 The Tesla he was in was a Tesla Model X 0:00:45.440 --> 0:00:48.360 and it was engaged in autopilot mode at the time 0:00:48.360 --> 0:00:52.360 of the crash. His family contends that the Tesla's safety 0:00:52.400 --> 0:00:56.760 systems failed and the vehicle steered itself into a concrete median, 0:00:57.440 --> 0:01:01.440 and the family's lawyer submitted a recording of Elon Musk 0:01:01.520 --> 0:01:04.800 as evidence that Wang was led to believe his vehicle 0:01:04.920 --> 0:01:09.520 had greater capabilities than it actually possessed. So, in this recording, 0:01:09.680 --> 0:01:14.240 Elon Musk said of Tesla vehicles, quote a Model S 0:01:14.480 --> 0:01:18.120 and Model X at this point can drive autonomously with 0:01:18.240 --> 0:01:23.039 greater safety than a person right now. End quote In response, 0:01:23.360 --> 0:01:27.280 Musk's lawyers said, the recording could be faked. Now, not 0:01:27.400 --> 0:01:30.840 to waffle about this, but if we're speaking solely on 0:01:31.160 --> 0:01:35.759 a technical level, the recording could be faked. And by 0:01:35.760 --> 0:01:39.479 that I mean there are technologies that are sophisticated enough 0:01:39.520 --> 0:01:43.560 to create a fake recording. But just because something could 0:01:43.959 --> 0:01:47.640 be faked doesn't mean it actually was faked. And the 0:01:47.720 --> 0:01:52.760 judge in the Tesla Casevet Pennypacker, who has an amazing name, 0:01:53.200 --> 0:01:57.080 said that this argument is a truly dangerous one. The 0:01:57.160 --> 0:02:00.920 judge said that it implies quote that because mister Musk 0:02:01.080 --> 0:02:03.840 is famous and might be more of a target for 0:02:03.960 --> 0:02:08.640 deep fakes, his public statements are immune quote. In other words, 0:02:09.000 --> 0:02:13.320 if you're notable enough or notorious enough, you have a 0:02:13.639 --> 0:02:17.800 carte blanche excuse for anything that you are recorded as saying, 0:02:17.960 --> 0:02:21.640 because maybe someone just targeted you and created a fake 0:02:21.760 --> 0:02:26.640 version to discredit you. In twenty eighteen, Danielle Citron and 0:02:26.800 --> 0:02:30.440 Robert Chesney wrote a paper in which they predicted this 0:02:30.560 --> 0:02:35.320 sort of situation. They dubbed it the liar's dividend. That 0:02:35.840 --> 0:02:39.560 when there is a proliferation of technology that can create 0:02:39.639 --> 0:02:45.480 misinformation or outright disinformation, the liars out there reap the benefits, 0:02:45.600 --> 0:02:48.600 because what is the truth anyway? When you can't trust 0:02:48.639 --> 0:02:53.359 the evidence, everything falls apart. This is just one of 0:02:53.400 --> 0:02:58.000 the many challenges deep fake technology presents. There are potentially 0:02:58.520 --> 0:03:03.360 harmless or perhaps even beneficial uses of this technology, but 0:03:03.560 --> 0:03:07.000 it doesn't take much imagination to come up with ways 0:03:07.000 --> 0:03:11.359 to cause harm. Let's talk a second about the entertainment industry. 0:03:12.200 --> 0:03:16.000 With deep fake technology, it becomes possible to create videos 0:03:16.040 --> 0:03:21.000 and audio recordings that simulate celebrities, which potentially allows a 0:03:21.040 --> 0:03:24.359 director to cast a film with people who otherwise would 0:03:24.400 --> 0:03:29.320 be very much unavailable. Using sufficiently sophisticated deep fakes, you 0:03:29.400 --> 0:03:32.200 could create a movie that combines a cast of modern 0:03:32.280 --> 0:03:36.040 and classic film stars. Maybe you want the Marx Brothers 0:03:36.120 --> 0:03:39.560 running around with Will Ferrell, Maybe you want Lon Cheney 0:03:39.640 --> 0:03:43.040 Junior to show up in your modern werewolf movie. Or 0:03:43.360 --> 0:03:48.560 maybe you're doing something slightly less extreme, maybe you're using 0:03:48.600 --> 0:03:51.840 the technology to generate a younger version of your current 0:03:51.920 --> 0:03:55.800 star ala Harrison Ford in the upcoming Indiana Jones and 0:03:55.840 --> 0:03:58.920 the Dial of Destiny film. So it doesn't have to 0:03:58.960 --> 0:04:03.240 be more or sinister, but it does bring into question 0:04:03.400 --> 0:04:06.480 concepts like the right to personality or the right to 0:04:06.560 --> 0:04:11.600 identity or the right to publicity. Presumably filmmakers wouldn't want 0:04:11.640 --> 0:04:14.840 to move forward on any project with a computer generated 0:04:14.880 --> 0:04:18.680 simulation of a real film star without the permission from 0:04:18.839 --> 0:04:22.600 that person or their family. But it's possible to do it, 0:04:23.080 --> 0:04:26.080 and depending on the movie, maybe they do go ahead 0:04:26.240 --> 0:04:31.719 without securing permission first. Maybe it's a edgy parody film, 0:04:31.920 --> 0:04:34.640 and the buzz around their decision to do this could 0:04:34.760 --> 0:04:37.960 end up being a boost to marketing. People would say 0:04:37.960 --> 0:04:40.359 how dare they do this, and then go buy tickets 0:04:40.400 --> 0:04:43.880 to see the fallout of it. For actors, there's a 0:04:43.920 --> 0:04:46.719 real concern that this technology could rob them of work, 0:04:47.040 --> 0:04:49.400 that if they turned down a role, the filmmaker could 0:04:49.440 --> 0:04:52.000 just get a computer generated version of them in there, 0:04:52.520 --> 0:04:55.479 or that they could appear to you know, appear in 0:04:55.520 --> 0:04:59.120 projects that they don't actually agree with, and perhaps most 0:04:59.200 --> 0:05:02.920 importantly for many actors out there, that this could all 0:05:02.960 --> 0:05:07.400 happen without compensation for the original actor. I know that 0:05:07.480 --> 0:05:10.760 it could be tough to feel sympathetic toward big Hollywood stars, 0:05:10.760 --> 0:05:14.360 but keep in mind the vast majority of working actors 0:05:14.400 --> 0:05:17.760 out there are not raking in the huge movie deals. 0:05:18.200 --> 0:05:21.359 They're just as worried about AI biting into their work 0:05:21.640 --> 0:05:24.599 as the rest of us are. Then there's the world 0:05:24.839 --> 0:05:28.839 of audio performance. Earlier this year, a TikTok user with 0:05:28.920 --> 0:05:33.520 the handle ghost Writer nine seven seven wrote and produced 0:05:33.520 --> 0:05:37.359 a song called Heart on My Sleeve. But ghost Writer 0:05:37.560 --> 0:05:41.240 nine seven seven didn't provide the vocals for this track. Instead, 0:05:41.520 --> 0:05:46.720 they used AI generated deep fake vocal simulations of artists 0:05:46.880 --> 0:05:51.400 Drake and The Weekend. The songwriter then posted the release 0:05:51.440 --> 0:05:56.080 on multiple platforms and it quickly went viral. Universal music 0:05:56.120 --> 0:05:59.919 groups sprung into action right away and claimed copyright infringement. 0:06:00.600 --> 0:06:04.320 And I am no legal expert, but in my mind 0:06:04.720 --> 0:06:08.680 that's a weak argument. After all, the song itself was 0:06:08.800 --> 0:06:11.599 an original, it was not a cover. It had not 0:06:11.680 --> 0:06:17.480 been stolen from someone's discography. You cannot copyright the sound 0:06:17.560 --> 0:06:21.680 of a voice. Universal Music Group doesn't own the vocal 0:06:21.760 --> 0:06:25.400 quality of Drake or the Weekend, and I'm sure those 0:06:25.480 --> 0:06:29.240 artists would be concerned to learn otherwise. And even if 0:06:29.279 --> 0:06:32.640 the agreement between the label and the artists did go 0:06:32.760 --> 0:06:36.719 all Ursula from the Little Mermaid and claim ownership of 0:06:36.760 --> 0:06:40.520 the voices themselves, there's not really a legal foundation to 0:06:40.640 --> 0:06:45.320 use that as a deterrent against Deep fakes. Universal Music 0:06:45.320 --> 0:06:48.800 Group did argue that the deep fake voices used tons 0:06:48.880 --> 0:06:53.560 of recorded material to train itself to sound like those artists. 0:06:54.240 --> 0:06:57.640 That is most certainly the case. We'll dive into deep 0:06:57.720 --> 0:07:01.200 fake techniques a bit later in this episode, but it 0:07:01.279 --> 0:07:04.520 often boils down to machine learning and using a lot 0:07:04.560 --> 0:07:08.480 of training material to educate a model about what it 0:07:08.560 --> 0:07:11.080 is you want it to do. The more material you 0:07:11.120 --> 0:07:15.720 can submit in training, the better, and Universal Music Group 0:07:15.760 --> 0:07:20.280 said quote the training of generative AI using our artist's music, 0:07:20.480 --> 0:07:22.880 which represents both a breach of our agreements and a 0:07:22.960 --> 0:07:26.920 violation of copyright law end quote, before going on to 0:07:26.960 --> 0:07:29.200 suggest that allowing Heart on my sleeve to exist as 0:07:29.200 --> 0:07:31.960 akin to powering up skynet so that the terminators will 0:07:31.960 --> 0:07:36.520 become real. I'm exaggerating only a little bit, and again 0:07:37.320 --> 0:07:39.880 I am not a copyright expert, but it's hard for 0:07:39.920 --> 0:07:43.760 me to imagine how training an AI model on music 0:07:44.280 --> 0:07:47.880 is in itself a violation of copyright law. After all, 0:07:48.520 --> 0:07:54.080 every musician, every artist, heck, every person who has been 0:07:54.120 --> 0:07:58.000 around other people has been influenced by the work of 0:07:58.040 --> 0:08:03.560 other people. Sometimes you can actually hear the influences in music. 0:08:04.480 --> 0:08:06.720 You might hear an artist play and say, oh, that 0:08:06.800 --> 0:08:10.040 reminds me of Johnny Cash or something like that. The 0:08:10.080 --> 0:08:14.640 history of art is one in which succeeding generations iterate 0:08:15.000 --> 0:08:18.080 on the works of those who came before them. Sometimes 0:08:18.360 --> 0:08:22.520 they make drastic departures from the generations that came before them, 0:08:22.520 --> 0:08:26.560 but even that is in response to the influence of 0:08:26.640 --> 0:08:31.880 the earlier art. So, if you make the argument that 0:08:32.040 --> 0:08:36.080 training AI on specific works is wrong, how do you 0:08:36.200 --> 0:08:40.680 differentiate that from someone who gets their start playing song 0:08:40.760 --> 0:08:44.360 covers or maybe writing their own stuff, but with musical 0:08:44.400 --> 0:08:49.520 influences from identifiable artists. Because art is not created in 0:08:49.559 --> 0:08:53.800 a vacuum, obviously, using AI is different. It can lead 0:08:53.800 --> 0:08:56.760 to the creation of a near perfect simulation of the 0:08:56.800 --> 0:09:02.000 original artist, But the method of training the AI isn't 0:09:02.040 --> 0:09:06.000 really that different from a budding musician voraciously devouring the 0:09:06.160 --> 0:09:10.400 entire discography of their favorite artists before emulating those artists 0:09:10.440 --> 0:09:14.440 in their own work. It is a sticky wicket, no 0:09:14.600 --> 0:09:17.840 question about it, and we're in the early stages of 0:09:17.880 --> 0:09:21.560 figuring out how to handle it, which is particularly unfortunate 0:09:21.880 --> 0:09:26.640 since the technology is already here. But how did we 0:09:26.760 --> 0:09:32.280 get here? Well. An exhaustive history of deep fake technology 0:09:32.280 --> 0:09:36.120 would require a full series of episodes about the history 0:09:36.120 --> 0:09:40.240 of artificial intelligence and machine learning in general and computer 0:09:40.440 --> 0:09:44.600 vision in particular, as well as text to speech and 0:09:44.679 --> 0:09:48.240 lots of other related technologies. But for our purposes, we'll 0:09:48.280 --> 0:09:53.000 simply acknowledge that countless computer scientists and programmers had spent 0:09:53.280 --> 0:09:57.040 endless hours advancing computer technology with the goal of finding 0:09:57.040 --> 0:10:02.040 ways to make machines quote unquote under and data. This 0:10:02.120 --> 0:10:05.559 is easier said than done, so let's take images as 0:10:05.600 --> 0:10:08.920 an example, as that will factor heavily in our discussion today. 0:10:09.280 --> 0:10:12.280 We humans can glance at a photo and we can 0:10:12.320 --> 0:10:16.840 immediately identify what is an object versus just a background. 0:10:16.960 --> 0:10:19.760 So if you have a red mug placed in front 0:10:19.800 --> 0:10:23.320 of a white cinder block wall, we can see what 0:10:23.520 --> 0:10:25.400 is a mug and what is a wall. But we 0:10:25.480 --> 0:10:28.720 have to teach computers how to do that, and when 0:10:28.720 --> 0:10:33.400 you're talking about technologies that generate moving images, it becomes 0:10:33.640 --> 0:10:39.120 even more complicated. So, for lack of a clear beginning, 0:10:39.640 --> 0:10:44.880 I am somewhat arbitrarily going to start in nineteen ninety seven. Now, 0:10:44.920 --> 0:10:47.760 a couple of things happened that year that would be 0:10:47.800 --> 0:10:51.520 important for us to talk about, and one was not 0:10:51.800 --> 0:10:56.199 quite deep baked technology, but it did illustrate some potential 0:10:57.240 --> 0:11:00.120 ethical issues we had to think about. And that was 0:11:00.000 --> 0:11:04.679 a commercial that aired during a big old American football game. 0:11:05.120 --> 0:11:08.720 You know one that happens every year, You know, the 0:11:08.760 --> 0:11:11.600 one I can't I can't call it by name for 0:11:11.679 --> 0:11:15.959 you know, legal reasons. Anyway, one famous feature of this 0:11:16.120 --> 0:11:19.480 big old American football game is that brands will shell 0:11:19.520 --> 0:11:22.839 out huge amounts of money to air commercials during it. 0:11:23.200 --> 0:11:26.439 And one brand to do that in nineteen ninety seven 0:11:27.120 --> 0:11:31.120 was the Dirt Devil vacuum cleaner company. Now, those of 0:11:31.120 --> 0:11:33.120 you across the pond would call it a hoover, not 0:11:33.120 --> 0:11:35.720 a vacuum cleaner, but a hoover is a different brand altogether, 0:11:35.760 --> 0:11:39.720 so stop confusing me. In the commercial, famous actor and 0:11:39.840 --> 0:11:44.400 dancer Fred Astaire is shown dancing with Dirt Devil vacuum cleaners. 0:11:44.880 --> 0:11:48.400 But here's the thing. Fred Astaire had died a decade earlier. 0:11:49.000 --> 0:11:52.760 The footage was taken from his films, with Dirt Devil 0:11:52.880 --> 0:11:56.280 inserting the imagery of its products into the footage to 0:11:56.320 --> 0:11:59.160 make it seem as if Astaire had actually shot commercials 0:11:59.200 --> 0:12:02.400 this way and really danced with vacuum cleaners. So in 0:12:02.440 --> 0:12:05.000 this case, the footage of a stare was legitimate. It 0:12:05.120 --> 0:12:07.199 was the appearance of the vacuum cleaners that had been 0:12:07.240 --> 0:12:10.680 inserted into it. But the use of footage of performers 0:12:10.679 --> 0:12:14.360 who have passed away prompted a debate about the ethics 0:12:14.400 --> 0:12:18.320 of that practice, and people began to speculate about what 0:12:18.559 --> 0:12:21.600 might happen once technology reached a point where a computer 0:12:21.679 --> 0:12:27.239 simulation of a person would be indistinguishable from the real thing. Meanwhile, 0:12:27.400 --> 0:12:30.920 also in nineteen ninety seven, a group of computer scientists 0:12:31.000 --> 0:12:36.880 published and important work. The scientists were Christoph Or, Chris Bregler, 0:12:37.720 --> 0:12:43.040 Michel Covell, and Malcolm Stanley. The paper's title is Video 0:12:43.160 --> 0:12:47.960 Rewrite Driving Visual Speech with Audio. This work built on 0:12:48.040 --> 0:12:51.600 top of a lot of other previous work. For example, 0:12:52.040 --> 0:12:56.480 base interpretation was already a discipline in computer science. It 0:12:56.559 --> 0:12:59.400 traces its history all the way back to the nineteen sixties. 0:13:00.040 --> 0:13:02.800 Ditto for technology that could generate speech from texts. That 0:13:02.920 --> 0:13:06.760 two dates back to the nineteen sixties. Computer animation had 0:13:06.800 --> 0:13:09.720 been around for a while by nineteen ninety seven, so 0:13:09.920 --> 0:13:13.040 creating a three D model of lips, one that you 0:13:13.040 --> 0:13:17.520 could subsequently animate that was also a thing already. But 0:13:17.600 --> 0:13:20.800 what these researchers did was they brought all these elements together. 0:13:21.360 --> 0:13:25.320 It was a convergence of technologies that resulted in a 0:13:25.400 --> 0:13:30.000 new application, one which would allow for computer generated synthetic 0:13:30.120 --> 0:13:35.559 video of real people. The team created the video rewrite software, 0:13:35.960 --> 0:13:38.360 and they also showed what it was capable of doing 0:13:38.400 --> 0:13:42.000 in some very very short video clips. The results are 0:13:42.040 --> 0:13:46.199 primitive by today's standards, but nonetheless impressive. In one two 0:13:46.320 --> 0:13:51.120 second clip, President JFK appears to say I never met 0:13:51.160 --> 0:13:54.319 Forrest Gump. It's a cheeky reference to the nineteen ninety 0:13:54.320 --> 0:13:57.640 four film, which included a segment in which the titular 0:13:57.760 --> 0:14:02.360 character Forst Gump appears to meet JFK and then informs 0:14:02.440 --> 0:14:05.080 him that he needs to rush off to the restroom. 0:14:05.720 --> 0:14:10.800 Video Rewrite served as a foundation for technologies that we 0:14:11.040 --> 0:14:14.559 could refer to as deep fake tech. So just a 0:14:14.600 --> 0:14:18.040 few years later, in two thousand and one, Christopher J. Taylor, 0:14:18.320 --> 0:14:22.520 Gareth J. Edwards, and Timothy. My middle initial is F 0:14:22.640 --> 0:14:25.080 and not Jay, which actually upsets Jonathan. Because of a 0:14:25.120 --> 0:14:29.240 lack of consistency, Coots published a paper that was titled 0:14:29.560 --> 0:14:33.480 Active Appearance Models. The abstract for this paper reads, in 0:14:33.560 --> 0:14:37.880 part quote, we describe a new method of matching statistical 0:14:37.920 --> 0:14:43.640 models of appearances to images. End quote now in plain English. 0:14:43.840 --> 0:14:47.240 This paper describes a method in which computer vision relies 0:14:47.320 --> 0:14:52.280 on statistical models to more accurately identify elements within the image. 0:14:52.440 --> 0:14:56.680 So let's consider facial recognition technology. As I mentioned earlier, 0:14:57.160 --> 0:15:01.280 computers do not inherently understand image. If presented with a 0:15:01.280 --> 0:15:04.920 picture of a face, a computer cannot naturally determine what 0:15:05.040 --> 0:15:08.600 the various features of that face are. Only through proper 0:15:08.640 --> 0:15:11.360 programming and machine learning can you start to do this 0:15:11.760 --> 0:15:15.720 and train a computer to recognize features like a nose, 0:15:16.240 --> 0:15:20.640 a mouth, eyebrows, eyes, et cetera. And by training machines 0:15:20.680 --> 0:15:23.760 on millions of faces, you can reach a point where 0:15:23.800 --> 0:15:26.400 the machine can examine a new face, one that has 0:15:26.520 --> 0:15:29.920 never before been submitted to the machine, and attempt to 0:15:30.000 --> 0:15:34.520 identify those features. This is a necessary step with a 0:15:34.560 --> 0:15:37.840 lot of deep fake technology. See to call all deep 0:15:37.880 --> 0:15:42.440 fakes computer generated is a little misleading. Often what is 0:15:42.520 --> 0:15:46.920 happening is a computer is replacing an existing person or 0:15:47.040 --> 0:15:51.600 face in a video with someone else's features. In order 0:15:51.640 --> 0:15:53.520 to do that, you first have to be able to 0:15:53.680 --> 0:15:57.840 map and identify the original person that was in the video, 0:15:58.200 --> 0:16:01.920 you need to be able to match the synthesized face 0:16:02.360 --> 0:16:06.200 with the movements of the original face. To do that, 0:16:06.240 --> 0:16:09.920 the computer first has to encode the original face, essentially 0:16:10.240 --> 0:16:13.760 to break it down into lots of smaller shapes. Then 0:16:13.800 --> 0:16:16.600 it has to be able to match the synthesized face 0:16:17.000 --> 0:16:20.920 to the original one with a similar encoded approach, and 0:16:20.920 --> 0:16:24.640 then decode that into the synthesized face that replaces the 0:16:24.680 --> 0:16:28.080 original one and then follows the various motions of the 0:16:28.080 --> 0:16:32.880 original face. So you're replacing one person with another through 0:16:32.880 --> 0:16:35.000 the use of a computer, and as part of that, 0:16:35.040 --> 0:16:38.000 the computer has to break down the original person into 0:16:38.080 --> 0:16:41.280 points of data that the computer can handle. So with 0:16:41.400 --> 0:16:45.760 this technology, I could stand facing a camera and deliver 0:16:45.840 --> 0:16:48.800 a speech and then, using software designed to follow the 0:16:48.800 --> 0:16:52.880 steps I just laid out, replace my image with that 0:16:53.000 --> 0:16:56.080 of someone else. If I also used a program designed 0:16:56.120 --> 0:17:00.480 to create a vocal impersonation of that someone else, well 0:17:00.560 --> 0:17:03.320 I could create a video where some celebrity says things 0:17:03.320 --> 0:17:06.920 that they would never say, Like maybe I could create 0:17:06.960 --> 0:17:10.040 a video of Keanu Reeves saying tech Stuff is my 0:17:10.160 --> 0:17:13.720 favorite podcast. Jonathan is such a cool host. I wish 0:17:13.840 --> 0:17:17.240 I could hang out with him. For the record, mister Reeves, 0:17:17.320 --> 0:17:20.080 I would never actually do that. I'm just saying I 0:17:20.200 --> 0:17:24.639 could do it. Of course, creating a video image of 0:17:24.720 --> 0:17:27.640 Keanu Reeves would just be one part of the equation. 0:17:27.920 --> 0:17:31.120 Another would be replicating his voice. Now, I could try 0:17:31.160 --> 0:17:34.600 and do my own impersonation, but this would so clearly 0:17:34.640 --> 0:17:37.399 be fake that I would never achieve my goal of 0:17:37.440 --> 0:17:39.600 trying to make it appear as though Keanu Reeves knows 0:17:39.600 --> 0:17:41.240 who I am and wants to hang out with me. 0:17:41.920 --> 0:17:44.600 I can't even say WHOA the way he does. To 0:17:44.720 --> 0:17:48.800 achieve my dreams, I would need a voice synthesis program 0:17:48.960 --> 0:17:52.600 that I could train on Keano's voice and then produce 0:17:52.640 --> 0:17:57.959 a computer generated impersonation. The history of voice synthesis is 0:17:58.200 --> 0:18:01.080 crazy long. I mean, if we really, we really wanted 0:18:01.119 --> 0:18:02.840 to dive into it, we could go all the way 0:18:02.840 --> 0:18:07.560 back to the late seventeen hundreds. But we won't because 0:18:07.560 --> 0:18:11.119 I can't keep you here that long. Text to speech 0:18:11.200 --> 0:18:14.560 technologies brings us a bit closer to modern day, but 0:18:14.960 --> 0:18:17.840 then we're still talking about the nineteen sixties or thereabouts. 0:18:17.840 --> 0:18:20.600 As I mentioned earlier in this episode. To get to 0:18:20.640 --> 0:18:23.720 a point where computers are capable of producing an imitation 0:18:23.840 --> 0:18:27.480 of a specific person's voice. Then we're getting up to 0:18:27.560 --> 0:18:31.199 like the last decade or so, researchers built tools that, 0:18:31.320 --> 0:18:36.440 after training on how a specific person produces different sounds phonemes. 0:18:36.880 --> 0:18:38.520 If we want to think of it in terms of 0:18:38.600 --> 0:18:41.680 language and the sounds of language, well, then we have 0:18:42.080 --> 0:18:46.040 applications that can take text, interpret that text as a 0:18:46.119 --> 0:18:49.520 series of sounds, pull upon the computer knowledge of how 0:18:49.600 --> 0:18:55.200 a particular person makes those specific sounds, and then voila, 0:18:55.640 --> 0:18:58.679 we have ourselves a copy. Now, early versions of this 0:18:58.760 --> 0:19:02.960 technology were understandibly a bit limited. You would end up 0:19:03.040 --> 0:19:06.719 with speech that on a service level sounded like the 0:19:06.760 --> 0:19:10.720 person in question, the synthesized person, but it would typically 0:19:10.720 --> 0:19:15.080 come across as flat or using incorrect inflection to emphasize 0:19:15.119 --> 0:19:17.960 a point. So think of that kind of robotics sound 0:19:18.040 --> 0:19:21.320 you would get with early personal assistance, right, like if 0:19:21.359 --> 0:19:25.280 you were using a GPS system, which I realized I 0:19:25.440 --> 0:19:29.120 just used a repetition there, like ATM machine. But let's 0:19:29.119 --> 0:19:33.040 say you're using a GPS and it has a voice 0:19:33.080 --> 0:19:37.959 associated with it. Older ones were very robotic, and they 0:19:37.960 --> 0:19:40.920 could also say things that were hilariously wrong. I'll never 0:19:40.960 --> 0:19:43.640 forget the time I was riding in a car and 0:19:43.720 --> 0:19:47.440 the GPS told us to turn right on Oak Doctor 0:19:47.960 --> 0:19:52.879 instead of Oak Drive. But over time the models improved 0:19:53.000 --> 0:19:55.800 and things started to sound a bit more natural. So 0:19:57.040 --> 0:20:01.199 those early ones not so good. Not mistake them for 0:20:01.280 --> 0:20:03.679 being a real person. It would sound like a robot 0:20:03.880 --> 0:20:06.720 in making an impersonation of that person, But models would 0:20:06.720 --> 0:20:11.160 grow in sophistication, and training sessions would include examples where 0:20:11.359 --> 0:20:15.400 the target's expressions would be associated with specific emotions like anger, 0:20:15.640 --> 0:20:20.240 or happiness or sadness. You can actually use a voice 0:20:20.280 --> 0:20:24.520 synthesizer yourself and train it. And as part of that, 0:20:24.600 --> 0:20:28.480 you're typically told to read out sentences with different emotional 0:20:28.520 --> 0:20:31.679 weight to them, So using a bit of appropriate text, 0:20:32.320 --> 0:20:35.199 then maybe some metadata to indicate what emotion should be 0:20:35.320 --> 0:20:38.680 used to read out that text. It then becomes possible 0:20:38.720 --> 0:20:43.439 to craft vocal performances that were and are difficult to 0:20:43.520 --> 0:20:47.119 distinguish from the real thing. We're going to take a 0:20:47.200 --> 0:20:49.800 quick break to thank our sponsor, and then I'll be 0:20:49.880 --> 0:20:53.439 back to talk more about the history and impact of 0:20:53.520 --> 0:21:08.080 deep fake technology. Back to our history of video deep fakes. 0:21:08.160 --> 0:21:11.199 We left off at two thousand and one, and for 0:21:11.280 --> 0:21:15.520 nearly two decades computer scientists continued to work on systems 0:21:15.920 --> 0:21:20.600 that would push forward the capabilities of synthesized video content. 0:21:21.280 --> 0:21:23.800 By the time we get up to twenty seventeen, a 0:21:23.920 --> 0:21:28.320 pair of papers explained that the advancements in consumer computers 0:21:28.640 --> 0:21:31.280 had reached a point where it was actually possible to 0:21:31.400 --> 0:21:36.159 achieve synthesized video using off the shelf computer systems, and 0:21:36.240 --> 0:21:39.320 that would be a huge game changer. No longer would 0:21:39.359 --> 0:21:45.280 you need access to incredibly powerful systems with specialized software. 0:21:45.720 --> 0:21:50.320 Now you could potentially create or access an application on 0:21:50.400 --> 0:21:52.560 an off the shelf computer to do the same thing. 0:21:53.240 --> 0:21:56.960 So the tools to generate computer synthesized video now we're 0:21:57.040 --> 0:22:00.720 within the grasp of the average computer user. With cloud 0:22:00.720 --> 0:22:04.120 based services that could augment these efforts, it became possible 0:22:04.160 --> 0:22:06.480 for a creative person to make videos that appear to 0:22:06.520 --> 0:22:10.480 show people doing and saying things that they never actually did. 0:22:11.119 --> 0:22:14.800 And again, there are multiple uses for such technology. Not 0:22:14.960 --> 0:22:18.600 all of them are sinister, but it doesn't take much 0:22:18.640 --> 0:22:22.320 imagination to come up with scenarios where things get grim, 0:22:22.560 --> 0:22:25.879 And indeed, many early uses of this tech once it 0:22:25.880 --> 0:22:30.320 became accessible, were bad. One big one was using face 0:22:30.400 --> 0:22:34.520 swapping technology to make it appear as though someone famous 0:22:34.680 --> 0:22:39.520 or otherwise was appearing in an adult video. And I 0:22:39.520 --> 0:22:41.920 think it goes without saying that this is a total 0:22:42.080 --> 0:22:46.159 violation of the victim. It robs them of agency and 0:22:46.320 --> 0:22:50.080 they may end up suffering consequences despite not being remotely 0:22:50.160 --> 0:22:55.480 responsible for the content. So imagine facing judgment for something 0:22:55.520 --> 0:22:58.680 that not only you did not do, but you had 0:22:58.720 --> 0:23:03.240 no way of preventing. Honestly, it's impossible for me to 0:23:03.280 --> 0:23:07.560 communicate how devastating this can be. There are several accounts 0:23:07.600 --> 0:23:10.600 online written by people who have been the victim of 0:23:10.640 --> 0:23:13.520 this sort of activity, and they are worth your time. 0:23:13.880 --> 0:23:17.240 They are harrowing to read, but it is important their 0:23:17.280 --> 0:23:20.800 words will far more effectively explain how traumatizing this experience 0:23:20.840 --> 0:23:24.879 can be. And just as a reminder, the rise of 0:23:24.960 --> 0:23:28.840 social networks means that we've all been sharing a lot 0:23:28.960 --> 0:23:32.560 of images of ourselves, videos of ourselves. There's a lot 0:23:32.560 --> 0:23:35.879 of content out there that could be used to train 0:23:36.440 --> 0:23:40.399 various machine models. So it's something to keep in mind 0:23:40.800 --> 0:23:44.280 that even if you aren't concerned right now, there's nothing 0:23:44.320 --> 0:23:48.160 to say that you couldn't become a victim tomorrow. Deep 0:23:48.200 --> 0:23:52.800 fakes also pose a risk to organizations it's not just individuals. 0:23:53.240 --> 0:23:56.119 So imagine for a moment that you see you have 0:23:56.160 --> 0:23:58.919 a voicemail at work, and you pull it up, and 0:23:58.960 --> 0:24:01.520 you listen to the voicemail, and it sounds like your boss, 0:24:01.960 --> 0:24:03.800 and your boss is telling you that you need to 0:24:03.840 --> 0:24:09.000 transfer company funds from the company account into a different one. 0:24:09.440 --> 0:24:11.879 And perhaps they say that it's in order for you 0:24:11.920 --> 0:24:14.760 to pay off some third party vendor for a project 0:24:14.760 --> 0:24:18.320 that you're not really familiar with. But then maybe it 0:24:18.359 --> 0:24:21.160 turns out that voicemail wasn't from your boss after all. 0:24:21.720 --> 0:24:25.679 Maybe it was the result of spearfishing. Maybe a nefarious 0:24:25.680 --> 0:24:29.840 thief has identified you as a possible key to stealing 0:24:29.920 --> 0:24:34.440 money from your organization and has used tech to impersonate 0:24:34.480 --> 0:24:39.119 your boss and direct you toward facilitating a crime. You 0:24:39.240 --> 0:24:43.560 unknowingly have become an accomplice. There's actually been a case 0:24:43.640 --> 0:24:46.400 where this sort of thing was alleged to have happened. 0:24:46.440 --> 0:24:49.960 Now I have to say alleged, because there were questions 0:24:50.000 --> 0:24:53.159 about whether or not it really was a case of 0:24:53.280 --> 0:24:57.560 a synthesized voice, or if maybe this was more of 0:24:57.600 --> 0:25:02.240 a straightforward embezzlement issue, and the deep fake defense aka 0:25:02.400 --> 0:25:06.760 the liar's dividend came into play. Deep fakes have come 0:25:07.520 --> 0:25:11.840 a long way in a few short years. However, they 0:25:11.840 --> 0:25:15.720 are not perfect. There can be telltale signs that a 0:25:15.880 --> 0:25:20.120 video is fake, though they can sometimes be too subtle 0:25:20.240 --> 0:25:23.800 for the human eye to detect. Sometimes there's a dead giveaway. 0:25:24.400 --> 0:25:26.560 You're watching a video and you think this person is 0:25:26.560 --> 0:25:31.560 blinking too frequently or not frequently enough, or maybe their 0:25:31.560 --> 0:25:36.719 eyes don't look quite right, or they movements you're seeing 0:25:36.960 --> 0:25:39.280 don't line up. That a person is turning their head 0:25:39.320 --> 0:25:41.760 one way, their eyes are shifting another in a way 0:25:41.760 --> 0:25:45.000 that just doesn't seem natural. There are those sorts of 0:25:45.040 --> 0:25:47.000 things that people can pick up on, there's some that 0:25:47.040 --> 0:25:50.280 are far more subtle, and deep fake detection tools are 0:25:50.359 --> 0:25:52.920 growing in importance as a result of this. There are 0:25:52.960 --> 0:25:57.639 tools that are trained to spot signs of fakery, sometimes 0:25:57.760 --> 0:26:00.240 ones that are far too subtle for us to notice. 0:26:00.400 --> 0:26:04.560 So it may be things like inconsistencies in lighting and 0:26:04.640 --> 0:26:08.960 the quality of reflections within the frame. Things like that 0:26:09.720 --> 0:26:12.720 may end up being an indication that a video was 0:26:12.800 --> 0:26:19.399 manufactured artificially rather than an actual recording, and they're becoming 0:26:19.440 --> 0:26:23.879