WEBVTT - The Origin and Impact of Deepfake Technology

0:00:04.440 --> 0:00:12.240
<v Speaker 1>Welcome to tech Stuff, a production from iHeartRadio. He there

0:00:12.320 --> 0:00:16.040
<v Speaker 1>and welcome to tech Stuff. I'm your host, Jonathan Strickland.

0:00:16.120 --> 0:00:19.000
<v Speaker 1>I'm an executive producer with iHeartRadio. And how the tech

0:00:19.040 --> 0:00:23.639
<v Speaker 1>are you. In April twenty twenty three, the lawyers for

0:00:23.760 --> 0:00:28.560
<v Speaker 1>Tesla CEO Elon Musk argued that submitted recordings of their

0:00:28.640 --> 0:00:33.279
<v Speaker 1>client from twenty sixteen might have been deep bakes, so

0:00:33.360 --> 0:00:37.680
<v Speaker 1>the ongoing case is an emotionally charged one. In twenty eighteen,

0:00:37.800 --> 0:00:41.800
<v Speaker 1>a man named Walter Huang died in a car accident.

0:00:41.960 --> 0:00:45.000
<v Speaker 1>The Tesla he was in was a Tesla Model X

0:00:45.440 --> 0:00:48.360
<v Speaker 1>and it was engaged in autopilot mode at the time

0:00:48.360 --> 0:00:52.360
<v Speaker 1>of the crash. His family contends that the Tesla's safety

0:00:52.400 --> 0:00:56.760
<v Speaker 1>systems failed and the vehicle steered itself into a concrete median,

0:00:57.440 --> 0:01:01.440
<v Speaker 1>and the family's lawyer submitted a recording of Elon Musk

0:01:01.520 --> 0:01:04.800
<v Speaker 1>as evidence that Wang was led to believe his vehicle

0:01:04.920 --> 0:01:09.520
<v Speaker 1>had greater capabilities than it actually possessed. So, in this recording,

0:01:09.680 --> 0:01:14.240
<v Speaker 1>Elon Musk said of Tesla vehicles, quote a Model S

0:01:14.480 --> 0:01:18.120
<v Speaker 1>and Model X at this point can drive autonomously with

0:01:18.240 --> 0:01:23.039
<v Speaker 1>greater safety than a person right now. End quote In response,

0:01:23.360 --> 0:01:27.280
<v Speaker 1>Musk's lawyers said, the recording could be faked. Now, not

0:01:27.400 --> 0:01:30.840
<v Speaker 1>to waffle about this, but if we're speaking solely on

0:01:31.160 --> 0:01:35.759
<v Speaker 1>a technical level, the recording could be faked. And by

0:01:35.760 --> 0:01:39.479
<v Speaker 1>that I mean there are technologies that are sophisticated enough

0:01:39.520 --> 0:01:43.560
<v Speaker 1>to create a fake recording. But just because something could

0:01:43.959 --> 0:01:47.640
<v Speaker 1>be faked doesn't mean it actually was faked. And the

0:01:47.720 --> 0:01:52.760
<v Speaker 1>judge in the Tesla Casevet Pennypacker, who has an amazing name,

0:01:53.200 --> 0:01:57.080
<v Speaker 1>said that this argument is a truly dangerous one. The

0:01:57.160 --> 0:02:00.920
<v Speaker 1>judge said that it implies quote that because mister Musk

0:02:01.080 --> 0:02:03.840
<v Speaker 1>is famous and might be more of a target for

0:02:03.960 --> 0:02:08.640
<v Speaker 1>deep fakes, his public statements are immune quote. In other words,

0:02:09.000 --> 0:02:13.320
<v Speaker 1>if you're notable enough or notorious enough, you have a

0:02:13.639 --> 0:02:17.800
<v Speaker 1>carte blanche excuse for anything that you are recorded as saying,

0:02:17.960 --> 0:02:21.640
<v Speaker 1>because maybe someone just targeted you and created a fake

0:02:21.760 --> 0:02:26.640
<v Speaker 1>version to discredit you. In twenty eighteen, Danielle Citron and

0:02:26.800 --> 0:02:30.440
<v Speaker 1>Robert Chesney wrote a paper in which they predicted this

0:02:30.560 --> 0:02:35.320
<v Speaker 1>sort of situation. They dubbed it the liar's dividend. That

0:02:35.840 --> 0:02:39.560
<v Speaker 1>when there is a proliferation of technology that can create

0:02:39.639 --> 0:02:45.480
<v Speaker 1>misinformation or outright disinformation, the liars out there reap the benefits,

0:02:45.600 --> 0:02:48.600
<v Speaker 1>because what is the truth anyway? When you can't trust

0:02:48.639 --> 0:02:53.359
<v Speaker 1>the evidence, everything falls apart. This is just one of

0:02:53.400 --> 0:02:58.000
<v Speaker 1>the many challenges deep fake technology presents. There are potentially

0:02:58.520 --> 0:03:03.360
<v Speaker 1>harmless or perhaps even beneficial uses of this technology, but

0:03:03.560 --> 0:03:07.000
<v Speaker 1>it doesn't take much imagination to come up with ways

0:03:07.000 --> 0:03:11.359
<v Speaker 1>to cause harm. Let's talk a second about the entertainment industry.

0:03:12.200 --> 0:03:16.000
<v Speaker 1>With deep fake technology, it becomes possible to create videos

0:03:16.040 --> 0:03:21.000
<v Speaker 1>and audio recordings that simulate celebrities, which potentially allows a

0:03:21.040 --> 0:03:24.359
<v Speaker 1>director to cast a film with people who otherwise would

0:03:24.400 --> 0:03:29.320
<v Speaker 1>be very much unavailable. Using sufficiently sophisticated deep fakes, you

0:03:29.400 --> 0:03:32.200
<v Speaker 1>could create a movie that combines a cast of modern

0:03:32.280 --> 0:03:36.040
<v Speaker 1>and classic film stars. Maybe you want the Marx Brothers

0:03:36.120 --> 0:03:39.560
<v Speaker 1>running around with Will Ferrell, Maybe you want Lon Cheney

0:03:39.640 --> 0:03:43.040
<v Speaker 1>Junior to show up in your modern werewolf movie. Or

0:03:43.360 --> 0:03:48.560
<v Speaker 1>maybe you're doing something slightly less extreme, maybe you're using

0:03:48.600 --> 0:03:51.840
<v Speaker 1>the technology to generate a younger version of your current

0:03:51.920 --> 0:03:55.800
<v Speaker 1>star ala Harrison Ford in the upcoming Indiana Jones and

0:03:55.840 --> 0:03:58.920
<v Speaker 1>the Dial of Destiny film. So it doesn't have to

0:03:58.960 --> 0:04:03.240
<v Speaker 1>be more or sinister, but it does bring into question

0:04:03.400 --> 0:04:06.480
<v Speaker 1>concepts like the right to personality or the right to

0:04:06.560 --> 0:04:11.600
<v Speaker 1>identity or the right to publicity. Presumably filmmakers wouldn't want

0:04:11.640 --> 0:04:14.840
<v Speaker 1>to move forward on any project with a computer generated

0:04:14.880 --> 0:04:18.680
<v Speaker 1>simulation of a real film star without the permission from

0:04:18.839 --> 0:04:22.600
<v Speaker 1>that person or their family. But it's possible to do it,

0:04:23.080 --> 0:04:26.080
<v Speaker 1>and depending on the movie, maybe they do go ahead

0:04:26.240 --> 0:04:31.719
<v Speaker 1>without securing permission first. Maybe it's a edgy parody film,

0:04:31.920 --> 0:04:34.640
<v Speaker 1>and the buzz around their decision to do this could

0:04:34.760 --> 0:04:37.960
<v Speaker 1>end up being a boost to marketing. People would say

0:04:37.960 --> 0:04:40.359
<v Speaker 1>how dare they do this, and then go buy tickets

0:04:40.400 --> 0:04:43.880
<v Speaker 1>to see the fallout of it. For actors, there's a

0:04:43.920 --> 0:04:46.719
<v Speaker 1>real concern that this technology could rob them of work,

0:04:47.040 --> 0:04:49.400
<v Speaker 1>that if they turned down a role, the filmmaker could

0:04:49.440 --> 0:04:52.000
<v Speaker 1>just get a computer generated version of them in there,

0:04:52.520 --> 0:04:55.479
<v Speaker 1>or that they could appear to you know, appear in

0:04:55.520 --> 0:04:59.120
<v Speaker 1>projects that they don't actually agree with, and perhaps most

0:04:59.200 --> 0:05:02.920
<v Speaker 1>importantly for many actors out there, that this could all

0:05:02.960 --> 0:05:07.400
<v Speaker 1>happen without compensation for the original actor. I know that

0:05:07.480 --> 0:05:10.760
<v Speaker 1>it could be tough to feel sympathetic toward big Hollywood stars,

0:05:10.760 --> 0:05:14.360
<v Speaker 1>but keep in mind the vast majority of working actors

0:05:14.400 --> 0:05:17.760
<v Speaker 1>out there are not raking in the huge movie deals.

0:05:18.200 --> 0:05:21.359
<v Speaker 1>They're just as worried about AI biting into their work

0:05:21.640 --> 0:05:24.599
<v Speaker 1>as the rest of us are. Then there's the world

0:05:24.839 --> 0:05:28.839
<v Speaker 1>of audio performance. Earlier this year, a TikTok user with

0:05:28.920 --> 0:05:33.520
<v Speaker 1>the handle ghost Writer nine seven seven wrote and produced

0:05:33.520 --> 0:05:37.359
<v Speaker 1>a song called Heart on My Sleeve. But ghost Writer

0:05:37.560 --> 0:05:41.240
<v Speaker 1>nine seven seven didn't provide the vocals for this track. Instead,

0:05:41.520 --> 0:05:46.720
<v Speaker 1>they used AI generated deep fake vocal simulations of artists

0:05:46.880 --> 0:05:51.400
<v Speaker 1>Drake and The Weekend. The songwriter then posted the release

0:05:51.440 --> 0:05:56.080
<v Speaker 1>on multiple platforms and it quickly went viral. Universal music

0:05:56.120 --> 0:05:59.919
<v Speaker 1>groups sprung into action right away and claimed copyright infringement.

0:06:00.600 --> 0:06:04.320
<v Speaker 1>And I am no legal expert, but in my mind

0:06:04.720 --> 0:06:08.680
<v Speaker 1>that's a weak argument. After all, the song itself was

0:06:08.800 --> 0:06:11.599
<v Speaker 1>an original, it was not a cover. It had not

0:06:11.680 --> 0:06:17.480
<v Speaker 1>been stolen from someone's discography. You cannot copyright the sound

0:06:17.560 --> 0:06:21.680
<v Speaker 1>of a voice. Universal Music Group doesn't own the vocal

0:06:21.760 --> 0:06:25.400
<v Speaker 1>quality of Drake or the Weekend, and I'm sure those

0:06:25.480 --> 0:06:29.240
<v Speaker 1>artists would be concerned to learn otherwise. And even if

0:06:29.279 --> 0:06:32.640
<v Speaker 1>the agreement between the label and the artists did go

0:06:32.760 --> 0:06:36.719
<v Speaker 1>all Ursula from the Little Mermaid and claim ownership of

0:06:36.760 --> 0:06:40.520
<v Speaker 1>the voices themselves, there's not really a legal foundation to

0:06:40.640 --> 0:06:45.320
<v Speaker 1>use that as a deterrent against Deep fakes. Universal Music

0:06:45.320 --> 0:06:48.800
<v Speaker 1>Group did argue that the deep fake voices used tons

0:06:48.880 --> 0:06:53.560
<v Speaker 1>of recorded material to train itself to sound like those artists.

0:06:54.240 --> 0:06:57.640
<v Speaker 1>That is most certainly the case. We'll dive into deep

0:06:57.720 --> 0:07:01.200
<v Speaker 1>fake techniques a bit later in this episode, but it

0:07:01.279 --> 0:07:04.520
<v Speaker 1>often boils down to machine learning and using a lot

0:07:04.560 --> 0:07:08.480
<v Speaker 1>of training material to educate a model about what it

0:07:08.560 --> 0:07:11.080
<v Speaker 1>is you want it to do. The more material you

0:07:11.120 --> 0:07:15.720
<v Speaker 1>can submit in training, the better, and Universal Music Group

0:07:15.760 --> 0:07:20.280
<v Speaker 1>said quote the training of generative AI using our artist's music,

0:07:20.480 --> 0:07:22.880
<v Speaker 1>which represents both a breach of our agreements and a

0:07:22.960 --> 0:07:26.920
<v Speaker 1>violation of copyright law end quote, before going on to

0:07:26.960 --> 0:07:29.200
<v Speaker 1>suggest that allowing Heart on my sleeve to exist as

0:07:29.200 --> 0:07:31.960
<v Speaker 1>akin to powering up skynet so that the terminators will

0:07:31.960 --> 0:07:36.520
<v Speaker 1>become real. I'm exaggerating only a little bit, and again

0:07:37.320 --> 0:07:39.880
<v Speaker 1>I am not a copyright expert, but it's hard for

0:07:39.920 --> 0:07:43.760
<v Speaker 1>me to imagine how training an AI model on music

0:07:44.280 --> 0:07:47.880
<v Speaker 1>is in itself a violation of copyright law. After all,

0:07:48.520 --> 0:07:54.080
<v Speaker 1>every musician, every artist, heck, every person who has been

0:07:54.120 --> 0:07:58.000
<v Speaker 1>around other people has been influenced by the work of

0:07:58.040 --> 0:08:03.560
<v Speaker 1>other people. Sometimes you can actually hear the influences in music.

0:08:04.480 --> 0:08:06.720
<v Speaker 1>You might hear an artist play and say, oh, that

0:08:06.800 --> 0:08:10.040
<v Speaker 1>reminds me of Johnny Cash or something like that. The

0:08:10.080 --> 0:08:14.640
<v Speaker 1>history of art is one in which succeeding generations iterate

0:08:15.000 --> 0:08:18.080
<v Speaker 1>on the works of those who came before them. Sometimes

0:08:18.360 --> 0:08:22.520
<v Speaker 1>they make drastic departures from the generations that came before them,

0:08:22.520 --> 0:08:26.560
<v Speaker 1>but even that is in response to the influence of

0:08:26.640 --> 0:08:31.880
<v Speaker 1>the earlier art. So, if you make the argument that

0:08:32.040 --> 0:08:36.080
<v Speaker 1>training AI on specific works is wrong, how do you

0:08:36.200 --> 0:08:40.680
<v Speaker 1>differentiate that from someone who gets their start playing song

0:08:40.760 --> 0:08:44.360
<v Speaker 1>covers or maybe writing their own stuff, but with musical

0:08:44.400 --> 0:08:49.520
<v Speaker 1>influences from identifiable artists. Because art is not created in

0:08:49.559 --> 0:08:53.800
<v Speaker 1>a vacuum, obviously, using AI is different. It can lead

0:08:53.800 --> 0:08:56.760
<v Speaker 1>to the creation of a near perfect simulation of the

0:08:56.800 --> 0:09:02.000
<v Speaker 1>original artist, But the method of training the AI isn't

0:09:02.040 --> 0:09:06.000
<v Speaker 1>really that different from a budding musician voraciously devouring the

0:09:06.160 --> 0:09:10.400
<v Speaker 1>entire discography of their favorite artists before emulating those artists

0:09:10.440 --> 0:09:14.440
<v Speaker 1>in their own work. It is a sticky wicket, no

0:09:14.600 --> 0:09:17.840
<v Speaker 1>question about it, and we're in the early stages of

0:09:17.880 --> 0:09:21.560
<v Speaker 1>figuring out how to handle it, which is particularly unfortunate

0:09:21.880 --> 0:09:26.640
<v Speaker 1>since the technology is already here. But how did we

0:09:26.760 --> 0:09:32.280
<v Speaker 1>get here? Well. An exhaustive history of deep fake technology

0:09:32.280 --> 0:09:36.120
<v Speaker 1>would require a full series of episodes about the history

0:09:36.120 --> 0:09:40.240
<v Speaker 1>of artificial intelligence and machine learning in general and computer

0:09:40.440 --> 0:09:44.600
<v Speaker 1>vision in particular, as well as text to speech and

0:09:44.679 --> 0:09:48.240
<v Speaker 1>lots of other related technologies. But for our purposes, we'll

0:09:48.280 --> 0:09:53.000
<v Speaker 1>simply acknowledge that countless computer scientists and programmers had spent

0:09:53.280 --> 0:09:57.040
<v Speaker 1>endless hours advancing computer technology with the goal of finding

0:09:57.040 --> 0:10:02.040
<v Speaker 1>ways to make machines quote unquote under and data. This

0:10:02.120 --> 0:10:05.559
<v Speaker 1>is easier said than done, so let's take images as

0:10:05.600 --> 0:10:08.920
<v Speaker 1>an example, as that will factor heavily in our discussion today.

0:10:09.280 --> 0:10:12.280
<v Speaker 1>We humans can glance at a photo and we can

0:10:12.320 --> 0:10:16.840
<v Speaker 1>immediately identify what is an object versus just a background.

0:10:16.960 --> 0:10:19.760
<v Speaker 1>So if you have a red mug placed in front

0:10:19.800 --> 0:10:23.320
<v Speaker 1>of a white cinder block wall, we can see what

0:10:23.520 --> 0:10:25.400
<v Speaker 1>is a mug and what is a wall. But we

0:10:25.480 --> 0:10:28.720
<v Speaker 1>have to teach computers how to do that, and when

0:10:28.720 --> 0:10:33.400
<v Speaker 1>you're talking about technologies that generate moving images, it becomes

0:10:33.640 --> 0:10:39.120
<v Speaker 1>even more complicated. So, for lack of a clear beginning,

0:10:39.640 --> 0:10:44.880
<v Speaker 1>I am somewhat arbitrarily going to start in nineteen ninety seven. Now,

0:10:44.920 --> 0:10:47.760
<v Speaker 1>a couple of things happened that year that would be

0:10:47.800 --> 0:10:51.520
<v Speaker 1>important for us to talk about, and one was not

0:10:51.800 --> 0:10:56.199
<v Speaker 1>quite deep baked technology, but it did illustrate some potential

0:10:57.240 --> 0:11:00.120
<v Speaker 1>ethical issues we had to think about. And that was

0:11:00.000 --> 0:11:04.679
<v Speaker 1>a commercial that aired during a big old American football game.

0:11:05.120 --> 0:11:08.720
<v Speaker 1>You know one that happens every year, You know, the

0:11:08.760 --> 0:11:11.600
<v Speaker 1>one I can't I can't call it by name for

0:11:11.679 --> 0:11:15.959
<v Speaker 1>you know, legal reasons. Anyway, one famous feature of this

0:11:16.120 --> 0:11:19.480
<v Speaker 1>big old American football game is that brands will shell

0:11:19.520 --> 0:11:22.839
<v Speaker 1>out huge amounts of money to air commercials during it.

0:11:23.200 --> 0:11:26.439
<v Speaker 1>And one brand to do that in nineteen ninety seven

0:11:27.120 --> 0:11:31.120
<v Speaker 1>was the Dirt Devil vacuum cleaner company. Now, those of

0:11:31.120 --> 0:11:33.120
<v Speaker 1>you across the pond would call it a hoover, not

0:11:33.120 --> 0:11:35.720
<v Speaker 1>a vacuum cleaner, but a hoover is a different brand altogether,

0:11:35.760 --> 0:11:39.720
<v Speaker 1>so stop confusing me. In the commercial, famous actor and

0:11:39.840 --> 0:11:44.400
<v Speaker 1>dancer Fred Astaire is shown dancing with Dirt Devil vacuum cleaners.

0:11:44.880 --> 0:11:48.400
<v Speaker 1>But here's the thing. Fred Astaire had died a decade earlier.

0:11:49.000 --> 0:11:52.760
<v Speaker 1>The footage was taken from his films, with Dirt Devil

0:11:52.880 --> 0:11:56.280
<v Speaker 1>inserting the imagery of its products into the footage to

0:11:56.320 --> 0:11:59.160
<v Speaker 1>make it seem as if Astaire had actually shot commercials

0:11:59.200 --> 0:12:02.400
<v Speaker 1>this way and really danced with vacuum cleaners. So in

0:12:02.440 --> 0:12:05.000
<v Speaker 1>this case, the footage of a stare was legitimate. It

0:12:05.120 --> 0:12:07.199
<v Speaker 1>was the appearance of the vacuum cleaners that had been

0:12:07.240 --> 0:12:10.680
<v Speaker 1>inserted into it. But the use of footage of performers

0:12:10.679 --> 0:12:14.360
<v Speaker 1>who have passed away prompted a debate about the ethics

0:12:14.400 --> 0:12:18.320
<v Speaker 1>of that practice, and people began to speculate about what

0:12:18.559 --> 0:12:21.600
<v Speaker 1>might happen once technology reached a point where a computer

0:12:21.679 --> 0:12:27.239
<v Speaker 1>simulation of a person would be indistinguishable from the real thing. Meanwhile,

0:12:27.400 --> 0:12:30.920
<v Speaker 1>also in nineteen ninety seven, a group of computer scientists

0:12:31.000 --> 0:12:36.880
<v Speaker 1>published and important work. The scientists were Christoph Or, Chris Bregler,

0:12:37.720 --> 0:12:43.040
<v Speaker 1>Michel Covell, and Malcolm Stanley. The paper's title is Video

0:12:43.160 --> 0:12:47.960
<v Speaker 1>Rewrite Driving Visual Speech with Audio. This work built on

0:12:48.040 --> 0:12:51.600
<v Speaker 1>top of a lot of other previous work. For example,

0:12:52.040 --> 0:12:56.480
<v Speaker 1>base interpretation was already a discipline in computer science. It

0:12:56.559 --> 0:12:59.400
<v Speaker 1>traces its history all the way back to the nineteen sixties.

0:13:00.040 --> 0:13:02.800
<v Speaker 1>Ditto for technology that could generate speech from texts. That

0:13:02.920 --> 0:13:06.760
<v Speaker 1>two dates back to the nineteen sixties. Computer animation had

0:13:06.800 --> 0:13:09.720
<v Speaker 1>been around for a while by nineteen ninety seven, so

0:13:09.920 --> 0:13:13.040
<v Speaker 1>creating a three D model of lips, one that you

0:13:13.040 --> 0:13:17.520
<v Speaker 1>could subsequently animate that was also a thing already. But

0:13:17.600 --> 0:13:20.800
<v Speaker 1>what these researchers did was they brought all these elements together.

0:13:21.360 --> 0:13:25.320
<v Speaker 1>It was a convergence of technologies that resulted in a

0:13:25.400 --> 0:13:30.000
<v Speaker 1>new application, one which would allow for computer generated synthetic

0:13:30.120 --> 0:13:35.559
<v Speaker 1>video of real people. The team created the video rewrite software,

0:13:35.960 --> 0:13:38.360
<v Speaker 1>and they also showed what it was capable of doing

0:13:38.400 --> 0:13:42.000
<v Speaker 1>in some very very short video clips. The results are

0:13:42.040 --> 0:13:46.199
<v Speaker 1>primitive by today's standards, but nonetheless impressive. In one two

0:13:46.320 --> 0:13:51.120
<v Speaker 1>second clip, President JFK appears to say I never met

0:13:51.160 --> 0:13:54.319
<v Speaker 1>Forrest Gump. It's a cheeky reference to the nineteen ninety

0:13:54.320 --> 0:13:57.640
<v Speaker 1>four film, which included a segment in which the titular

0:13:57.760 --> 0:14:02.360
<v Speaker 1>character Forst Gump appears to meet JFK and then informs

0:14:02.440 --> 0:14:05.080
<v Speaker 1>him that he needs to rush off to the restroom.

0:14:05.720 --> 0:14:10.800
<v Speaker 1>Video Rewrite served as a foundation for technologies that we

0:14:11.040 --> 0:14:14.559
<v Speaker 1>could refer to as deep fake tech. So just a

0:14:14.600 --> 0:14:18.040
<v Speaker 1>few years later, in two thousand and one, Christopher J. Taylor,

0:14:18.320 --> 0:14:22.520
<v Speaker 1>Gareth J. Edwards, and Timothy. My middle initial is F

0:14:22.640 --> 0:14:25.080
<v Speaker 1>and not Jay, which actually upsets Jonathan. Because of a

0:14:25.120 --> 0:14:29.240
<v Speaker 1>lack of consistency, Coots published a paper that was titled

0:14:29.560 --> 0:14:33.480
<v Speaker 1>Active Appearance Models. The abstract for this paper reads, in

0:14:33.560 --> 0:14:37.880
<v Speaker 1>part quote, we describe a new method of matching statistical

0:14:37.920 --> 0:14:43.640
<v Speaker 1>models of appearances to images. End quote now in plain English.

0:14:43.840 --> 0:14:47.240
<v Speaker 1>This paper describes a method in which computer vision relies

0:14:47.320 --> 0:14:52.280
<v Speaker 1>on statistical models to more accurately identify elements within the image.

0:14:52.440 --> 0:14:56.680
<v Speaker 1>So let's consider facial recognition technology. As I mentioned earlier,

0:14:57.160 --> 0:15:01.280
<v Speaker 1>computers do not inherently understand image. If presented with a

0:15:01.280 --> 0:15:04.920
<v Speaker 1>picture of a face, a computer cannot naturally determine what

0:15:05.040 --> 0:15:08.600
<v Speaker 1>the various features of that face are. Only through proper

0:15:08.640 --> 0:15:11.360
<v Speaker 1>programming and machine learning can you start to do this

0:15:11.760 --> 0:15:15.720
<v Speaker 1>and train a computer to recognize features like a nose,

0:15:16.240 --> 0:15:20.640
<v Speaker 1>a mouth, eyebrows, eyes, et cetera. And by training machines

0:15:20.680 --> 0:15:23.760
<v Speaker 1>on millions of faces, you can reach a point where

0:15:23.800 --> 0:15:26.400
<v Speaker 1>the machine can examine a new face, one that has

0:15:26.520 --> 0:15:29.920
<v Speaker 1>never before been submitted to the machine, and attempt to

0:15:30.000 --> 0:15:34.520
<v Speaker 1>identify those features. This is a necessary step with a

0:15:34.560 --> 0:15:37.840
<v Speaker 1>lot of deep fake technology. See to call all deep

0:15:37.880 --> 0:15:42.440
<v Speaker 1>fakes computer generated is a little misleading. Often what is

0:15:42.520 --> 0:15:46.920
<v Speaker 1>happening is a computer is replacing an existing person or

0:15:47.040 --> 0:15:51.600
<v Speaker 1>face in a video with someone else's features. In order

0:15:51.640 --> 0:15:53.520
<v Speaker 1>to do that, you first have to be able to

0:15:53.680 --> 0:15:57.840
<v Speaker 1>map and identify the original person that was in the video,

0:15:58.200 --> 0:16:01.920
<v Speaker 1>you need to be able to match the synthesized face

0:16:02.360 --> 0:16:06.200
<v Speaker 1>with the movements of the original face. To do that,

0:16:06.240 --> 0:16:09.920
<v Speaker 1>the computer first has to encode the original face, essentially

0:16:10.240 --> 0:16:13.760
<v Speaker 1>to break it down into lots of smaller shapes. Then

0:16:13.800 --> 0:16:16.600
<v Speaker 1>it has to be able to match the synthesized face

0:16:17.000 --> 0:16:20.920
<v Speaker 1>to the original one with a similar encoded approach, and

0:16:20.920 --> 0:16:24.640
<v Speaker 1>then decode that into the synthesized face that replaces the

0:16:24.680 --> 0:16:28.080
<v Speaker 1>original one and then follows the various motions of the

0:16:28.080 --> 0:16:32.880
<v Speaker 1>original face. So you're replacing one person with another through

0:16:32.880 --> 0:16:35.000
<v Speaker 1>the use of a computer, and as part of that,

0:16:35.040 --> 0:16:38.000
<v Speaker 1>the computer has to break down the original person into

0:16:38.080 --> 0:16:41.280
<v Speaker 1>points of data that the computer can handle. So with

0:16:41.400 --> 0:16:45.760
<v Speaker 1>this technology, I could stand facing a camera and deliver

0:16:45.840 --> 0:16:48.800
<v Speaker 1>a speech and then, using software designed to follow the

0:16:48.800 --> 0:16:52.880
<v Speaker 1>steps I just laid out, replace my image with that

0:16:53.000 --> 0:16:56.080
<v Speaker 1>of someone else. If I also used a program designed

0:16:56.120 --> 0:17:00.480
<v Speaker 1>to create a vocal impersonation of that someone else, well

0:17:00.560 --> 0:17:03.320
<v Speaker 1>I could create a video where some celebrity says things

0:17:03.320 --> 0:17:06.920
<v Speaker 1>that they would never say, Like maybe I could create

0:17:06.960 --> 0:17:10.040
<v Speaker 1>a video of Keanu Reeves saying tech Stuff is my

0:17:10.160 --> 0:17:13.720
<v Speaker 1>favorite podcast. Jonathan is such a cool host. I wish

0:17:13.840 --> 0:17:17.240
<v Speaker 1>I could hang out with him. For the record, mister Reeves,

0:17:17.320 --> 0:17:20.080
<v Speaker 1>I would never actually do that. I'm just saying I

0:17:20.200 --> 0:17:24.639
<v Speaker 1>could do it. Of course, creating a video image of

0:17:24.720 --> 0:17:27.640
<v Speaker 1>Keanu Reeves would just be one part of the equation.

0:17:27.920 --> 0:17:31.120
<v Speaker 1>Another would be replicating his voice. Now, I could try

0:17:31.160 --> 0:17:34.600
<v Speaker 1>and do my own impersonation, but this would so clearly

0:17:34.640 --> 0:17:37.399
<v Speaker 1>be fake that I would never achieve my goal of

0:17:37.440 --> 0:17:39.600
<v Speaker 1>trying to make it appear as though Keanu Reeves knows

0:17:39.600 --> 0:17:41.240
<v Speaker 1>who I am and wants to hang out with me.

0:17:41.920 --> 0:17:44.600
<v Speaker 1>I can't even say WHOA the way he does. To

0:17:44.720 --> 0:17:48.800
<v Speaker 1>achieve my dreams, I would need a voice synthesis program

0:17:48.960 --> 0:17:52.600
<v Speaker 1>that I could train on Keano's voice and then produce

0:17:52.640 --> 0:17:57.959
<v Speaker 1>a computer generated impersonation. The history of voice synthesis is

0:17:58.200 --> 0:18:01.080
<v Speaker 1>crazy long. I mean, if we really, we really wanted

0:18:01.119 --> 0:18:02.840
<v Speaker 1>to dive into it, we could go all the way

0:18:02.840 --> 0:18:07.560
<v Speaker 1>back to the late seventeen hundreds. But we won't because

0:18:07.560 --> 0:18:11.119
<v Speaker 1>I can't keep you here that long. Text to speech

0:18:11.200 --> 0:18:14.560
<v Speaker 1>technologies brings us a bit closer to modern day, but

0:18:14.960 --> 0:18:17.840
<v Speaker 1>then we're still talking about the nineteen sixties or thereabouts.

0:18:17.840 --> 0:18:20.600
<v Speaker 1>As I mentioned earlier in this episode. To get to

0:18:20.640 --> 0:18:23.720
<v Speaker 1>a point where computers are capable of producing an imitation

0:18:23.840 --> 0:18:27.480
<v Speaker 1>of a specific person's voice. Then we're getting up to

0:18:27.560 --> 0:18:31.199
<v Speaker 1>like the last decade or so, researchers built tools that,

0:18:31.320 --> 0:18:36.440
<v Speaker 1>after training on how a specific person produces different sounds phonemes.

0:18:36.880 --> 0:18:38.520
<v Speaker 1>If we want to think of it in terms of

0:18:38.600 --> 0:18:41.680
<v Speaker 1>language and the sounds of language, well, then we have

0:18:42.080 --> 0:18:46.040
<v Speaker 1>applications that can take text, interpret that text as a

0:18:46.119 --> 0:18:49.520
<v Speaker 1>series of sounds, pull upon the computer knowledge of how

0:18:49.600 --> 0:18:55.200
<v Speaker 1>a particular person makes those specific sounds, and then voila,

0:18:55.640 --> 0:18:58.679
<v Speaker 1>we have ourselves a copy. Now, early versions of this

0:18:58.760 --> 0:19:02.960
<v Speaker 1>technology were understandibly a bit limited. You would end up

0:19:03.040 --> 0:19:06.719
<v Speaker 1>with speech that on a service level sounded like the

0:19:06.760 --> 0:19:10.720
<v Speaker 1>person in question, the synthesized person, but it would typically

0:19:10.720 --> 0:19:15.080
<v Speaker 1>come across as flat or using incorrect inflection to emphasize

0:19:15.119 --> 0:19:17.960
<v Speaker 1>a point. So think of that kind of robotics sound

0:19:18.040 --> 0:19:21.320
<v Speaker 1>you would get with early personal assistance, right, like if

0:19:21.359 --> 0:19:25.280
<v Speaker 1>you were using a GPS system, which I realized I

0:19:25.440 --> 0:19:29.120
<v Speaker 1>just used a repetition there, like ATM machine. But let's

0:19:29.119 --> 0:19:33.040
<v Speaker 1>say you're using a GPS and it has a voice

0:19:33.080 --> 0:19:37.959
<v Speaker 1>associated with it. Older ones were very robotic, and they

0:19:37.960 --> 0:19:40.920
<v Speaker 1>could also say things that were hilariously wrong. I'll never

0:19:40.960 --> 0:19:43.640
<v Speaker 1>forget the time I was riding in a car and

0:19:43.720 --> 0:19:47.440
<v Speaker 1>the GPS told us to turn right on Oak Doctor

0:19:47.960 --> 0:19:52.879
<v Speaker 1>instead of Oak Drive. But over time the models improved

0:19:53.000 --> 0:19:55.800
<v Speaker 1>and things started to sound a bit more natural. So

0:19:57.040 --> 0:20:01.199
<v Speaker 1>those early ones not so good. Not mistake them for

0:20:01.280 --> 0:20:03.679
<v Speaker 1>being a real person. It would sound like a robot

0:20:03.880 --> 0:20:06.720
<v Speaker 1>in making an impersonation of that person, But models would

0:20:06.720 --> 0:20:11.160
<v Speaker 1>grow in sophistication, and training sessions would include examples where

0:20:11.359 --> 0:20:15.400
<v Speaker 1>the target's expressions would be associated with specific emotions like anger,

0:20:15.640 --> 0:20:20.240
<v Speaker 1>or happiness or sadness. You can actually use a voice

0:20:20.280 --> 0:20:24.520
<v Speaker 1>synthesizer yourself and train it. And as part of that,

0:20:24.600 --> 0:20:28.480
<v Speaker 1>you're typically told to read out sentences with different emotional

0:20:28.520 --> 0:20:31.679
<v Speaker 1>weight to them, So using a bit of appropriate text,

0:20:32.320 --> 0:20:35.199
<v Speaker 1>then maybe some metadata to indicate what emotion should be

0:20:35.320 --> 0:20:38.680
<v Speaker 1>used to read out that text. It then becomes possible

0:20:38.720 --> 0:20:43.439
<v Speaker 1>to craft vocal performances that were and are difficult to

0:20:43.520 --> 0:20:47.119
<v Speaker 1>distinguish from the real thing. We're going to take a

0:20:47.200 --> 0:20:49.800
<v Speaker 1>quick break to thank our sponsor, and then I'll be

0:20:49.880 --> 0:20:53.439
<v Speaker 1>back to talk more about the history and impact of

0:20:53.520 --> 0:21:08.080
<v Speaker 1>deep fake technology. Back to our history of video deep fakes.

0:21:08.160 --> 0:21:11.199
<v Speaker 1>We left off at two thousand and one, and for

0:21:11.280 --> 0:21:15.520
<v Speaker 1>nearly two decades computer scientists continued to work on systems

0:21:15.920 --> 0:21:20.600
<v Speaker 1>that would push forward the capabilities of synthesized video content.

0:21:21.280 --> 0:21:23.800
<v Speaker 1>By the time we get up to twenty seventeen, a

0:21:23.920 --> 0:21:28.320
<v Speaker 1>pair of papers explained that the advancements in consumer computers

0:21:28.640 --> 0:21:31.280
<v Speaker 1>had reached a point where it was actually possible to

0:21:31.400 --> 0:21:36.159
<v Speaker 1>achieve synthesized video using off the shelf computer systems, and

0:21:36.240 --> 0:21:39.320
<v Speaker 1>that would be a huge game changer. No longer would

0:21:39.359 --> 0:21:45.280
<v Speaker 1>you need access to incredibly powerful systems with specialized software.

0:21:45.720 --> 0:21:50.320
<v Speaker 1>Now you could potentially create or access an application on

0:21:50.400 --> 0:21:52.560
<v Speaker 1>an off the shelf computer to do the same thing.

0:21:53.240 --> 0:21:56.960
<v Speaker 1>So the tools to generate computer synthesized video now we're

0:21:57.040 --> 0:22:00.720
<v Speaker 1>within the grasp of the average computer user. With cloud

0:22:00.720 --> 0:22:04.120
<v Speaker 1>based services that could augment these efforts, it became possible

0:22:04.160 --> 0:22:06.480
<v Speaker 1>for a creative person to make videos that appear to

0:22:06.520 --> 0:22:10.480
<v Speaker 1>show people doing and saying things that they never actually did.

0:22:11.119 --> 0:22:14.800
<v Speaker 1>And again, there are multiple uses for such technology. Not

0:22:14.960 --> 0:22:18.600
<v Speaker 1>all of them are sinister, but it doesn't take much

0:22:18.640 --> 0:22:22.320
<v Speaker 1>imagination to come up with scenarios where things get grim,

0:22:22.560 --> 0:22:25.879
<v Speaker 1>And indeed, many early uses of this tech once it

0:22:25.880 --> 0:22:30.320
<v Speaker 1>became accessible, were bad. One big one was using face

0:22:30.400 --> 0:22:34.520
<v Speaker 1>swapping technology to make it appear as though someone famous

0:22:34.680 --> 0:22:39.520
<v Speaker 1>or otherwise was appearing in an adult video. And I

0:22:39.520 --> 0:22:41.920
<v Speaker 1>think it goes without saying that this is a total

0:22:42.080 --> 0:22:46.159
<v Speaker 1>violation of the victim. It robs them of agency and

0:22:46.320 --> 0:22:50.080
<v Speaker 1>they may end up suffering consequences despite not being remotely

0:22:50.160 --> 0:22:55.480
<v Speaker 1>responsible for the content. So imagine facing judgment for something

0:22:55.520 --> 0:22:58.680
<v Speaker 1>that not only you did not do, but you had

0:22:58.720 --> 0:23:03.240
<v Speaker 1>no way of preventing. Honestly, it's impossible for me to

0:23:03.280 --> 0:23:07.560
<v Speaker 1>communicate how devastating this can be. There are several accounts

0:23:07.600 --> 0:23:10.600
<v Speaker 1>online written by people who have been the victim of

0:23:10.640 --> 0:23:13.520
<v Speaker 1>this sort of activity, and they are worth your time.

0:23:13.880 --> 0:23:17.240
<v Speaker 1>They are harrowing to read, but it is important their

0:23:17.280 --> 0:23:20.800
<v Speaker 1>words will far more effectively explain how traumatizing this experience

0:23:20.840 --> 0:23:24.879
<v Speaker 1>can be. And just as a reminder, the rise of

0:23:24.960 --> 0:23:28.840
<v Speaker 1>social networks means that we've all been sharing a lot

0:23:28.960 --> 0:23:32.560
<v Speaker 1>of images of ourselves, videos of ourselves. There's a lot

0:23:32.560 --> 0:23:35.879
<v Speaker 1>of content out there that could be used to train

0:23:36.440 --> 0:23:40.399
<v Speaker 1>various machine models. So it's something to keep in mind

0:23:40.800 --> 0:23:44.280
<v Speaker 1>that even if you aren't concerned right now, there's nothing

0:23:44.320 --> 0:23:48.160
<v Speaker 1>to say that you couldn't become a victim tomorrow. Deep

0:23:48.200 --> 0:23:52.800
<v Speaker 1>fakes also pose a risk to organizations it's not just individuals.

0:23:53.240 --> 0:23:56.119
<v Speaker 1>So imagine for a moment that you see you have

0:23:56.160 --> 0:23:58.919
<v Speaker 1>a voicemail at work, and you pull it up, and

0:23:58.960 --> 0:24:01.520
<v Speaker 1>you listen to the voicemail, and it sounds like your boss,

0:24:01.960 --> 0:24:03.800
<v Speaker 1>and your boss is telling you that you need to

0:24:03.840 --> 0:24:09.000
<v Speaker 1>transfer company funds from the company account into a different one.

0:24:09.440 --> 0:24:11.879
<v Speaker 1>And perhaps they say that it's in order for you

0:24:11.920 --> 0:24:14.760
<v Speaker 1>to pay off some third party vendor for a project

0:24:14.760 --> 0:24:18.320
<v Speaker 1>that you're not really familiar with. But then maybe it

0:24:18.359 --> 0:24:21.160
<v Speaker 1>turns out that voicemail wasn't from your boss after all.

0:24:21.720 --> 0:24:25.679
<v Speaker 1>Maybe it was the result of spearfishing. Maybe a nefarious

0:24:25.680 --> 0:24:29.840
<v Speaker 1>thief has identified you as a possible key to stealing

0:24:29.920 --> 0:24:34.440
<v Speaker 1>money from your organization and has used tech to impersonate

0:24:34.480 --> 0:24:39.119
<v Speaker 1>your boss and direct you toward facilitating a crime. You

0:24:39.240 --> 0:24:43.560
<v Speaker 1>unknowingly have become an accomplice. There's actually been a case

0:24:43.640 --> 0:24:46.400
<v Speaker 1>where this sort of thing was alleged to have happened.

0:24:46.440 --> 0:24:49.960
<v Speaker 1>Now I have to say alleged, because there were questions

0:24:50.000 --> 0:24:53.159
<v Speaker 1>about whether or not it really was a case of

0:24:53.280 --> 0:24:57.560
<v Speaker 1>a synthesized voice, or if maybe this was more of

0:24:57.600 --> 0:25:02.240
<v Speaker 1>a straightforward embezzlement issue, and the deep fake defense aka

0:25:02.400 --> 0:25:06.760
<v Speaker 1>the liar's dividend came into play. Deep fakes have come

0:25:07.520 --> 0:25:11.840
<v Speaker 1>a long way in a few short years. However, they

0:25:11.840 --> 0:25:15.720
<v Speaker 1>are not perfect. There can be telltale signs that a

0:25:15.880 --> 0:25:20.120
<v Speaker 1>video is fake, though they can sometimes be too subtle

0:25:20.240 --> 0:25:23.800
<v Speaker 1>for the human eye to detect. Sometimes there's a dead giveaway.

0:25:24.400 --> 0:25:26.560
<v Speaker 1>You're watching a video and you think this person is

0:25:26.560 --> 0:25:31.560
<v Speaker 1>blinking too frequently or not frequently enough, or maybe their

0:25:31.560 --> 0:25:36.719
<v Speaker 1>eyes don't look quite right, or they movements you're seeing

0:25:36.960 --> 0:25:39.280
<v Speaker 1>don't line up. That a person is turning their head

0:25:39.320 --> 0:25:41.760
<v Speaker 1>one way, their eyes are shifting another in a way

0:25:41.760 --> 0:25:45.000
<v Speaker 1>that just doesn't seem natural. There are those sorts of

0:25:45.040 --> 0:25:47.000
<v Speaker 1>things that people can pick up on, there's some that

0:25:47.040 --> 0:25:50.280
<v Speaker 1>are far more subtle, and deep fake detection tools are

0:25:50.359 --> 0:25:52.920
<v Speaker 1>growing in importance as a result of this. There are

0:25:52.960 --> 0:25:57.639
<v Speaker 1>tools that are trained to spot signs of fakery, sometimes

0:25:57.760 --> 0:26:00.240
<v Speaker 1>ones that are far too subtle for us to notice.

0:26:00.400 --> 0:26:04.560
<v Speaker 1>So it may be things like inconsistencies in lighting and

0:26:04.640 --> 0:26:08.960
<v Speaker 1>the quality of reflections within the frame. Things like that

0:26:09.720 --> 0:26:12.720
<v Speaker 1>may end up being an indication that a video was

0:26:12.800 --> 0:26:19.399
<v Speaker 1>manufactured artificially rather than an actual recording, and they're becoming

0:26:19.440 --> 0:26:23.879
<v Speaker 1>more and more important for people and for organizations. So

0:26:23.920 --> 0:26:27.639
<v Speaker 1>in addition to those tools, Organization leaders should really prepare

0:26:27.680 --> 0:26:32.560
<v Speaker 1>employees for the possibility of encountering deep fakes. Critical thinking

0:26:32.720 --> 0:26:38.120
<v Speaker 1>is a big part of uncovering deception, as is preparation. Heck,

0:26:38.200 --> 0:26:40.760
<v Speaker 1>depending on the organization, you might go so far as

0:26:40.840 --> 0:26:45.040
<v Speaker 1>to set up a phrase or question as an authentication

0:26:45.200 --> 0:26:48.359
<v Speaker 1>process at the top of an official phone call or

0:26:48.800 --> 0:26:51.320
<v Speaker 1>video meeting, so that the person on the other end

0:26:51.320 --> 0:26:55.080
<v Speaker 1>of the line can verify that things are legit. I

0:26:55.119 --> 0:26:57.320
<v Speaker 1>know it sounds like you're going a far away, but

0:26:57.440 --> 0:27:01.119
<v Speaker 1>as this technology gets more sophisticated, as people deploy it

0:27:01.800 --> 0:27:05.960
<v Speaker 1>in ways that are potentially harmful, you have to start

0:27:06.000 --> 0:27:09.080
<v Speaker 1>to think about these things. What we do not want

0:27:09.119 --> 0:27:11.879
<v Speaker 1>to do is to enter into an era where we

0:27:11.960 --> 0:27:15.600
<v Speaker 1>can no longer reliably determine the real from the fate.

0:27:16.240 --> 0:27:18.600
<v Speaker 1>But there is no putting the cat back in the bag,

0:27:19.080 --> 0:27:21.919
<v Speaker 1>or the genie in the bottle or baby in the corner.

0:27:22.400 --> 0:27:26.679
<v Speaker 1>The technology isn't going away. It will not disappear. It

0:27:26.680 --> 0:27:30.480
<v Speaker 1>will continue to evolve and to improve, and so it

0:27:30.520 --> 0:27:34.359
<v Speaker 1>falls upon us to educate ourselves as best we can

0:27:34.840 --> 0:27:40.440
<v Speaker 1>in preparation for encountering it, and to think about how

0:27:40.480 --> 0:27:45.919
<v Speaker 1>we can address the flagrant misuses of the technology to

0:27:46.640 --> 0:27:49.439
<v Speaker 1>attempt to dissuade people from using it in that way,

0:27:49.680 --> 0:27:53.600
<v Speaker 1>because again, the victimization element of this can be really

0:27:53.640 --> 0:27:58.080
<v Speaker 1>severe and really traumatizing and incredibly disruptive to a person's life.

0:27:58.680 --> 0:28:03.560
<v Speaker 1>We should not forget that either. So in conclusion, I

0:28:03.600 --> 0:28:07.320
<v Speaker 1>will say that this technology is truly impressive and again

0:28:07.400 --> 0:28:10.720
<v Speaker 1>it can have some really incredible uses. I don't want

0:28:10.760 --> 0:28:13.560
<v Speaker 1>to paint it as just being a bad thing. It

0:28:13.640 --> 0:28:16.200
<v Speaker 1>is not good or bad. It is how we use

0:28:16.240 --> 0:28:19.480
<v Speaker 1>it that determines whether or not the end result is

0:28:19.520 --> 0:28:24.359
<v Speaker 1>a positive one or a negative one. But only by

0:28:24.440 --> 0:28:27.960
<v Speaker 1>learning about it can we prepare for what is to come.

0:28:28.520 --> 0:28:32.040
<v Speaker 1>So I hope that you found this episode informative, that

0:28:32.080 --> 0:28:35.480
<v Speaker 1>you have a deeper appreciation for what this technology does

0:28:35.680 --> 0:28:38.360
<v Speaker 1>and what it is capable of, and I will speak

0:28:38.360 --> 0:28:49.000
<v Speaker 1>to you again really soon. Tech Stuff is an iHeartRadio production.

0:28:49.280 --> 0:28:54.320
<v Speaker 1>For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts,

0:28:54.440 --> 0:28:56.440
<v Speaker 1>or wherever you listen to your favorite shows.