WEBVTT - Rerun: Deep Learning and Deepfakes

0:00:04.400 --> 0:00:07.760
<v Speaker 1>Welcome to tech Stuff, a production from I Heart Radio.

0:00:11.800 --> 0:00:14.200
<v Speaker 1>Hey there, and welcome to tech Stuff. I'm your host,

0:00:14.320 --> 0:00:16.840
<v Speaker 1>Jonathan Strickland. I'm an executive producer with I Heart Radio,

0:00:16.880 --> 0:00:20.720
<v Speaker 1>and how the tech are you. I am currently on

0:00:21.320 --> 0:00:26.280
<v Speaker 1>vacation celebrating my anniversary, and I didn't want to leave

0:00:26.360 --> 0:00:29.560
<v Speaker 1>you without an episode, So the episode We're going to

0:00:29.600 --> 0:00:32.640
<v Speaker 1>play for You was recorded and published on September seven,

0:00:32.760 --> 0:00:36.200
<v Speaker 1>twenty twenty. It is called deep learning and deep fakes,

0:00:36.200 --> 0:00:41.400
<v Speaker 1>and recent developments in the deep fakes field include researchers

0:00:41.440 --> 0:00:46.400
<v Speaker 1>creating tools that can detect tells in artificial voices, for example.

0:00:47.040 --> 0:00:49.280
<v Speaker 1>But really, when you think about that, it's just a

0:00:49.360 --> 0:00:52.560
<v Speaker 1>see saw like pattern that will see deep fake technology

0:00:52.600 --> 0:00:56.040
<v Speaker 1>improve over time, and then our ability to detect deep

0:00:56.080 --> 0:00:59.200
<v Speaker 1>fakes will improve, and this will keep going until one

0:00:59.200 --> 0:01:03.600
<v Speaker 1>side or the other has the edge permanently. Now we

0:01:03.800 --> 0:01:06.640
<v Speaker 1>kind of talk about that in this episode. In fact,

0:01:07.120 --> 0:01:10.160
<v Speaker 1>also deep fakes are very much in the spotlight literally

0:01:10.600 --> 0:01:13.679
<v Speaker 1>on the popular TV series America's Got Talent, A team

0:01:13.680 --> 0:01:16.360
<v Speaker 1>from the startup Metaphysics made it all the way to

0:01:16.360 --> 0:01:19.400
<v Speaker 1>the final round of the competition by creating deep fake

0:01:19.560 --> 0:01:22.319
<v Speaker 1>copies of the famous judges on the show all in

0:01:22.400 --> 0:01:28.640
<v Speaker 1>real time. It's equal parts entertaining and terrifying. Okay, maybe

0:01:29.520 --> 0:01:38.360
<v Speaker 1>entertaining terrifying. Anyway, enjoy this episode deep learning and deep Fakes. Now,

0:01:38.360 --> 0:01:41.039
<v Speaker 1>before I get into today's episode, I want to give

0:01:41.160 --> 0:01:45.120
<v Speaker 1>a little listener warning here. The topic at hand involves

0:01:45.200 --> 0:01:48.680
<v Speaker 1>some adult content, including the use of technology to do

0:01:48.800 --> 0:01:55.120
<v Speaker 1>stuff that can be unethical, illegal, hurtful, and just plain awful. Now,

0:01:55.160 --> 0:01:57.960
<v Speaker 1>I think this is an important topic, but I wanted

0:01:58.000 --> 0:01:59.840
<v Speaker 1>to give a bit of a heads up at this

0:02:00.000 --> 0:02:02.440
<v Speaker 1>are of the episode, just in case any of you

0:02:02.440 --> 0:02:05.720
<v Speaker 1>guys are listening to a podcast on like a family

0:02:05.960 --> 0:02:08.799
<v Speaker 1>road trip or something. I think this is an important

0:02:08.840 --> 0:02:12.200
<v Speaker 1>topic and I think everyone should know about it and

0:02:12.240 --> 0:02:14.320
<v Speaker 1>think about it. But I also respect that for some

0:02:14.360 --> 0:02:17.800
<v Speaker 1>people this subject might get a bit taboo. So let's

0:02:17.840 --> 0:02:23.160
<v Speaker 1>go on with the episode. Back in ninete a movie

0:02:23.320 --> 0:02:28.160
<v Speaker 1>called Rising Sun, directed by Philip Kaufman, based on a

0:02:28.320 --> 0:02:32.079
<v Speaker 1>Michael Crichton novel and starring Wesley Snipes and Sean Connery

0:02:32.240 --> 0:02:35.639
<v Speaker 1>came out in theaters. Now, I didn't see it in theaters,

0:02:36.320 --> 0:02:38.640
<v Speaker 1>but I did catch it when it came on you know,

0:02:39.200 --> 0:02:43.040
<v Speaker 1>HBO or Cinemax or something. Later on the movie included

0:02:43.080 --> 0:02:46.280
<v Speaker 1>a sequence that I found to be totally unbelievable. And

0:02:46.320 --> 0:02:50.000
<v Speaker 1>I'm not talking about buying into Sean Connery being an

0:02:50.040 --> 0:02:54.600
<v Speaker 1>expert on Japanese culture and business practices. Actually, side note,

0:02:54.760 --> 0:02:59.000
<v Speaker 1>Sean Connery has an interesting history of playing unlikely characters,

0:02:59.040 --> 0:03:01.760
<v Speaker 1>such as in high Lander, where he played an immortal

0:03:01.919 --> 0:03:05.720
<v Speaker 1>who is supposedly Egyptian, then who lived in feudal Japan

0:03:06.240 --> 0:03:09.120
<v Speaker 1>and ended up in Spain where he became known as Ramirez,

0:03:09.520 --> 0:03:12.080
<v Speaker 1>and all the while he's talking to a Scottish Highlander

0:03:12.240 --> 0:03:15.519
<v Speaker 1>who's played by a Belgian actor. But I'm getting way

0:03:15.520 --> 0:03:19.320
<v Speaker 1>off track here. Besides, I've heard Crichton actually wrote the

0:03:19.440 --> 0:03:22.280
<v Speaker 1>character while thinking of Connery, So you know, what the

0:03:22.280 --> 0:03:25.600
<v Speaker 1>heck do I know? In the film, Snips and Connery

0:03:25.720 --> 0:03:29.519
<v Speaker 1>are investigators, and they're looking into a homicide that happened

0:03:29.560 --> 0:03:34.080
<v Speaker 1>at a Japanese business but on American soil. The security

0:03:34.080 --> 0:03:38.360
<v Speaker 1>system in the building captured video of the homicide and

0:03:38.400 --> 0:03:40.800
<v Speaker 1>the identity of the killer appears to be a pretty

0:03:40.840 --> 0:03:44.200
<v Speaker 1>open and shut case. But that's not how it all

0:03:44.240 --> 0:03:47.800
<v Speaker 1>turns out. The investigators talked to a security expert played

0:03:47.840 --> 0:03:51.720
<v Speaker 1>by Tia Carrera, and she demonstrates in real time how

0:03:51.840 --> 0:03:56.440
<v Speaker 1>video footage can be altered. She records a short video

0:03:56.800 --> 0:04:00.240
<v Speaker 1>of Connery and snipes loads that onto a computer. Her

0:04:00.560 --> 0:04:04.520
<v Speaker 1>freezes a frame of the video and essentially performs a

0:04:04.600 --> 0:04:07.800
<v Speaker 1>cut and paste job swapping the heads of our two

0:04:07.880 --> 0:04:11.440
<v Speaker 1>lead characters. Then she resumes the video and the head

0:04:11.480 --> 0:04:16.800
<v Speaker 1>swap remains in place, and that head swap stuff is possible.

0:04:17.040 --> 0:04:19.120
<v Speaker 1>I mean, clearly it has to be possible, because you

0:04:19.160 --> 0:04:22.240
<v Speaker 1>actually do see that effect in the film itself. But

0:04:22.480 --> 0:04:25.040
<v Speaker 1>it takes a bit more than a quick cut and

0:04:25.120 --> 0:04:27.720
<v Speaker 1>paste job. But we'll leave off of that for now.

0:04:28.279 --> 0:04:31.920
<v Speaker 1>The whole point of that sequence, apart from showing off

0:04:32.000 --> 0:04:36.640
<v Speaker 1>some cinema magic, is to demonstrate to the investigators that video,

0:04:36.800 --> 0:04:41.080
<v Speaker 1>like photographs, can be altered. The expert has detected a

0:04:41.080 --> 0:04:44.520
<v Speaker 1>blue halo around the face of the supposed murderer in

0:04:44.600 --> 0:04:48.000
<v Speaker 1>the footage, indicating that some sort of trickery has happened.

0:04:48.480 --> 0:04:51.719
<v Speaker 1>She also reveals that she cannot magically restore the video

0:04:51.760 --> 0:04:54.839
<v Speaker 1>to its previous unaltered state, which I think was actually

0:04:54.880 --> 0:04:57.680
<v Speaker 1>a nice change of pace for a movie. By the way,

0:04:58.240 --> 0:05:01.480
<v Speaker 1>I think this movie is really, you know, not good,

0:05:01.960 --> 0:05:07.040
<v Speaker 1>like not worth your time, but that's my opinion anyway.

0:05:07.080 --> 0:05:10.400
<v Speaker 1>For years, this kind of video sorcery was pretty much

0:05:10.560 --> 0:05:14.600
<v Speaker 1>limited to the film and TV industries. It usually required

0:05:14.640 --> 0:05:18.360
<v Speaker 1>a lot of pre planning beforehand, so it wasn't as

0:05:18.400 --> 0:05:21.560
<v Speaker 1>simple as just taking footage that was already shot and

0:05:21.680 --> 0:05:24.559
<v Speaker 1>changing it in post on a whim with a couple

0:05:24.600 --> 0:05:26.800
<v Speaker 1>of clicks of a button. If it were, we would

0:05:26.800 --> 0:05:30.640
<v Speaker 1>see a lot fewer mistakes left in movies and television

0:05:30.760 --> 0:05:33.240
<v Speaker 1>because you could catch it later and just fix it.

0:05:33.640 --> 0:05:37.080
<v Speaker 1>But the tricks were possible, they were just difficult to

0:05:37.120 --> 0:05:40.159
<v Speaker 1>pull off. It just wasn't something you or I would

0:05:40.200 --> 0:05:43.960
<v Speaker 1>ever encounter in our day to day lives. But today

0:05:44.400 --> 0:05:47.239
<v Speaker 1>we live in a different world, a world that has

0:05:47.279 --> 0:05:52.240
<v Speaker 1>examples of synthetic media. Commonly referred to as deep fakes.

0:05:52.880 --> 0:05:56.640
<v Speaker 1>These are videos that have been altered or generated so

0:05:56.760 --> 0:05:59.360
<v Speaker 1>that the subject of the video is doing something that

0:05:59.400 --> 0:06:03.640
<v Speaker 1>they probably really would or could never do. They've brought

0:06:03.640 --> 0:06:07.279
<v Speaker 1>into question whether or not video evidence is even reliable,

0:06:07.600 --> 0:06:10.919
<v Speaker 1>much as the film Rising Sun was talking about. We

0:06:11.000 --> 0:06:16.560
<v Speaker 1>already know that eyewitness testimony is terribly unreliable. Our perception

0:06:16.720 --> 0:06:20.080
<v Speaker 1>and memories play tricks on us, and we can quote

0:06:20.160 --> 0:06:24.160
<v Speaker 1>unquote remember stuff that just didn't happen the way things

0:06:24.279 --> 0:06:28.640
<v Speaker 1>actually unfolded in reality. But now we're looking at video

0:06:28.720 --> 0:06:33.320
<v Speaker 1>evidence and potentially the same light. I mean, it's scary.

0:06:33.400 --> 0:06:37.680
<v Speaker 1>So today we're going to learn about synthetic media, how

0:06:37.960 --> 0:06:42.080
<v Speaker 1>it can be generated, the implications that follow with that

0:06:42.240 --> 0:06:44.680
<v Speaker 1>sort of reality, and ways that people are trying to

0:06:44.720 --> 0:06:50.240
<v Speaker 1>counteract a potentially dangerous threat. You know, fun stuff. Now, first,

0:06:50.360 --> 0:06:54.719
<v Speaker 1>the term synthetic media has a particular meaning. It refers

0:06:54.760 --> 0:06:59.560
<v Speaker 1>to art created through some sort of automated process, so

0:06:59.680 --> 0:07:04.000
<v Speaker 1>it's a largely hands off approach to creating the final

0:07:04.720 --> 0:07:08.560
<v Speaker 1>art piece. Now, under that definition, the example of rising

0:07:08.600 --> 0:07:12.040
<v Speaker 1>Sun would not apply here because we see in the

0:07:12.080 --> 0:07:14.840
<v Speaker 1>film and presumably this happens in the book as well,

0:07:14.880 --> 0:07:18.360
<v Speaker 1>but I haven't read the book that a human being

0:07:18.600 --> 0:07:22.400
<v Speaker 1>actually changes that. People have used tools to alter the

0:07:22.480 --> 0:07:26.240
<v Speaker 1>video footage. This would be more like using photoshop to

0:07:26.240 --> 0:07:29.280
<v Speaker 1>touch up a still image, with the computer system presumably

0:07:29.360 --> 0:07:32.520
<v Speaker 1>doing some of the work in the background to keep

0:07:32.560 --> 0:07:35.200
<v Speaker 1>things matched up. Either that or you would need to

0:07:35.240 --> 0:07:38.920
<v Speaker 1>alter each image in the footage frame by frame, or

0:07:39.000 --> 0:07:42.960
<v Speaker 1>use some sort of matt approach. To learn more about Matts,

0:07:43.360 --> 0:07:45.480
<v Speaker 1>you can listen to my episode about how blue and

0:07:45.560 --> 0:07:50.000
<v Speaker 1>green screens work. Synthetic media as a general practice has

0:07:50.200 --> 0:07:54.320
<v Speaker 1>been around for centuries. Artists have set up various contraptions

0:07:54.360 --> 0:07:58.360
<v Speaker 1>to create works with little or no human guidance. In

0:07:58.400 --> 0:08:01.559
<v Speaker 1>the twentieth century we started to see a movement called

0:08:01.760 --> 0:08:05.160
<v Speaker 1>generative art take form. This type of art is all

0:08:05.200 --> 0:08:08.960
<v Speaker 1>about creating a system that then creates or generates the

0:08:09.040 --> 0:08:12.600
<v Speaker 1>finished art piece. That would mean that the finished work,

0:08:12.720 --> 0:08:16.800
<v Speaker 1>such as a painting, wouldn't reflect the feelings or thoughts

0:08:16.800 --> 0:08:20.240
<v Speaker 1>of the artists who created the system. In fact, it

0:08:20.320 --> 0:08:23.560
<v Speaker 1>starts to raise the question what is the art? Is

0:08:23.600 --> 0:08:26.400
<v Speaker 1>it the painting that came about due to a machine

0:08:26.480 --> 0:08:30.480
<v Speaker 1>following a program of some sort, or is the art

0:08:30.720 --> 0:08:35.079
<v Speaker 1>the program itself? Is the art the process by which

0:08:35.120 --> 0:08:37.800
<v Speaker 1>the painting was made? Now, I'm not here to answer

0:08:37.840 --> 0:08:41.400
<v Speaker 1>that question. I just think it is an interesting question

0:08:41.440 --> 0:08:46.480
<v Speaker 1>to ask. Sometimes people ask much less polite questions, such

0:08:46.520 --> 0:08:50.600
<v Speaker 1>as is it art at all? Some art critics went

0:08:50.640 --> 0:08:53.640
<v Speaker 1>out of their way to dismiss generative art in the

0:08:53.679 --> 0:08:58.280
<v Speaker 1>early days. They found it insulting, but hey, that's kind

0:08:58.320 --> 0:09:02.160
<v Speaker 1>of the history of art and general Each new movement

0:09:02.200 --> 0:09:07.199
<v Speaker 1>in art inevitably finds both supporters and critics as it emerges.

0:09:07.640 --> 0:09:11.480
<v Speaker 1>If anything, you might argue that such a response legitimizes

0:09:11.720 --> 0:09:14.679
<v Speaker 1>the movement in you know, a weird way. If people

0:09:14.720 --> 0:09:18.880
<v Speaker 1>hate it, it must be something. In two thousand eighteen,

0:09:19.000 --> 0:09:23.800
<v Speaker 1>an artist collective called Obvious located out of Paris, France.

0:09:24.200 --> 0:09:27.760
<v Speaker 1>They submitted portrait style paintings that were created not by

0:09:27.880 --> 0:09:32.719
<v Speaker 1>an actual human painter, but by an artificially intelligent system.

0:09:32.760 --> 0:09:37.280
<v Speaker 1>Now they looked a lot like typical eighteenth century style portraits.

0:09:37.920 --> 0:09:41.400
<v Speaker 1>There was no attempt to pass off the portrait as

0:09:41.440 --> 0:09:44.360
<v Speaker 1>if it were actually made by a human artist. In fact,

0:09:44.800 --> 0:09:47.959
<v Speaker 1>the appeal of the piece was largely due to it

0:09:48.080 --> 0:09:52.840
<v Speaker 1>being synthetically generated. It went to auction at Christie's and

0:09:52.960 --> 0:09:59.000
<v Speaker 1>the AI created painting fetched more than four hundred thousand dollars.

0:09:59.120 --> 0:10:02.240
<v Speaker 1>And the way the group trained their AI is relevant

0:10:02.280 --> 0:10:06.720
<v Speaker 1>to our discussion about deep fakes. The collective relied on

0:10:06.800 --> 0:10:11.560
<v Speaker 1>a type of machine learning called generative adversarial networks or

0:10:11.800 --> 0:10:16.320
<v Speaker 1>g a N, which in turn is depending on deep learning.

0:10:16.400 --> 0:10:18.079
<v Speaker 1>So it looks like we've got a few things we're

0:10:18.080 --> 0:10:20.760
<v Speaker 1>gonna have to define here. Now, I'm going to keep

0:10:20.840 --> 0:10:24.840
<v Speaker 1>things fairly high level, because, as it turns out, there

0:10:24.880 --> 0:10:28.240
<v Speaker 1>are a few different ways to create machine learning models,

0:10:28.600 --> 0:10:31.160
<v Speaker 1>and to go through all of them in exhaustive detail

0:10:31.280 --> 0:10:34.760
<v Speaker 1>would represent a university level course in machine learning. I

0:10:34.800 --> 0:10:38.280
<v Speaker 1>have neither the time for that nor the expertise. I

0:10:38.320 --> 0:10:41.920
<v Speaker 1>would do a terrible job, so we'll go with a

0:10:42.040 --> 0:10:47.560
<v Speaker 1>high level perspective here. First. A generative adversarial network uses

0:10:47.679 --> 0:10:51.800
<v Speaker 1>two systems. You have a generator and you have a discriminator.

0:10:52.280 --> 0:10:55.600
<v Speaker 1>Both of these systems are a type of neural network.

0:10:56.000 --> 0:10:59.600
<v Speaker 1>A neural network is a computing model that is inspired

0:10:59.640 --> 0:11:03.960
<v Speaker 1>by the way our brains work. Our brains contain billions

0:11:03.960 --> 0:11:08.319
<v Speaker 1>of neurons, and these neurons work together, communicating through electrical

0:11:08.360 --> 0:11:13.080
<v Speaker 1>and chemical signals, controlling and coordinating pretty much everything in

0:11:13.120 --> 0:11:18.440
<v Speaker 1>our bodies. With computers, the neurons are nodes. The job

0:11:18.559 --> 0:11:21.720
<v Speaker 1>of a node is, you know, supposed to be kind

0:11:21.720 --> 0:11:24.400
<v Speaker 1>of like a neuron cell in the brain. It's to

0:11:24.520 --> 0:11:29.200
<v Speaker 1>take in multiple weighted input values and then generate a

0:11:29.320 --> 0:11:34.160
<v Speaker 1>single output value. Now, the word weighted w E I

0:11:34.320 --> 0:11:37.920
<v Speaker 1>G H T E D weighted is really important here

0:11:37.960 --> 0:11:42.160
<v Speaker 1>because the larger and inputs weight the more that input

0:11:42.280 --> 0:11:45.679
<v Speaker 1>will have an effect on whatever the output is. So

0:11:45.720 --> 0:11:48.720
<v Speaker 1>it kind of comes down to which inputs are the

0:11:48.800 --> 0:11:52.440
<v Speaker 1>most important for that nodes particular function. Now, if I

0:11:52.480 --> 0:11:55.720
<v Speaker 1>were to make an analogy, I would say, your boss

0:11:55.840 --> 0:11:59.560
<v Speaker 1>hands you three tasks to do. One of those tasks

0:11:59.600 --> 0:12:03.840
<v Speaker 1>has the label extremely important, and the second task has

0:12:03.920 --> 0:12:08.120
<v Speaker 1>the label critically important, and the third task has a

0:12:08.200 --> 0:12:10.200
<v Speaker 1>label saying you should have finished that one before it

0:12:10.280 --> 0:12:13.240
<v Speaker 1>was handed to you. Okay, so that's just some sort

0:12:13.280 --> 0:12:15.719
<v Speaker 1>of snarky office humor that I need to get off

0:12:15.720 --> 0:12:20.200
<v Speaker 1>my chest. But more seriously, imagine a node accepting three inputs.

0:12:20.200 --> 0:12:24.679
<v Speaker 1>In this example, input one has a fift weight, input

0:12:24.760 --> 0:12:28.320
<v Speaker 1>two has a weight, and input three has a ten

0:12:28.440 --> 0:12:32.040
<v Speaker 1>percent weight That adds up to and that would tell

0:12:32.080 --> 0:12:35.520
<v Speaker 1>you that the output that node generates will be most

0:12:35.679 --> 0:12:39.880
<v Speaker 1>affected by input one, followed by input two, and then

0:12:39.880 --> 0:12:43.439
<v Speaker 1>input three would have a smaller effect on whatever the

0:12:43.480 --> 0:12:48.560
<v Speaker 1>output is. Each node applies a nonlinear transformation on the

0:12:48.600 --> 0:12:53.720
<v Speaker 1>input values, again affected by each inputs weight value, and

0:12:53.800 --> 0:12:58.920
<v Speaker 1>that generates the output value. The details of that really

0:12:58.920 --> 0:13:02.360
<v Speaker 1>are not important are our episode. It involves performing changes

0:13:02.360 --> 0:13:06.040
<v Speaker 1>on variables that in turn change the correlation between variables,

0:13:06.040 --> 0:13:08.560
<v Speaker 1>and it gets a bit Matthew, and we would get

0:13:08.600 --> 0:13:11.840
<v Speaker 1>lost in the weeds. Pretty quickly. The important thing to

0:13:11.880 --> 0:13:15.520
<v Speaker 1>remember is that a node within a neural network takes

0:13:15.640 --> 0:13:20.520
<v Speaker 1>in a weighted sum of inputs, then performs a process

0:13:20.559 --> 0:13:25.480
<v Speaker 1>on those inputs before passing the result on as an output.

0:13:25.920 --> 0:13:30.319
<v Speaker 1>Then some other node a layer down will accept that output,

0:13:30.640 --> 0:13:33.079
<v Speaker 1>along with outputs from a couple of other nodes one

0:13:33.160 --> 0:13:36.840
<v Speaker 1>layer up, and then we'll perform an operation based on

0:13:36.920 --> 0:13:40.079
<v Speaker 1>those weighted inputs and pass that on to the next layer,

0:13:40.160 --> 0:13:43.480
<v Speaker 1>and so on. So these nodes are in layers, like

0:13:43.600 --> 0:13:47.240
<v Speaker 1>you know a cake. One layer of notes processes some inputs,

0:13:47.280 --> 0:13:50.280
<v Speaker 1>they send it onto the next layer of nodes, and

0:13:50.320 --> 0:13:52.240
<v Speaker 1>then that one does onto the next one, and the

0:13:52.280 --> 0:13:56.400
<v Speaker 1>next one and so on. This isn't a new idea.

0:13:56.800 --> 0:14:02.160
<v Speaker 1>Computer scientists began theorizing and experimenting with neural network approaches

0:14:02.640 --> 0:14:06.000
<v Speaker 1>as far back as the nineteen fifties with the perceptron,

0:14:06.320 --> 0:14:09.680
<v Speaker 1>which was a hypothetical system that was described by Frank

0:14:09.760 --> 0:14:13.559
<v Speaker 1>Rosenblatt of Cornell University. But it wasn't until the last

0:14:13.640 --> 0:14:17.160
<v Speaker 1>decade that computing power and our ability to handle a

0:14:17.200 --> 0:14:20.520
<v Speaker 1>lot of data reached a point where these sort of

0:14:20.600 --> 0:14:24.480
<v Speaker 1>learning models could really take off. The goal of this

0:14:24.680 --> 0:14:28.440
<v Speaker 1>system is to train it to perform a particular task

0:14:28.920 --> 0:14:33.120
<v Speaker 1>within a certain level of precision. The weights I mentioned

0:14:33.160 --> 0:14:35.880
<v Speaker 1>are adjustable, so you can think of it as teaching

0:14:35.880 --> 0:14:39.760
<v Speaker 1>a system which bits are the most important in order

0:14:39.760 --> 0:14:42.520
<v Speaker 1>to do whatever it is the system is supposed to

0:14:42.520 --> 0:14:45.680
<v Speaker 1>do in order to achieve your task. These are the

0:14:45.680 --> 0:14:49.200
<v Speaker 1>bits that are the most important and therefore should matter

0:14:49.240 --> 0:14:52.040
<v Speaker 1>the most when you weigh a decision. This is a

0:14:52.040 --> 0:14:54.840
<v Speaker 1>bit easier if we talk about a similar system with

0:14:55.080 --> 0:14:59.120
<v Speaker 1>the version of IBM S Watson that played on Jeopardy.

0:14:59.360 --> 0:15:03.080
<v Speaker 1>That system famously was not connected to the Internet. It

0:15:03.200 --> 0:15:06.400
<v Speaker 1>had to rely on all the information that was stored

0:15:06.480 --> 0:15:11.920
<v Speaker 1>within itself. When the system encountered a clue in Jeopardy,

0:15:11.960 --> 0:15:14.760
<v Speaker 1>it would analyze the clue, and then it would reference

0:15:14.800 --> 0:15:17.920
<v Speaker 1>its database to look for possible answers to whatever that

0:15:18.000 --> 0:15:21.920
<v Speaker 1>clue was. The system would weigh those possible answers and

0:15:21.960 --> 0:15:25.240
<v Speaker 1>attempt to determine which, if any, were the most likely

0:15:25.320 --> 0:15:29.040
<v Speaker 1>to be correct. If the certainty was over a certain threshold,

0:15:29.400 --> 0:15:33.320
<v Speaker 1>such as sure, the system would buzz in with its answer.

0:15:33.680 --> 0:15:37.360
<v Speaker 1>If no response rose above that threshold, the system would

0:15:37.400 --> 0:15:40.080
<v Speaker 1>not buzz in. So you could say that Watson was

0:15:40.120 --> 0:15:43.320
<v Speaker 1>playing the game with a best guess sort of approach.

0:15:43.840 --> 0:15:48.880
<v Speaker 1>Neural networks do essentially that sort of processing. With this

0:15:48.920 --> 0:15:52.440
<v Speaker 1>particular type of approach, we know what we want the

0:15:52.520 --> 0:15:55.480
<v Speaker 1>outcome to be, so we can judge whether or not

0:15:55.560 --> 0:15:59.760
<v Speaker 1>the system was successful. After each attempt, we can adjust

0:16:00.120 --> 0:16:03.800
<v Speaker 1>weight on the input between nodes to refine the decision

0:16:03.840 --> 0:16:07.760
<v Speaker 1>making process to get more accurate results. If the system

0:16:07.840 --> 0:16:11.080
<v Speaker 1>succeeds in its task, we can increase the weights that

0:16:11.160 --> 0:16:15.240
<v Speaker 1>contributed to the system picking the correct answer and thus

0:16:15.480 --> 0:16:21.800
<v Speaker 1>decrease the inputs that did not contribute to the successful response.

0:16:22.280 --> 0:16:25.880
<v Speaker 1>If the system done messed up and gave the wrong answer,

0:16:26.440 --> 0:16:28.880
<v Speaker 1>then we do the opposite. We look at the inputs

0:16:28.920 --> 0:16:32.880
<v Speaker 1>that contributed to the wrong answer, we diminish their weights,

0:16:33.200 --> 0:16:35.560
<v Speaker 1>and we increase the weights of the other input and

0:16:35.560 --> 0:16:40.120
<v Speaker 1>then we run the test again a lot. I'll explain

0:16:40.320 --> 0:16:42.760
<v Speaker 1>a bit more about this process when we come back,

0:16:42.800 --> 0:16:54.200
<v Speaker 1>but first let's take a quick break. Early in the

0:16:54.320 --> 0:16:58.680
<v Speaker 1>history of neural networks, computer scientists were hitting some pretty

0:16:58.760 --> 0:17:02.240
<v Speaker 1>hard stops do to the limitations of computing power at

0:17:02.280 --> 0:17:06.040
<v Speaker 1>the time. Early networks were only a couple of layers deep,

0:17:06.119 --> 0:17:08.800
<v Speaker 1>which really meant they weren't terribly powerful, and they could

0:17:08.800 --> 0:17:12.560
<v Speaker 1>only tackle rudimentary tasks like figuring out whether or not

0:17:12.600 --> 0:17:16.679
<v Speaker 1>a square is drawn on a piece of paper that

0:17:17.200 --> 0:17:23.320
<v Speaker 1>isn't terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and

0:17:23.480 --> 0:17:28.520
<v Speaker 1>Ronald Williams published a lecture titled learning representations by back

0:17:28.640 --> 0:17:34.159
<v Speaker 1>propagating errors. This was a big breakthrough with deep learning.

0:17:34.760 --> 0:17:36.960
<v Speaker 1>This all has to do with a deep learning system

0:17:37.000 --> 0:17:40.200
<v Speaker 1>improving its ability to complete a specific task. And basically

0:17:40.240 --> 0:17:43.679
<v Speaker 1>the algorithm's job is to go from the output layer,

0:17:43.920 --> 0:17:46.800
<v Speaker 1>you know, where the system has made a decision, and

0:17:46.840 --> 0:17:50.480
<v Speaker 1>then work backward through the neural network, adjusting the weights

0:17:50.520 --> 0:17:55.960
<v Speaker 1>that led to an incorrect decision. So let's say it's

0:17:56.040 --> 0:17:59.520
<v Speaker 1>a system that is looking to figure out whether or

0:17:59.560 --> 0:18:02.760
<v Speaker 1>not a hat is in a photograph and it says,

0:18:02.960 --> 0:18:05.320
<v Speaker 1>there's a cat in this picture, and you look at

0:18:05.320 --> 0:18:08.159
<v Speaker 1>the picture and there is no cat there. Then you

0:18:08.160 --> 0:18:12.439
<v Speaker 1>would look at the inputs one level back just before

0:18:12.480 --> 0:18:15.080
<v Speaker 1>the system said here's a picture of a cat, and

0:18:15.119 --> 0:18:17.520
<v Speaker 1>you'd say, all right, which of these inputs lad the

0:18:17.520 --> 0:18:20.760
<v Speaker 1>system to believe this was a picture of a cat?

0:18:21.160 --> 0:18:23.639
<v Speaker 1>And then you would adjust those Then you would go

0:18:23.840 --> 0:18:27.720
<v Speaker 1>back one layer up, So you're working your way up

0:18:27.920 --> 0:18:31.919
<v Speaker 1>the model and say which inputs here led to it

0:18:32.119 --> 0:18:36.240
<v Speaker 1>giving the outputs that led to the mistake, and you

0:18:36.320 --> 0:18:39.640
<v Speaker 1>do this all the way up until you get up

0:18:39.640 --> 0:18:42.800
<v Speaker 1>to the input level at the top of the computer model.

0:18:42.840 --> 0:18:46.000
<v Speaker 1>You are back propagating, and then you run the test

0:18:46.040 --> 0:18:50.760
<v Speaker 1>again to see if you've got improvement. It's exhaustive, but

0:18:50.840 --> 0:18:56.080
<v Speaker 1>it's also drastically improved neural network performance, much faster than

0:18:56.160 --> 0:18:59.920
<v Speaker 1>just throwing more brute force to it. The algorithm is

0:19:00.080 --> 0:19:02.439
<v Speaker 1>entually is checking to see if a small change in

0:19:02.520 --> 0:19:06.520
<v Speaker 1>each input value received by a layer of nodes would

0:19:06.560 --> 0:19:08.679
<v Speaker 1>have led to a more accurate results. So it's all

0:19:08.720 --> 0:19:11.960
<v Speaker 1>about going from that output working your way backward. In

0:19:12.040 --> 0:19:15.520
<v Speaker 1>two thousand twelve, Alex Krajewski published a paper that gave

0:19:15.600 --> 0:19:19.320
<v Speaker 1>us the next big breakthrough. He argued that a really

0:19:19.520 --> 0:19:23.080
<v Speaker 1>deep neural network with a lot of layers could give

0:19:23.200 --> 0:19:26.359
<v Speaker 1>really great results if you paired it with enough data

0:19:26.440 --> 0:19:29.800
<v Speaker 1>to train the system. So you needed to throw lots

0:19:29.840 --> 0:19:33.680
<v Speaker 1>of data at these models, and it needed to be

0:19:33.760 --> 0:19:37.760
<v Speaker 1>an enormous amount of data. However, once trained, the system

0:19:37.840 --> 0:19:40.880
<v Speaker 1>would produce lower error rates. So yeah, I would take

0:19:40.880 --> 0:19:43.640
<v Speaker 1>a long time, but you would get better results. Now,

0:19:43.680 --> 0:19:46.439
<v Speaker 1>at the time, a good error rate for such a

0:19:46.480 --> 0:19:51.480
<v Speaker 1>system was that means one out of four conclusions the

0:19:51.560 --> 0:19:54.480
<v Speaker 1>system would come to would be wrong. If you ran

0:19:54.560 --> 0:19:58.400
<v Speaker 1>it across a long enough number of decisions, you would

0:19:58.400 --> 0:20:02.240
<v Speaker 1>find that one out of every four wasn't right. The

0:20:02.320 --> 0:20:05.959
<v Speaker 1>system that Alex's team worked on produced results that had

0:20:06.000 --> 0:20:09.399
<v Speaker 1>an error rate of six percent, so much lower. And

0:20:09.440 --> 0:20:13.879
<v Speaker 1>then in just five years, with more improvements to this process,

0:20:14.280 --> 0:20:18.080
<v Speaker 1>the classification error rate had dropped down to two point

0:20:18.320 --> 0:20:22.800
<v Speaker 1>three percent for deep learning systems. So from to two

0:20:22.880 --> 0:20:27.560
<v Speaker 1>point three percent, it was really powerful stuff. Okay, so

0:20:27.720 --> 0:20:31.879
<v Speaker 1>you've got your artificial neural network. You've got your layers

0:20:31.960 --> 0:20:35.760
<v Speaker 1>and layers of nodes. You've adjusted the weights of the

0:20:35.800 --> 0:20:39.719
<v Speaker 1>inputs into each node to see if your system can identify,

0:20:40.119 --> 0:20:44.960
<v Speaker 1>you know, pictures of cats, and you start feeding images

0:20:45.040 --> 0:20:48.879
<v Speaker 1>to this system, lots of them. This is the domain

0:20:49.080 --> 0:20:51.360
<v Speaker 1>that you are feeding to your system. The more images

0:20:51.400 --> 0:20:53.520
<v Speaker 1>you can feed to it, the better. And you want

0:20:53.520 --> 0:20:55.840
<v Speaker 1>a wide variety of images of all sorts of stuff,

0:20:56.240 --> 0:20:58.800
<v Speaker 1>not just of different types of cats, but stuff that

0:20:58.920 --> 0:21:03.400
<v Speaker 1>most certainly isn't not a cat, like dogs, or cars

0:21:03.520 --> 0:21:06.760
<v Speaker 1>or chartered public accountants. You name it, and you look

0:21:06.840 --> 0:21:10.520
<v Speaker 1>to see which images the system identifies correctly and which

0:21:10.560 --> 0:21:14.040
<v Speaker 1>ones it screws up, both which images have cats in

0:21:14.080 --> 0:21:17.880
<v Speaker 1>it that actually don't have cats in it, or images

0:21:17.920 --> 0:21:20.760
<v Speaker 1>the system has identified as saying there is no cat here,

0:21:20.960 --> 0:21:23.880
<v Speaker 1>but there is a cat there. This guides you into

0:21:23.920 --> 0:21:27.520
<v Speaker 1>adjusting the weights again and again, and you start over

0:21:27.560 --> 0:21:29.440
<v Speaker 1>and you do it again, and that's your basic deep

0:21:29.520 --> 0:21:33.000
<v Speaker 1>learning system, and it gets better over time as you

0:21:33.080 --> 0:21:36.399
<v Speaker 1>train it. It learns. Now, let's transition over to the

0:21:36.440 --> 0:21:40.439
<v Speaker 1>adversarial systems I mentioned earlier, because they take this and

0:21:40.480 --> 0:21:45.560
<v Speaker 1>twist it a little bit. So you've got to artificial

0:21:45.720 --> 0:21:49.520
<v Speaker 1>neural networks and they are using this general approach to

0:21:49.720 --> 0:21:53.400
<v Speaker 1>deep learning, and you're setting them up so that they

0:21:53.440 --> 0:21:58.000
<v Speaker 1>feed into each other. One network. The generator has the

0:21:58.040 --> 0:22:01.919
<v Speaker 1>task to learn how to do something such as create

0:22:01.960 --> 0:22:05.919
<v Speaker 1>an eighteenth century style portrait based off lots and lots

0:22:06.000 --> 0:22:09.600
<v Speaker 1>of examples of the real thing. The domain the problem

0:22:09.960 --> 0:22:14.760
<v Speaker 1>domain the second network. The discriminator has a different job.

0:22:15.359 --> 0:22:18.800
<v Speaker 1>It has to tell the difference between authentic portraits that

0:22:19.040 --> 0:22:23.960
<v Speaker 1>came from the problem domain and computer generated portraits that

0:22:24.040 --> 0:22:27.919
<v Speaker 1>came from the generator itself. So essentially, the discriminator is

0:22:28.000 --> 0:22:31.199
<v Speaker 1>like the model I mentioned earlier that was identifying pictures

0:22:31.200 --> 0:22:33.320
<v Speaker 1>of cats, It's doing the same sort of thing, except

0:22:33.359 --> 0:22:36.600
<v Speaker 1>instead of saying cat or no cat, it's saying real

0:22:36.760 --> 0:22:40.600
<v Speaker 1>portrait or computer generated portrait. So there are essentially two

0:22:40.600 --> 0:22:44.359
<v Speaker 1>outcomes the discriminator could reach, and that's whether or not

0:22:44.440 --> 0:22:48.119
<v Speaker 1>an images computer generated or it wasn't. So do you

0:22:48.119 --> 0:22:51.680
<v Speaker 1>see where this is going? You train up both models.

0:22:52.119 --> 0:22:54.879
<v Speaker 1>You have the generator attempt to make its own version

0:22:54.960 --> 0:22:58.400
<v Speaker 1>of something such as that eighteenth century portrait. It does

0:22:58.440 --> 0:23:01.119
<v Speaker 1>so it designs the portrait it based on what the

0:23:01.160 --> 0:23:05.720
<v Speaker 1>model believes are the key elements of a portrait, so

0:23:05.920 --> 0:23:10.679
<v Speaker 1>things like colors, shapes, the ratio of size, like you know,

0:23:10.720 --> 0:23:13.720
<v Speaker 1>how large should the head be in relation to the body.

0:23:13.760 --> 0:23:17.960
<v Speaker 1>All of these factors and many more come into play.

0:23:18.119 --> 0:23:22.399
<v Speaker 1>The generator creates its own idea of what a portrait

0:23:22.520 --> 0:23:25.159
<v Speaker 1>is supposed to look like, and chances are the early

0:23:25.240 --> 0:23:29.879
<v Speaker 1>rounds of this will not be terribly convincing. The results

0:23:30.040 --> 0:23:33.280
<v Speaker 1>are then fed to the discriminator, which tries to suss

0:23:33.320 --> 0:23:36.359
<v Speaker 1>out which of the images fed to it are computer

0:23:36.480 --> 0:23:40.360
<v Speaker 1>generated and which ones aren't. After that round, both models

0:23:40.600 --> 0:23:45.480
<v Speaker 1>are tweaked. The generator adjusts input weights to get closer

0:23:45.560 --> 0:23:49.159
<v Speaker 1>to the genuine article, and the discriminator adjust weights to

0:23:49.320 --> 0:23:53.320
<v Speaker 1>reduce false positives or to catch computer generated images. And

0:23:53.359 --> 0:23:57.560
<v Speaker 1>then you go again and again and again and again,

0:23:57.840 --> 0:24:01.479
<v Speaker 1>and they both get better over time. So, assuming everything

0:24:01.560 --> 0:24:04.840
<v Speaker 1>is working properly, over time, the adjustment of input weights

0:24:04.880 --> 0:24:08.320
<v Speaker 1>will lead to more convincing results, and given enough time

0:24:08.520 --> 0:24:11.480
<v Speaker 1>and enough repetition, you'll end up with a computer generated

0:24:11.520 --> 0:24:13.879
<v Speaker 1>painting that you can auction off for nearly half a

0:24:13.960 --> 0:24:18.479
<v Speaker 1>million dollars. Though keep in mind that huge price relates

0:24:18.520 --> 0:24:21.720
<v Speaker 1>back to the novelty of it being an early AI

0:24:21.760 --> 0:24:25.399
<v Speaker 1>generated painting. It would be shocking to me if we

0:24:25.480 --> 0:24:29.400
<v Speaker 1>saw that actually become a trend. Also, the painting, while interesting,

0:24:29.880 --> 0:24:32.760
<v Speaker 1>isn't exactly so astounding as to make you think there's

0:24:32.800 --> 0:24:35.399
<v Speaker 1>no way a machine did that. You'd look at them

0:24:35.400 --> 0:24:38.160
<v Speaker 1>and go, yeah, I can imagine a machine did that. One.

0:24:38.840 --> 0:24:43.160
<v Speaker 1>A group of computer scientists first described the general adversarial

0:24:43.200 --> 0:24:46.040
<v Speaker 1>network architecture in a paper in two thousand and fourteen,

0:24:46.640 --> 0:24:49.840
<v Speaker 1>and like other neural networks, these models require a lot

0:24:49.880 --> 0:24:52.480
<v Speaker 1>of data. The more the better. In fact, smaller data

0:24:52.480 --> 0:24:56.159
<v Speaker 1>sets means the models have to make some pretty big assumptions,

0:24:56.720 --> 0:25:00.440
<v Speaker 1>and you tend to get pretty lousy results. More data,

0:25:00.600 --> 0:25:03.879
<v Speaker 1>as in more examples, teaches the models more about the

0:25:03.920 --> 0:25:07.119
<v Speaker 1>parameters of the domain, whatever it is they are trying

0:25:07.160 --> 0:25:10.560
<v Speaker 1>to generate. It refines the approach. So if you have

0:25:10.600 --> 0:25:13.280
<v Speaker 1>a sophisticated enough pair of models and you have enough

0:25:13.400 --> 0:25:16.280
<v Speaker 1>data to fill up a domain, you can generate some

0:25:16.440 --> 0:25:20.520
<v Speaker 1>convincing material, and that includes video. And this brings us

0:25:20.560 --> 0:25:26.240
<v Speaker 1>around to deep fakes. And in addition to generative adversarial networks,

0:25:26.280 --> 0:25:31.400
<v Speaker 1>a couple of other things really converged to create the

0:25:31.480 --> 0:25:35.040
<v Speaker 1>techniques and trends and technology that would allow for deep

0:25:35.040 --> 0:25:42.040
<v Speaker 1>fakes proper. In Malcolm Slaney, Michelle Covell, and Christoph Bregler

0:25:42.520 --> 0:25:46.680
<v Speaker 1>wrote some software that they called the Video Rewrite Program.

0:25:46.680 --> 0:25:50.959
<v Speaker 1>The software would analyze faces and then create or synthesize

0:25:51.240 --> 0:25:55.920
<v Speaker 1>lip animation which could be matched to pre recorded audio.

0:25:56.080 --> 0:25:59.480
<v Speaker 1>So you could take some film footage of a person

0:25:59.720 --> 0:26:03.439
<v Speaker 1>and and reanimate their lips so that they could appear

0:26:03.480 --> 0:26:06.000
<v Speaker 1>to say all sorts of things, which in some ways

0:26:06.119 --> 0:26:09.439
<v Speaker 1>set the stage for deep fakes. This case, it was

0:26:09.480 --> 0:26:12.840
<v Speaker 1>really just focusing on the lips and the general area

0:26:12.920 --> 0:26:16.560
<v Speaker 1>around the lips, so you weren't changing the rest of

0:26:16.600 --> 0:26:19.560
<v Speaker 1>the expression of the face, and you would have to,

0:26:20.160 --> 0:26:23.520
<v Speaker 1>you know, keep your recording to be about the same

0:26:23.600 --> 0:26:25.960
<v Speaker 1>length as whatever the film clip was, or you would

0:26:25.960 --> 0:26:28.080
<v Speaker 1>have to loop the film clip over and over it,

0:26:28.080 --> 0:26:30.320
<v Speaker 1>which would make it, you know, far more obvious that

0:26:30.440 --> 0:26:35.000
<v Speaker 1>this was a fake. In addition, motion tracking technology was

0:26:35.040 --> 0:26:37.720
<v Speaker 1>advancing over time too, and this also became an important

0:26:37.760 --> 0:26:41.080
<v Speaker 1>tool in computer animation. This tool would also be used

0:26:41.400 --> 0:26:45.800
<v Speaker 1>by deep fake algorithms to create facial expressions, manipulating the

0:26:45.840 --> 0:26:48.760
<v Speaker 1>digital image just as it would if it were a

0:26:48.840 --> 0:26:53.199
<v Speaker 1>video game character or a Pixar animated character. Typically, you

0:26:53.280 --> 0:26:56.439
<v Speaker 1>need to start with some existing video in order to

0:26:56.480 --> 0:27:00.720
<v Speaker 1>manipulate it. You're not actually computer generating the animation, like,

0:27:00.760 --> 0:27:05.720
<v Speaker 1>you're not creating a computer generated version of whomever it

0:27:05.840 --> 0:27:11.119
<v Speaker 1>is you're doing the fake of. You're using existing imagery

0:27:11.200 --> 0:27:13.880
<v Speaker 1>in order to do that and then manipulating that existing imagery,

0:27:14.000 --> 0:27:17.719
<v Speaker 1>So it's a little different from computer animation. In two

0:27:17.760 --> 0:27:21.440
<v Speaker 1>thousand and sixteen, students and faculty at the Technical University

0:27:21.440 --> 0:27:25.600
<v Speaker 1>of Munich created the face to face project that would

0:27:25.600 --> 0:27:30.040
<v Speaker 1>be face the numeral two and then face and this

0:27:30.119 --> 0:27:33.120
<v Speaker 1>was particularly jaw dropping to me at the time when

0:27:33.119 --> 0:27:37.440
<v Speaker 1>I first saw these videos back in I was floored.

0:27:37.920 --> 0:27:41.480
<v Speaker 1>They created a system that had a target actor. This

0:27:41.520 --> 0:27:44.120
<v Speaker 1>would be the video of the person that you want

0:27:44.160 --> 0:27:47.440
<v Speaker 1>to manipulate. In the example they used, it was former

0:27:47.520 --> 0:27:52.240
<v Speaker 1>US President George W. Bush. Their process also had a

0:27:52.320 --> 0:27:56.880
<v Speaker 1>source actor. This was the source of the expressions and

0:27:56.920 --> 0:28:00.400
<v Speaker 1>facial movements you would see in the target So kind

0:28:00.440 --> 0:28:03.679
<v Speaker 1>of like a digital puppeteer in a way, but the

0:28:03.680 --> 0:28:05.720
<v Speaker 1>way they did it was really cool. They had a

0:28:05.760 --> 0:28:09.840
<v Speaker 1>camera trained on the source actor and it would track

0:28:09.960 --> 0:28:13.840
<v Speaker 1>specific points of movement on the source actor's face, and

0:28:13.880 --> 0:28:17.400
<v Speaker 1>then the system would manipulate the same points of movement

0:28:17.600 --> 0:28:21.280
<v Speaker 1>on the target actor's face in the video. So if

0:28:21.320 --> 0:28:25.520
<v Speaker 1>the source actor smiled, then the target smiled, so the

0:28:25.560 --> 0:28:27.960
<v Speaker 1>source actor would smile, and then you would see George W.

0:28:28.160 --> 0:28:31.080
<v Speaker 1>Bush and the video smile in real time. It was

0:28:31.440 --> 0:28:37.040
<v Speaker 1>really strange. They used this looping video of George W.

0:28:37.160 --> 0:28:40.960
<v Speaker 1>Bush wearing a neutral expression. They had to start with

0:28:41.600 --> 0:28:45.360
<v Speaker 1>that as there they're sort of zero point, and I

0:28:45.400 --> 0:28:48.240
<v Speaker 1>gotta tell you, it really does look like the former

0:28:48.280 --> 0:28:50.400
<v Speaker 1>president George W. Bush is having a bit of a

0:28:50.480 --> 0:28:54.440
<v Speaker 1>freak out on a looping video because he keeps on,

0:28:54.600 --> 0:28:59.160
<v Speaker 1>opening his mouth, closing his mouth, grimacing, raising his eyebrows.

0:28:59.440 --> 0:29:02.040
<v Speaker 1>You need to watch this video. It is still available

0:29:02.080 --> 0:29:06.600
<v Speaker 1>online to check out. In ten, students and faculty over

0:29:06.600 --> 0:29:10.800
<v Speaker 1>at the University of Washington created the Synthesizing Obama project,

0:29:11.080 --> 0:29:13.960
<v Speaker 1>in which they trained a computer model to generate a

0:29:14.040 --> 0:29:18.280
<v Speaker 1>synthetic video of former US President Barack Obama, and they

0:29:18.400 --> 0:29:21.800
<v Speaker 1>made it lip sync to a pre recorded audio clip

0:29:22.000 --> 0:29:26.640
<v Speaker 1>from one of Obama's addresses to the nation. They actually

0:29:26.680 --> 0:29:30.440
<v Speaker 1>had the original video of that address for comparison, so

0:29:30.640 --> 0:29:33.920
<v Speaker 1>they could look back at that and see how they're

0:29:33.960 --> 0:29:37.680
<v Speaker 1>generated one compared to the real thing. And their approach

0:29:37.920 --> 0:29:41.840
<v Speaker 1>used a model that analyzed hundreds of hours of video

0:29:41.880 --> 0:29:46.840
<v Speaker 1>footage of Obama speaking, and it mapped specific mouth shapes

0:29:47.000 --> 0:29:51.680
<v Speaker 1>to specific sounds. It would also include some of Obama's mannerisms,

0:29:51.720 --> 0:29:53.719
<v Speaker 1>such as how he moves his head when he talks

0:29:53.840 --> 0:29:57.520
<v Speaker 1>or uses facial expressions to emphasize words. And watching the

0:29:57.640 --> 0:30:01.600
<v Speaker 1>video and that the the real one next to the

0:30:01.640 --> 0:30:05.840
<v Speaker 1>generated one is pretty strange. You can tell the generated

0:30:05.880 --> 0:30:09.960
<v Speaker 1>one isn't quite right. It's not matching the audio exactly,

0:30:10.240 --> 0:30:14.720
<v Speaker 1>at least not on the early versions, but it's fairly

0:30:14.800 --> 0:30:17.720
<v Speaker 1>close and it might even pass casual inspection for a

0:30:17.760 --> 0:30:20.280
<v Speaker 1>lot of people who weren't, like, you know, actually paying attention.

0:30:20.920 --> 0:30:26.120
<v Speaker 1>Authors Morass and Alexandro defined deep fakes as quote the

0:30:26.160 --> 0:30:31.480
<v Speaker 1>product of artificial intelligence applications that merge, combine, replace, and

0:30:31.600 --> 0:30:35.719
<v Speaker 1>superimpose images and video clips to create fake videos that

0:30:35.760 --> 0:30:41.280
<v Speaker 1>appear authentic end quote. They first emerged in seventeen and

0:30:41.360 --> 0:30:45.040
<v Speaker 1>so this is a pretty darn young application of technology.

0:30:45.680 --> 0:30:48.880
<v Speaker 1>One thing that is worrisome is that once someone has

0:30:48.920 --> 0:30:52.640
<v Speaker 1>access to the tools, it's not that difficult to create

0:30:52.720 --> 0:30:55.760
<v Speaker 1>a deep fake video. You pretty much just need a

0:30:55.800 --> 0:30:59.560
<v Speaker 1>decent computer, the tools, a bit of know how on

0:30:59.640 --> 0:31:02.840
<v Speaker 1>how to do it, and some time you also need

0:31:03.000 --> 0:31:06.720
<v Speaker 1>some reference material, as in like videos and images of

0:31:06.760 --> 0:31:10.560
<v Speaker 1>the person that you are replicating, and like the machine

0:31:10.640 --> 0:31:13.960
<v Speaker 1>learning systems I've mentioned, the more reference material you have,

0:31:14.200 --> 0:31:17.480
<v Speaker 1>the better. That's why the deep fakes you encounter these

0:31:17.560 --> 0:31:21.560
<v Speaker 1>days tend to be of notable famous people like celebrities

0:31:21.560 --> 0:31:25.560
<v Speaker 1>and politicians. Mainly there's no shortage of reference material for

0:31:25.600 --> 0:31:28.960
<v Speaker 1>those types of individuals, and so they are easier to

0:31:29.000 --> 0:31:32.360
<v Speaker 1>replicate with deep fakes than someone who maintains a much

0:31:32.560 --> 0:31:35.520
<v Speaker 1>lower profile. Not to say that that will always be

0:31:35.600 --> 0:31:38.160
<v Speaker 1>the case, or that there aren't systems out there that

0:31:38.240 --> 0:31:43.680
<v Speaker 1>can accept smaller amounts of reference material. It's just harder

0:31:43.720 --> 0:31:50.200
<v Speaker 1>to make a convincing version with fewer samples. But in

0:31:50.320 --> 0:31:53.760
<v Speaker 1>order to make a convincing fake, the system really has

0:31:53.800 --> 0:31:57.920
<v Speaker 1>to learn how a person moves. All those facial expressions matter.

0:31:58.160 --> 0:32:01.200
<v Speaker 1>It also has to learn how a person sounds. Will

0:32:01.240 --> 0:32:07.240
<v Speaker 1>get into sound a little bit later. But mannerisms, inflection, accent, emphasis, cadence,

0:32:07.360 --> 0:32:09.920
<v Speaker 1>quirks and ticks, all of these things have to be

0:32:09.960 --> 0:32:13.760
<v Speaker 1>analyzed and replicated to make a convincing fake, and it

0:32:13.800 --> 0:32:16.120
<v Speaker 1>has to be done just right or else it comes

0:32:16.120 --> 0:32:20.960
<v Speaker 1>off as creepy or unrealistic. Think about how impressionists will

0:32:21.000 --> 0:32:24.600
<v Speaker 1>take a celebrity's manner of speech and then heighten some

0:32:24.720 --> 0:32:28.200
<v Speaker 1>of it in comedic effect. You'll hear all the time

0:32:28.240 --> 0:32:31.240
<v Speaker 1>with folks who do impressions of people like Jack Nicholson

0:32:31.440 --> 0:32:35.400
<v Speaker 1>or Christopher Walkin or Barbara streisand people who have a

0:32:35.520 --> 0:32:40.040
<v Speaker 1>very particular way of speaking. Impressionists will take those as

0:32:40.160 --> 0:32:43.680
<v Speaker 1>markers and they really punch in on them. Well, a

0:32:43.760 --> 0:32:46.520
<v Speaker 1>deep fake can't really do that too much, or else

0:32:46.560 --> 0:32:48.840
<v Speaker 1>it won't come across as genuine. It'll feel like you're

0:32:48.840 --> 0:32:54.040
<v Speaker 1>watching a famous person impersonating themselves, which is weird. Now.

0:32:54.040 --> 0:32:56.520
<v Speaker 1>The earliest mention of deep fakes I can find dates

0:32:56.560 --> 0:32:59.880
<v Speaker 1>to a two thousand seventeen Reddit forum in which you

0:33:00.080 --> 0:33:03.960
<v Speaker 1>are shared deep faked videos that appeared to show female

0:33:03.960 --> 0:33:09.000
<v Speaker 1>celebrities in sexual situations. Heads and faces had been replaced,

0:33:09.240 --> 0:33:13.000
<v Speaker 1>and the actors in pornographic movies had their heads or

0:33:13.040 --> 0:33:17.200
<v Speaker 1>faces swapped out for these various celebrities. Now the fakes

0:33:17.360 --> 0:33:22.680
<v Speaker 1>can look fairly convincing, extremely convincing in some cases, which

0:33:22.880 --> 0:33:25.760
<v Speaker 1>can lead to some people assuming that the videos are

0:33:25.760 --> 0:33:29.160
<v Speaker 1>genuine and that the folks that they saw in the

0:33:29.240 --> 0:33:32.160
<v Speaker 1>videos are really the ones who were in it. And

0:33:32.320 --> 0:33:35.680
<v Speaker 1>obviously that's a real problem, right, I mean that this

0:33:35.760 --> 0:33:40.080
<v Speaker 1>technology we've given enough reference data DEFEATA system, someone could

0:33:40.160 --> 0:33:43.040
<v Speaker 1>fabricate a video that appears to put a person in

0:33:43.080 --> 0:33:47.040
<v Speaker 1>a compromising position, whether it's a sexual act or making

0:33:47.120 --> 0:33:50.720
<v Speaker 1>damaging statements or committing a crime or whatever. And there

0:33:50.720 --> 0:33:52.800
<v Speaker 1>are tools right now that allow you to do pretty

0:33:52.880 --> 0:33:55.720
<v Speaker 1>much what the face to face tool was doing back

0:33:55.720 --> 0:33:59.320
<v Speaker 1>in two thousands sixteen, a program called avatar. If I

0:34:00.160 --> 0:34:04.280
<v Speaker 1>just not that easy to say anyway, It can run

0:34:04.280 --> 0:34:08.400
<v Speaker 1>on top of live streaming conference services like Zoom and Skype,

0:34:08.719 --> 0:34:12.160
<v Speaker 1>and you can swap out your face for a celebrities face.

0:34:12.719 --> 0:34:17.920
<v Speaker 1>Your facial expressions map to the computer manipulated celebrity face.

0:34:18.880 --> 0:34:21.680
<v Speaker 1>It just looks at you through your webcam, and then

0:34:21.719 --> 0:34:25.160
<v Speaker 1>if you smile, the celebrity image smiles, et cetera. It's

0:34:25.239 --> 0:34:27.879
<v Speaker 1>like that old face to face program. It does need

0:34:28.000 --> 0:34:32.279
<v Speaker 1>a pretty beefy PC to manage doing all this because

0:34:32.280 --> 0:34:35.680
<v Speaker 1>you're also running that live streaming service underneath it. It's

0:34:35.719 --> 0:34:39.640
<v Speaker 1>also not exactly user friendly. You need some programming experience

0:34:39.640 --> 0:34:43.719
<v Speaker 1>to really get it to work. But it is widely accessible,

0:34:44.120 --> 0:34:48.160
<v Speaker 1>as the source code is is open source and it's

0:34:48.200 --> 0:34:51.840
<v Speaker 1>on get hubs, so anyone can get it. Samantha Cole,

0:34:52.120 --> 0:34:55.240
<v Speaker 1>who writes for Vice, has covered the topic of deep

0:34:55.280 --> 0:34:58.760
<v Speaker 1>fakes pretty extensively and the potential harm they can cause,

0:34:59.160 --> 0:35:01.680
<v Speaker 1>and I recommend you check out her work if you're

0:35:01.719 --> 0:35:05.560
<v Speaker 1>interested in learning more about that. Do be warned that

0:35:05.640 --> 0:35:09.640
<v Speaker 1>Coal covers some pretty adult themed topics and I think

0:35:09.640 --> 0:35:13.480
<v Speaker 1>she does great work and very important work. But as

0:35:13.520 --> 0:35:15.480
<v Speaker 1>a guy who grew up in the Deep South, it's

0:35:15.520 --> 0:35:17.840
<v Speaker 1>also the kind of stuff that occasionally makes me clutch

0:35:17.920 --> 0:35:20.400
<v Speaker 1>my pearls, But that's more of a statement about me

0:35:20.800 --> 0:35:24.880
<v Speaker 1>than her work. She does great work. I think most

0:35:24.920 --> 0:35:28.040
<v Speaker 1>of us can imagine plenty of scenarios in which this

0:35:28.080 --> 0:35:31.440
<v Speaker 1>sort of technology could cause mischief on a good day

0:35:31.520 --> 0:35:35.680
<v Speaker 1>and catastrophe on a bad day, whether it's spreading misinformation,

0:35:35.960 --> 0:35:41.040
<v Speaker 1>creating fear, uncertainty and doubt fud or by making people

0:35:41.160 --> 0:35:44.560
<v Speaker 1>seem to say things they never actually said, or contributing

0:35:44.600 --> 0:35:47.359
<v Speaker 1>to an ugly subculture in which people try to make

0:35:47.400 --> 0:35:51.480
<v Speaker 1>their more base fantasies a reality by putting one person's

0:35:51.520 --> 0:35:54.160
<v Speaker 1>head on another person's body. You know, it's not great.

0:35:54.760 --> 0:35:57.840
<v Speaker 1>There are legitimate uses of the technology too, of course,

0:35:58.120 --> 0:36:01.200
<v Speaker 1>you know, tech itself is rarely good or bad. It's

0:36:01.320 --> 0:36:04.640
<v Speaker 1>all in how we use it. But this particular technology

0:36:04.680 --> 0:36:07.719
<v Speaker 1>has a lot of potentially harmful uses, and Samantha Cole

0:36:07.760 --> 0:36:10.799
<v Speaker 1>has done a great job explaining them. When we come back,

0:36:11.000 --> 0:36:13.799
<v Speaker 1>I'll talk a bit more about the war against deep

0:36:13.880 --> 0:36:16.200
<v Speaker 1>fakes and how people are trying to prepare for a

0:36:16.239 --> 0:36:20.840
<v Speaker 1>world that is increasingly filled with media we can't really trust.

0:36:21.360 --> 0:36:33.240
<v Speaker 1>But first let's take a quick break. Before the break,

0:36:33.680 --> 0:36:37.680
<v Speaker 1>I mentioned Samantha Cole, who has written extensively about deep fags,

0:36:37.719 --> 0:36:40.480
<v Speaker 1>and one point she makes that I think is important

0:36:40.520 --> 0:36:44.880
<v Speaker 1>for us to note is that the vast majority of

0:36:44.960 --> 0:36:49.600
<v Speaker 1>instances of deep fake videos haven't been some manufactured video

0:36:49.640 --> 0:36:53.960
<v Speaker 1>of a political leader saying inflammatory things. That continues to

0:36:53.960 --> 0:36:57.480
<v Speaker 1>be a big concern. There's a genuine fear that someone

0:36:57.560 --> 0:37:01.040
<v Speaker 1>is going to manufacture a video in which a politician

0:37:01.080 --> 0:37:04.359
<v Speaker 1>appears to say or do something truly terrible in an

0:37:04.360 --> 0:37:08.560
<v Speaker 1>effort to either discredit the politician or perhaps instigate a

0:37:08.680 --> 0:37:13.600
<v Speaker 1>conflict with some other group. There are literal doomsday scenarios

0:37:13.600 --> 0:37:18.440
<v Speaker 1>in which such a video would prompt a massive military response,

0:37:18.719 --> 0:37:21.320
<v Speaker 1>though that does seem like it might be a little

0:37:21.440 --> 0:37:24.239
<v Speaker 1>far fetched, though heck, I don't know, considering the world

0:37:24.239 --> 0:37:26.040
<v Speaker 1>we live in, maybe it's not that big of a

0:37:26.080 --> 0:37:30.640
<v Speaker 1>stretch anyway. Cole's point is that so far, debt has

0:37:30.800 --> 0:37:34.239
<v Speaker 1>not happened. She points out that the most frequent use

0:37:34.400 --> 0:37:37.160
<v Speaker 1>for the tech either tends to be people goofing around

0:37:37.320 --> 0:37:41.040
<v Speaker 1>or disturbingly using it to in her words, quote, take

0:37:41.080 --> 0:37:45.240
<v Speaker 1>ownership of women's bodies in non consensual porn end quote.

0:37:45.560 --> 0:37:48.759
<v Speaker 1>Cole argues that the reason we haven't really seen deep

0:37:48.760 --> 0:37:52.000
<v Speaker 1>fix used much outside of these realms, apart from a

0:37:52.040 --> 0:37:56.040
<v Speaker 1>few advertising campaigns, is that people are pretty good at

0:37:56.120 --> 0:37:59.879
<v Speaker 1>spotting deep fix they aren't quite at a level where

0:38:00.000 --> 0:38:03.040
<v Speaker 1>they can easily pass for the real thing. There's still

0:38:03.080 --> 0:38:06.399
<v Speaker 1>something slightly off about them. They tend to butt up

0:38:06.440 --> 0:38:09.440
<v Speaker 1>against the uncanny valley. Now, for those of you not

0:38:09.600 --> 0:38:13.520
<v Speaker 1>familiar with that term, the uncanny valley describes the feeling

0:38:13.719 --> 0:38:17.000
<v Speaker 1>we humans get when we encounter a robot or a

0:38:17.040 --> 0:38:23.640
<v Speaker 1>computer generated figure that closely resembles a human or human behavior,

0:38:24.239 --> 0:38:27.760
<v Speaker 1>but you can still tell it's not actually a person,

0:38:28.040 --> 0:38:30.200
<v Speaker 1>and it's not a good feeling. It tends to be

0:38:30.239 --> 0:38:34.120
<v Speaker 1>described as repulsive and disturbing, or at the very best,

0:38:34.640 --> 0:38:39.879
<v Speaker 1>off putting. See also the animated film Polar Express. There's

0:38:39.920 --> 0:38:43.399
<v Speaker 1>a reason that when that film came out, people kind

0:38:43.440 --> 0:38:47.440
<v Speaker 1>of reacted negatively to the animation, and it's also a

0:38:47.480 --> 0:38:51.200
<v Speaker 1>reason why Pixar tends to prefer to go with stylized

0:38:51.280 --> 0:38:54.560
<v Speaker 1>human characters who are different enough from the way real

0:38:54.680 --> 0:38:58.040
<v Speaker 1>humans look to kind of bypass uncanny valley. We just

0:38:58.120 --> 0:39:00.680
<v Speaker 1>think of that as a cartoon nuts that's trying to

0:39:00.760 --> 0:39:04.280
<v Speaker 1>pass itself off as being human. But while there hasn't

0:39:04.320 --> 0:39:06.800
<v Speaker 1>really been a flood of fake videos hitting the Internet

0:39:06.920 --> 0:39:11.200
<v Speaker 1>with the intent to discredit politicians or infuriate specific people

0:39:11.320 --> 0:39:14.720
<v Speaker 1>or whatever. There remains a general sense that this is coming.

0:39:15.239 --> 0:39:18.480
<v Speaker 1>It's just not here now. The sense I get is

0:39:18.480 --> 0:39:21.840
<v Speaker 1>that people feel it's an inevitability, and there are already

0:39:21.880 --> 0:39:24.480
<v Speaker 1>folks working on tools that will help us sort out

0:39:24.480 --> 0:39:29.000
<v Speaker 1>the real stuff from the fakes. Take Microsoft, for example.

0:39:29.520 --> 0:39:34.240
<v Speaker 1>There R and D division fittingly called Microsoft Research, developed

0:39:34.239 --> 0:39:38.600
<v Speaker 1>a tool they called the Video Authenticator. This tool analyzes

0:39:38.760 --> 0:39:42.960
<v Speaker 1>video samples and looks for signs of deep fakery. In

0:39:43.000 --> 0:39:45.800
<v Speaker 1>a blog post written by Tom Bert and Eric Horvitts

0:39:45.840 --> 0:39:50.520
<v Speaker 1>to Microsoft executives, they say, quote it works by detecting

0:39:50.560 --> 0:39:54.160
<v Speaker 1>the blending boundary of the deep fake and subtle fading

0:39:54.280 --> 0:39:57.120
<v Speaker 1>or gray scale elements that might not be detectable by

0:39:57.120 --> 0:40:01.279
<v Speaker 1>the human eye. End quote. Now I'm no expert, but

0:40:01.480 --> 0:40:05.560
<v Speaker 1>to me, it sounds like the video Authenticator is working

0:40:05.560 --> 0:40:09.720
<v Speaker 1>in a way that's not too dissimilar to a discriminator

0:40:10.040 --> 0:40:14.240
<v Speaker 1>in a generative adversarial network. I mean, the whole purpose

0:40:14.560 --> 0:40:18.000
<v Speaker 1>of the discriminator is to discriminate or to tell the

0:40:18.040 --> 0:40:23.319
<v Speaker 1>difference between genuine, unaltered videos and computer generated ones. So

0:40:23.520 --> 0:40:27.200
<v Speaker 1>the video authenticator is looking for tailtale signs that a

0:40:27.320 --> 0:40:32.720
<v Speaker 1>video was not produced through traditional means but was computer generated. However,

0:40:32.760 --> 0:40:36.040
<v Speaker 1>that's the very thing that the generators in G A

0:40:36.239 --> 0:40:39.080
<v Speaker 1>N systems are looking out for. So when a generator

0:40:39.120 --> 0:40:43.760
<v Speaker 1>receives feedback that a video it generated did not slip

0:40:43.800 --> 0:40:47.960
<v Speaker 1>past the discriminator, it then tweaks those input weights and

0:40:48.080 --> 0:40:51.800
<v Speaker 1>starts to shift its approach in order to bypass whatever

0:40:51.840 --> 0:40:54.840
<v Speaker 1>it was that gave away its last attempt, and it

0:40:54.920 --> 0:40:59.440
<v Speaker 1>does this again and again. So the video authenticator might

0:40:59.480 --> 0:41:02.719
<v Speaker 1>work well for a given amount of time, but I

0:41:02.719 --> 0:41:05.319
<v Speaker 1>would suspect that in the long run, the deep fake

0:41:05.400 --> 0:41:10.440
<v Speaker 1>systems will become sophisticated enough to fool the authenticator. Of course,

0:41:10.960 --> 0:41:14.960
<v Speaker 1>Microsoft will continue to tweak the authenticator as well, and

0:41:15.040 --> 0:41:17.919
<v Speaker 1>it will become something of a seesaw battle as one

0:41:18.000 --> 0:41:22.040
<v Speaker 1>side outperforms the other temporarily, and then the balance will shift.

0:41:22.440 --> 0:41:24.719
<v Speaker 1>Though there may come a time where either the deep

0:41:24.760 --> 0:41:27.680
<v Speaker 1>fakes are too good and they don't set off any

0:41:27.719 --> 0:41:34.239
<v Speaker 1>alarms from the discriminator, or the discriminator gets so sensitive

0:41:34.640 --> 0:41:37.759
<v Speaker 1>that it starts to flag real videos and hits a

0:41:37.840 --> 0:41:41.640
<v Speaker 1>lot of false positives and calls them generated videos instead.

0:41:42.040 --> 0:41:44.719
<v Speaker 1>Either way, you reach a point where a tool like

0:41:44.760 --> 0:41:47.600
<v Speaker 1>this no longer really serves a useful purpose, and the

0:41:47.680 --> 0:41:51.239
<v Speaker 1>video authenticator will be obsolete. Now, this is something we

0:41:51.280 --> 0:41:54.680
<v Speaker 1>see in artificial intelligence all the time. If you remember

0:41:54.719 --> 0:41:57.760
<v Speaker 1>the good old days of capture, you know, the approving

0:41:57.840 --> 0:42:00.399
<v Speaker 1>you're not a robot stuff. The stuff up we were

0:42:00.400 --> 0:42:03.759
<v Speaker 1>told to do was typically type in a series of

0:42:03.920 --> 0:42:06.680
<v Speaker 1>letters and numbers, and it wasn't that hard when it

0:42:06.760 --> 0:42:10.320
<v Speaker 1>first started, at least not at first. That's because the

0:42:10.560 --> 0:42:14.600
<v Speaker 1>text recognition algorithms of the time weren't very good. They

0:42:14.640 --> 0:42:19.480
<v Speaker 1>couldn't decipher mildly deformed text because the shapes of the

0:42:19.520 --> 0:42:22.920
<v Speaker 1>text felt too far outside the parameters of what the

0:42:22.960 --> 0:42:26.759
<v Speaker 1>system could recognize as a legitimate letter or number. You

0:42:26.800 --> 0:42:30.120
<v Speaker 1>make the number a little you know, deformed, and then

0:42:30.160 --> 0:42:32.279
<v Speaker 1>suddenly the systems like, well, that doesn't look like a

0:42:32.360 --> 0:42:34.920
<v Speaker 1>three to me, because it's not in the shape of

0:42:34.920 --> 0:42:39.319
<v Speaker 1>a three. But over time, people developed better text recognition

0:42:39.400 --> 0:42:42.600
<v Speaker 1>programs that could recognize these shapes even if they weren't

0:42:42.600 --> 0:42:46.480
<v Speaker 1>in a standard three orientation, and those systems began to

0:42:46.520 --> 0:42:51.560
<v Speaker 1>defeat those simple early captures that required captured designers to

0:42:51.640 --> 0:42:55.359
<v Speaker 1>make tougher versions and Eventually the machines got good enough

0:42:55.400 --> 0:42:58.920
<v Speaker 1>that they could match or even outperform humans, and at

0:42:58.960 --> 0:43:01.920
<v Speaker 1>that point those tech based captures proved to be more

0:43:02.000 --> 0:43:05.680
<v Speaker 1>challenging for people than for machines, which meant if you

0:43:05.800 --> 0:43:08.440
<v Speaker 1>use them, you defeated the whole purpose in the first place.

0:43:08.600 --> 0:43:11.640
<v Speaker 1>So while this escalation proved to be a challenge for security,

0:43:12.280 --> 0:43:15.680
<v Speaker 1>it was a boon for artificial intelligence. And while I

0:43:15.719 --> 0:43:19.680
<v Speaker 1>focused almost exclusively on the imagery of video here, the

0:43:19.760 --> 0:43:22.400
<v Speaker 1>same sort of stuff is going on with generated speech,

0:43:22.560 --> 0:43:28.040
<v Speaker 1>including generated speech that imitates specific voices like deep big videos.

0:43:28.280 --> 0:43:31.080
<v Speaker 1>This approach works best if you have a really big

0:43:31.160 --> 0:43:35.600
<v Speaker 1>data set of recorded audio, so people like movie and

0:43:35.680 --> 0:43:41.640
<v Speaker 1>TV stars, news reporters, politicians, and um, you know, podcasters,

0:43:42.400 --> 0:43:45.480
<v Speaker 1>we're great targets for this stuff. There might be hundreds

0:43:45.560 --> 0:43:48.880
<v Speaker 1>or you know, in my case, thousands of hours of

0:43:48.920 --> 0:43:52.680
<v Speaker 1>recording material to work from. Training a model to use

0:43:52.760 --> 0:43:59.040
<v Speaker 1>the frequencies. Timbre, intonation, pronunciation, pauses, and other mannerisms of

0:43:59.040 --> 0:44:02.560
<v Speaker 1>speech can versus in a system that can generate vocals

0:44:02.640 --> 0:44:06.680
<v Speaker 1>that sound like the target, sometimes to a fairly convincing degree,

0:44:07.360 --> 0:44:10.160
<v Speaker 1>and for a while. To peek behind the curtain here

0:44:10.760 --> 0:44:12.880
<v Speaker 1>we at tech stuff. We're working with a company that

0:44:12.960 --> 0:44:15.080
<v Speaker 1>I'm not going to name, but they were going to

0:44:15.120 --> 0:44:17.680
<v Speaker 1>do something like this as an experiment. I was going

0:44:17.719 --> 0:44:20.200
<v Speaker 1>to do a whole episode on it, and I had

0:44:20.280 --> 0:44:25.640
<v Speaker 1>planned on crafting a segment of that episode only through text.

0:44:25.800 --> 0:44:28.520
<v Speaker 1>I was not going to actually record it myself and

0:44:28.520 --> 0:44:32.240
<v Speaker 1>then use a system that was trained on my voice

0:44:32.680 --> 0:44:37.320
<v Speaker 1>to replicate my voice and deliver that segment on its own.

0:44:37.680 --> 0:44:40.080
<v Speaker 1>I was curious if it can nail not just the

0:44:40.120 --> 0:44:44.239
<v Speaker 1>audio quality of my voice, which, let's be honest, is amazing.

0:44:44.920 --> 0:44:48.560
<v Speaker 1>That's sarcasm. I can't stand listening to myself, but it

0:44:48.600 --> 0:44:53.000
<v Speaker 1>would also have to replicate how I actually make certain sounds,

0:44:53.080 --> 0:44:55.160
<v Speaker 1>Like would it get the bit of the southern accent

0:44:55.440 --> 0:44:59.200
<v Speaker 1>that's in my voice, or the way I emphasize certain words.

0:44:59.480 --> 0:45:01.960
<v Speaker 1>Would it us for effect at all? Or would it

0:45:02.040 --> 0:45:05.759
<v Speaker 1>just robotically say one word after the next and only

0:45:05.840 --> 0:45:09.400
<v Speaker 1>pause when there was some helpful punctuation that told it

0:45:09.480 --> 0:45:12.880
<v Speaker 1>to do so. Would it indicate a question by raising

0:45:12.920 --> 0:45:16.040
<v Speaker 1>the pitch at the end of its sentence. Sadly, we

0:45:16.560 --> 0:45:20.600
<v Speaker 1>never got far with that particular project, so I don't

0:45:20.600 --> 0:45:22.440
<v Speaker 1>have any answers for you. I don't know how it

0:45:22.480 --> 0:45:25.040
<v Speaker 1>would have turned out, But clearly one of the things

0:45:25.080 --> 0:45:27.799
<v Speaker 1>I thought of was that it's a bit of a

0:45:27.840 --> 0:45:30.360
<v Speaker 1>red flag. If you can train a computer to sound

0:45:30.400 --> 0:45:33.839
<v Speaker 1>exactly like a specific person, that means you could make

0:45:33.920 --> 0:45:38.279
<v Speaker 1>that person say anything you like, and obviously, like deep

0:45:38.320 --> 0:45:41.839
<v Speaker 1>fake videos, that could have some pretty devastating consequences if

0:45:41.840 --> 0:45:47.120
<v Speaker 1>it were at all, you know, believable or seemed realistic. Now,

0:45:47.160 --> 0:45:50.120
<v Speaker 1>the company we were working with was working hard to

0:45:50.120 --> 0:45:52.360
<v Speaker 1>make sure that the only person to have access to

0:45:52.600 --> 0:45:55.520
<v Speaker 1>a specific voice would be the owner of that voice,

0:45:55.640 --> 0:45:59.600
<v Speaker 1>or presumably the company employing that person, though that does

0:45:59.640 --> 0:46:02.239
<v Speaker 1>bring up a whole bunch of other potential problems, like

0:46:02.280 --> 0:46:06.560
<v Speaker 1>can you imagine eliminating voice actors from a job because

0:46:06.600 --> 0:46:08.400
<v Speaker 1>you've got enough of their voice and you can just

0:46:08.560 --> 0:46:11.960
<v Speaker 1>replicate it. That wouldn't be great, But even so, it

0:46:12.080 --> 0:46:14.920
<v Speaker 1>was something I felt was both fascinating from a technology

0:46:14.960 --> 0:46:19.160
<v Speaker 1>standpoint and potentially problematic when it comes to an application

0:46:19.440 --> 0:46:22.880
<v Speaker 1>of that technology. One other thing I should mention is

0:46:22.960 --> 0:46:26.239
<v Speaker 1>that the Internet at large has been pretty active in

0:46:26.400 --> 0:46:29.799
<v Speaker 1>fighting deep fakes, not necessarily in detecting them, but removing

0:46:29.840 --> 0:46:33.560
<v Speaker 1>the platforms from which they were being shared, Reddit being

0:46:33.600 --> 0:46:36.160
<v Speaker 1>a big one, the subreddit that was dedicated to deep

0:46:36.160 --> 0:46:39.640
<v Speaker 1>fakes what had been shut down, So there have been

0:46:39.719 --> 0:46:41.600
<v Speaker 1>some of those moves as well. Now this is not

0:46:41.960 --> 0:46:46.080
<v Speaker 1>directly against the technology, it's more against the proliferation of

0:46:46.120 --> 0:46:51.120
<v Speaker 1>the uh the output of that technology. As for detecting

0:46:51.160 --> 0:46:53.919
<v Speaker 1>deep fakes, it's interesting to me that people are even

0:46:54.000 --> 0:46:57.319
<v Speaker 1>developing tools to detect them, because to me, the best

0:46:57.360 --> 0:47:00.839
<v Speaker 1>tools so far seems to be human perception. It's not

0:47:01.080 --> 0:47:06.160
<v Speaker 1>that the images aren't really convincing, or that we can

0:47:06.200 --> 0:47:09.799
<v Speaker 1>suddenly detect these, you know, blending lines like the video

0:47:09.840 --> 0:47:13.719
<v Speaker 1>Authenticator tool. It's rather that it's just not hard for

0:47:13.800 --> 0:47:16.640
<v Speaker 1>us to spot a deep fake. Stuff just doesn't quite

0:47:16.960 --> 0:47:21.400
<v Speaker 1>look right in the way that people behave in these videos.

0:47:21.400 --> 0:47:25.960
<v Speaker 1>The vocals and animation often don't quite match. The expressions

0:47:26.320 --> 0:47:31.200
<v Speaker 1>aren't really natural, the progression of mannerisms feels synthetic and

0:47:31.280 --> 0:47:36.120
<v Speaker 1>not genuine. It just it looks off. It's that uncanny

0:47:36.200 --> 0:47:39.760
<v Speaker 1>Valley thing, and so just paying attention and thinking critically

0:47:39.760 --> 0:47:41.880
<v Speaker 1>can really help you suss out the fakes from the

0:47:41.920 --> 0:47:45.200
<v Speaker 1>real thing. Even if we reach a point where machines

0:47:45.320 --> 0:47:49.080
<v Speaker 1>can create a convincing enough fake to pass for reality,

0:47:49.360 --> 0:47:53.120
<v Speaker 1>we can still apply critical thinking, and we always should. Heck,

0:47:53.440 --> 0:47:55.960
<v Speaker 1>we should be applying critical thinking even when there's no

0:47:56.080 --> 0:47:59.399
<v Speaker 1>doubt as to the validity of the video, because there

0:47:59.400 --> 0:48:03.960
<v Speaker 1>may be enough to doubt the content of the video itself.

0:48:04.360 --> 0:48:07.600
<v Speaker 1>If I listen to a genuine scam artist in a

0:48:07.680 --> 0:48:12.200
<v Speaker 1>genuine video, that doesn't make the scam more legitimate. We

0:48:12.239 --> 0:48:15.080
<v Speaker 1>always need to use critical thinking. What I think is

0:48:15.120 --> 0:48:18.600
<v Speaker 1>most important is that we acknowledge the very real fact

0:48:18.880 --> 0:48:23.880
<v Speaker 1>that there are numerous organizations, agencies, governments, and other groups

0:48:23.920 --> 0:48:29.520
<v Speaker 1>that are actively attempting to spread misinformation and disinformation. There

0:48:29.560 --> 0:48:34.799
<v Speaker 1>are entire intelligence agencies dedicated to this endeavor, and then

0:48:35.200 --> 0:48:38.440
<v Speaker 1>there are more independent groups that are doing it for

0:48:38.520 --> 0:48:41.960
<v Speaker 1>one reason or another, typically either to advance a particular

0:48:42.160 --> 0:48:45.879
<v Speaker 1>political agenda or just to make as much money as

0:48:46.000 --> 0:48:50.080
<v Speaker 1>quickly as possible. This is beyond doubt or question. There

0:48:50.120 --> 0:48:54.279
<v Speaker 1>are numerous misinformation campaigns that are actively going on out

0:48:54.320 --> 0:48:57.560
<v Speaker 1>there in the real world right now. Most of them

0:48:57.840 --> 0:49:01.920
<v Speaker 1>are not depending on deep fakes, because one, deep fakes

0:49:01.960 --> 0:49:05.200
<v Speaker 1>aren't really good enough to fool most people right now,

0:49:05.640 --> 0:49:08.840
<v Speaker 1>and too, they don't need the deep fakes in the

0:49:08.880 --> 0:49:11.640
<v Speaker 1>first place. There are other methods that are simpler that

0:49:11.760 --> 0:49:15.600
<v Speaker 1>don't need nearly the processing power that work just fine.

0:49:15.880 --> 0:49:18.440
<v Speaker 1>Why would you go through the trouble of synthesizing a

0:49:18.560 --> 0:49:21.080
<v Speaker 1>video if you can get a better response with a

0:49:21.120 --> 0:49:25.160
<v Speaker 1>blog post filled with lies or half truths. It's just

0:49:25.280 --> 0:49:28.759
<v Speaker 1>not a great return on investment. So bottom line, be

0:49:28.960 --> 0:49:33.799
<v Speaker 1>vigilant out there, particularly on social media. Be aware that

0:49:33.840 --> 0:49:36.520
<v Speaker 1>there are plenty of people who will not hesitate to

0:49:36.640 --> 0:49:40.000
<v Speaker 1>mislead others in order to get what they want. Use

0:49:40.000 --> 0:49:45.279
<v Speaker 1>a critical eye to evaluate the information you encounter. Ask questions,

0:49:45.719 --> 0:49:50.440
<v Speaker 1>check sources, look for corroborating reports. It's a lot of work,

0:49:50.480 --> 0:49:53.359
<v Speaker 1>but trust me, it's way better that we do our

0:49:53.400 --> 0:49:56.400
<v Speaker 1>best to make sure the stuff we're depending on is

0:49:56.480 --> 0:50:00.600
<v Speaker 1>actually dependable. It'll turn out better for us in long run.

0:50:00.880 --> 0:50:04.319
<v Speaker 1>Well that wraps up this episode of tech stuff, which yeah,

0:50:04.600 --> 0:50:07.640
<v Speaker 1>I used as a backdoor to argue about critical thinking. Again,

0:50:07.719 --> 0:50:12.040
<v Speaker 1>sue me, don't, don't really sue me. But I think

0:50:12.040 --> 0:50:16.360
<v Speaker 1>that that's another instance where it's a really clear example

0:50:16.400 --> 0:50:18.520
<v Speaker 1>where we have to use that kind of stuff. So

0:50:18.680 --> 0:50:22.680
<v Speaker 1>I'm gonna keep on stressing it. And you guys are awesome.

0:50:22.960 --> 0:50:25.840
<v Speaker 1>I believe in you. I think that when we start

0:50:25.920 --> 0:50:29.400
<v Speaker 1>using these tools at our disposal that everybody can develop

0:50:29.840 --> 0:50:33.919
<v Speaker 1>just with some practice that things will be better. We'll

0:50:33.960 --> 0:50:37.720
<v Speaker 1>be able to suss out the nonsense from the real stuff,

0:50:38.400 --> 0:50:40.960
<v Speaker 1>and we're all better off in the long run if

0:50:41.000 --> 0:50:43.719
<v Speaker 1>we can do that. If you guys have suggestions for

0:50:43.840 --> 0:50:46.600
<v Speaker 1>future topics I should cover in episodes of tech Stuff,

0:50:46.719 --> 0:50:50.360
<v Speaker 1>let me know via Twitter. The handle is text stuff

0:50:50.719 --> 0:50:55.000
<v Speaker 1>H s W and I'll talk to you again really soon.

0:51:01.200 --> 0:51:04.239
<v Speaker 1>Tech Stuff is an I Heart Radio production. For more

0:51:04.320 --> 0:51:07.720
<v Speaker 1>podcasts from I Heart Radio, visit the i Heart Radio app,

0:51:07.840 --> 0:51:11.000
<v Speaker 1>Apple Podcasts, or wherever you listen to your favorite shows.