WEBVTT - Deep Learning and Deepfakes

0:00:04.400 --> 0:00:07.800
<v Speaker 1>Welcome to Tech Stuff, a production from I Heart Radio.

0:00:12.400 --> 0:00:15.200
<v Speaker 1>Hey there, and welcome to tech Stuff. I'm your host,

0:00:15.360 --> 0:00:18.400
<v Speaker 1>Jonathan Strickland. I'm an executive producer with I Heart Radio

0:00:18.440 --> 0:00:21.720
<v Speaker 1>and I love all things tech. Now, before I get

0:00:21.760 --> 0:00:25.000
<v Speaker 1>into today's episode, I want to give a little listener

0:00:25.239 --> 0:00:29.280
<v Speaker 1>warning here. The topic at hand involves some adult content,

0:00:29.760 --> 0:00:33.040
<v Speaker 1>including the use of technology to do stuff that can

0:00:33.120 --> 0:00:37.960
<v Speaker 1>be unethical, illegal, hurtful, and just plain awful. Now, I

0:00:38.000 --> 0:00:40.800
<v Speaker 1>think this is an important topic, but I wanted to

0:00:40.840 --> 0:00:42.800
<v Speaker 1>give a bit of a heads up at the start

0:00:42.840 --> 0:00:45.440
<v Speaker 1>of the episode, just in case any of you guys

0:00:45.479 --> 0:00:48.880
<v Speaker 1>are listening to a podcast on like a family road

0:00:48.920 --> 0:00:51.880
<v Speaker 1>trip or something. I think this is an important topic

0:00:52.320 --> 0:00:55.120
<v Speaker 1>and I think everyone should know about it and think

0:00:55.160 --> 0:00:57.360
<v Speaker 1>about it. But I also respect that for some people

0:00:57.360 --> 0:01:00.680
<v Speaker 1>this subject might get a bit taboo. So let's go

0:01:00.880 --> 0:01:06.360
<v Speaker 1>on with the episode. Back in nine, a movie called

0:01:06.720 --> 0:01:11.319
<v Speaker 1>Rising Sun, directed by Philip Kaufman, based on a Michael

0:01:11.360 --> 0:01:15.119
<v Speaker 1>Crichton novel and starring Wesley Snipes and Sean Connery came

0:01:15.120 --> 0:01:18.360
<v Speaker 1>out in theaters. Now, I didn't see it in theaters,

0:01:19.040 --> 0:01:21.360
<v Speaker 1>but I did catch it when it came on you know,

0:01:21.920 --> 0:01:25.760
<v Speaker 1>HBO or Cinemax or something. Later on, the movie included

0:01:25.760 --> 0:01:28.959
<v Speaker 1>a sequence that I found to be totally unbelievable. And

0:01:29.000 --> 0:01:32.720
<v Speaker 1>I'm not talking about buying into Sean Connery being an

0:01:32.720 --> 0:01:37.319
<v Speaker 1>expert on Japanese culture and business practices. Actually, side note,

0:01:37.480 --> 0:01:41.720
<v Speaker 1>Sean Connery has an interesting history of playing unlikely characters,

0:01:41.760 --> 0:01:44.759
<v Speaker 1>such as in Highlander, where he played an immortal who

0:01:44.840 --> 0:01:49.080
<v Speaker 1>was supposedly Egyptian, then who lived in feudal Japan and

0:01:49.200 --> 0:01:51.840
<v Speaker 1>ended up in Spain where he became known as Ramirez.

0:01:52.200 --> 0:01:54.760
<v Speaker 1>And all the while he's talking to a Scottish Highlander

0:01:54.960 --> 0:01:58.200
<v Speaker 1>who's played by a Belgian actor. But I'm getting way

0:01:58.240 --> 0:02:02.040
<v Speaker 1>off track here. Besides, I've heard Crichton actually wrote the

0:02:02.160 --> 0:02:05.000
<v Speaker 1>character while thinking of Connery, So you know, what the

0:02:05.000 --> 0:02:08.320
<v Speaker 1>heck do I know? In the film, Snives and Connery

0:02:08.440 --> 0:02:12.240
<v Speaker 1>are investigators, and they're looking into a homicide that happened

0:02:12.280 --> 0:02:16.760
<v Speaker 1>at a Japanese business but on American soil. The security

0:02:16.800 --> 0:02:21.080
<v Speaker 1>system in the building captured video of the homicide and

0:02:21.120 --> 0:02:23.480
<v Speaker 1>the identity of the killer appears to be a pretty

0:02:23.560 --> 0:02:26.880
<v Speaker 1>open and shut case. But that's not how it all

0:02:26.960 --> 0:02:30.520
<v Speaker 1>turns out. The investigators talked to a security expert played

0:02:30.520 --> 0:02:34.440
<v Speaker 1>by Tia Carrera, and she demonstrates in real time how

0:02:34.560 --> 0:02:39.160
<v Speaker 1>video footage can be altered. She records a short video

0:02:39.520 --> 0:02:43.800
<v Speaker 1>of Connery and snipes loads that onto a computer, freezes

0:02:43.960 --> 0:02:47.600
<v Speaker 1>a frame of the video, and essentially performs a cut

0:02:47.639 --> 0:02:51.440
<v Speaker 1>and paste job swapping the heads of our two lead characters.

0:02:51.880 --> 0:02:55.000
<v Speaker 1>Then she resumes the video and the head swap remains

0:02:55.000 --> 0:02:59.960
<v Speaker 1>in place, and that head swap stuff is possible. I mean,

0:03:00.040 --> 0:03:02.440
<v Speaker 1>clearly it has to be possible, because you actually do

0:03:02.600 --> 0:03:05.680
<v Speaker 1>see that effect in the film itself. But it takes

0:03:05.800 --> 0:03:08.680
<v Speaker 1>a bit more than a quick cut and paste job.

0:03:08.760 --> 0:03:11.320
<v Speaker 1>But we'll leave off of that for now. The whole

0:03:11.360 --> 0:03:15.720
<v Speaker 1>point of that sequence, apart from showing off some cinema magic,

0:03:16.240 --> 0:03:20.560
<v Speaker 1>is to demonstrate to the investigators that video, like photographs,

0:03:20.880 --> 0:03:24.560
<v Speaker 1>can be altered. The expert has detected a blue halo

0:03:24.720 --> 0:03:27.919
<v Speaker 1>around the face of the supposed murderer in the footage,

0:03:28.240 --> 0:03:31.680
<v Speaker 1>indicating that some sort of trickery has happened. She also

0:03:31.760 --> 0:03:34.800
<v Speaker 1>reveals that she cannot magically restore the video to its

0:03:34.800 --> 0:03:37.920
<v Speaker 1>previous unaltered state, which I think was actually a nice

0:03:38.000 --> 0:03:41.040
<v Speaker 1>change of pace for a movie. By the way, I

0:03:41.080 --> 0:03:44.760
<v Speaker 1>think this movie is really, you know, not good, like

0:03:45.480 --> 0:03:50.360
<v Speaker 1>not worth your time, but that's my opinion anyway. For years,

0:03:50.680 --> 0:03:54.040
<v Speaker 1>this kind of video sorcery was pretty much limited to

0:03:54.160 --> 0:03:57.760
<v Speaker 1>the film and TV industries. It usually required a lot

0:03:57.800 --> 0:04:01.720
<v Speaker 1>of pre planning beforehand, so it wasn't as simple as

0:04:01.760 --> 0:04:04.920
<v Speaker 1>just taking footage that was already shot and changing it

0:04:04.960 --> 0:04:07.720
<v Speaker 1>in post on a whim with a couple of clicks

0:04:07.760 --> 0:04:09.800
<v Speaker 1>of a button. If it were, we would see a

0:04:09.800 --> 0:04:13.920
<v Speaker 1>lot fewer mistakes left in movies and television because you

0:04:13.960 --> 0:04:16.599
<v Speaker 1>could catch it later and just fix it. But the

0:04:16.640 --> 0:04:20.359
<v Speaker 1>tricks were possible, they were just difficult to pull off.

0:04:20.880 --> 0:04:23.840
<v Speaker 1>It just wasn't something you or I would ever encounter

0:04:23.960 --> 0:04:27.560
<v Speaker 1>in our day to day lives. But today we live

0:04:27.680 --> 0:04:30.880
<v Speaker 1>in a different world, a world that has examples of

0:04:31.000 --> 0:04:35.960
<v Speaker 1>synthetic media. Commonly referred to as deep fakes. These are

0:04:36.080 --> 0:04:39.839
<v Speaker 1>videos that have been altered or generated so that the

0:04:39.920 --> 0:04:42.839
<v Speaker 1>subject of the video is doing something that they probably

0:04:42.960 --> 0:04:47.200
<v Speaker 1>would or could never do. They've brought into question whether

0:04:47.320 --> 0:04:50.800
<v Speaker 1>or not video evidence is even reliable, much as the

0:04:50.839 --> 0:04:54.440
<v Speaker 1>film Rising Sun was talking about. We already know that

0:04:54.520 --> 0:05:00.559
<v Speaker 1>eyewitness testimony is terribly unreliable. Our perception and memory play

0:05:00.600 --> 0:05:04.560
<v Speaker 1>tricks on us, and we can quote unquote remember stuff

0:05:04.600 --> 0:05:09.560
<v Speaker 1>that just didn't happen the way things actually unfolded in reality.

0:05:09.600 --> 0:05:13.359
<v Speaker 1>But now we're looking at video evidence and potentially the

0:05:13.440 --> 0:05:17.600
<v Speaker 1>same light. I mean, it's scary. So today we're going

0:05:17.680 --> 0:05:21.760
<v Speaker 1>to learn about synthetic media, how it can be generated,

0:05:22.080 --> 0:05:26.000
<v Speaker 1>the implications that follow with that sort of reality, and

0:05:26.120 --> 0:05:29.559
<v Speaker 1>ways that people are trying to counteract a potentially dangerous threat,

0:05:30.040 --> 0:05:34.800
<v Speaker 1>you know, fun stuff. Now, first, the term synthetic media

0:05:35.120 --> 0:05:39.400
<v Speaker 1>has a particular meaning. It refers to art created through

0:05:39.520 --> 0:05:43.760
<v Speaker 1>some sort of automated process, so it's a largely hands

0:05:43.839 --> 0:05:49.000
<v Speaker 1>off approach to creating the final art piece. Now, under

0:05:49.040 --> 0:05:52.880
<v Speaker 1>that definition, the example of rising sun would not apply

0:05:53.080 --> 0:05:56.400
<v Speaker 1>here because we see in the film and presumably this

0:05:56.480 --> 0:05:58.599
<v Speaker 1>happens in the book as well, but I haven't read

0:05:58.680 --> 0:06:03.200
<v Speaker 1>the book that a human being actually changes that. People

0:06:03.279 --> 0:06:06.880
<v Speaker 1>have used tools to alter the video footage. This would

0:06:06.880 --> 0:06:10.280
<v Speaker 1>be more like using photoshop to touch up a still image,

0:06:10.279 --> 0:06:14.039
<v Speaker 1>with the computer system presumably doing some of the work

0:06:14.040 --> 0:06:16.800
<v Speaker 1>in the background to keep things matched up. Either that

0:06:16.960 --> 0:06:19.640
<v Speaker 1>or you would need to alter each image in the

0:06:19.640 --> 0:06:23.760
<v Speaker 1>footage frame by frame, or use some sort of matt approach.

0:06:24.360 --> 0:06:26.880
<v Speaker 1>To learn more about matts, you can listen to my

0:06:26.920 --> 0:06:30.760
<v Speaker 1>episode about how blue and green screens work. Synthetic media

0:06:31.040 --> 0:06:35.200
<v Speaker 1>as a general practice has been around for centuries. Artists

0:06:35.200 --> 0:06:38.640
<v Speaker 1>have set up various contraptions to create works with little

0:06:38.880 --> 0:06:43.039
<v Speaker 1>or no human guidance. In the twentieth century we started

0:06:43.120 --> 0:06:46.960
<v Speaker 1>to see a movement called generative art take form. This

0:06:47.000 --> 0:06:49.560
<v Speaker 1>type of art is all about creating a system that

0:06:49.680 --> 0:06:53.880
<v Speaker 1>then creates or generates the finished art piece. That would

0:06:53.920 --> 0:06:57.080
<v Speaker 1>mean that the finished work, such as a painting, wouldn't

0:06:57.400 --> 0:07:00.400
<v Speaker 1>reflect the feelings or thoughts of the art is who

0:07:00.440 --> 0:07:03.919
<v Speaker 1>created the system. In fact, it starts to raise the

0:07:04.000 --> 0:07:07.120
<v Speaker 1>question what is the art? Is it the painting that

0:07:07.200 --> 0:07:11.000
<v Speaker 1>came about due to a machine following a program of

0:07:11.080 --> 0:07:15.400
<v Speaker 1>some sort, or is the art the program itself? Is

0:07:15.440 --> 0:07:19.000
<v Speaker 1>the art the process by which the painting was made?

0:07:19.320 --> 0:07:22.000
<v Speaker 1>Now I'm not here to answer that question. I just

0:07:22.320 --> 0:07:26.640
<v Speaker 1>think it is an interesting question to ask. Sometimes people

0:07:26.680 --> 0:07:30.600
<v Speaker 1>ask much less polite questions, such as is it art

0:07:30.640 --> 0:07:34.280
<v Speaker 1>at all? Some art critics went out of their way

0:07:34.320 --> 0:07:37.520
<v Speaker 1>to dismiss generative art in the early days. They found

0:07:37.520 --> 0:07:42.000
<v Speaker 1>it insulting, but hey, that's kind of the history of

0:07:42.200 --> 0:07:46.560
<v Speaker 1>art in general. Each new movement and art inevitably finds

0:07:46.600 --> 0:07:51.080
<v Speaker 1>both supporters and critics as it emerges. If anything, you

0:07:51.200 --> 0:07:55.360
<v Speaker 1>might argue that such a response legitimizes the movement in

0:07:55.560 --> 0:07:58.640
<v Speaker 1>you know, a weird way. If people hate it, it

0:07:58.720 --> 0:08:02.720
<v Speaker 1>must be something. In two thousand eighteen, an artist collective

0:08:03.040 --> 0:08:07.920
<v Speaker 1>called Obvious located out of Paris, France. They submitted portrait

0:08:08.000 --> 0:08:11.920
<v Speaker 1>style paintings that were created not by an actual human painter,

0:08:12.440 --> 0:08:16.440
<v Speaker 1>but by an artificially intelligent system. Now they looked a

0:08:16.480 --> 0:08:21.720
<v Speaker 1>lot like typical eighteenth century style portraits. There was no

0:08:21.800 --> 0:08:24.640
<v Speaker 1>attempt to pass off the portrait as if it were

0:08:24.720 --> 0:08:28.120
<v Speaker 1>actually made by a human artist. In fact, the appeal

0:08:28.320 --> 0:08:32.760
<v Speaker 1>of the piece was largely due to it being synthetically generated.

0:08:33.200 --> 0:08:36.720
<v Speaker 1>It went to auction at Christie's and the AI created

0:08:36.800 --> 0:08:42.000
<v Speaker 1>painting fetched more than four hundred thousand dollars. And the

0:08:42.040 --> 0:08:45.280
<v Speaker 1>way the group trained their AI is relevant to our

0:08:45.320 --> 0:08:49.960
<v Speaker 1>discussion about deep fakes. The collective relied on a type

0:08:49.960 --> 0:08:55.560
<v Speaker 1>of machine learning called generative adversarial networks or g a N,

0:08:56.080 --> 0:08:59.319
<v Speaker 1>which in turn is depending on deep learning. So it

0:08:59.360 --> 0:09:00.959
<v Speaker 1>looks like we've got a few things we're going to

0:09:01.080 --> 0:09:03.840
<v Speaker 1>have to define here. Now, I'm going to keep things

0:09:04.160 --> 0:09:07.719
<v Speaker 1>fairly high level, because as it turns out there are

0:09:07.760 --> 0:09:11.439
<v Speaker 1>a few different ways to create machine learning models, and

0:09:11.520 --> 0:09:14.280
<v Speaker 1>to go through all of them in exhaustive detail would

0:09:14.280 --> 0:09:17.600
<v Speaker 1>represent a university level course in machine learning. I have

0:09:17.760 --> 0:09:21.240
<v Speaker 1>neither the time for that nor the expertise. I would

0:09:21.320 --> 0:09:24.960
<v Speaker 1>do a terrible job, So we'll go with a high

0:09:25.040 --> 0:09:31.480
<v Speaker 1>level perspective here first. A generative adversarial network uses two systems.

0:09:31.520 --> 0:09:35.280
<v Speaker 1>You have a generator and you have a discriminator. Both

0:09:35.360 --> 0:09:38.760
<v Speaker 1>of these systems are a type of neural network. A

0:09:38.840 --> 0:09:42.480
<v Speaker 1>neural network is a computing model that is inspired by

0:09:42.480 --> 0:09:47.520
<v Speaker 1>the way our brains work. Our brains contain billions of neurons,

0:09:47.760 --> 0:09:52.200
<v Speaker 1>and these neurons work together, communicating through electrical and chemical signals,

0:09:52.440 --> 0:09:57.680
<v Speaker 1>controlling and coordinating pretty much everything in our bodies. With computers,

0:09:58.040 --> 0:10:02.720
<v Speaker 1>the neurons are Note the job of a node is,

0:10:03.120 --> 0:10:05.400
<v Speaker 1>you know, supposed to be kind of like a neuron

0:10:05.640 --> 0:10:08.960
<v Speaker 1>cell in the brain. It's to take in multiple weighted

0:10:09.080 --> 0:10:14.360
<v Speaker 1>input values and then generate a single output value. Now,

0:10:14.400 --> 0:10:18.000
<v Speaker 1>the word weighted w E I G H T E

0:10:18.080 --> 0:10:21.840
<v Speaker 1>D weighted is really important here because the larger and

0:10:21.960 --> 0:10:26.120
<v Speaker 1>inputs weight, the more that input will have an effect

0:10:26.360 --> 0:10:29.000
<v Speaker 1>on whatever the output is. So it kind of comes

0:10:29.040 --> 0:10:32.679
<v Speaker 1>down to which inputs are the most important for that

0:10:32.800 --> 0:10:36.760
<v Speaker 1>nodes particular function. Now, if I were to make an analogy,

0:10:36.840 --> 0:10:40.560
<v Speaker 1>I would say, your boss hands you three tasks to do.

0:10:41.240 --> 0:10:45.360
<v Speaker 1>One of those tasks has the label extremely important, and

0:10:45.440 --> 0:10:49.320
<v Speaker 1>the second task has the label critically important, and the

0:10:49.400 --> 0:10:52.240
<v Speaker 1>third task has a label saying you should have finished

0:10:52.280 --> 0:10:55.040
<v Speaker 1>that one before it was handed to you. Okay, so

0:10:55.080 --> 0:10:57.800
<v Speaker 1>that's just some sort of snarky office humor that I

0:10:57.840 --> 0:11:00.520
<v Speaker 1>need to get off my chest. But more seriously, imagine

0:11:00.559 --> 0:11:05.000
<v Speaker 1>a node accepting three inputs. In this example, input one

0:11:05.280 --> 0:11:09.680
<v Speaker 1>has a fifty weight, Input two has a weight, and

0:11:09.720 --> 0:11:12.360
<v Speaker 1>input three has a ten percent weight. That adds up

0:11:12.400 --> 0:11:16.200
<v Speaker 1>to and that would tell you that the output that

0:11:16.280 --> 0:11:21.160
<v Speaker 1>node generates will be most affected by input one, followed

0:11:21.200 --> 0:11:24.199
<v Speaker 1>by input two, and then input three would have a

0:11:24.280 --> 0:11:29.120
<v Speaker 1>smaller effect on whatever the output is. Each node applies

0:11:29.200 --> 0:11:34.080
<v Speaker 1>a nonlinear transformation on the input values, again affected by

0:11:34.240 --> 0:11:39.000
<v Speaker 1>each inputs weight value, and that generates the output value.

0:11:39.480 --> 0:11:43.520
<v Speaker 1>The details of that really are not important for our episode,

0:11:43.520 --> 0:11:46.920
<v Speaker 1>and involves performing changes on variables that in turn change

0:11:46.960 --> 0:11:50.360
<v Speaker 1>the correlation between variables, and it gets a bit Matthew,

0:11:50.559 --> 0:11:53.360
<v Speaker 1>and we would get lost in the weeds pretty quickly.

0:11:53.679 --> 0:11:56.480
<v Speaker 1>The important thing to remember is that a node within

0:11:56.520 --> 0:12:01.280
<v Speaker 1>a neural network takes in a weighted sum inputs, then

0:12:01.320 --> 0:12:06.680
<v Speaker 1>performs a process on those inputs before passing the result

0:12:06.800 --> 0:12:10.520
<v Speaker 1>on as an output. Then some other node a layer

0:12:10.640 --> 0:12:14.400
<v Speaker 1>down will accept that output, along with outputs from a

0:12:14.440 --> 0:12:17.600
<v Speaker 1>couple of other nodes one layer up, and then we'll

0:12:17.640 --> 0:12:21.400
<v Speaker 1>perform an operation based on those weighted inputs and pass

0:12:21.480 --> 0:12:23.840
<v Speaker 1>that on to the next layer, and so on. So

0:12:23.920 --> 0:12:27.000
<v Speaker 1>these nodes are in layers, like you know a cake.

0:12:27.600 --> 0:12:30.520
<v Speaker 1>One layer of notes processes some inputs, they send it

0:12:30.559 --> 0:12:33.440
<v Speaker 1>on to the next layer of nodes, and then that

0:12:33.480 --> 0:12:35.320
<v Speaker 1>one does onto the next one, and the next one

0:12:35.360 --> 0:12:40.880
<v Speaker 1>and so on. This isn't a new idea. Computer scientists

0:12:41.040 --> 0:12:45.679
<v Speaker 1>began theorizing and experimenting with neural network approaches as far

0:12:45.760 --> 0:12:49.360
<v Speaker 1>back as the nineteen fifties with the perceptron, which was

0:12:49.400 --> 0:12:53.280
<v Speaker 1>a hypothetical system that was described by Frank Rosenblatt of

0:12:53.320 --> 0:12:57.160
<v Speaker 1>Cornell University. But it wasn't until the last decade that

0:12:57.280 --> 0:13:00.400
<v Speaker 1>computing power and our ability to handle a lot of

0:13:00.520 --> 0:13:04.040
<v Speaker 1>data reached a point where these sort of learning models

0:13:04.040 --> 0:13:08.280
<v Speaker 1>could really take off. The goal of this system is

0:13:08.320 --> 0:13:12.080
<v Speaker 1>to train it to perform a particular task within a

0:13:12.120 --> 0:13:16.880
<v Speaker 1>certain level of precision. The weights I mentioned are adjustable,

0:13:17.040 --> 0:13:19.360
<v Speaker 1>so you can think of it as teaching a system

0:13:19.480 --> 0:13:22.840
<v Speaker 1>which bits are the most important in order to do

0:13:23.040 --> 0:13:25.760
<v Speaker 1>whatever it is the system is supposed to do in

0:13:25.840 --> 0:13:28.880
<v Speaker 1>order to achieve your task, These are the bits that

0:13:28.920 --> 0:13:32.320
<v Speaker 1>are the most important and therefore should matter the most

0:13:32.320 --> 0:13:35.240
<v Speaker 1>when you weigh a decision. This is a bit easier

0:13:35.280 --> 0:13:38.319
<v Speaker 1>if we talk about a similar system with the version

0:13:38.360 --> 0:13:42.679
<v Speaker 1>of IBM S Watson that played on Jeopardy. That system

0:13:42.800 --> 0:13:46.280
<v Speaker 1>famously was not connected to the Internet. It had to

0:13:46.320 --> 0:13:50.319
<v Speaker 1>rely on all the information that was stored within itself.

0:13:50.960 --> 0:13:55.000
<v Speaker 1>When the system encountered a clue in Jeopardy, it would

0:13:55.000 --> 0:13:57.959
<v Speaker 1>analyze the clue, and then it would reference its data

0:13:57.960 --> 0:14:01.320
<v Speaker 1>base to look for possible answers to whatever that clue was.

0:14:01.800 --> 0:14:05.160
<v Speaker 1>The system would weigh those possible answers and attempt to

0:14:05.160 --> 0:14:08.760
<v Speaker 1>determine which, if any, were the most likely to be correct.

0:14:09.200 --> 0:14:13.920
<v Speaker 1>If the certainty was over a certain threshold, such as sure,

0:14:14.200 --> 0:14:16.720
<v Speaker 1>the system would buzz in with its answer. If no

0:14:16.880 --> 0:14:20.920
<v Speaker 1>response rose above that threshold, the system would not buzz in,

0:14:21.280 --> 0:14:23.480
<v Speaker 1>So you could say that Watson was playing the game

0:14:23.520 --> 0:14:27.680
<v Speaker 1>with a best guess sort of approach. Neural networks do

0:14:28.240 --> 0:14:33.000
<v Speaker 1>essentially that sort of processing. With this particular type of approach,

0:14:33.400 --> 0:14:36.640
<v Speaker 1>we know what we want the outcome to be, so

0:14:36.840 --> 0:14:39.880
<v Speaker 1>we can judge whether or not the system was successful.

0:14:40.200 --> 0:14:43.760
<v Speaker 1>After each attempt, we can adjust the weight on the

0:14:43.800 --> 0:14:47.760
<v Speaker 1>input between nodes to refine the decision making process to

0:14:47.840 --> 0:14:51.880
<v Speaker 1>get more accurate results. If the system succeeds in its task,

0:14:52.360 --> 0:14:55.720
<v Speaker 1>we can increase the weights that contributed to the system

0:14:55.760 --> 0:15:00.240
<v Speaker 1>picking the correct answer and thus decrease the input it's

0:15:00.320 --> 0:15:05.280
<v Speaker 1>that did not contribute to the successful response. If the

0:15:05.280 --> 0:15:09.320
<v Speaker 1>system done messed up and gave the wrong answer, then

0:15:09.360 --> 0:15:11.720
<v Speaker 1>we do the opposite. We look at the inputs that

0:15:11.760 --> 0:15:16.000
<v Speaker 1>contributed to the wrong answer, we diminish their weights, and

0:15:16.080 --> 0:15:18.440
<v Speaker 1>we increase the weights of the other input and then

0:15:18.440 --> 0:15:23.120
<v Speaker 1>we run the test again a lot. I'll explain a

0:15:23.160 --> 0:15:25.600
<v Speaker 1>bit more about this process when we come back, but

0:15:25.680 --> 0:15:36.400
<v Speaker 1>first let's take a quick break. Early in the history

0:15:36.520 --> 0:15:40.760
<v Speaker 1>of neural networks, computer scientists were hitting some pretty hard

0:15:40.880 --> 0:15:44.400
<v Speaker 1>stops due to the limitations of computing power at the time.

0:15:44.720 --> 0:15:48.080
<v Speaker 1>Early networks were only a couple of layers deep, which

0:15:48.080 --> 0:15:50.720
<v Speaker 1>really meant they weren't terribly powerful, and they could only

0:15:50.760 --> 0:15:54.400
<v Speaker 1>tackle rudimentary tasks like figuring out whether or not a

0:15:54.520 --> 0:15:59.160
<v Speaker 1>square is drawn on a piece of paper that isn't

0:15:59.240 --> 0:16:05.560
<v Speaker 1>terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and Ronald

0:16:05.600 --> 0:16:12.120
<v Speaker 1>Williams published a lecture titled learning representations by back propagating errors.

0:16:12.160 --> 0:16:16.840
<v Speaker 1>This was a big breakthrough with deep learning. This all

0:16:16.880 --> 0:16:19.360
<v Speaker 1>has to do with a deep learning system improving its

0:16:19.360 --> 0:16:22.760
<v Speaker 1>ability to complete a specific task. And basically the algorithm's

0:16:22.840 --> 0:16:25.840
<v Speaker 1>job is to go from the output layer, you know,

0:16:25.960 --> 0:16:29.000
<v Speaker 1>where the system has made a decision, and then work

0:16:29.160 --> 0:16:32.680
<v Speaker 1>backward through the neural network, adjusting the weights that led

0:16:32.720 --> 0:16:38.480
<v Speaker 1>to an incorrect decision. So let's say it's a system

0:16:38.520 --> 0:16:41.680
<v Speaker 1>that is looking to figure out whether or not a

0:16:41.720 --> 0:16:45.000
<v Speaker 1>cat is in a photograph and it says, there's a

0:16:45.040 --> 0:16:47.400
<v Speaker 1>cat in this picture, and you look at the picture

0:16:47.400 --> 0:16:50.440
<v Speaker 1>and there is no cat there. Then you would look

0:16:50.560 --> 0:16:54.720
<v Speaker 1>at the inputs one level back just before the system

0:16:54.800 --> 0:16:57.160
<v Speaker 1>said here's a picture of a cat, and you'd say,

0:16:57.200 --> 0:16:59.720
<v Speaker 1>all right, which of these inputs lad the system to

0:17:00.120 --> 0:17:03.200
<v Speaker 1>leave this was a picture of a cat, And then

0:17:03.280 --> 0:17:06.200
<v Speaker 1>you would adjust those. Then you would go back one

0:17:06.320 --> 0:17:10.159
<v Speaker 1>layer up, so you're working your way up the model

0:17:10.520 --> 0:17:14.240
<v Speaker 1>and say which inputs here led to it giving the

0:17:14.280 --> 0:17:18.400
<v Speaker 1>outputs that led to the mistake, and you do this

0:17:18.640 --> 0:17:21.760
<v Speaker 1>all the way up until you get up to the

0:17:21.800 --> 0:17:24.639
<v Speaker 1>input level at the top of the computer model. You

0:17:24.680 --> 0:17:28.040
<v Speaker 1>are back propagating, and then you run the test again

0:17:28.160 --> 0:17:32.720
<v Speaker 1>to see if you've got improvement. It's exhaustive, but it's

0:17:32.800 --> 0:17:38.000
<v Speaker 1>also drastically improved neural network performance, much faster than just

0:17:38.520 --> 0:17:42.080
<v Speaker 1>throwing more brute force to it. The algorithm essentially is

0:17:42.160 --> 0:17:44.920
<v Speaker 1>checking to see if a small change in each input

0:17:45.040 --> 0:17:48.640
<v Speaker 1>value received by a layer of nodes would have led

0:17:48.680 --> 0:17:51.200
<v Speaker 1>to a more accurate results. So it's all about going

0:17:51.240 --> 0:17:54.679
<v Speaker 1>from that output working your way backward. In two thousand twelve,

0:17:54.720 --> 0:17:57.920
<v Speaker 1>Alex Krajewski published a paper that gave us the next

0:17:58.000 --> 0:18:02.480
<v Speaker 1>big breakthrough. He argued that a really deep neural network

0:18:02.760 --> 0:18:06.040
<v Speaker 1>with a lot of layers could give really great results

0:18:06.200 --> 0:18:09.960
<v Speaker 1>if you paired it with enough data to train the system.

0:18:10.000 --> 0:18:13.600
<v Speaker 1>So you needed to throw lots of data at these models,

0:18:14.320 --> 0:18:17.720
<v Speaker 1>and it needed to be an enormous amount of data. However,

0:18:17.880 --> 0:18:22.120
<v Speaker 1>once trained, the system would produce lower error rates. So yeah,

0:18:22.160 --> 0:18:24.040
<v Speaker 1>I would take a long time but you would get

0:18:24.080 --> 0:18:27.560
<v Speaker 1>better results. Now, at the time, a good error rate

0:18:27.720 --> 0:18:31.840
<v Speaker 1>for such a system was that means one out of

0:18:31.920 --> 0:18:35.159
<v Speaker 1>four conclusions the system would come to would be wrong.

0:18:35.600 --> 0:18:39.800
<v Speaker 1>If you ran it across a long enough number of decisions,

0:18:39.800 --> 0:18:43.120
<v Speaker 1>you would find that one out of every four wasn't right.

0:18:43.880 --> 0:18:47.520
<v Speaker 1>The system that Alex's team worked on produced results that

0:18:47.560 --> 0:18:50.880
<v Speaker 1>had an error rate of six percent, so much lower.

0:18:51.040 --> 0:18:54.720
<v Speaker 1>And then in just five years, with more improvements to

0:18:54.800 --> 0:18:58.919
<v Speaker 1>this process, the classification error rate had dropped down to

0:18:59.080 --> 0:19:02.760
<v Speaker 1>two point three percent for deep learning systems. So from

0:19:04.160 --> 0:19:09.080
<v Speaker 1>to two point three it was really powerful stuff. Okay,

0:19:09.119 --> 0:19:12.960
<v Speaker 1>so you've got your artificial neural network. You've got your

0:19:13.080 --> 0:19:17.359
<v Speaker 1>layers and layers of nodes. You've adjusted the weights of

0:19:17.400 --> 0:19:20.439
<v Speaker 1>the inputs into each node to see if your system

0:19:20.520 --> 0:19:25.240
<v Speaker 1>can identify, you know, pictures of cats, and you start

0:19:25.320 --> 0:19:29.479
<v Speaker 1>feeding images to this system, lots of them. This is

0:19:29.520 --> 0:19:32.439
<v Speaker 1>the domain that you are feeding to your system. The

0:19:32.480 --> 0:19:34.919
<v Speaker 1>more images you can feed to it, the better. And

0:19:34.960 --> 0:19:37.120
<v Speaker 1>you want a wide variety of images of all sorts

0:19:37.160 --> 0:19:39.879
<v Speaker 1>of stuff, not just of different types of cats, but

0:19:40.000 --> 0:19:43.760
<v Speaker 1>stuff that most certainly is not a cat, like dogs,

0:19:43.880 --> 0:19:48.000
<v Speaker 1>or cars or chartered public accountants, you name it, and

0:19:48.080 --> 0:19:51.400
<v Speaker 1>you look to see which images the system identifies correctly

0:19:51.800 --> 0:19:55.080
<v Speaker 1>and which ones it screws up, both which images have

0:19:55.320 --> 0:19:58.400
<v Speaker 1>cats in it that actually don't have cats in it,

0:19:58.840 --> 0:20:01.639
<v Speaker 1>or images the system has identified as saying there is

0:20:01.680 --> 0:20:04.719
<v Speaker 1>no cat here, but there is a cat there. This

0:20:04.800 --> 0:20:08.480
<v Speaker 1>guides you into adjusting the weights again and again, and

0:20:08.560 --> 0:20:10.360
<v Speaker 1>you start over and you do it again, and that's

0:20:10.400 --> 0:20:13.920
<v Speaker 1>your basic deep learning system, and it gets better over

0:20:14.000 --> 0:20:17.560
<v Speaker 1>time as you train it. It learns. Now, let's transition

0:20:17.600 --> 0:20:21.480
<v Speaker 1>over to the adversarial systems I mentioned earlier, because they

0:20:21.520 --> 0:20:25.160
<v Speaker 1>take this and twist it a little bit. So you've

0:20:25.200 --> 0:20:30.040
<v Speaker 1>got two artificial neural networks and they are using this

0:20:30.160 --> 0:20:33.840
<v Speaker 1>general approach to deep learning, and you're setting them up

0:20:34.160 --> 0:20:38.800
<v Speaker 1>so that they feed into each other. One network, the generator,

0:20:39.320 --> 0:20:42.880
<v Speaker 1>has the task to learn how to do something such

0:20:42.920 --> 0:20:47.000
<v Speaker 1>as create an eighteenth century style portrait based off lots

0:20:47.080 --> 0:20:50.400
<v Speaker 1>and lots of examples of the real thing. The domain

0:20:50.560 --> 0:20:55.800
<v Speaker 1>the problem domain. The second network, the discriminator, has a

0:20:55.840 --> 0:20:59.639
<v Speaker 1>different job. It has to tell the difference between authentic

0:20:59.720 --> 0:21:03.840
<v Speaker 1>port traits that came from the problem domain and computer

0:21:04.080 --> 0:21:08.120
<v Speaker 1>generated portraits that came from the generator itself. So essentially

0:21:08.200 --> 0:21:11.600
<v Speaker 1>the discriminator is like the model I mentioned earlier that

0:21:11.680 --> 0:21:14.480
<v Speaker 1>was identifying pictures of cats. It's doing the same sort

0:21:14.480 --> 0:21:17.000
<v Speaker 1>of thing, except instead of saying cat or no cat,

0:21:17.080 --> 0:21:21.359
<v Speaker 1>it's saying real portrait or a computer generated portrait. So

0:21:21.400 --> 0:21:25.120
<v Speaker 1>there are essentially two outcomes the discriminator could reach, and

0:21:25.240 --> 0:21:28.679
<v Speaker 1>that's whether or not an images computer generated or it wasn't.

0:21:29.520 --> 0:21:31.720
<v Speaker 1>So do you see where this is going? You train

0:21:31.880 --> 0:21:35.680
<v Speaker 1>up both models. You have the generator attempt to make

0:21:35.720 --> 0:21:39.080
<v Speaker 1>its own version of something such as that eighteenth century portrait.

0:21:39.680 --> 0:21:42.680
<v Speaker 1>It does so it designs the portrait based on what

0:21:42.760 --> 0:21:46.440
<v Speaker 1>the model believes are the key elements of a portrait,

0:21:47.160 --> 0:21:51.960
<v Speaker 1>So things like colors, shapes, the ratio of size, like

0:21:52.200 --> 0:21:54.480
<v Speaker 1>you know, how large should the head be in relation

0:21:54.520 --> 0:21:58.080
<v Speaker 1>to the body. All of these factors and many more

0:21:58.440 --> 0:22:03.200
<v Speaker 1>come into play. The generator creates its own idea of

0:22:03.240 --> 0:22:06.080
<v Speaker 1>what a portrait is supposed to look like, and chances

0:22:06.080 --> 0:22:09.800
<v Speaker 1>are the early rounds of this will not be terribly convincing.

0:22:10.560 --> 0:22:14.560
<v Speaker 1>The results are then fed to the discriminator, which tries

0:22:14.600 --> 0:22:17.520
<v Speaker 1>to suss out which of the images fed to it

0:22:17.560 --> 0:22:20.879
<v Speaker 1>our computer generated and which ones aren't. After that round,

0:22:21.280 --> 0:22:26.320
<v Speaker 1>both models are tweaked the generator adjusts input weights to

0:22:26.359 --> 0:22:29.880
<v Speaker 1>get closer to the genuine article, and the discriminator adjust

0:22:29.960 --> 0:22:34.720
<v Speaker 1>weights to reduce false positives or to catch computer generated images.

0:22:34.960 --> 0:22:39.280
<v Speaker 1>And then you go again and again and again and again,

0:22:39.560 --> 0:22:43.199
<v Speaker 1>and they both get better over time. So, assuming everything

0:22:43.280 --> 0:22:46.560
<v Speaker 1>is working properly, over time, the adjustment of input weights

0:22:46.600 --> 0:22:50.040
<v Speaker 1>will lead to more convincing results, and given enough time

0:22:50.240 --> 0:22:53.200
<v Speaker 1>and enough repetition, you'll end up with a computer generated

0:22:53.240 --> 0:22:55.639
<v Speaker 1>painting that you can auction off for nearly half a

0:22:55.680 --> 0:22:59.920
<v Speaker 1>million dollars. Though keep in mind that huge price or

0:23:00.040 --> 0:23:02.760
<v Speaker 1>dates back to the novelty of it being an early

0:23:02.960 --> 0:23:06.960
<v Speaker 1>AI generated painting. It would be shocking to me if

0:23:07.000 --> 0:23:10.320
<v Speaker 1>we saw that actually become a trend. Also, the painting,

0:23:10.359 --> 0:23:13.800
<v Speaker 1>while interesting, isn't exactly so astounding as to make you

0:23:13.840 --> 0:23:16.840
<v Speaker 1>think there's no way a machine did that. You'd look

0:23:16.880 --> 0:23:19.240
<v Speaker 1>at them and go, yeah, I can imagine a machine

0:23:19.240 --> 0:23:23.400
<v Speaker 1>did that. One. A group of computer scientists first described

0:23:23.520 --> 0:23:26.879
<v Speaker 1>the general adversarial network architecture in a paper in two

0:23:26.920 --> 0:23:30.560
<v Speaker 1>thousand and fourteen, and like other neural networks, these models

0:23:30.600 --> 0:23:33.399
<v Speaker 1>require a lot of data. The more the better. In fact,

0:23:33.480 --> 0:23:36.040
<v Speaker 1>smaller data sets means the models have to make some

0:23:36.119 --> 0:23:40.960
<v Speaker 1>pretty big assumptions, and you tend to get pretty lousy results.

0:23:41.440 --> 0:23:45.160
<v Speaker 1>More data, as in more examples, teaches the models more

0:23:45.200 --> 0:23:48.320
<v Speaker 1>about the parameters of the domain, whatever it is they

0:23:48.320 --> 0:23:52.080
<v Speaker 1>are trying to generate. It refines the approach. So if

0:23:52.119 --> 0:23:54.600
<v Speaker 1>you have a sophisticated enough pair of models and you

0:23:54.640 --> 0:23:57.280
<v Speaker 1>have enough data to fill up a domain, you can

0:23:57.359 --> 0:24:01.439
<v Speaker 1>generate some convincing material. And that in ludes video and

0:24:01.560 --> 0:24:05.080
<v Speaker 1>this brings us around to deep fakes. And in addition

0:24:05.200 --> 0:24:09.679
<v Speaker 1>to generative adversarial networks, a couple of other things really

0:24:10.200 --> 0:24:15.520
<v Speaker 1>converged to create the techniques and trends and technology that

0:24:15.560 --> 0:24:22.160
<v Speaker 1>would allow for deep fakes proper. In Malcolm Slaney, Michelle Covell,

0:24:22.440 --> 0:24:26.360
<v Speaker 1>and Christoph Bregler wrote some software that they called the

0:24:26.440 --> 0:24:30.960
<v Speaker 1>Video Rewrite Program. The software would analyze faces and then

0:24:31.040 --> 0:24:35.920
<v Speaker 1>create or synthesize lip animation which could be matched to

0:24:36.320 --> 0:24:39.800
<v Speaker 1>pre recorded audio. So you could take some film footage

0:24:40.280 --> 0:24:44.040
<v Speaker 1>of a person and then reanimate their lips so that

0:24:44.080 --> 0:24:47.000
<v Speaker 1>they could appear to say all sorts of things, which

0:24:47.000 --> 0:24:50.840
<v Speaker 1>in some ways set the stage for deep fakes. This case,

0:24:50.920 --> 0:24:53.840
<v Speaker 1>it was really just focusing on the lips and the

0:24:53.880 --> 0:24:57.879
<v Speaker 1>general area around the lips, so you weren't changing the

0:24:57.960 --> 0:25:00.920
<v Speaker 1>rest of the expression of the face, and you would

0:25:00.960 --> 0:25:04.800
<v Speaker 1>have to, you know, keep your recording to be about

0:25:04.880 --> 0:25:07.359
<v Speaker 1>the same length as whatever the film clip was, or

0:25:07.400 --> 0:25:09.719
<v Speaker 1>you would have to loop the film clip over and over,

0:25:09.800 --> 0:25:12.040
<v Speaker 1>which would make it, you know, far more obvious that

0:25:12.160 --> 0:25:16.720
<v Speaker 1>this was a fake. In addition, motion tracking technology was

0:25:16.760 --> 0:25:19.440
<v Speaker 1>advancing over time too, and this also became an important

0:25:19.480 --> 0:25:22.800
<v Speaker 1>tool in computer animation. This tool would also be used

0:25:23.119 --> 0:25:27.479
<v Speaker 1>by deep fake algorithms to create facial expressions, manipulating the

0:25:27.560 --> 0:25:30.479
<v Speaker 1>digital image just as it would if it were a

0:25:30.560 --> 0:25:34.959
<v Speaker 1>video game character or a Pixar animated character. Typically, you

0:25:35.000 --> 0:25:38.159
<v Speaker 1>need to start with some existing video in order to

0:25:38.200 --> 0:25:42.439
<v Speaker 1>manipulate it. You're not actually computer generating the animation, like,

0:25:42.480 --> 0:25:47.439
<v Speaker 1>you're not creating a computer generated version of whomever it

0:25:47.560 --> 0:25:51.320
<v Speaker 1>is you're you're doing the fake of You're using existing

0:25:51.760 --> 0:25:54.639
<v Speaker 1>imagery in order to do that and then manipulating that

0:25:54.720 --> 0:25:58.760
<v Speaker 1>existing imagery, so it's a little different from computer animation.

0:25:59.200 --> 0:26:02.000
<v Speaker 1>In two thousands six teen, students and faculty at the

0:26:02.040 --> 0:26:06.720
<v Speaker 1>Technical University of Munich created the face to Face project

0:26:07.000 --> 0:26:10.600
<v Speaker 1>that would be face the numeral two and then face

0:26:11.359 --> 0:26:14.640
<v Speaker 1>and this was particularly jaw dropping to me at the time.

0:26:14.640 --> 0:26:18.320
<v Speaker 1>When I first saw these videos back in ten, I

0:26:18.400 --> 0:26:22.600
<v Speaker 1>was floored. They created a system that had a target actor.

0:26:23.040 --> 0:26:25.600
<v Speaker 1>This would be the video of the person that you

0:26:25.640 --> 0:26:28.520
<v Speaker 1>want to manipulate. In the example they used, it was

0:26:28.760 --> 0:26:33.600
<v Speaker 1>former US President George W. Bush. Their process also had

0:26:33.640 --> 0:26:38.399
<v Speaker 1>a source actor. This was the source of the expressions

0:26:38.440 --> 0:26:41.440
<v Speaker 1>and facial movements you would see in the targets, so

0:26:41.960 --> 0:26:45.240
<v Speaker 1>kind of like a digital puppeteer in a way. But

0:26:45.320 --> 0:26:47.280
<v Speaker 1>the way they did it was really cool. They had

0:26:47.280 --> 0:26:51.160
<v Speaker 1>a camera trained on the source actor and it would

0:26:51.280 --> 0:26:54.919
<v Speaker 1>track specific points of movement on the source actor's face,

0:26:55.480 --> 0:26:58.600
<v Speaker 1>and then the system would manipulate the same points of

0:26:58.720 --> 0:27:02.520
<v Speaker 1>movement on the target actor's face in the video. So

0:27:02.720 --> 0:27:07.040
<v Speaker 1>if the source actor smiled, then the target smiled, so

0:27:07.160 --> 0:27:08.920
<v Speaker 1>the source actor would smile, and then you would see

0:27:08.920 --> 0:27:12.240
<v Speaker 1>George W. Bush in the video smile in real time.

0:27:12.560 --> 0:27:17.919
<v Speaker 1>It was really strange. They used this looping video of

0:27:18.240 --> 0:27:21.560
<v Speaker 1>George W. Bush wearing a neutral expression. They had to

0:27:21.640 --> 0:27:26.359
<v Speaker 1>start with that as there they're sort of zero point,

0:27:26.880 --> 0:27:28.880
<v Speaker 1>and I gotta tell you, it really does look like

0:27:29.520 --> 0:27:31.920
<v Speaker 1>the former president George W. Bush is having a bit

0:27:31.920 --> 0:27:35.600
<v Speaker 1>of a freak out on a looping video because he

0:27:35.720 --> 0:27:40.159
<v Speaker 1>keeps on opening his mouth, closing his mouth, grimacing, raising

0:27:40.160 --> 0:27:43.040
<v Speaker 1>his eyebrows. You need to watch this video. It is

0:27:43.080 --> 0:27:48.080
<v Speaker 1>still available online to check out. In Students and faculty

0:27:48.119 --> 0:27:52.560
<v Speaker 1>over at the University of Washington created the Synthesizing Obama project,

0:27:52.800 --> 0:27:55.639
<v Speaker 1>in which they trained a computer model to generate a

0:27:55.760 --> 0:27:59.920
<v Speaker 1>synthetic video of former US President Barack Obama, and they

0:28:00.119 --> 0:28:03.520
<v Speaker 1>made it lip sync to a pre recorded audio clip

0:28:03.720 --> 0:28:08.320
<v Speaker 1>from one of Obama's addresses to the nation. They actually

0:28:08.400 --> 0:28:12.160
<v Speaker 1>had the original video of that address for comparison, so

0:28:12.359 --> 0:28:15.639
<v Speaker 1>they could look back at that and see how they're

0:28:15.680 --> 0:28:19.400
<v Speaker 1>generated one compared to the real thing. And their approach

0:28:19.640 --> 0:28:23.560
<v Speaker 1>used a model that analyzed hundreds of hours of video

0:28:23.600 --> 0:28:28.600
<v Speaker 1>footage of Obama speaking, and it mapped specific mouth shapes

0:28:28.720 --> 0:28:33.400
<v Speaker 1>to specific sounds. It would also include some of Obama's mannerisms,

0:28:33.440 --> 0:28:35.439
<v Speaker 1>such as how he moves his head when he talks

0:28:35.560 --> 0:28:39.240
<v Speaker 1>or uses facial expressions to emphasize words. And watching the

0:28:39.360 --> 0:28:43.200
<v Speaker 1>video and that, you know the real one next to

0:28:43.240 --> 0:28:46.960
<v Speaker 1>the generated one is pretty strange. You can tell the

0:28:47.000 --> 0:28:51.680
<v Speaker 1>generated one isn't quite right. It's not matching the audio exactly,

0:28:51.960 --> 0:28:56.840
<v Speaker 1>at least not on the early versions, but it's fairly close,

0:28:56.920 --> 0:28:59.600
<v Speaker 1>and it might even pass casual inspection for a lot

0:28:59.640 --> 0:29:02.000
<v Speaker 1>of people who weren't, like, you know, actually paying attention.

0:29:02.600 --> 0:29:07.840
<v Speaker 1>Authors Morass and Alexandro defined deep fakes as quote the

0:29:07.880 --> 0:29:13.200
<v Speaker 1>product of artificial intelligence applications that merge, combine, replace, and

0:29:13.320 --> 0:29:17.440
<v Speaker 1>superimpose images and video clips to create fake videos that

0:29:17.480 --> 0:29:22.600
<v Speaker 1>appear authentic end quote. They first emerged in two seventeen,

0:29:22.880 --> 0:29:26.760
<v Speaker 1>and so this is a pretty darn young application of technology.

0:29:27.400 --> 0:29:30.600
<v Speaker 1>One thing that is worrisome is that once someone has

0:29:30.640 --> 0:29:34.360
<v Speaker 1>access to the tools, it's not that difficult to create

0:29:34.440 --> 0:29:37.479
<v Speaker 1>a deep fake video. You pretty much just need a

0:29:37.520 --> 0:29:41.320
<v Speaker 1>decent computer, the tools, a bit of know how on

0:29:41.400 --> 0:29:44.560
<v Speaker 1>how to do it, and some time you also need

0:29:44.720 --> 0:29:48.440
<v Speaker 1>some reference material, as in like videos and images of

0:29:48.480 --> 0:29:52.280
<v Speaker 1>the person that you are replicating, and like the machine

0:29:52.360 --> 0:29:55.640
<v Speaker 1>learning systems I've mentioned, the more reference material you have,

0:29:55.920 --> 0:29:59.200
<v Speaker 1>the better. That's why the deep fakes you encounter these

0:29:59.280 --> 0:30:03.280
<v Speaker 1>days tend to be of notable famous people like celebrities

0:30:03.280 --> 0:30:07.240
<v Speaker 1>and politicians. Mainly there's no shortage of reference material for

0:30:07.320 --> 0:30:10.680
<v Speaker 1>those types of individuals, and so they are easier to

0:30:10.720 --> 0:30:14.080
<v Speaker 1>replicate with deep fakes than someone who maintains a much

0:30:14.280 --> 0:30:17.240
<v Speaker 1>lower profile. Not to say that that will always be

0:30:17.320 --> 0:30:19.880
<v Speaker 1>the case, or that there aren't systems out there that

0:30:19.960 --> 0:30:25.400
<v Speaker 1>can accept smaller amounts of reference material. It's just harder

0:30:25.440 --> 0:30:31.920
<v Speaker 1>to make a convincing version with fewer samples. But in

0:30:32.040 --> 0:30:35.480
<v Speaker 1>order to make a convincing fake, the system really has

0:30:35.520 --> 0:30:39.640
<v Speaker 1>to learn how a person moves. All those facial expressions matter.

0:30:39.880 --> 0:30:42.920
<v Speaker 1>It also has to learn how a person sounds. Will

0:30:42.960 --> 0:30:48.960
<v Speaker 1>get into sound a little bit later. But mannerisms, inflection, accent, emphasis, cadence,

0:30:49.080 --> 0:30:51.640
<v Speaker 1>quirks and ticks, all of these things have to be

0:30:51.680 --> 0:30:55.480
<v Speaker 1>analyzed and replicated to make a convincing fake, and it

0:30:55.520 --> 0:30:57.840
<v Speaker 1>has to be done just right, or else it comes

0:30:57.840 --> 0:31:02.680
<v Speaker 1>off as creepy or unrealistic. Think about how impressionists will

0:31:02.720 --> 0:31:06.320
<v Speaker 1>take a celebrity's manner of speech and then heighten some

0:31:06.440 --> 0:31:09.920
<v Speaker 1>of it in comedic effect. You'll hear all the time

0:31:09.960 --> 0:31:12.959
<v Speaker 1>with folks who do impressions of people like Jack Nicholson

0:31:13.160 --> 0:31:17.080
<v Speaker 1>or Christopher Walkin or Barbara streisand people who have a

0:31:17.240 --> 0:31:21.760
<v Speaker 1>very particular way of speaking. Impressionists will take those as

0:31:21.880 --> 0:31:25.360
<v Speaker 1>markers and they really punch in on them. Well, a

0:31:25.480 --> 0:31:28.240
<v Speaker 1>deep fake can't really do that too much, or else

0:31:28.280 --> 0:31:30.560
<v Speaker 1>it won't come across as genuine. It'll feel like you're

0:31:30.560 --> 0:31:35.760
<v Speaker 1>watching a famous person impersonating themselves, which is weird. Now.

0:31:35.760 --> 0:31:38.240
<v Speaker 1>The earliest mention of deep fakes I can find dates

0:31:38.280 --> 0:31:41.480
<v Speaker 1>to a two thousand seventeen Reddit forum in which a

0:31:41.600 --> 0:31:45.680
<v Speaker 1>user shared deep faked videos that appeared to show female

0:31:45.680 --> 0:31:50.720
<v Speaker 1>celebrities in sexual situations. Heads and faces had been replaced,

0:31:50.960 --> 0:31:54.720
<v Speaker 1>and the actors in pornographic movies had their heads or

0:31:54.760 --> 0:31:58.840
<v Speaker 1>faces swapped out for these various celebrities. Now the fakes

0:31:59.080 --> 0:32:04.400
<v Speaker 1>can look fairly convincing, extremely convincing in some cases, which

0:32:04.600 --> 0:32:07.479
<v Speaker 1>can lead to some people assuming that the videos are

0:32:07.480 --> 0:32:10.880
<v Speaker 1>genuine and that the folks that they saw in the

0:32:10.960 --> 0:32:13.880
<v Speaker 1>videos are really the ones who are in it. And

0:32:14.040 --> 0:32:17.400
<v Speaker 1>obviously that's a real problem, right. I mean that this

0:32:17.480 --> 0:32:21.800
<v Speaker 1>technology we've given enough reference data DEFEATA system, someone could

0:32:21.840 --> 0:32:24.760
<v Speaker 1>fabricate a video that appears to put a person in

0:32:24.800 --> 0:32:28.760
<v Speaker 1>a compromising position, whether it's a sexual act or making

0:32:28.840 --> 0:32:32.400
<v Speaker 1>damaging statements or committing a crime or whatever. And there

0:32:32.440 --> 0:32:34.520
<v Speaker 1>are tools right now that allow you to do pretty

0:32:34.600 --> 0:32:37.440
<v Speaker 1>much what the face to face tool was doing back

0:32:37.440 --> 0:32:41.040
<v Speaker 1>in two thousand sixteen. A program called avatar if I,

0:32:41.640 --> 0:32:45.720
<v Speaker 1>which is not that easy to say anyway. It can

0:32:45.800 --> 0:32:49.520
<v Speaker 1>run on top of live streaming conference services like Zoom

0:32:49.520 --> 0:32:52.400
<v Speaker 1>and Skype, and you can swap out your face for

0:32:52.480 --> 0:32:57.200
<v Speaker 1>a celebrities face. Your facial expressions map to the computer

0:32:57.840 --> 0:33:02.200
<v Speaker 1>manipulated celebrity face uh that just looks at you through

0:33:02.240 --> 0:33:06.680
<v Speaker 1>your webcam, and then if you smile, the celebrity image smiles, etcetera.

0:33:06.720 --> 0:33:09.240
<v Speaker 1>It's like that old face to face program. It does

0:33:09.360 --> 0:33:13.720
<v Speaker 1>need a pretty beefy PC to manage doing all this

0:33:13.760 --> 0:33:17.240
<v Speaker 1>because you're also running that live streaming service underneath it.

0:33:17.240 --> 0:33:20.320
<v Speaker 1>It's also not exactly user friendly. You need some programming

0:33:20.760 --> 0:33:24.000
<v Speaker 1>experience to really get it to work. But it is

0:33:24.280 --> 0:33:29.080
<v Speaker 1>widely accessible as the source code is is open source

0:33:29.480 --> 0:33:31.880
<v Speaker 1>and it's on get hubs, so anyone can get it.

0:33:32.640 --> 0:33:36.440
<v Speaker 1>Samantha Cole, who writes for Vice, has covered the topic

0:33:36.480 --> 0:33:39.880
<v Speaker 1>of deep fakes pretty extensively and the potential harm they

0:33:39.880 --> 0:33:42.920
<v Speaker 1>can cause, and I recommend you check out her work

0:33:43.080 --> 0:33:46.440
<v Speaker 1>if you're interested in learning more about that. Do be

0:33:46.520 --> 0:33:51.000
<v Speaker 1>warned that Coal covers some pretty adult themed topics and

0:33:51.040 --> 0:33:53.880
<v Speaker 1>I think she does great work and very important work.

0:33:54.320 --> 0:33:57.000
<v Speaker 1>But as a guy who grew up in the Deep South,

0:33:57.040 --> 0:33:59.200
<v Speaker 1>it's also the kind of stuff that occasionally makes me

0:33:59.240 --> 0:34:01.600
<v Speaker 1>clutch my purse roles. But that's more of a statement

0:34:01.600 --> 0:34:06.040
<v Speaker 1>about me than her work. She does great work. I

0:34:06.080 --> 0:34:09.400
<v Speaker 1>think most of us can imagine plenty of scenarios in

0:34:09.400 --> 0:34:12.239
<v Speaker 1>which this sort of technology could cause mischief on a

0:34:12.280 --> 0:34:16.080
<v Speaker 1>good day and catastrophe on a bad day, whether it's

0:34:16.160 --> 0:34:21.640
<v Speaker 1>spreading misinformation, creating fear and certainty and doubt fud or

0:34:21.760 --> 0:34:25.200
<v Speaker 1>by making people seem to say things they never actually said,

0:34:25.360 --> 0:34:28.759
<v Speaker 1>or contributing to an ugly subculture in which people try

0:34:28.800 --> 0:34:32.560
<v Speaker 1>to make their more base fantasies a reality by putting

0:34:32.600 --> 0:34:35.239
<v Speaker 1>one person's head on another person's body. You know, it's

0:34:35.280 --> 0:34:39.040
<v Speaker 1>not great. There are legitimate uses of the technology too,

0:34:39.120 --> 0:34:42.600
<v Speaker 1>of course, you know, tech itself is rarely good or bad.

0:34:42.719 --> 0:34:45.759
<v Speaker 1>It's all in how we use it. But this particular

0:34:45.800 --> 0:34:49.200
<v Speaker 1>technology has a lot of potentially harmful uses, and Samantha

0:34:49.239 --> 0:34:52.040
<v Speaker 1>Coll has done a great job explaining them. When we

0:34:52.120 --> 0:34:54.479
<v Speaker 1>come back, I'll talk a bit more about the war

0:34:54.600 --> 0:34:57.600
<v Speaker 1>against deep fakes and how people are trying to prepare

0:34:57.680 --> 0:35:00.880
<v Speaker 1>for a world that is increasingly filled with media we

0:35:01.080 --> 0:35:05.600
<v Speaker 1>can't really trust. But first, let's take a quick break.

0:35:13.120 --> 0:35:16.880
<v Speaker 1>Before the break, I mentioned Samantha Cole, who has written

0:35:16.920 --> 0:35:19.880
<v Speaker 1>extensively about deep fags, and one point she makes that

0:35:19.920 --> 0:35:23.359
<v Speaker 1>I think is important for us to note is that

0:35:23.440 --> 0:35:28.320
<v Speaker 1>the vast majority of instances of deep fake videos haven't

0:35:28.480 --> 0:35:33.120
<v Speaker 1>been some manufactured video of a political leader saying inflammatory things.

0:35:33.840 --> 0:35:37.200
<v Speaker 1>That continues to be a big concern. There's a genuine

0:35:37.320 --> 0:35:40.400
<v Speaker 1>fear that someone is going to manufacture a video in

0:35:40.440 --> 0:35:43.920
<v Speaker 1>which a politician appears to say or do something truly

0:35:44.000 --> 0:35:47.480
<v Speaker 1>terrible in an effort to either discredit the politician or

0:35:47.520 --> 0:35:52.319
<v Speaker 1>perhaps instigate a conflict with some other group. There are

0:35:52.400 --> 0:35:56.960
<v Speaker 1>literal doomsday scenarios in which such a video would prompt

0:35:56.960 --> 0:36:01.160
<v Speaker 1>a massive military response, though it does seem like it

0:36:01.239 --> 0:36:04.120
<v Speaker 1>might be a little far fetched. Though heck, I don't know,

0:36:04.239 --> 0:36:06.279
<v Speaker 1>considering the world we live in, maybe it's not that

0:36:06.400 --> 0:36:10.200
<v Speaker 1>big of a stretch anyway. Cole's point is that so far,

0:36:10.640 --> 0:36:14.080
<v Speaker 1>debt has not happened. She points out that the most

0:36:14.280 --> 0:36:17.000
<v Speaker 1>frequent use for the tech either tends to be people

0:36:17.120 --> 0:36:20.920
<v Speaker 1>goofing around or disturbingly using it too. In her words,

0:36:21.000 --> 0:36:25.160
<v Speaker 1>quote take ownership of women's bodies in non consensual porn

0:36:25.440 --> 0:36:28.920
<v Speaker 1>end quote. Cole argues that the reason we haven't really

0:36:28.920 --> 0:36:32.240
<v Speaker 1>seen deep fix used much outside of these realms, apart

0:36:32.280 --> 0:36:36.400
<v Speaker 1>from a few advertising campaigns. Is that people are pretty

0:36:36.440 --> 0:36:39.719
<v Speaker 1>good at spotting Deep Fix. They aren't quite at a

0:36:39.840 --> 0:36:42.759
<v Speaker 1>level where they can easily pass for the real thing.

0:36:43.320 --> 0:36:46.400
<v Speaker 1>There's still something slightly off about them. They tend to

0:36:46.560 --> 0:36:49.880
<v Speaker 1>butt up against the uncanny valley. Now, for those of

0:36:49.880 --> 0:36:53.560
<v Speaker 1>you not familiar with that term, the uncanny valley describes

0:36:53.600 --> 0:36:57.320
<v Speaker 1>the feeling we humans get when we encounter a robot

0:36:57.520 --> 0:37:02.520
<v Speaker 1>or a computer generated figure that closely resembles a human

0:37:02.880 --> 0:37:06.600
<v Speaker 1>or human behavior, but you can still tell it's not

0:37:07.040 --> 0:37:10.400
<v Speaker 1>actually a person, and it's not a good feeling. It

0:37:10.440 --> 0:37:13.960
<v Speaker 1>tends to be described as repulsive and disturbing, or at

0:37:14.160 --> 0:37:18.720
<v Speaker 1>the very best, off putting. See also the animated film

0:37:18.760 --> 0:37:22.960
<v Speaker 1>Polar Express. There's a reason that when that film came out,

0:37:23.120 --> 0:37:27.839
<v Speaker 1>people kind of reacted negatively to the animation, and it's

0:37:27.840 --> 0:37:30.640
<v Speaker 1>also a reason why picks are tends to prefer to

0:37:30.680 --> 0:37:34.479
<v Speaker 1>go with stylized human characters who are different enough from

0:37:34.600 --> 0:37:38.320
<v Speaker 1>the way real humans look to kind of bypass uncanny valley.

0:37:38.520 --> 0:37:40.880
<v Speaker 1>We just think of that as a cartoon, not something

0:37:40.920 --> 0:37:44.120
<v Speaker 1>that's trying to pass itself off as being human. But

0:37:44.200 --> 0:37:46.800
<v Speaker 1>while there hasn't really been a flood of fake videos

0:37:46.840 --> 0:37:50.319
<v Speaker 1>hitting the Internet with the intent to discredit politicians or

0:37:50.400 --> 0:37:54.280
<v Speaker 1>infuriate specific people or whatever, there remains a general sense

0:37:54.320 --> 0:37:58.040
<v Speaker 1>that this is coming. It's just not here now. The

0:37:58.120 --> 0:38:01.600
<v Speaker 1>sense I get is that people feel it's an inevitability,

0:38:01.680 --> 0:38:04.080
<v Speaker 1>and there are already folks working on tools that will

0:38:04.080 --> 0:38:07.160
<v Speaker 1>help us sort out the real stuff from the fakes.

0:38:07.719 --> 0:38:12.440
<v Speaker 1>Take Microsoft, for example. There R and D division fittingly

0:38:12.640 --> 0:38:17.680
<v Speaker 1>called Microsoft Research, developed a tool they call the Video Authenticator.

0:38:18.120 --> 0:38:21.960
<v Speaker 1>This tool analyzes video samples and looks for signs of

0:38:22.320 --> 0:38:25.440
<v Speaker 1>deep fakery. In a blog post written by Tom Bert

0:38:25.520 --> 0:38:30.040
<v Speaker 1>and Eric Horvitts to Microsoft executives, they say, quote it

0:38:30.080 --> 0:38:33.600
<v Speaker 1>works by detecting the blending boundary of the deep fake

0:38:33.760 --> 0:38:36.840
<v Speaker 1>and subtle fading or gray scale elements that might not

0:38:36.960 --> 0:38:40.759
<v Speaker 1>be detectable by the human eye. End quote. Now I'm

0:38:40.800 --> 0:38:44.360
<v Speaker 1>no expert, but to me, it sounds like the video

0:38:44.440 --> 0:38:48.600
<v Speaker 1>Authenticator is working in a way that's not too dissimilar

0:38:48.880 --> 0:38:53.719
<v Speaker 1>to a discriminator in a generative adversarial network. I mean,

0:38:54.040 --> 0:38:58.080
<v Speaker 1>the whole purpose of the discriminator is to discriminate or

0:38:58.160 --> 0:39:01.960
<v Speaker 1>to tell the difference between genuine when unaltered videos and

0:39:02.080 --> 0:39:06.440
<v Speaker 1>computer generated ones. So the video authenticator is looking for

0:39:06.520 --> 0:39:10.400
<v Speaker 1>tailtale signs that a video was not produced through traditional

0:39:10.480 --> 0:39:14.560
<v Speaker 1>means but was computer generated. However, that's the very thing

0:39:14.840 --> 0:39:18.200
<v Speaker 1>that the generators in G A N systems are looking

0:39:18.239 --> 0:39:21.960
<v Speaker 1>out for. So when a generator receives feedback that a

0:39:22.080 --> 0:39:26.360
<v Speaker 1>video it generated did not slip past the discriminator, it

0:39:26.440 --> 0:39:30.000
<v Speaker 1>then tweaks those input weights and starts to shift its

0:39:30.040 --> 0:39:33.680
<v Speaker 1>approach in order to bypass whatever it was that gave

0:39:33.719 --> 0:39:37.600
<v Speaker 1>away its last attempt, and it does this again and again.

0:39:38.120 --> 0:39:41.880
<v Speaker 1>So the video authenticator might work well for a given

0:39:41.920 --> 0:39:44.759
<v Speaker 1>amount of time, but I would suspect that in the

0:39:44.880 --> 0:39:48.120
<v Speaker 1>long run, the deep fake systems will become sophisticated enough

0:39:48.440 --> 0:39:53.319
<v Speaker 1>to fool the authenticator. Of course, Microsoft will continue to

0:39:53.400 --> 0:39:56.720
<v Speaker 1>tweak the authenticator as well, and it will become something

0:39:56.760 --> 0:40:00.920
<v Speaker 1>of a seesaw battle as one side outperforms the other temporarily,

0:40:01.280 --> 0:40:04.000
<v Speaker 1>and then the balance will shift. Though there may come

0:40:04.000 --> 0:40:06.760
<v Speaker 1>a time where either the deep fakes are too good

0:40:07.120 --> 0:40:10.240
<v Speaker 1>and they don't set off any alarms from the discriminator,

0:40:11.080 --> 0:40:16.040
<v Speaker 1>or the discriminator gets so sensitive that it starts to

0:40:16.080 --> 0:40:19.200
<v Speaker 1>flag real videos and it hits a lot of false

0:40:19.280 --> 0:40:23.680
<v Speaker 1>positives and calls them generated videos instead. Either way, you

0:40:23.760 --> 0:40:26.720
<v Speaker 1>reach a point where a tool like this no longer

0:40:26.760 --> 0:40:29.839
<v Speaker 1>really serves a useful purpose, and the video authenticator will

0:40:29.840 --> 0:40:32.920
<v Speaker 1>be obsolete. Now, this is something we see in artificial

0:40:32.960 --> 0:40:36.080
<v Speaker 1>intelligence all the time. If you remember the good old

0:40:36.120 --> 0:40:39.000
<v Speaker 1>days of capture, you know, the approving you're not a

0:40:39.120 --> 0:40:42.480
<v Speaker 1>robot stuff. The stuff we were told to do was

0:40:42.840 --> 0:40:45.960
<v Speaker 1>typically type in a series of letters and numbers, and

0:40:46.000 --> 0:40:48.960
<v Speaker 1>it wasn't that hard when it first started, at least

0:40:49.000 --> 0:40:53.000
<v Speaker 1>not at first. That's because the text recognition algorithms of

0:40:53.040 --> 0:40:58.160
<v Speaker 1>the time weren't very good. They couldn't decipher mildly deformed

0:40:58.280 --> 0:41:01.439
<v Speaker 1>text because the shape to the text felt too far

0:41:01.560 --> 0:41:05.399
<v Speaker 1>outside the parameters of what the system could recognize as

0:41:05.440 --> 0:41:08.480
<v Speaker 1>a legitimate letter or number. You make the number a little,

0:41:09.040 --> 0:41:12.239
<v Speaker 1>you know, deformed, and then suddenly the systems like, well,

0:41:12.239 --> 0:41:14.839
<v Speaker 1>that doesn't look like a three to me because it's

0:41:14.880 --> 0:41:17.400
<v Speaker 1>not in the shape of a three. But over time

0:41:17.560 --> 0:41:22.239
<v Speaker 1>people developed better text recognition programs that could recognize these

0:41:22.239 --> 0:41:25.360
<v Speaker 1>shapes even if they weren't in a standard three orientation,

0:41:25.960 --> 0:41:30.040
<v Speaker 1>and those systems began to defeat those simple early captures

0:41:30.600 --> 0:41:34.800
<v Speaker 1>that required captured designers to make tougher versions, and eventually

0:41:34.840 --> 0:41:37.239
<v Speaker 1>the machines got good enough that they can match or

0:41:37.320 --> 0:41:41.280
<v Speaker 1>even outperform humans. And at that point, those text based

0:41:41.360 --> 0:41:45.240
<v Speaker 1>captures proved to be more challenging for people than for machines,

0:41:45.280 --> 0:41:47.839
<v Speaker 1>which meant if you use them, you defeated the whole

0:41:47.880 --> 0:41:50.959
<v Speaker 1>purpose in the first place. So while this escalation proved

0:41:51.000 --> 0:41:53.800
<v Speaker 1>to be a challenge for security, it was a boon

0:41:54.120 --> 0:41:58.360
<v Speaker 1>for artificial intelligence. And while I focused almost exclusively on

0:41:58.440 --> 0:42:01.320
<v Speaker 1>the imagery of video here, the same sort of stuff

0:42:01.400 --> 0:42:04.880
<v Speaker 1>is going on with generated speech, including generated speech that

0:42:04.960 --> 0:42:09.920
<v Speaker 1>imitates specific voices like deep big videos. This approach works

0:42:09.960 --> 0:42:12.680
<v Speaker 1>best if you have a really big data set of

0:42:12.760 --> 0:42:19.680
<v Speaker 1>recorded audio, so people like movie and TV stars, news reporters, politicians,

0:42:19.760 --> 0:42:24.880
<v Speaker 1>and um, you know, podcasters, we're great targets for this stuff.

0:42:25.120 --> 0:42:27.280
<v Speaker 1>There might be hundreds or you know, in my case,

0:42:27.680 --> 0:42:32.440
<v Speaker 1>thousands of hours of recording material to work from. Training

0:42:32.440 --> 0:42:38.439
<v Speaker 1>a model to use the frequencies timbre, intonation, pronunciation, pauses,

0:42:38.520 --> 0:42:41.560
<v Speaker 1>and other mannerisms of speech can result in a system

0:42:41.640 --> 0:42:45.160
<v Speaker 1>that can generate vocals that sound like the target, sometimes

0:42:45.160 --> 0:42:49.640
<v Speaker 1>to a fairly convincing degree, and for a while to

0:42:49.640 --> 0:42:52.560
<v Speaker 1>peek behind the curtain here we at tech stuff. We're

0:42:52.600 --> 0:42:54.520
<v Speaker 1>working with a company that I'm not going to name,

0:42:54.800 --> 0:42:57.399
<v Speaker 1>but they were going to do something like this as

0:42:57.480 --> 0:42:59.960
<v Speaker 1>an experiment. I was gonna do a whole episode on it,

0:43:00.520 --> 0:43:03.680
<v Speaker 1>and I had planned on crafting a segment of that

0:43:03.800 --> 0:43:07.800
<v Speaker 1>episode only through text. I was not going to actually

0:43:07.800 --> 0:43:10.880
<v Speaker 1>record it myself and then use a system that was

0:43:10.960 --> 0:43:16.120
<v Speaker 1>trained on my voice to replicate my voice and deliver

0:43:16.280 --> 0:43:19.520
<v Speaker 1>that segment on its own. I was curious if it

0:43:19.520 --> 0:43:22.479
<v Speaker 1>can nail not just the audio quality of my voice, which,

0:43:22.840 --> 0:43:27.200
<v Speaker 1>let's be honest, is amazing that sarcasm I can't stand

0:43:27.200 --> 0:43:30.600
<v Speaker 1>listening to myself, but it would also have to replicate

0:43:30.640 --> 0:43:34.480
<v Speaker 1>how I actually make certain sounds, Like would it get

0:43:34.480 --> 0:43:37.160
<v Speaker 1>the bit of the southern accent that's in my voice,

0:43:37.800 --> 0:43:40.960
<v Speaker 1>or the way I emphasize certain words. Would it pause

0:43:41.040 --> 0:43:44.399
<v Speaker 1>for effect at all or would it just robotically say

0:43:44.560 --> 0:43:47.279
<v Speaker 1>one word after the next and only pause when there

0:43:47.360 --> 0:43:50.759
<v Speaker 1>was some helpful punctuation that told it to do so.

0:43:51.280 --> 0:43:54.080
<v Speaker 1>Would it indicate a question by raising the pitch at

0:43:54.080 --> 0:43:58.239
<v Speaker 1>the end of its sentence. Sadly, we never got far

0:43:58.760 --> 0:44:01.640
<v Speaker 1>with that particular problem check, so I don't have any

0:44:01.680 --> 0:44:03.520
<v Speaker 1>answers for you. I don't know how it would have

0:44:03.600 --> 0:44:06.240
<v Speaker 1>turned out, but clearly one of the things I thought

0:44:06.280 --> 0:44:09.200
<v Speaker 1>of was that it's a bit of a red flag.

0:44:09.239 --> 0:44:11.839
<v Speaker 1>If you can train a computer to sound exactly like

0:44:11.960 --> 0:44:15.240
<v Speaker 1>a specific person, that means you can make that person

0:44:15.760 --> 0:44:19.840
<v Speaker 1>say anything you like, and obviously, like deep fake videos,

0:44:19.880 --> 0:44:22.919
<v Speaker 1>that could have some pretty devastating consequences if it were

0:44:23.000 --> 0:44:27.960
<v Speaker 1>at all, you know, believable or seemed realistic. Now, the

0:44:27.960 --> 0:44:31.000
<v Speaker 1>company we were working with was working hard to make

0:44:31.000 --> 0:44:33.440
<v Speaker 1>sure that the only person to have access to a

0:44:33.480 --> 0:44:36.600
<v Speaker 1>specific voice would be the owner of that voice, or

0:44:37.160 --> 0:44:40.600
<v Speaker 1>presumably the company employing that person. Though that does bring

0:44:40.680 --> 0:44:43.160
<v Speaker 1>up a whole bunch of other potential problems, like can

0:44:43.200 --> 0:44:47.480
<v Speaker 1>you imagine eliminating voice actors from a job because you've

0:44:47.480 --> 0:44:50.000
<v Speaker 1>got enough of their voice and you can just replicate it.

0:44:50.080 --> 0:44:53.160
<v Speaker 1>That wouldn't be great, But even so, it was something

0:44:53.200 --> 0:44:56.480
<v Speaker 1>I felt was both fascinating from a technology standpoint and

0:44:56.520 --> 0:45:01.319
<v Speaker 1>potentially problematic when it comes to an application of that technology.

0:45:01.719 --> 0:45:05.080
<v Speaker 1>One other thing I should mention is that the Internet

0:45:05.200 --> 0:45:08.240
<v Speaker 1>at large has been pretty active in fighting deep fakes,

0:45:08.280 --> 0:45:11.640
<v Speaker 1>not necessarily in detecting them, but removing the platforms from

0:45:12.040 --> 0:45:14.839
<v Speaker 1>which they were being shared, Reddit being a big one.

0:45:14.960 --> 0:45:17.560
<v Speaker 1>The subreddit that was dedicated to deep fakes what had

0:45:17.560 --> 0:45:20.960
<v Speaker 1>been shut down. So there have been some of those

0:45:21.000 --> 0:45:24.160
<v Speaker 1>moves as well. Now this is not directly against the technology,

0:45:24.160 --> 0:45:28.840
<v Speaker 1>it's more against the proliferation of the uh the output

0:45:29.280 --> 0:45:33.040
<v Speaker 1>of that technology. As for detecting deep fakes, it's interesting

0:45:33.080 --> 0:45:36.800
<v Speaker 1>to me that people are even developing tools to detect them,

0:45:36.840 --> 0:45:39.719
<v Speaker 1>because to me, the best tools so far seems to

0:45:39.760 --> 0:45:45.839
<v Speaker 1>be human perception. It's not that the images aren't really convincing,

0:45:46.000 --> 0:45:49.120
<v Speaker 1>or that we can suddenly detect these, you know, blending

0:45:49.239 --> 0:45:53.440
<v Speaker 1>lines like the video Authenticator tool. It's rather that it's

0:45:53.480 --> 0:45:56.160
<v Speaker 1>just not hard for us to spot a deep fake. Now,

0:45:56.200 --> 0:46:00.040
<v Speaker 1>stuff just doesn't quite look right in the way that

0:46:00.200 --> 0:46:04.360
<v Speaker 1>people behave in these videos. The vocals and animation often

0:46:04.440 --> 0:46:09.280
<v Speaker 1>don't quite match. The expressions aren't really natural, the progression

0:46:09.320 --> 0:46:14.319
<v Speaker 1>of mannerisms feels synthetic and not genuine. It just it

0:46:14.360 --> 0:46:18.360
<v Speaker 1>looks off. It's that uncanny Valley thing, and so just

0:46:18.440 --> 0:46:21.640
<v Speaker 1>paying attention and thinking critically can really help use suss

0:46:21.640 --> 0:46:24.319
<v Speaker 1>out the fakes from the real thing. Even if we

0:46:24.400 --> 0:46:27.759
<v Speaker 1>reach a point where machines can create a convincing enough

0:46:27.800 --> 0:46:32.000
<v Speaker 1>fake to pass for reality. We can still apply critical thinking,

0:46:32.360 --> 0:46:35.440
<v Speaker 1>and we always should. Heck, we should be applying critical

0:46:35.480 --> 0:46:38.480
<v Speaker 1>thinking even when there's no doubt as to the validity

0:46:38.520 --> 0:46:42.200
<v Speaker 1>of the video, because there may be enough to doubt

0:46:42.280 --> 0:46:45.920
<v Speaker 1>the content of the video itself. If I listen to

0:46:46.000 --> 0:46:50.360
<v Speaker 1>a genuine scam artist in a genuine video, that doesn't

0:46:50.400 --> 0:46:53.799
<v Speaker 1>make the scam more legitimate. We always need to use

0:46:53.840 --> 0:46:57.200
<v Speaker 1>critical thinking. What I think is most important is that

0:46:57.239 --> 0:47:03.560
<v Speaker 1>we acknowledge the very real fact that there are numerous organizations, agencies, governments,

0:47:03.840 --> 0:47:08.160
<v Speaker 1>and other groups that are actively attempting to spread misinformation

0:47:08.400 --> 0:47:14.719
<v Speaker 1>and disinformation. There are entire intelligence agencies dedicated to this endeavor,

0:47:15.160 --> 0:47:18.640
<v Speaker 1>and then there are more independent groups that are doing

0:47:18.680 --> 0:47:22.000
<v Speaker 1>it for one reason or another, typically either to advance

0:47:22.040 --> 0:47:25.839
<v Speaker 1>a particular political agenda or just to make as much

0:47:25.920 --> 0:47:30.560
<v Speaker 1>money as quickly as possible. This is beyond doubt or question.

0:47:30.640 --> 0:47:34.600
<v Speaker 1>There are numerous misinformation campaigns that are actively going on

0:47:34.760 --> 0:47:38.080
<v Speaker 1>out there in the real world right now. Most of

0:47:38.120 --> 0:47:42.279
<v Speaker 1>them are not depending on deep fakes, because one, deep

0:47:42.320 --> 0:47:45.920
<v Speaker 1>fakes aren't really good enough to fool most people right now,

0:47:46.400 --> 0:47:49.600
<v Speaker 1>and too, they don't need the deep fakes in the

0:47:49.640 --> 0:47:52.400
<v Speaker 1>first place. There are other methods that are simpler, that

0:47:52.520 --> 0:47:56.280
<v Speaker 1>don't need nearly the processing power that work just fine.

0:47:56.600 --> 0:47:59.160
<v Speaker 1>Why would you go through the trouble of synthesizing a

0:47:59.280 --> 0:48:01.839
<v Speaker 1>video if you can get a better response with a

0:48:01.840 --> 0:48:05.920
<v Speaker 1>blog post filled with lies or half truths. It's just

0:48:06.000 --> 0:48:09.520
<v Speaker 1>not a great return on investment. So bottom line, be

0:48:09.680 --> 0:48:14.520
<v Speaker 1>vigilant out there, particularly on social media. Be aware that

0:48:14.560 --> 0:48:17.239
<v Speaker 1>there are plenty of people who will not hesitate to

0:48:17.360 --> 0:48:20.719
<v Speaker 1>mislead others in order to get what they want. Use

0:48:20.760 --> 0:48:26.000
<v Speaker 1>a critical eye to evaluate the information you encounter. Ask questions,

0:48:26.440 --> 0:48:31.160
<v Speaker 1>check sources, look for corroborating reports. It's a lot of work,

0:48:31.200 --> 0:48:34.080
<v Speaker 1>but trust me, it's way better that we do our

0:48:34.120 --> 0:48:37.120
<v Speaker 1>best to make sure the stuff we're depending on is

0:48:37.200 --> 0:48:40.759
<v Speaker 1>actually dependable. It'll turn out better for us in the

0:48:40.800 --> 0:48:43.919
<v Speaker 1>long run. Well, that wraps up this episode of text stuff,

0:48:43.960 --> 0:48:47.399
<v Speaker 1>which yeah, I used as a backdoor to argue about

0:48:47.440 --> 0:48:51.000
<v Speaker 1>critical thinking. Again, sue me, don't, don't really sue me.

0:48:51.520 --> 0:48:55.560
<v Speaker 1>But I think that that's another instance where it's a

0:48:55.640 --> 0:48:58.680
<v Speaker 1>really clear example where we have to use that kind

0:48:58.680 --> 0:49:01.000
<v Speaker 1>of stuff. So I'm gonna keep keep on stressing it.

0:49:01.480 --> 0:49:05.080
<v Speaker 1>And you guys are awesome. I believe in you. I

0:49:05.120 --> 0:49:08.080
<v Speaker 1>think that when we start using these tools at our

0:49:08.080 --> 0:49:12.560
<v Speaker 1>disposal that everybody can develop just with some practice, that

0:49:13.040 --> 0:49:16.120
<v Speaker 1>things will be better. We'll be able to suss out

0:49:16.200 --> 0:49:20.719
<v Speaker 1>the nonsense from the real stuff, and we're all better

0:49:20.760 --> 0:49:22.439
<v Speaker 1>off in the long run if we can do that.

0:49:23.000 --> 0:49:25.680
<v Speaker 1>If you guys have suggestions for future topics I should

0:49:25.719 --> 0:49:28.960
<v Speaker 1>cover in episodes of tech Stuff, let me know via Twitter.

0:49:29.280 --> 0:49:33.279
<v Speaker 1>The handle is text stuff H s W and I'll

0:49:33.280 --> 0:49:41.480
<v Speaker 1>talk to you again really soon. Text Stuff is an

0:49:41.480 --> 0:49:45.200
<v Speaker 1>I Heart Radio production. For more podcasts from my Heart Radio,

0:49:45.520 --> 0:49:48.680
<v Speaker 1>visit the i Heart Radio app, Apple Podcasts, or wherever

0:49:48.760 --> 0:49:50.280
<v Speaker 1>you listen to your favorite shows.