WEBVTT - How AI Can Make You Look Like a Better Dancer

0:00:04.120 --> 0:00:07.160
<v Speaker 1>Get in touch with technology with tech Stuff from how

0:00:07.200 --> 0:00:13.720
<v Speaker 1>stuff works dot com. Hey there, and welcome to tech Stuff.

0:00:13.760 --> 0:00:16.799
<v Speaker 1>I'm your host, Jonathan Strickland. I'm an executive producer and

0:00:16.840 --> 0:00:20.400
<v Speaker 1>I love all things tech. And here's a fun fact

0:00:20.440 --> 0:00:25.799
<v Speaker 1>about getting older. As you age, stuff that you once

0:00:25.840 --> 0:00:30.160
<v Speaker 1>thought was impossible will become not only possible, but will

0:00:30.200 --> 0:00:34.040
<v Speaker 1>become the norm, and future generations won't even think about

0:00:34.040 --> 0:00:37.400
<v Speaker 1>what it must have been like before the impossible was commonplace.

0:00:37.880 --> 0:00:40.720
<v Speaker 1>Now this is true for every generation. It's not like

0:00:40.760 --> 0:00:45.880
<v Speaker 1>this is, you know, a brand new, groundbreaking observation. Plenty

0:00:45.920 --> 0:00:48.400
<v Speaker 1>of people have made it before me, but I want

0:00:48.400 --> 0:00:53.720
<v Speaker 1>to talk about a specific implementation. For example, with photography,

0:00:54.160 --> 0:00:58.480
<v Speaker 1>it used to be pretty difficult to manipulate pictures convincingly.

0:00:58.840 --> 0:01:03.280
<v Speaker 1>There have been photo and tools for decades, but generally

0:01:03.320 --> 0:01:07.080
<v Speaker 1>it took a great deal of skill and training, plus

0:01:07.160 --> 0:01:10.440
<v Speaker 1>access to specialized equipment to pull it off, especially in

0:01:10.480 --> 0:01:15.240
<v Speaker 1>the old film days. Now, gradually, tools like Photoshop made

0:01:15.240 --> 0:01:19.160
<v Speaker 1>it easier to manipulate digital images. Now it still requires

0:01:19.200 --> 0:01:21.600
<v Speaker 1>a certain level of skill to pull off a really

0:01:21.680 --> 0:01:26.480
<v Speaker 1>convincing job, and it's easy to make really badly manipulated images.

0:01:26.880 --> 0:01:30.640
<v Speaker 1>But as these tools became widely available, people began to

0:01:30.720 --> 0:01:33.000
<v Speaker 1>learn how to use them. We had to come to

0:01:33.040 --> 0:01:36.840
<v Speaker 1>the realization that we cannot necessarily believe our own eyes

0:01:36.920 --> 0:01:40.479
<v Speaker 1>when we're looking at a digital image. Now, the same

0:01:40.520 --> 0:01:44.920
<v Speaker 1>thing is happening with video footage. It's quite possible to

0:01:45.080 --> 0:01:48.440
<v Speaker 1>fake video footage, though again, if you want to do

0:01:48.480 --> 0:01:52.000
<v Speaker 1>it really well, it requires some skill, some specialized tools,

0:01:52.680 --> 0:01:55.800
<v Speaker 1>and to really get some expertise in it in order

0:01:55.840 --> 0:01:58.120
<v Speaker 1>to do it in a way that's really convincing. But

0:01:58.760 --> 0:02:02.600
<v Speaker 1>it's a pretty recent techn logical capability in fiction, however,

0:02:02.720 --> 0:02:05.919
<v Speaker 1>it's been around for a long time. I remember seeing

0:02:06.040 --> 0:02:09.880
<v Speaker 1>the movie Rising Sun back in the early nineteen nineties. Now,

0:02:09.880 --> 0:02:13.799
<v Speaker 1>in that film, Wesley Snipes plays a detective and Sean

0:02:13.880 --> 0:02:17.560
<v Speaker 1>Connery in his most convincing roles, since he played an

0:02:17.560 --> 0:02:21.880
<v Speaker 1>Egyptian immortal posing as a Spaniard with a Scottish accent

0:02:21.919 --> 0:02:27.400
<v Speaker 1>in Highlander, would play a Japanese customs and culture expert

0:02:28.800 --> 0:02:33.880
<v Speaker 1>Sean Connery. Anyway, the film is a mystery thriller, and

0:02:34.160 --> 0:02:38.000
<v Speaker 1>while the two are investigating a murder, they come into

0:02:38.040 --> 0:02:41.200
<v Speaker 1>possession of some video footage, and they find out that

0:02:41.320 --> 0:02:44.600
<v Speaker 1>video footage has actually been manipulated it was planted for

0:02:44.639 --> 0:02:46.400
<v Speaker 1>them to find so that would put them on the

0:02:46.440 --> 0:02:51.560
<v Speaker 1>wrong trail. One person's face was replaced with someone else's,

0:02:51.600 --> 0:02:54.920
<v Speaker 1>and in a somewhat comedic scene, the video editor who

0:02:55.000 --> 0:02:59.560
<v Speaker 1>is explaining this casually swaps the heads of Snipes and

0:02:59.600 --> 0:03:02.280
<v Speaker 1>Connor in real time in a video feed to show

0:03:02.280 --> 0:03:06.280
<v Speaker 1>off this capability, which might have been a tad unrealistic,

0:03:06.320 --> 0:03:08.440
<v Speaker 1>but today we are in a world in which video

0:03:08.480 --> 0:03:13.799
<v Speaker 1>manipulation of increasingly convincing quality is achievable in real time.

0:03:14.400 --> 0:03:17.360
<v Speaker 1>In fact, these days, it's possible to use sophisticated computer

0:03:17.400 --> 0:03:21.960
<v Speaker 1>algorithms to allow for manipulation of captured video, almost as

0:03:22.040 --> 0:03:26.040
<v Speaker 1>if the video was a computer generated cartoon reacting to

0:03:26.120 --> 0:03:29.600
<v Speaker 1>real time inputs like a video game controller, only instead

0:03:29.600 --> 0:03:32.360
<v Speaker 1>of it being a video game, it's a real person

0:03:32.480 --> 0:03:36.280
<v Speaker 1>on video. There are, of course, lots of ways this

0:03:36.360 --> 0:03:40.000
<v Speaker 1>technology could be used unethically, and one of the best

0:03:40.040 --> 0:03:44.400
<v Speaker 1>known has been the focus of this whole conversation around

0:03:44.440 --> 0:03:48.240
<v Speaker 1>video manipulation, and it comes from a former Reddit user

0:03:48.560 --> 0:03:52.400
<v Speaker 1>who went by the handle deep fakes. Now that handle

0:03:52.400 --> 0:03:56.920
<v Speaker 1>has become the shorthand for the general practice, which frequently,

0:03:57.240 --> 0:04:01.280
<v Speaker 1>but not exclusively, would involve replace the face of an

0:04:01.280 --> 0:04:05.440
<v Speaker 1>actor in a pornographic scene with someone else's face like

0:04:05.560 --> 0:04:10.160
<v Speaker 1>that of a celebrity, which is pretty darn unethical and creepy.

0:04:10.240 --> 0:04:14.240
<v Speaker 1>The name itself was a reference to the technology used

0:04:14.280 --> 0:04:17.960
<v Speaker 1>in the approach, so it relies on a process called

0:04:18.279 --> 0:04:22.520
<v Speaker 1>deep learning. Deep learning is a type of machine learning.

0:04:22.520 --> 0:04:26.680
<v Speaker 1>It's a sub type of machine learning that utilizes artificial

0:04:26.720 --> 0:04:29.719
<v Speaker 1>neural networks. And I've talked an awful lot about those

0:04:29.800 --> 0:04:31.960
<v Speaker 1>kind of networks recently, so I'm not going to go

0:04:32.000 --> 0:04:34.480
<v Speaker 1>over the whole thing again. Will give just a really

0:04:34.560 --> 0:04:39.640
<v Speaker 1>quick rundown to say, you have nodes, artificial neurons in

0:04:39.680 --> 0:04:44.400
<v Speaker 1>these networks that receive input from potentially multiple other nodes,

0:04:45.160 --> 0:04:49.360
<v Speaker 1>and then on that input, your artificial neuron that you're

0:04:49.360 --> 0:04:52.880
<v Speaker 1>looking at will perform some sort of weighted operation and

0:04:52.960 --> 0:04:56.200
<v Speaker 1>produce a single output. That single output can move on

0:04:56.279 --> 0:05:00.000
<v Speaker 1>to become one of many inputs for a different ARTIFICI

0:05:00.000 --> 0:05:02.559
<v Speaker 1>shoal neuron in that network, and so on and so forth.

0:05:03.520 --> 0:05:08.400
<v Speaker 1>Deep learning networks are very very large artificial neural networks,

0:05:08.680 --> 0:05:11.640
<v Speaker 1>and they can accept a large amount of training data.

0:05:12.040 --> 0:05:15.520
<v Speaker 1>This is a scalable approach. This means the larger the

0:05:15.600 --> 0:05:18.479
<v Speaker 1>network and the more data you can feed it, the

0:05:18.600 --> 0:05:22.320
<v Speaker 1>better it performs. This is different from any other machine

0:05:22.400 --> 0:05:25.839
<v Speaker 1>learning models. Those tend to hit a performance plateau once

0:05:25.839 --> 0:05:28.640
<v Speaker 1>you hit a certain size, which means that if you

0:05:28.680 --> 0:05:32.279
<v Speaker 1>were to add more nodes to the network, you wouldn't

0:05:32.279 --> 0:05:35.560
<v Speaker 1>necessarily see a comparative increase in performance. You would you

0:05:35.560 --> 0:05:39.719
<v Speaker 1>would kind of flatten out over time. In fourteen, a

0:05:39.800 --> 0:05:43.680
<v Speaker 1>deep learning expert named Andrew ng gave a talk at

0:05:43.720 --> 0:05:48.159
<v Speaker 1>Stanford about the best use cases for deep learning, and

0:05:48.200 --> 0:05:54.000
<v Speaker 1>he mentioned that it was particularly good at supervised learning tasks. Now,

0:05:54.040 --> 0:05:56.800
<v Speaker 1>these are the types of computer problems in which we

0:05:56.960 --> 0:06:00.680
<v Speaker 1>humans already know the answer, such as is there a

0:06:00.800 --> 0:06:04.479
<v Speaker 1>cat in this photograph? Humans can pick up on that

0:06:04.800 --> 0:06:08.440
<v Speaker 1>right away, assuming someone has not carefully hidden a cat

0:06:08.600 --> 0:06:12.160
<v Speaker 1>in a very busy image. But for a computer, this

0:06:12.240 --> 0:06:14.760
<v Speaker 1>is a much more difficult problem. Even if the picture

0:06:14.800 --> 0:06:18.120
<v Speaker 1>has a cat center stage, it can be tough for

0:06:18.200 --> 0:06:21.880
<v Speaker 1>a computer to figure that out using a supervised learning approach.

0:06:21.920 --> 0:06:24.960
<v Speaker 1>With a deep learning network, you can train a system

0:06:25.080 --> 0:06:28.440
<v Speaker 1>to recognize cats and images with a high degree of

0:06:28.520 --> 0:06:32.600
<v Speaker 1>success if you have a large amount of training data

0:06:32.880 --> 0:06:36.800
<v Speaker 1>to train the network to recognize cats. Now, in this

0:06:36.920 --> 0:06:39.239
<v Speaker 1>other case, that I was talking about. The Reddit user

0:06:39.320 --> 0:06:44.280
<v Speaker 1>called deep fakes started posting on Reddit in late twenties seventeen.

0:06:44.720 --> 0:06:48.160
<v Speaker 1>The user made an open source code version of a

0:06:48.200 --> 0:06:51.720
<v Speaker 1>deep learning algorithm and made it available for the purposes

0:06:51.760 --> 0:06:56.200
<v Speaker 1>of video manipulation and anyone could take advantage of it. Now. Specifically,

0:06:56.680 --> 0:07:00.840
<v Speaker 1>this algorithm was designed for face swapping. The algorithm would

0:07:00.839 --> 0:07:03.400
<v Speaker 1>allow you to put the face of one person onto

0:07:03.480 --> 0:07:07.159
<v Speaker 1>the body of another in video form, and it wasn't

0:07:07.200 --> 0:07:11.000
<v Speaker 1>always convincing. In fact, it could often be easily detectable

0:07:11.120 --> 0:07:14.960
<v Speaker 1>as fake if someone had not trained the model properly

0:07:15.080 --> 0:07:18.800
<v Speaker 1>before creating the video. But it did open up a

0:07:18.840 --> 0:07:22.400
<v Speaker 1>can of worms once the practice started getting media coverage. However,

0:07:22.560 --> 0:07:25.760
<v Speaker 1>the actual technology to pull this off was already a

0:07:25.800 --> 0:07:29.360
<v Speaker 1>couple of years old when deep fakes shared it. Back

0:07:29.360 --> 0:07:33.560
<v Speaker 1>in there was a group of researchers from Stanford then

0:07:33.840 --> 0:07:37.280
<v Speaker 1>and also the University of Erlanger Nuremberg and the Max

0:07:37.320 --> 0:07:41.360
<v Speaker 1>Planck Institute for Informatics who collectively published a paper that

0:07:41.480 --> 0:07:45.640
<v Speaker 1>was titled Face to Face Real Time Face Capture and

0:07:45.680 --> 0:07:49.520
<v Speaker 1>Reenactment of r GB Videos and that's a face, the

0:07:49.680 --> 0:07:54.560
<v Speaker 1>number two and face. The paper details the methodology the

0:07:54.600 --> 0:07:58.400
<v Speaker 1>group used to create a pretty incredible effect. The algorithm

0:07:58.600 --> 0:08:02.600
<v Speaker 1>could take the facial expressis from one person and transfer

0:08:02.720 --> 0:08:06.120
<v Speaker 1>them in real time to a video target. It was

0:08:06.160 --> 0:08:10.880
<v Speaker 1>like turning the video into a digital puppet. So you

0:08:10.960 --> 0:08:14.040
<v Speaker 1>might have a video loop of a celebrity running, and

0:08:14.080 --> 0:08:17.200
<v Speaker 1>preferably it's a loop that's easily repeatable without the repeat

0:08:17.200 --> 0:08:22.240
<v Speaker 1>being terribly noticeable, and that's your target video. So if

0:08:22.280 --> 0:08:25.240
<v Speaker 1>you just let it run, you just would see video

0:08:25.280 --> 0:08:27.760
<v Speaker 1>of someone sitting down, maybe looking around a little bit,

0:08:27.760 --> 0:08:31.200
<v Speaker 1>but that's it, nothing special. Then you would have a

0:08:31.240 --> 0:08:36.000
<v Speaker 1>source subject sitting in view of a consumer quality webcam,

0:08:36.120 --> 0:08:41.040
<v Speaker 1>no special equipment here, and that person could make different expressions,

0:08:41.080 --> 0:08:44.640
<v Speaker 1>including opening and closing their mouths, and the video target

0:08:44.679 --> 0:08:49.080
<v Speaker 1>would match them move for move, like a digital puppet. Moreover,

0:08:49.480 --> 0:08:52.040
<v Speaker 1>the source subject didn't have to wear any special gear.

0:08:52.160 --> 0:08:54.160
<v Speaker 1>They didn't have to have any special markers, none of

0:08:54.200 --> 0:08:56.640
<v Speaker 1>those dots that you would see with motion capture. None

0:08:56.679 --> 0:09:00.040
<v Speaker 1>of that was necessary. All the algorithm needed was a

0:09:00.120 --> 0:09:03.719
<v Speaker 1>video feed from a monocular camera, so you didn't even

0:09:03.800 --> 0:09:07.080
<v Speaker 1>need depth perception for this. There's a video of their

0:09:07.080 --> 0:09:10.400
<v Speaker 1>work on YouTube that shows off this process and includes

0:09:10.440 --> 0:09:13.320
<v Speaker 1>a loop of George W. Bush sitting for an interview.

0:09:13.800 --> 0:09:18.280
<v Speaker 1>The source subject can manipulate the face that Bush makes

0:09:18.360 --> 0:09:21.040
<v Speaker 1>just by making faces of his own, and the algorithm

0:09:21.080 --> 0:09:24.240
<v Speaker 1>would map those movements to the target video. And it's

0:09:24.280 --> 0:09:29.160
<v Speaker 1>pretty wild to see an image of moving image of

0:09:29.440 --> 0:09:32.800
<v Speaker 1>George W. Bush responding in real time to all of

0:09:32.840 --> 0:09:35.760
<v Speaker 1>these different facial expressions this guy is making. So how

0:09:35.760 --> 0:09:37.920
<v Speaker 1>did the team do this? It's one thing to say

0:09:37.960 --> 0:09:40.760
<v Speaker 1>a deep learning algorithm gave them this capability, but that's

0:09:40.800 --> 0:09:46.640
<v Speaker 1>not really an explanation. The paper definitively spells this out

0:09:47.120 --> 0:09:52.360
<v Speaker 1>in real technical detail. It starts off the explanation by saying, quote,

0:09:52.720 --> 0:09:56.120
<v Speaker 1>in our method, we first reconstruct the shape identity of

0:09:56.120 --> 0:09:59.760
<v Speaker 1>the target actor using a new global non rigid model

0:09:59.800 --> 0:10:03.600
<v Speaker 1>BA fast bundling approach based on a prerecorded training sequence.

0:10:04.000 --> 0:10:06.800
<v Speaker 1>As this pre process is performed globally on a set

0:10:06.840 --> 0:10:10.439
<v Speaker 1>of training frames, we can resolve geometric ambiguities common to

0:10:10.520 --> 0:10:14.560
<v Speaker 1>binocular reconstruction. At runtime. We tracked both the expressions of

0:10:14.600 --> 0:10:17.760
<v Speaker 1>the source and target actors video by a dense analysis

0:10:17.760 --> 0:10:22.320
<v Speaker 1>by synthesis approach based on a statistical facial prior end quote,

0:10:22.760 --> 0:10:25.559
<v Speaker 1>and it goes on in that vein throughout the paper

0:10:26.000 --> 0:10:28.880
<v Speaker 1>which means it gets pretty dense. But I think we

0:10:28.920 --> 0:10:31.680
<v Speaker 1>can suss out what's going on from a high level

0:10:32.280 --> 0:10:36.160
<v Speaker 1>if we just take a moment. But first, I'm going

0:10:36.240 --> 0:10:39.080
<v Speaker 1>to take a moment of my own to thank my sponsor.

0:10:46.679 --> 0:10:50.559
<v Speaker 1>So how did the Face to Face team build this tool? Well,

0:10:50.640 --> 0:10:54.760
<v Speaker 1>for each target video, they would collect a large sample

0:10:55.200 --> 0:10:58.040
<v Speaker 1>of footage and images and feed it to this deep

0:10:58.160 --> 0:11:01.560
<v Speaker 1>learning algorithm. This would be necess sarry to identify all

0:11:01.600 --> 0:11:04.599
<v Speaker 1>the points on the face that would move with various expressions,

0:11:04.840 --> 0:11:08.400
<v Speaker 1>as well as to capture images of the inside of

0:11:08.440 --> 0:11:11.720
<v Speaker 1>the target's mouth when he or she spoke. This is

0:11:11.760 --> 0:11:15.080
<v Speaker 1>because the video loop they used to create the manipulated

0:11:15.160 --> 0:11:18.440
<v Speaker 1>video would feature the target subject, typically with his or

0:11:18.440 --> 0:11:21.600
<v Speaker 1>her mouth closed, so it might be a section in

0:11:21.640 --> 0:11:24.000
<v Speaker 1>which the subject was sitting down for an interview and

0:11:24.040 --> 0:11:27.600
<v Speaker 1>listening to an interviewer's questions but not responding yet they

0:11:27.600 --> 0:11:31.600
<v Speaker 1>were just listening. The additional video would provide information about

0:11:31.600 --> 0:11:33.840
<v Speaker 1>the inside of the target subject's mouth, which could be

0:11:33.880 --> 0:11:37.240
<v Speaker 1>rendered in time on the target video performance when it

0:11:37.360 --> 0:11:40.760
<v Speaker 1>came time to do that. Their approach improved the scanning

0:11:40.800 --> 0:11:44.720
<v Speaker 1>technique to build face templates for both the source subject

0:11:45.280 --> 0:11:48.720
<v Speaker 1>who provides all the expressions and the target subject, who

0:11:48.840 --> 0:11:52.679
<v Speaker 1>mimics all the expressions. As the source subject makes different

0:11:52.720 --> 0:11:57.120
<v Speaker 1>facial expressions, the computer face template detects how the subjects

0:11:57.240 --> 0:12:02.880
<v Speaker 1>face changes or deforms over time. The computer model then

0:12:02.960 --> 0:12:06.840
<v Speaker 1>takes that information, saying, all right, well the lips moved

0:12:06.840 --> 0:12:09.520
<v Speaker 1>in this way, there was a grimace here or a

0:12:09.559 --> 0:12:15.200
<v Speaker 1>smile there, and transfer those motions to the targets face template,

0:12:15.600 --> 0:12:19.280
<v Speaker 1>which is matched to the target's actual face. This process

0:12:19.360 --> 0:12:24.760
<v Speaker 1>transfers the expressions over to the target, and so when

0:12:24.800 --> 0:12:28.600
<v Speaker 1>the source subject grimaces, the target grimaces, if the source

0:12:28.640 --> 0:12:31.319
<v Speaker 1>subject just it's still the target. Video will continue to

0:12:31.400 --> 0:12:34.960
<v Speaker 1>loop and the targets face won't change. The more video

0:12:34.960 --> 0:12:37.720
<v Speaker 1>footage you can get of your target and your source,

0:12:38.120 --> 0:12:41.880
<v Speaker 1>the better the computer algorithms are that create those face templates,

0:12:41.920 --> 0:12:44.720
<v Speaker 1>and the more natural the manipulation will appear on the

0:12:44.720 --> 0:12:48.280
<v Speaker 1>finished video. You also want a really good amount of

0:12:48.800 --> 0:12:51.680
<v Speaker 1>footage just to get all that extra information you need,

0:12:51.679 --> 0:12:53.120
<v Speaker 1>like the inside of the mouth, so that that can

0:12:53.160 --> 0:12:57.319
<v Speaker 1>all be extrapolated properly. You have to design a tool

0:12:57.400 --> 0:13:00.880
<v Speaker 1>that can encode an image called the training image, and

0:13:00.800 --> 0:13:05.720
<v Speaker 1>then decodes this data to reconstruct the image. So imagine

0:13:05.760 --> 0:13:11.360
<v Speaker 1>you've got a picture. The encoder essentially creates data based

0:13:11.400 --> 0:13:14.400
<v Speaker 1>on that image. It's like a description of that image.

0:13:15.080 --> 0:13:19.280
<v Speaker 1>The decoder takes the description and tries to rebuild the

0:13:19.320 --> 0:13:22.520
<v Speaker 1>image based on the description. I think of this like

0:13:22.559 --> 0:13:25.720
<v Speaker 1>that scene in Willy Wonka where Mike TV gets broken

0:13:25.800 --> 0:13:28.479
<v Speaker 1>up in a million little pieces and then gets reconstructed

0:13:28.760 --> 0:13:31.960
<v Speaker 1>on the television screen. So the second image is not

0:13:32.040 --> 0:13:34.199
<v Speaker 1>a copy. It's not like you made a copy of

0:13:34.240 --> 0:13:36.880
<v Speaker 1>the first one. It's like you built a new image

0:13:36.920 --> 0:13:39.400
<v Speaker 1>based on the first one. By the way, when you

0:13:39.440 --> 0:13:45.200
<v Speaker 1>start off with these uh these algorithms, those reconstructions tend

0:13:45.240 --> 0:13:49.120
<v Speaker 1>to look pretty bad. You have to continually train and

0:13:49.160 --> 0:13:51.680
<v Speaker 1>train and train and train the model so that it

0:13:51.720 --> 0:13:57.160
<v Speaker 1>gets better and better at producing a close representation of

0:13:57.200 --> 0:14:01.480
<v Speaker 1>the original image. When it does, it's reconstru struction. And

0:14:01.960 --> 0:14:06.319
<v Speaker 1>you would have both essentially decoders for both your source

0:14:06.440 --> 0:14:10.720
<v Speaker 1>subject and your target subject. Use the same encoder for both,

0:14:11.120 --> 0:14:15.600
<v Speaker 1>but two different decoders, one dedicated to your source, one

0:14:15.679 --> 0:14:20.120
<v Speaker 1>dedicated to your target. Then you would feed the reconstructed

0:14:20.200 --> 0:14:22.720
<v Speaker 1>images through the system again and again. This is called

0:14:23.080 --> 0:14:27.240
<v Speaker 1>back propagation. You do this over millions of times, typically

0:14:27.520 --> 0:14:31.520
<v Speaker 1>to improve this process, and then you're ready to really

0:14:32.080 --> 0:14:34.880
<v Speaker 1>switch switch things up. So let's say we've got two people.

0:14:34.960 --> 0:14:37.800
<v Speaker 1>We've got person one and we've got person two, and

0:14:37.840 --> 0:14:41.160
<v Speaker 1>you've been feeding images of both of these people through

0:14:41.280 --> 0:14:44.680
<v Speaker 1>the same encoder, but of course you have dedicated decoders

0:14:44.720 --> 0:14:47.760
<v Speaker 1>to produce the reconstruction. So person one has decode er

0:14:47.840 --> 0:14:52.320
<v Speaker 1>one and person two has decoder two. Now let's say

0:14:52.320 --> 0:14:58.160
<v Speaker 1>you're ready to put person two's face on person one's body. Well,

0:14:58.200 --> 0:15:01.840
<v Speaker 1>you would feed an image of person one into the encoder,

0:15:02.200 --> 0:15:05.320
<v Speaker 1>but you use the decoder for a person to to

0:15:05.480 --> 0:15:08.680
<v Speaker 1>reconstruct the image, and what you get is persons who's

0:15:08.800 --> 0:15:13.480
<v Speaker 1>face but mimicking the expression from person one. You, or

0:15:13.600 --> 0:15:18.840
<v Speaker 1>rather the computer algorithm, does this frame by frame on video,

0:15:18.920 --> 0:15:20.920
<v Speaker 1>and you end up with a video appearing to feature

0:15:20.960 --> 0:15:24.680
<v Speaker 1>one person when in fact it's just their face on

0:15:24.760 --> 0:15:27.680
<v Speaker 1>top of someone else, and it's their face making the

0:15:27.680 --> 0:15:31.720
<v Speaker 1>exact same expressions as whoever was originally in that video.

0:15:32.680 --> 0:15:36.280
<v Speaker 1>Now back over to deep fakes. Before long after the

0:15:36.320 --> 0:15:40.840
<v Speaker 1>Reddit user initially posted this code, folks over at Reddit,

0:15:40.840 --> 0:15:43.600
<v Speaker 1>we're taking this open source code and making more advanced

0:15:43.680 --> 0:15:46.960
<v Speaker 1>software based off of it. Soon there were desktop apps

0:15:46.960 --> 0:15:50.200
<v Speaker 1>that would take over all the hard parts of this process,

0:15:50.440 --> 0:15:54.080
<v Speaker 1>all the codey bits, if you will, of training a model.

0:15:54.480 --> 0:15:57.160
<v Speaker 1>Some of them would guide users into creating the data

0:15:57.280 --> 0:15:59.800
<v Speaker 1>that would be used to train the model and go

0:16:00.040 --> 0:16:02.360
<v Speaker 1>all the way through the process of creating the final

0:16:02.440 --> 0:16:05.960
<v Speaker 1>fake videos. Even with some of the more sophisticated versions,

0:16:06.000 --> 0:16:09.720
<v Speaker 1>there were tell tales signs of tampering. Typically some blurring

0:16:09.760 --> 0:16:14.400
<v Speaker 1>around images, particularly near chins and mouths. Those would be signs.

0:16:14.960 --> 0:16:17.360
<v Speaker 1>If there was any flicker, that was a sign if

0:16:17.360 --> 0:16:21.720
<v Speaker 1>you didn't take enough time to train the model. Typically

0:16:21.720 --> 0:16:24.640
<v Speaker 1>you would want to do several days of training at least.

0:16:24.960 --> 0:16:26.840
<v Speaker 1>If you didn't take that time, you might see some

0:16:26.960 --> 0:16:29.840
<v Speaker 1>really nasty blurring and flickering, and it would be a

0:16:29.840 --> 0:16:35.480
<v Speaker 1>dead giveaway that this was tampered. Video in writer, director,

0:16:35.520 --> 0:16:39.520
<v Speaker 1>and comedian Jordan's Peel demonstrated the power of this technology.

0:16:39.600 --> 0:16:43.040
<v Speaker 1>He showed how, with his impersonation of Barack Obama and

0:16:43.200 --> 0:16:47.840
<v Speaker 1>some manipulation software, he could create a fake public service

0:16:47.840 --> 0:16:51.680
<v Speaker 1>address when which the president would appear to say things

0:16:52.080 --> 0:16:55.640
<v Speaker 1>that he normally would never say. The technology behind this

0:16:55.800 --> 0:16:58.640
<v Speaker 1>made use of what is called a long short term

0:16:58.680 --> 0:17:01.480
<v Speaker 1>memory network or l s TM, to go into the

0:17:01.520 --> 0:17:05.199
<v Speaker 1>mechanics of that would require another podcast, but using an

0:17:05.200 --> 0:17:08.520
<v Speaker 1>approach similar to what I've already described, a team was

0:17:08.560 --> 0:17:12.240
<v Speaker 1>able to make a video of Obama apparently lip syncing

0:17:12.480 --> 0:17:16.159
<v Speaker 1>Peel's satirical message. The goal of this p s A

0:17:16.480 --> 0:17:20.080
<v Speaker 1>was beyond alert because fakes are getting harder to spot.

0:17:20.600 --> 0:17:25.200
<v Speaker 1>The University of Washington showed off this and They're Synthesizing

0:17:25.240 --> 0:17:28.880
<v Speaker 1>Obama project in which they took the audio from one

0:17:29.000 --> 0:17:32.600
<v Speaker 1>of President Obama's speeches and then used it to animate

0:17:32.760 --> 0:17:36.600
<v Speaker 1>his face in video from a different address that he

0:17:36.680 --> 0:17:40.240
<v Speaker 1>gave during his presidency. So in this example, the person

0:17:40.280 --> 0:17:43.680
<v Speaker 1>in the target video is the same person as the

0:17:43.800 --> 0:17:47.399
<v Speaker 1>source for the audio. But the point was pretty clear

0:17:47.600 --> 0:17:51.399
<v Speaker 1>that tech would soon make it possible to fake someone

0:17:51.440 --> 0:17:54.960
<v Speaker 1>saying or doing something. It just takes the right algorithms,

0:17:55.280 --> 0:17:58.320
<v Speaker 1>the right amount of training data, and the right amount

0:17:58.359 --> 0:18:00.720
<v Speaker 1>of time to get the model trained up enough to

0:18:00.760 --> 0:18:04.960
<v Speaker 1>do it smoothly. Now, this technology could be used to

0:18:05.080 --> 0:18:09.119
<v Speaker 1>do stuff that isn't related to malicious deception or for

0:18:09.359 --> 0:18:12.399
<v Speaker 1>pornography or anything along those lines. It could be used

0:18:12.760 --> 0:18:16.119
<v Speaker 1>in television and film for lots of stuff, including potentially

0:18:16.160 --> 0:18:20.159
<v Speaker 1>adding in actors who have passed away into a film.

0:18:20.200 --> 0:18:23.320
<v Speaker 1>Paired with similar work that's going on in voice synthesis,

0:18:23.320 --> 0:18:26.560
<v Speaker 1>you could end up with a convincing replacement, which means

0:18:26.960 --> 0:18:30.800
<v Speaker 1>we could make movies with dead actors taking on new

0:18:30.920 --> 0:18:35.000
<v Speaker 1>parts because we can synthesize their speech, we can synthesize

0:18:35.040 --> 0:18:38.000
<v Speaker 1>their appearance. You would still have someone else acting out

0:18:38.080 --> 0:18:41.680
<v Speaker 1>the part physically, but you would replace their image with

0:18:42.040 --> 0:18:46.920
<v Speaker 1>this actor's image. Or maybe you would want to use

0:18:46.960 --> 0:18:49.080
<v Speaker 1>this kind of technology just to make everyone think you

0:18:49.080 --> 0:18:51.880
<v Speaker 1>can cut a rug. This brings me to the University

0:18:51.880 --> 0:18:55.080
<v Speaker 1>of California, Berkeley and is the subject of a paper

0:18:55.119 --> 0:18:59.480
<v Speaker 1>titled Everybody Dance Now. The goal is a simple concept

0:18:59.560 --> 0:19:02.280
<v Speaker 1>that's actually really hard to pull off. What if you

0:19:02.320 --> 0:19:05.719
<v Speaker 1>were to take the movements of a professional dancer and

0:19:05.760 --> 0:19:09.160
<v Speaker 1>then map those movements onto the body of someone who

0:19:09.320 --> 0:19:12.399
<v Speaker 1>wasn't a dancer. What if you could create a video

0:19:12.720 --> 0:19:17.040
<v Speaker 1>in which literally anyone would appear to move like a skilled,

0:19:17.440 --> 0:19:21.280
<v Speaker 1>trained dancer. And how the heck would that be possible. Well,

0:19:21.320 --> 0:19:23.919
<v Speaker 1>at the heart of the team's efforts was something I

0:19:23.960 --> 0:19:27.120
<v Speaker 1>talked about in a recent episode of tech Stuff about

0:19:27.200 --> 0:19:31.919
<v Speaker 1>an AI generated portrait, and that would be generative adversarial

0:19:32.040 --> 0:19:35.479
<v Speaker 1>networks or g A n s. These use a pair

0:19:35.600 --> 0:19:40.199
<v Speaker 1>of artificial neural networks in competition against each other. So

0:19:40.240 --> 0:19:42.600
<v Speaker 1>since I covered this recently, i'll just give again a

0:19:42.640 --> 0:19:46.240
<v Speaker 1>super quick high level summary. You've got one network that

0:19:46.320 --> 0:19:49.399
<v Speaker 1>has a specific job, such as trying to create an

0:19:49.400 --> 0:19:51.760
<v Speaker 1>original image of a cat. We'll go back to the

0:19:51.800 --> 0:19:54.560
<v Speaker 1>cat pictures. That's one of my favorite ones because it

0:19:54.600 --> 0:19:58.600
<v Speaker 1>was one of the early use cases of neural networks

0:19:58.600 --> 0:20:01.399
<v Speaker 1>that I remember encountering when I was doing research. Now,

0:20:01.480 --> 0:20:04.479
<v Speaker 1>let's say you've got your second network. Your second network

0:20:04.480 --> 0:20:08.159
<v Speaker 1>has the specific job of evaluating pictures of cats to

0:20:08.280 --> 0:20:12.280
<v Speaker 1>determine if they are valid, meaning is this a real

0:20:12.320 --> 0:20:16.000
<v Speaker 1>picture of a cat that's part of the training material

0:20:16.320 --> 0:20:20.000
<v Speaker 1>that I'm accepting, or is this, in fact a fake

0:20:20.400 --> 0:20:25.040
<v Speaker 1>that was created by a computer program the other neural network.

0:20:25.400 --> 0:20:27.920
<v Speaker 1>So you've got one network trying to fool the other network.

0:20:28.200 --> 0:20:31.560
<v Speaker 1>And these networks get better at what they do over time,

0:20:31.880 --> 0:20:38.159
<v Speaker 1>they improve, So your counterfeit network is getting better and

0:20:38.200 --> 0:20:41.959
<v Speaker 1>better at making fake pictures of cats, and your detector

0:20:42.040 --> 0:20:45.960
<v Speaker 1>network is getting better and better at detecting fake images

0:20:46.000 --> 0:20:49.440
<v Speaker 1>of cats. Now, typically this requires humans to give feedback

0:20:49.560 --> 0:20:52.879
<v Speaker 1>or tweaking weight values along the networks, but they do

0:20:52.920 --> 0:20:56.679
<v Speaker 1>get better over time. So if the network trying to

0:20:56.680 --> 0:20:59.359
<v Speaker 1>create a picture of a cat gets the feedback of sorry, buddy,

0:20:59.400 --> 0:21:01.960
<v Speaker 1>but they're onto you, then it can try again and

0:21:02.040 --> 0:21:04.480
<v Speaker 1>adjust it's approach slightly in an effort to fool the

0:21:04.480 --> 0:21:07.880
<v Speaker 1>second network. If the second network gets the feedback you'll

0:21:07.960 --> 0:21:10.320
<v Speaker 1>let this one slip by and it's fake, then it

0:21:10.359 --> 0:21:13.040
<v Speaker 1>will adjust or it will be adjusted to look out

0:21:13.040 --> 0:21:16.080
<v Speaker 1>for any tailtale signs that it had missed in that

0:21:16.200 --> 0:21:20.360
<v Speaker 1>earlier evaluation. Over time, the two networks working against each

0:21:20.359 --> 0:21:24.480
<v Speaker 1>other will create the ultimate result of better and better

0:21:24.600 --> 0:21:28.679
<v Speaker 1>computer generated content, whether it's an image of a cat

0:21:29.440 --> 0:21:34.760
<v Speaker 1>or a sonnet, or a song or a video. Now

0:21:34.880 --> 0:21:39.680
<v Speaker 1>that doesn't mean that these computer generated things are at

0:21:39.680 --> 0:21:43.399
<v Speaker 1>the same level as human generated stuff, especially when it

0:21:43.440 --> 0:21:46.479
<v Speaker 1>comes to text. I've seen a lot of song lyrics

0:21:46.520 --> 0:21:51.000
<v Speaker 1>that were inscrutable even by my old man standards. So

0:21:51.280 --> 0:21:54.520
<v Speaker 1>I think that we're a long way away from getting

0:21:54.600 --> 0:21:57.520
<v Speaker 1>to a point where they can fool us in every case.

0:21:57.600 --> 0:22:01.040
<v Speaker 1>But with video they're getting pretty darn good. Now, this

0:22:01.119 --> 0:22:04.719
<v Speaker 1>team had two groups of subjects, and so you had

0:22:04.760 --> 0:22:08.520
<v Speaker 1>your source subjects and your target subjects. The source in

0:22:08.560 --> 0:22:11.399
<v Speaker 1>this case, were the people who could dance, so like

0:22:11.520 --> 0:22:14.720
<v Speaker 1>ballet dancers, hip hop dancers and that sort of stuff.

0:22:14.760 --> 0:22:19.000
<v Speaker 1>They legit know how to move. They would demonstrate various

0:22:19.080 --> 0:22:22.400
<v Speaker 1>dances on video. The second group of subjects were your

0:22:22.440 --> 0:22:27.439
<v Speaker 1>target subjects. These were not trained dancers. They were to

0:22:27.600 --> 0:22:31.560
<v Speaker 1>go through a series of moves and poses, essentially aping

0:22:31.720 --> 0:22:35.600
<v Speaker 1>as best they could the movements of trained dancers, and

0:22:35.680 --> 0:22:40.080
<v Speaker 1>the goal of this pair of networks was to smooth

0:22:40.119 --> 0:22:43.160
<v Speaker 1>the movements out and adjust the timing so that these

0:22:43.280 --> 0:22:46.600
<v Speaker 1>untrained dancers would appear to move more like their groovy

0:22:46.720 --> 0:22:50.800
<v Speaker 1>source subject counterparts. I'll explain more in just a moment,

0:22:50.800 --> 0:22:53.840
<v Speaker 1>but first let's take another quick break to thank our sponsor.

0:23:01.359 --> 0:23:05.280
<v Speaker 1>According to the Everybody Dance Now paper, the team would

0:23:05.280 --> 0:23:09.040
<v Speaker 1>transfer motion between the sources to the target through an

0:23:09.240 --> 0:23:14.399
<v Speaker 1>end to end pixel based pipeline. So here's how that's done.

0:23:14.480 --> 0:23:18.919
<v Speaker 1>Because if you're like me, that phrase meant next to

0:23:19.000 --> 0:23:22.760
<v Speaker 1>nothing to you. So specifically, the group used three stages

0:23:22.800 --> 0:23:26.280
<v Speaker 1>to take the movements of one person and transpose them

0:23:26.320 --> 0:23:31.200
<v Speaker 1>to a target person. Those three were pose detection, global

0:23:31.280 --> 0:23:35.800
<v Speaker 1>pose normalization, and mapping from normalized pose stick figures to

0:23:35.840 --> 0:23:41.040
<v Speaker 1>the target subject. Post detection involves teaching machines, in other words,

0:23:41.040 --> 0:23:45.159
<v Speaker 1>computers how to interpret images to determine where key body

0:23:45.200 --> 0:23:50.480
<v Speaker 1>points are, like elbows, knees, hips, shoulders, the head, that

0:23:50.560 --> 0:23:53.679
<v Speaker 1>kind of stuff. That first requires that you teach the

0:23:53.720 --> 0:23:57.760
<v Speaker 1>machine to recognize those points in the first place. So

0:23:58.080 --> 0:24:00.160
<v Speaker 1>first you have to train a machine to recogniz eyes

0:24:00.240 --> 0:24:04.360
<v Speaker 1>those points and identify them with a target level of accuracy.

0:24:04.520 --> 0:24:08.000
<v Speaker 1>It's pretty typical to represent these joints as as points

0:24:08.040 --> 0:24:11.320
<v Speaker 1>in a stick figure, so each point represents another joint

0:24:11.400 --> 0:24:15.159
<v Speaker 1>or point of articulation. The lines represent the trunk of

0:24:15.200 --> 0:24:18.199
<v Speaker 1>the body, the limbs, the head. You end up with

0:24:18.240 --> 0:24:21.159
<v Speaker 1>a stick figure. If your machine learning mechanism was a

0:24:21.160 --> 0:24:24.199
<v Speaker 1>good one, the machine should be able to overlay a

0:24:24.280 --> 0:24:27.600
<v Speaker 1>stick figure on top of any image of a person posing,

0:24:28.040 --> 0:24:30.439
<v Speaker 1>and the stick figure should more or less conform to

0:24:30.560 --> 0:24:34.280
<v Speaker 1>that image, including where the actual joints are. So if

0:24:34.280 --> 0:24:36.800
<v Speaker 1>you have someone standing there in the classic Peter Pan

0:24:36.880 --> 0:24:40.239
<v Speaker 1>pose of their their fists on their hips uh and

0:24:40.280 --> 0:24:43.480
<v Speaker 1>their their arms out of kimbo, then it should draw

0:24:43.520 --> 0:24:46.320
<v Speaker 1>a stick figure that's essentially aping the same thing and

0:24:46.359 --> 0:24:48.760
<v Speaker 1>be able to overlay it on top of the original image.

0:24:49.000 --> 0:24:52.160
<v Speaker 1>Now these days this can be done in real time. So,

0:24:52.240 --> 0:24:54.520
<v Speaker 1>for example, there's a team at Google Creative Lab that

0:24:54.600 --> 0:24:57.639
<v Speaker 1>used a machine learning model of pose net and created

0:24:57.680 --> 0:25:01.240
<v Speaker 1>a JavaScript version with TensorFlow, which is an open source

0:25:01.320 --> 0:25:04.760
<v Speaker 1>software library often used for machine learning. And with this

0:25:04.840 --> 0:25:08.080
<v Speaker 1>tool you can do real time pose estimation through a

0:25:08.119 --> 0:25:12.399
<v Speaker 1>browser and a webcam. The application doesn't have any technology

0:25:12.480 --> 0:25:15.040
<v Speaker 1>related to identifying the person in the image. It's just

0:25:15.119 --> 0:25:17.600
<v Speaker 1>quote unquote interested in what the person is doing, not

0:25:17.720 --> 0:25:19.879
<v Speaker 1>who the person is. So you can actually run this

0:25:20.000 --> 0:25:22.880
<v Speaker 1>on your own machine in a browser, and you can

0:25:22.880 --> 0:25:24.800
<v Speaker 1>pose in front of a webcam and you'll see the

0:25:24.840 --> 0:25:29.679
<v Speaker 1>little stick figure uh painted on top of your image

0:25:29.680 --> 0:25:32.400
<v Speaker 1>on the computer. Essentially, so every time you move, every

0:25:32.400 --> 0:25:34.480
<v Speaker 1>time you bend a joint, you will see the stick

0:25:34.520 --> 0:25:37.760
<v Speaker 1>figure doing the same thing, um mapped on top of you.

0:25:38.240 --> 0:25:42.200
<v Speaker 1>The Berkeley team made use of a pre trained pose detector,

0:25:42.359 --> 0:25:45.040
<v Speaker 1>meaning they didn't build a new one, which helps save

0:25:45.080 --> 0:25:47.960
<v Speaker 1>a lot of time and expense on their project. Now

0:25:47.960 --> 0:25:51.639
<v Speaker 1>people come in all shapes and sizes. In the video

0:25:51.720 --> 0:25:54.600
<v Speaker 1>the team released, they showed off subjects who included a

0:25:54.640 --> 0:25:56.960
<v Speaker 1>woman who appeared to be of around average height and

0:25:57.000 --> 0:26:00.720
<v Speaker 1>a man who appeared to be pretty darn Tallman transfer

0:26:00.800 --> 0:26:03.639
<v Speaker 1>method that would only work between a subject and a

0:26:03.680 --> 0:26:07.280
<v Speaker 1>target who are of similar shape and size would be

0:26:07.280 --> 0:26:11.359
<v Speaker 1>pretty limited. So the purpose of the global pose normalization

0:26:11.440 --> 0:26:14.639
<v Speaker 1>stage is to account for all the differences between the

0:26:14.800 --> 0:26:18.480
<v Speaker 1>source and the target subjects and the locations within the

0:26:18.480 --> 0:26:22.520
<v Speaker 1>frame of the camera. Without this step, the motion transfer

0:26:22.640 --> 0:26:27.800
<v Speaker 1>might appear ghoulish. We don't have all the same proportions, right,

0:26:27.840 --> 0:26:31.240
<v Speaker 1>so a mismatch might mean a target's limbs would appear

0:26:31.280 --> 0:26:34.560
<v Speaker 1>to bend in places that were clearly not natural joints.

0:26:35.080 --> 0:26:36.919
<v Speaker 1>All you need to do is see an arm bend

0:26:36.960 --> 0:26:38.919
<v Speaker 1>where an arm isn't supposed to bend, and that's going

0:26:38.960 --> 0:26:41.560
<v Speaker 1>to ski the out quite a bit. Makes an effective

0:26:41.600 --> 0:26:44.920
<v Speaker 1>horror movie experience, but not one that would produce convincing

0:26:45.000 --> 0:26:47.760
<v Speaker 1>motion transfer. Now, there are a lot of ways that

0:26:47.760 --> 0:26:50.679
<v Speaker 1>the team could have gone about normalizing the poses, but

0:26:50.760 --> 0:26:54.320
<v Speaker 1>their choice seems particularly clever to me. They measured the

0:26:54.400 --> 0:26:58.600
<v Speaker 1>heights and ankle positions of the various subjects and used

0:26:58.720 --> 0:27:03.040
<v Speaker 1>linear mapping between the closest and farthest ankle positions in

0:27:03.080 --> 0:27:06.800
<v Speaker 1>both videos to normalize the stick figure for the target subjects.

0:27:07.440 --> 0:27:10.760
<v Speaker 1>The program would calculate the scale of the figure as

0:27:10.800 --> 0:27:13.720
<v Speaker 1>well as the scale of motion from frame to frame.

0:27:14.080 --> 0:27:16.240
<v Speaker 1>And I think that's pretty darn cool because it wasn't

0:27:16.280 --> 0:27:19.640
<v Speaker 1>just accounting for the size of the subjects to get

0:27:19.640 --> 0:27:21.920
<v Speaker 1>all the joints right, but also to make sure the

0:27:21.960 --> 0:27:25.240
<v Speaker 1>scale of the movements with respect to the body size

0:27:25.280 --> 0:27:29.320
<v Speaker 1>and proportions would remain the same. So a tall person

0:27:29.400 --> 0:27:34.520
<v Speaker 1>with really long limbs moving their arms in really big, big,

0:27:34.560 --> 0:27:39.399
<v Speaker 1>bold gestures, if you tried to transfer that motion to

0:27:39.480 --> 0:27:42.960
<v Speaker 1>someone who was of smaller stature, it could really look disturbing.

0:27:43.440 --> 0:27:47.359
<v Speaker 1>But by using this scaling approach, the movements on the

0:27:47.480 --> 0:27:52.600
<v Speaker 1>smaller person would be proportionate in size to the movements

0:27:52.680 --> 0:27:56.240
<v Speaker 1>of the larger person. The team would use two of

0:27:56.359 --> 0:28:00.439
<v Speaker 1>the Generative Adversarial Network setups to work on making a

0:28:00.440 --> 0:28:04.040
<v Speaker 1>convincing final video. The first was dedicated to image to

0:28:04.200 --> 0:28:07.840
<v Speaker 1>image translation, attempting to manipulate the image of the target

0:28:07.880 --> 0:28:10.760
<v Speaker 1>subjects that would follow the motions made from the pose

0:28:10.880 --> 0:28:14.919
<v Speaker 1>detection process, and like all g a N setups, this

0:28:15.000 --> 0:28:18.240
<v Speaker 1>included the generator, which would attempt to create a convincing

0:28:18.320 --> 0:28:21.879
<v Speaker 1>sequence of images, and the discriminators, which tried to weed

0:28:21.880 --> 0:28:25.440
<v Speaker 1>out the quote unquote fake sequences from the generator from

0:28:25.480 --> 0:28:28.080
<v Speaker 1>the ground truth data that was being fed to it.

0:28:28.840 --> 0:28:32.639
<v Speaker 1>The second g N set set up was specifically dedicated

0:28:33.000 --> 0:28:36.359
<v Speaker 1>to add detail and realism to the faces of the

0:28:36.400 --> 0:28:39.920
<v Speaker 1>target subjects. In some frames this appears to have worked

0:28:39.920 --> 0:28:42.600
<v Speaker 1>pretty well, and others there's a bit of an uncanny

0:28:42.720 --> 0:28:45.840
<v Speaker 1>valley thing or maybe even horror movie type element going on,

0:28:46.480 --> 0:28:49.720
<v Speaker 1>similar to how some of the AI generated portraits that

0:28:49.800 --> 0:28:52.000
<v Speaker 1>I talked about in the previous episode introduced a bit

0:28:52.040 --> 0:28:56.200
<v Speaker 1>of unrealistic qualities to the various images. When shooting video

0:28:56.280 --> 0:29:00.080
<v Speaker 1>of the target subjects, the team captured images at one

0:29:00.160 --> 0:29:03.360
<v Speaker 1>hundred twenty frames per second to get enough data for

0:29:03.440 --> 0:29:07.360
<v Speaker 1>each subject. The sessions lasted for about twenty minutes. They

0:29:07.440 --> 0:29:10.560
<v Speaker 1>used smartphone cameras to do it, since many smartphones allow

0:29:10.600 --> 0:29:12.840
<v Speaker 1>you to shoot video at this kind of frame rate

0:29:12.920 --> 0:29:15.840
<v Speaker 1>these days. They had their target subjects where close fitting

0:29:15.840 --> 0:29:19.920
<v Speaker 1>clothing that wasn't prone to wrinkling because the post recognition

0:29:19.920 --> 0:29:23.240
<v Speaker 1>tool they were using wasn't designed to encode information about clothing.

0:29:24.040 --> 0:29:27.040
<v Speaker 1>As for the source videos, the ones that would actually

0:29:27.080 --> 0:29:29.960
<v Speaker 1>create the motions that would be transferred to the targets,

0:29:30.240 --> 0:29:32.880
<v Speaker 1>the team didn't have to worry about capturing images at

0:29:32.960 --> 0:29:35.239
<v Speaker 1>such a high frame rate. They could use videos of

0:29:35.280 --> 0:29:39.479
<v Speaker 1>just reasonable quality, meaning decent resolution and frame rate, and

0:29:39.520 --> 0:29:42.400
<v Speaker 1>their post detection tool would do its work and create

0:29:42.400 --> 0:29:45.080
<v Speaker 1>the stick figure that would serve as the guide for

0:29:45.160 --> 0:29:48.920
<v Speaker 1>the target motions later on. Because of that, the team

0:29:49.040 --> 0:29:52.920
<v Speaker 1>can really use any online video of sufficient quality to

0:29:52.960 --> 0:29:55.960
<v Speaker 1>act as the source information for motion transfer. It doesn't

0:29:55.960 --> 0:29:58.880
<v Speaker 1>have to be a video shot specifically for that purpose.

0:29:59.400 --> 0:30:02.760
<v Speaker 1>In fact, one of the example videos the team used

0:30:02.760 --> 0:30:06.120
<v Speaker 1>in their demonstration was from a Bruno Mars music video

0:30:06.240 --> 0:30:09.440
<v Speaker 1>for That's what I Like. Before applying the motion transfer,

0:30:09.720 --> 0:30:12.920
<v Speaker 1>the team smoothed pose key points to reduce jitter in

0:30:12.920 --> 0:30:16.960
<v Speaker 1>the final output, and then the team applied the motion transfer.

0:30:17.360 --> 0:30:21.640
<v Speaker 1>The stick figure motions were then transferred to the target

0:30:21.760 --> 0:30:25.800
<v Speaker 1>subjects and the result is pretty interesting. It is not seamless.

0:30:26.240 --> 0:30:29.080
<v Speaker 1>You can definitely tell something odd is going on, but

0:30:29.160 --> 0:30:31.800
<v Speaker 1>it is an indication of where things are going and

0:30:31.880 --> 0:30:36.080
<v Speaker 1>using adversarial networks could lead to more convincing motion transfers

0:30:36.120 --> 0:30:40.440
<v Speaker 1>in the future. Now, this could lead to all sorts

0:30:40.440 --> 0:30:43.960
<v Speaker 1>of stuff nefarious and otherwise. You could imagine using it

0:30:44.000 --> 0:30:47.520
<v Speaker 1>to transform an average actor into a martial arts master,

0:30:48.480 --> 0:30:51.920
<v Speaker 1>or it might allow directors more freedom of casting, knowing

0:30:51.960 --> 0:30:55.920
<v Speaker 1>that if the actors they choose don't possess certain physical skills,

0:30:56.280 --> 0:30:59.480
<v Speaker 1>they can use this kind of technology to fake it,

0:30:59.520 --> 0:31:02.240
<v Speaker 1>but would also be used to fake footage to make

0:31:02.520 --> 0:31:06.200
<v Speaker 1>it looks like people like specific people are doing stuff

0:31:06.240 --> 0:31:09.160
<v Speaker 1>that they are not doing. It could be used to

0:31:09.200 --> 0:31:13.440
<v Speaker 1>spread misinformation and it likely will be, which means we'll

0:31:13.480 --> 0:31:15.760
<v Speaker 1>need to be on the lookout for signs of fakes,

0:31:15.800 --> 0:31:18.720
<v Speaker 1>which are going to get harder and harder to detect

0:31:18.880 --> 0:31:22.720
<v Speaker 1>as time goes on. And hey, you guys remember DARPA,

0:31:22.840 --> 0:31:25.240
<v Speaker 1>right because I just did a whole series of episodes

0:31:25.240 --> 0:31:29.280
<v Speaker 1>about them. Well, that agency has funded programs dedicated to

0:31:29.480 --> 0:31:34.240
<v Speaker 1>automating various forensic tools, including tools that could be used

0:31:34.240 --> 0:31:39.480
<v Speaker 1>to detect AI created forgeries in video and audio. Often

0:31:39.560 --> 0:31:43.040
<v Speaker 1>the secret is in the eyes. Most of these neural

0:31:43.080 --> 0:31:47.600
<v Speaker 1>networks are trained on still images, so you send thousands

0:31:47.680 --> 0:31:49.880
<v Speaker 1>or tens of thousands of images if you have them,

0:31:50.080 --> 0:31:54.000
<v Speaker 1>of your various subjects, your target, and your source. But

0:31:54.120 --> 0:31:58.320
<v Speaker 1>most published still images don't show people with their eyes closed.

0:31:59.440 --> 0:32:02.760
<v Speaker 1>So I've moved my movements and blinking tends to be

0:32:02.800 --> 0:32:05.200
<v Speaker 1>a little wonky in these fake videos. You might watch

0:32:05.200 --> 0:32:08.120
<v Speaker 1>one for a while and think, huh, that's weird. This

0:32:08.160 --> 0:32:11.800
<v Speaker 1>guy hasn't blinked for like ten minutes, or when they

0:32:11.840 --> 0:32:15.080
<v Speaker 1>blink it looks really strange. Well, that's an indication that

0:32:15.160 --> 0:32:17.680
<v Speaker 1>it's a fake video. There are other ones as well,

0:32:17.720 --> 0:32:23.200
<v Speaker 1>but DARPA is understandably keeping those quiet because not you know,

0:32:23.240 --> 0:32:27.280
<v Speaker 1>if if they publish how they figure out AI created

0:32:27.400 --> 0:32:31.600
<v Speaker 1>videos are in fact faked, then that gives the fakers

0:32:31.720 --> 0:32:35.440
<v Speaker 1>enough information to go back and improve their models. So

0:32:35.520 --> 0:32:38.760
<v Speaker 1>we're likely to see something akin to what happened with capture.

0:32:39.240 --> 0:32:44.360
<v Speaker 1>Specialists will develop new tools to detect a I generated media.

0:32:45.200 --> 0:32:49.080
<v Speaker 1>AI developers will then create more sophisticated models, and so

0:32:49.160 --> 0:32:51.760
<v Speaker 1>it becomes kind of an arms race a seesaw, and

0:32:51.840 --> 0:32:54.680
<v Speaker 1>one benefit is that AI as a whole will improve,

0:32:55.840 --> 0:32:57.920
<v Speaker 1>but we may not be able to believe it when

0:32:58.000 --> 0:33:02.200
<v Speaker 1>we see it. Well, that wraps up this episode of fascinating,

0:33:02.800 --> 0:33:07.640
<v Speaker 1>somewhat disturbing topic, and uh, I'm sure we're gonna hear

0:33:07.760 --> 0:33:10.040
<v Speaker 1>a lot more about this in the years to come.

0:33:10.120 --> 0:33:14.680
<v Speaker 1>We've seen a lot of of sites banning deep fakes

0:33:14.680 --> 0:33:20.080
<v Speaker 1>outright because of the misinformation that they can spread. So

0:33:20.160 --> 0:33:24.760
<v Speaker 1>we're already seeing a reaction to this in various online communities,

0:33:25.200 --> 0:33:28.840
<v Speaker 1>so that's very interesting to me. But we're definitely gonna

0:33:28.920 --> 0:33:32.520
<v Speaker 1>keep seeing this continue. It's a it's a valid area

0:33:32.560 --> 0:33:36.360
<v Speaker 1>of AI research, so we will have to wait and

0:33:36.400 --> 0:33:38.680
<v Speaker 1>see how it all plays out. If you guys have

0:33:38.680 --> 0:33:41.440
<v Speaker 1>any suggestions for future episodes of tech Stuff, why not

0:33:41.520 --> 0:33:43.760
<v Speaker 1>send me a message. You can go over to our

0:33:43.800 --> 0:33:47.800
<v Speaker 1>website that is Text Stuff podcast dot com. You'll find

0:33:47.800 --> 0:33:50.560
<v Speaker 1>all the different ways to contact me. I look forward

0:33:50.560 --> 0:33:52.640
<v Speaker 1>to hearing from you. Make sure you check out our

0:33:52.680 --> 0:33:56.080
<v Speaker 1>store over at t public dot com slash tech Stuff

0:33:56.360 --> 0:33:59.760
<v Speaker 1>by some merchandise. You can make sure that you get

0:34:00.000 --> 0:34:03.160
<v Speaker 1>all the really cool T shirts like prove to Me

0:34:03.200 --> 0:34:05.720
<v Speaker 1>You're not a Robot. That one's pretty appropriate for this

0:34:05.760 --> 0:34:08.960
<v Speaker 1>particular episode. And remember every single purchase you make goes

0:34:09.000 --> 0:34:11.520
<v Speaker 1>to help the show, so we greatly appreciate it. Also,

0:34:11.920 --> 0:34:14.399
<v Speaker 1>if you haven't heard, we have been nominated for an

0:34:14.400 --> 0:34:18.120
<v Speaker 1>I Heart Radio Podcast Award. It's the first year I

0:34:18.200 --> 0:34:20.760
<v Speaker 1>Heart Radio is giving out podcast awards. We are nominated

0:34:20.760 --> 0:34:24.839
<v Speaker 1>in the Science and Technology category. You can go online

0:34:24.880 --> 0:34:27.880
<v Speaker 1>and visit the I Heart Radio Podcast Awards page and

0:34:27.960 --> 0:34:32.040
<v Speaker 1>vote up to five times a day for your favorite podcasts.

0:34:32.360 --> 0:34:34.560
<v Speaker 1>If you wanted to. You could dedicate all five of

0:34:34.600 --> 0:34:38.279
<v Speaker 1>those votes every single day to us. I would not

0:34:38.360 --> 0:34:40.560
<v Speaker 1>complain if you did that. It would be really cool

0:34:40.600 --> 0:34:43.359
<v Speaker 1>to win that award, but make sure you check it out.

0:34:43.440 --> 0:34:45.440
<v Speaker 1>There may be lots of shows there that you truly

0:34:45.440 --> 0:34:47.080
<v Speaker 1>love and you want to throw your support behind them.

0:34:47.080 --> 0:34:49.120
<v Speaker 1>That would be really cool with you, And I'll talk

0:34:49.160 --> 0:34:57.719
<v Speaker 1>to you again really soon for more on this and

0:34:57.800 --> 0:35:00.000
<v Speaker 1>bousands of other topics because it has to have four.

0:35:00.080 --> 0:35:00.520
<v Speaker 1>Stock com