WEBVTT - How AI Can Make You Look Like a Better Dancer 0:00:04.120 --> 0:00:07.160 Get in touch with technology with tech Stuff from how 0:00:07.200 --> 0:00:13.720 stuff works dot com. Hey there, and welcome to tech Stuff. 0:00:13.760 --> 0:00:16.799 I'm your host, Jonathan Strickland. I'm an executive producer and 0:00:16.840 --> 0:00:20.400 I love all things tech. And here's a fun fact 0:00:20.440 --> 0:00:25.799 about getting older. As you age, stuff that you once 0:00:25.840 --> 0:00:30.160 thought was impossible will become not only possible, but will 0:00:30.200 --> 0:00:34.040 become the norm, and future generations won't even think about 0:00:34.040 --> 0:00:37.400 what it must have been like before the impossible was commonplace. 0:00:37.880 --> 0:00:40.720 Now this is true for every generation. It's not like 0:00:40.760 --> 0:00:45.880 this is, you know, a brand new, groundbreaking observation. Plenty 0:00:45.920 --> 0:00:48.400 of people have made it before me, but I want 0:00:48.400 --> 0:00:53.720 to talk about a specific implementation. For example, with photography, 0:00:54.160 --> 0:00:58.480 it used to be pretty difficult to manipulate pictures convincingly. 0:00:58.840 --> 0:01:03.280 There have been photo and tools for decades, but generally 0:01:03.320 --> 0:01:07.080 it took a great deal of skill and training, plus 0:01:07.160 --> 0:01:10.440 access to specialized equipment to pull it off, especially in 0:01:10.480 --> 0:01:15.240 the old film days. Now, gradually, tools like Photoshop made 0:01:15.240 --> 0:01:19.160 it easier to manipulate digital images. Now it still requires 0:01:19.200 --> 0:01:21.600 a certain level of skill to pull off a really 0:01:21.680 --> 0:01:26.480 convincing job, and it's easy to make really badly manipulated images. 0:01:26.880 --> 0:01:30.640 But as these tools became widely available, people began to 0:01:30.720 --> 0:01:33.000 learn how to use them. We had to come to 0:01:33.040 --> 0:01:36.840 the realization that we cannot necessarily believe our own eyes 0:01:36.920 --> 0:01:40.479 when we're looking at a digital image. Now, the same 0:01:40.520 --> 0:01:44.920 thing is happening with video footage. It's quite possible to 0:01:45.080 --> 0:01:48.440 fake video footage, though again, if you want to do 0:01:48.480 --> 0:01:52.000 it really well, it requires some skill, some specialized tools, 0:01:52.680 --> 0:01:55.800 and to really get some expertise in it in order 0:01:55.840 --> 0:01:58.120 to do it in a way that's really convincing. But 0:01:58.760 --> 0:02:02.600 it's a pretty recent techn logical capability in fiction, however, 0:02:02.720 --> 0:02:05.919 it's been around for a long time. I remember seeing 0:02:06.040 --> 0:02:09.880 the movie Rising Sun back in the early nineteen nineties. Now, 0:02:09.880 --> 0:02:13.799 in that film, Wesley Snipes plays a detective and Sean 0:02:13.880 --> 0:02:17.560 Connery in his most convincing roles, since he played an 0:02:17.560 --> 0:02:21.880 Egyptian immortal posing as a Spaniard with a Scottish accent 0:02:21.919 --> 0:02:27.400 in Highlander, would play a Japanese customs and culture expert 0:02:28.800 --> 0:02:33.880 Sean Connery. Anyway, the film is a mystery thriller, and 0:02:34.160 --> 0:02:38.000 while the two are investigating a murder, they come into 0:02:38.040 --> 0:02:41.200 possession of some video footage, and they find out that 0:02:41.320 --> 0:02:44.600 video footage has actually been manipulated it was planted for 0:02:44.639 --> 0:02:46.400 them to find so that would put them on the 0:02:46.440 --> 0:02:51.560 wrong trail. One person's face was replaced with someone else's, 0:02:51.600 --> 0:02:54.920 and in a somewhat comedic scene, the video editor who 0:02:55.000 --> 0:02:59.560 is explaining this casually swaps the heads of Snipes and 0:02:59.600 --> 0:03:02.280 Connor in real time in a video feed to show 0:03:02.280 --> 0:03:06.280 off this capability, which might have been a tad unrealistic, 0:03:06.320 --> 0:03:08.440 but today we are in a world in which video 0:03:08.480 --> 0:03:13.799 manipulation of increasingly convincing quality is achievable in real time. 0:03:14.400 --> 0:03:17.360 In fact, these days, it's possible to use sophisticated computer 0:03:17.400 --> 0:03:21.960 algorithms to allow for manipulation of captured video, almost as 0:03:22.040 --> 0:03:26.040 if the video was a computer generated cartoon reacting to 0:03:26.120 --> 0:03:29.600 real time inputs like a video game controller, only instead 0:03:29.600 --> 0:03:32.360 of it being a video game, it's a real person 0:03:32.480 --> 0:03:36.280 on video. There are, of course, lots of ways this 0:03:36.360 --> 0:03:40.000 technology could be used unethically, and one of the best 0:03:40.040 --> 0:03:44.400 known has been the focus of this whole conversation around 0:03:44.440 --> 0:03:48.240 video manipulation, and it comes from a former Reddit user 0:03:48.560 --> 0:03:52.400 who went by the handle deep fakes. Now that handle 0:03:52.400 --> 0:03:56.920 has become the shorthand for the general practice, which frequently, 0:03:57.240 --> 0:04:01.280 but not exclusively, would involve replace the face of an 0:04:01.280 --> 0:04:05.440 actor in a pornographic scene with someone else's face like 0:04:05.560 --> 0:04:10.160 that of a celebrity, which is pretty darn unethical and creepy. 0:04:10.240 --> 0:04:14.240 The name itself was a reference to the technology used 0:04:14.280 --> 0:04:17.960 in the approach, so it relies on a process called 0:04:18.279 --> 0:04:22.520 deep learning. Deep learning is a type of machine learning. 0:04:22.520 --> 0:04:26.680 It's a sub type of machine learning that utilizes artificial 0:04:26.720 --> 0:04:29.719 neural networks. And I've talked an awful lot about those 0:04:29.800 --> 0:04:31.960 kind of networks recently, so I'm not going to go 0:04:32.000 --> 0:04:34.480 over the whole thing again. Will give just a really 0:04:34.560 --> 0:04:39.640 quick rundown to say, you have nodes, artificial neurons in 0:04:39.680 --> 0:04:44.400 these networks that receive input from potentially multiple other nodes, 0:04:45.160 --> 0:04:49.360 and then on that input, your artificial neuron that you're 0:04:49.360 --> 0:04:52.880 looking at will perform some sort of weighted operation and 0:04:52.960 --> 0:04:56.200 produce a single output. That single output can move on 0:04:56.279 --> 0:05:00.000 to become one of many inputs for a different ARTIFICI 0:05:00.000 --> 0:05:02.559 shoal neuron in that network, and so on and so forth. 0:05:03.520 --> 0:05:08.400 Deep learning networks are very very large artificial neural networks, 0:05:08.680 --> 0:05:11.640 and they can accept a large amount of training data. 0:05:12.040 --> 0:05:15.520 This is a scalable approach. This means the larger the 0:05:15.600 --> 0:05:18.479 network and the more data you can feed it, the 0:05:18.600 --> 0:05:22.320 better it performs. This is different from any other machine 0:05:22.400 --> 0:05:25.839 learning models. Those tend to hit a performance plateau once 0:05:25.839 --> 0:05:28.640 you hit a certain size, which means that if you 0:05:28.680 --> 0:05:32.279 were to add more nodes to the network, you wouldn't 0:05:32.279 --> 0:05:35.560 necessarily see a comparative increase in performance. You would you 0:05:35.560 --> 0:05:39.719 would kind of flatten out over time. In fourteen, a 0:05:39.800 --> 0:05:43.680 deep learning expert named Andrew ng gave a talk at 0:05:43.720 --> 0:05:48.159 Stanford about the best use cases for deep learning, and 0:05:48.200 --> 0:05:54.000 he mentioned that it was particularly good at supervised learning tasks. Now, 0:05:54.040 --> 0:05:56.800 these are the types of computer problems in which we 0:05:56.960 --> 0:06:00.680 humans already know the answer, such as is there a 0:06:00.800 --> 0:06:04.479 cat in this photograph? Humans can pick up on that 0:06:04.800 --> 0:06:08.440 right away, assuming someone has not carefully hidden a cat 0:06:08.600 --> 0:06:12.160 in a very busy image. But for a computer, this 0:06:12.240 --> 0:06:14.760 is a much more difficult problem. Even if the picture 0:06:14.800 --> 0:06:18.120 has a cat center stage, it can be tough for 0:06:18.200 --> 0:06:21.880 a computer to figure that out using a supervised learning approach. 0:06:21.920 --> 0:06:24.960 With a deep learning network, you can train a system 0:06:25.080 --> 0:06:28.440 to recognize cats and images with a high degree of 0:06:28.520 --> 0:06:32.600 success if you have a large amount of training data 0:06:32.880 --> 0:06:36.800 to train the network to recognize cats. Now, in this 0:06:36.920 --> 0:06:39.239 other case, that I was talking about. The Reddit user 0:06:39.320 --> 0:06:44.280 called deep fakes started posting on Reddit in late twenties seventeen. 0:06:44.720 --> 0:06:48.160 The user made an open source code version of a 0:06:48.200 --> 0:06:51.720 deep learning algorithm and made it available for the purposes 0:06:51.760 --> 0:06:56.200 of video manipulation and anyone could take advantage of it. Now. Specifically, 0:06:56.680 --> 0:07:00.840 this algorithm was designed for face swapping. The algorithm would 0:07:00.839 --> 0:07:03.400 allow you to put the face of one person onto 0:07:03.480 --> 0:07:07.159 the body of another in video form, and it wasn't 0:07:07.200 --> 0:07:11.000 always convincing. In fact, it could often be easily detectable 0:07:11.120 --> 0:07:14.960 as fake if someone had not trained the model properly 0:07:15.080 --> 0:07:18.800 before creating the video. But it did open up a 0:07:18.840 --> 0:07:22.400 can of worms once the practice started getting media coverage. However, 0:07:22.560 --> 0:07:25.760 the actual technology to pull this off was already a 0:07:25.800 --> 0:07:29.360 couple of years old when deep fakes shared it. Back 0:07:29.360 --> 0:07:33.560 in there was a group of researchers from Stanford then 0:07:33.840 --> 0:07:37.280 and also the University of Erlanger Nuremberg and the Max 0:07:37.320 --> 0:07:41.360 Planck Institute for Informatics who collectively published a paper that 0:07:41.480 --> 0:07:45.640 was titled Face to Face Real Time Face Capture and 0:07:45.680 --> 0:07:49.520 Reenactment of r GB Videos and that's a face, the 0:07:49.680 --> 0:07:54.560 number two and face. The paper details the methodology the 0:07:54.600 --> 0:07:58.400 group used to create a pretty incredible effect. The algorithm 0:07:58.600 --> 0:08:02.600 could take the facial expressis from one person and transfer 0:08:02.720 --> 0:08:06.120 them in real time to a video target. It was 0:08:06.160 --> 0:08:10.880 like turning the video into a digital puppet. So you 0:08:10.960 --> 0:08:14.040 might have a video loop of a celebrity running, and 0:08:14.080 --> 0:08:17.200 preferably it's a loop that's easily repeatable without the repeat 0:08:17.200 --> 0:08:22.240 being terribly noticeable, and that's your target video. So if 0:08:22.280 --> 0:08:25.240 you just let it run, you just would see video 0:08:25.280 --> 0:08:27.760 of someone sitting down, maybe looking around a little bit, 0:08:27.760 --> 0:08:31.200 but that's it, nothing special. Then you would have a 0:08:31.240 --> 0:08:36.000 source subject sitting in view of a consumer quality webcam, 0:08:36.120 --> 0:08:41.040 no special equipment here, and that person could make different expressions, 0:08:41.080 --> 0:08:44.640 including opening and closing their mouths, and the video target 0:08:44.679 --> 0:08:49.080 would match them move for move, like a digital puppet. Moreover, 0:08:49.480 --> 0:08:52.040 the source subject didn't have to wear any special gear. 0:08:52.160 --> 0:08:54.160 They didn't have to have any special markers, none of 0:08:54.200 --> 0:08:56.640 those dots that you would see with motion capture. None 0:08:56.679 --> 0:09:00.040 of that was necessary. All the algorithm needed was a 0:09:00.120 --> 0:09:03.719 video feed from a monocular camera, so you didn't even 0:09:03.800 --> 0:09:07.080 need depth perception for this. There's a video of their 0:09:07.080 --> 0:09:10.400 work on YouTube that shows off this process and includes 0:09:10.440 --> 0:09:13.320 a loop of George W. Bush sitting for an interview. 0:09:13.800 --> 0:09:18.280 The source subject can manipulate the face that Bush makes 0:09:18.360 --> 0:09:21.040 just by making faces of his own, and the algorithm 0:09:21.080 --> 0:09:24.240 would map those movements to the target video. And it's 0:09:24.280 --> 0:09:29.160 pretty wild to see an image of moving image of 0:09:29.440 --> 0:09:32.800 George W. Bush responding in real time to all of 0:09:32.840 --> 0:09:35.760 these different facial expressions this guy is making. So how 0:09:35.760 --> 0:09:37.920 did the team do this? It's one thing to say 0:09:37.960 --> 0:09:40.760 a deep learning algorithm gave them this capability, but that's 0:09:40.800 --> 0:09:46.640 not really an explanation. The paper definitively spells this out 0:09:47.120 --> 0:09:52.360 in real technical detail. It starts off the explanation by saying, quote, 0:09:52.720 --> 0:09:56.120 in our method, we first reconstruct the shape identity of 0:09:56.120 --> 0:09:59.760 the target actor using a new global non rigid model 0:09:59.800 --> 0:10:03.600 BA fast bundling approach based on a prerecorded training sequence. 0:10:04.000 --> 0:10:06.800 As this pre process is performed globally on a set 0:10:06.840 --> 0:10:10.439 of training frames, we can resolve geometric ambiguities common to 0:10:10.520 --> 0:10:14.560 binocular reconstruction. At runtime. We tracked both the expressions of 0:10:14.600 --> 0:10:17.760 the source and target actors video by a dense analysis 0:10:17.760 --> 0:10:22.320 by synthesis approach based on a statistical facial prior end quote, 0:10:22.760 --> 0:10:25.559 and it goes on in that vein throughout the paper 0:10:26.000 --> 0:10:28.880 which means it gets pretty dense. But I think we 0:10:28.920 --> 0:10:31.680 can suss out what's going on from a high level 0:10:32.280 --> 0:10:36.160 if we just take a moment. But first, I'm going 0:10:36.240 --> 0:10:39.080 to take a moment of my own to thank my sponsor. 0:10:46.679 --> 0:10:50.559 So how did the Face to Face team build this tool? Well, 0:10:50.640 --> 0:10:54.760 for each target video, they would collect a large sample 0:10:55.200 --> 0:10:58.040 of footage and images and feed it to this deep 0:10:58.160 --> 0:11:01.560 learning algorithm. This would be necess sarry to identify all 0:11:01.600 --> 0:11:04.599 the points on the face that would move with various expressions, 0:11:04.840 --> 0:11:08.400 as well as to capture images of the inside of 0:11:08.440 --> 0:11:11.720 the target's mouth when he or she spoke. This is 0:11:11.760 --> 0:11:15.080 because the video loop they used to create the manipulated 0:11:15.160 --> 0:11:18.440 video would feature the target subject, typically with his or 0:11:18.440 --> 0:11:21.600 her mouth closed, so it might be a section in 0:11:21.640 --> 0:11:24.000 which the subject was sitting down for an interview and 0:11:24.040 --> 0:11:27.600 listening to an interviewer's questions but not responding yet they 0:11:27.600 --> 0:11:31.600 were just listening. The additional video would provide information about 0:11:31.600 --> 0:11:33.840 the inside of the target subject's mouth, which could be 0:11:33.880 --> 0:11:37.240 rendered in time on the target video performance when it 0:11:37.360 --> 0:11:40.760 came time to do that. Their approach improved the scanning 0:11:40.800 --> 0:11:44.720 technique to build face templates for both the source subject 0:11:45.280 --> 0:11:48.720 who provides all the expressions and the target subject, who 0:11:48.840 --> 0:11:52.679 mimics all the expressions. As the source subject makes different 0:11:52.720 --> 0:11:57.120 facial expressions, the computer face template detects how the subjects 0:11:57.240 --> 0:12:02.880 face changes or deforms over time. The computer model then 0:12:02.960 --> 0:12:06.840 takes that information, saying, all right, well the lips moved 0:12:06.840 --> 0:12:09.520 in this way, there was a grimace here or a 0:12:09.559 --> 0:12:15.200 smile there, and transfer those motions to the targets face template, 0:12:15.600 --> 0:12:19.280 which is matched to the target's actual face. This process 0:12:19.360 --> 0:12:24.760 transfers the expressions over to the target, and so when 0:12:24.800 --> 0:12:28.600 the source subject grimaces, the target grimaces, if the source 0:12:28.640 --> 0:12:31.319 subject just it's still the target. Video will continue to 0:12:31.400 --> 0:12:34.960 loop and the targets face won't change. The more video 0:12:34.960 --> 0:12:37.720 footage you can get of your target and your source, 0:12:38.120 --> 0:12:41.880 the better the computer algorithms are that create those face templates, 0:12:41.920 --> 0:12:44.720 and the more natural the manipulation will appear on the 0:12:44.720 --> 0:12:48.280 finished video. You also want a really good amount of 0:12:48.800 --> 0:12:51.680 footage just to get all that extra information you need, 0:12:51.679 --> 0:12:53.120 like the inside of the mouth, so that that can 0:12:53.160 --> 0:12:57.319 all be extrapolated properly. You have to design a tool 0:12:57.400 --> 0:13:00.880 that can encode an image called the training image, and 0:13:00.800 --> 0:13:05.720 then decodes this data to reconstruct the image. So imagine 0:13:05.760 --> 0:13:11.360 you've got a picture. The encoder essentially creates data based 0:13:11.400 --> 0:13:14.400 on that image. It's like a description of that image. 0:13:15.080 --> 0:13:19.280 The decoder takes the description and tries to rebuild the 0:13:19.320 --> 0:13:22.520 image based on the description. I think of this like 0:13:22.559 --> 0:13:25.720 that scene in Willy Wonka where Mike TV gets broken 0:13:25.800 --> 0:13:28.479 up in a million little pieces and then gets reconstructed 0:13:28.760 --> 0:13:31.960 on the television screen. So the second image is not 0:13:32.040 --> 0:13:34.199 a copy. It's not like you made a copy of 0:13:34.240 --> 0:13:36.880 the first one. It's like you built a new image 0:13:36.920 --> 0:13:39.400 based on the first one. By the way, when you 0:13:39.440 --> 0:13:45.200 start off with these uh these algorithms, those reconstructions tend 0:13:45.240 --> 0:13:49.120 to look pretty bad. You have to continually train and 0:13:49.160 --> 0:13:51.680 train and train and train the model so that it 0:13:51.720 --> 0:13:57.160 gets better and better at producing a close representation of 0:13:57.200 --> 0:14:01.480 the original image. When it does, it's reconstru struction. And 0:14:01.960 --> 0:14:06.319 you would have both essentially decoders for both your source 0:14:06.440 --> 0:14:10.720 subject and your target subject. Use the same encoder for both, 0:14:11.120 --> 0:14:15.600 but two different decoders, one dedicated to your source, one 0:14:15.679 --> 0:14:20.120 dedicated to your target. Then you would feed the reconstructed 0:14:20.200 --> 0:14:22.720 images through the system again and again. This is called 0:14:23.080 --> 0:14:27.240 back propagation. You do this over millions of times, typically 0:14:27.520 --> 0:14:31.520 to improve this process, and then you're ready to really 0:14:32.080 --> 0:14:34.880 switch switch things up. So let's say we've got two people. 0:14:34.960 --> 0:14:37.800 We've got person one and we've got person two, and 0:14:37.840 --> 0:14:41.160 you've been feeding images of both of these people through 0:14:41.280 --> 0:14:44.680 the same encoder, but of course you have dedicated decoders 0:14:44.720 --> 0:14:47.760 to produce the reconstruction. So person one has decode er 0:14:47.840 --> 0:14:52.320 one and person two has decoder two. Now let's say 0:14:52.320 --> 0:14:58.160 you're ready to put person two's face on person one's body. Well, 0:14:58.200 --> 0:15:01.840 you would feed an image of person one into the encoder, 0:15:02.200 --> 0:15:05.320 but you use the decoder for a person to to 0:15:05.480 --> 0:15:08.680 reconstruct the image, and what you get is persons who's 0:15:08.800 --> 0:15:13.480 face but mimicking the expression from person one. You, or 0:15:13.600 --> 0:15:18.840 rather the computer algorithm, does this frame by frame on video, 0:15:18.920 --> 0:15:20.920 and you end up with a video appearing to feature 0:15:20.960 --> 0:15:24.680 one person when in fact it's just their face on 0:15:24.760 --> 0:15:27.680 top of someone else, and it's their face making the 0:15:27.680 --> 0:15:31.720 exact same expressions as whoever was originally in that video. 0:15:32.680 --> 0:15:36.280 Now back over to deep fakes. Before long after the 0:15:36.320 --> 0:15:40.840 Reddit user initially posted this code, folks over at Reddit, 0:15:40.840 --> 0:15:43.600 we're taking this open source code and making more advanced 0:15:43.680 --> 0:15:46.960 software based off of it. Soon there were desktop apps 0:15:46.960 --> 0:15:50.200 that would take over all the hard parts of this process, 0:15:50.440 --> 0:15:54.080 all the codey bits, if you will, of training a model. 0:15:54.480 --> 0:15:57.160 Some of them would guide users into creating the data 0:15:57.280 --> 0:15:59.800 that would be used to train the model and go 0:16:00.040 --> 0:16:02.360 all the way through the process of creating the final 0:16:02.440 --> 0:16:05.960 fake videos. Even with some of the more sophisticated versions, 0:16:06.000 --> 0:16:09.720 there were tell tales signs of tampering. Typically some blurring 0:16:09.760 --> 0:16:14.400 around images, particularly near chins and mouths. Those would be signs. 0:16:14.960 --> 0:16:17.360 If there was any flicker, that was a sign if 0:16:17.360 --> 0:16:21.720 you didn't take enough time to train the model. Typically 0:16:21.720 --> 0:16:24.640 you would want to do several days of training at least. 0:16:24.960 --> 0:16:26.840 If you didn't take that time, you might see some 0:16:26.960 --> 0:16:29.840 really nasty blurring and flickering, and it would be a 0:16:29.840 --> 0:16:35.480 dead giveaway that this was tampered. Video in writer, director, 0:16:35.520 --> 0:16:39.520 and comedian Jordan's Peel demonstrated the power of this technology. 0:16:39.600 --> 0:16:43.040 He showed how, with his impersonation of Barack Obama and 0:16:43.200 --> 0:16:47.840 some manipulation software, he could create a fake public service 0:16:47.840 --> 0:16:51.680 address when which the president would appear to say things 0:16:52.080 --> 0:16:55.640 that he normally would never say. The technology behind this 0:16:55.800 --> 0:16:58.640 made use of what is called a long short term 0:16:58.680 --> 0:17:01.480 memory network or l s TM, to go into the 0:17:01.520 --> 0:17:05.199 mechanics of that would require another podcast, but using an 0:17:05.200 --> 0:17:08.520 approach similar to what I've already described, a team was 0:17:08.560 --> 0:17:12.240 able to make a video of Obama apparently lip syncing 0:17:12.480 --> 0:17:16.159 Peel's satirical message. The goal of this p s A 0:17:16.480 --> 0:17:20.080 was beyond alert because fakes are getting harder to spot. 0:17:20.600 --> 0:17:25.200 The University of Washington showed off this and They're Synthesizing 0:17:25.240 --> 0:17:28.880 Obama project in which they took the audio from one 0:17:29.000 --> 0:17:32.600 of President Obama's speeches and then used it to animate 0:17:32.760 --> 0:17:36.600 his face in video from a different address that he 0:17:36.680 --> 0:17:40.240 gave during his presidency. So in this example, the person 0:17:40.280 --> 0:17:43.680 in the target video is the same person as the 0:17:43.800 --> 0:17:47.399 source for the audio. But the point was pretty clear 0:17:47.600 --> 0:17:51.399 that tech would soon make it possible to fake someone 0:17:51.440 --> 0:17:54.960 saying or doing something. It just takes the right algorithms, 0:17:55.280 --> 0:17:58.320 the right amount of training data, and the right amount 0:17:58.359 --> 0:18:00.720 of time to get the model trained up enough to 0:18:00.760 --> 0:18:04.960 do it smoothly. Now, this technology could be used to 0:18:05.080 --> 0:18:09.119 do stuff that isn't related to malicious deception or for 0:18:09.359 --> 0:18:12.399 pornography or anything along those lines. It could be used 0:18:12.760 --> 0:18:16.119 in television and film for lots of stuff, including potentially 0:18:16.160 --> 0:18:20.159 adding in actors who have passed away into a film. 0:18:20.200 --> 0:18:23.320 Paired with similar work that's going on in voice synthesis, 0:18:23.320 --> 0:18:26.560 you could end up with a convincing replacement, which means 0:18:26.960 --> 0:18:30.800 we could make movies with dead actors taking on new 0:18:30.920 --> 0:18:35.000 parts because we can synthesize their speech, we can synthesize 0:18:35.040 --> 0:18:38.000 their appearance. You would still have someone else acting out 0:18:38.080 --> 0:18:41.680 the part physically, but you would replace their image with 0:18:42.040 --> 0:18:46.920 this actor's image. Or maybe you would want to use 0:18:46.960 --> 0:18:49.080 this kind of technology just to make everyone think you 0:18:49.080 --> 0:18:51.880 can cut a rug. This brings me to the University 0:18:51.880 --> 0:18:55.080 of California, Berkeley and is the subject of a paper 0:18:55.119 --> 0:18:59.480 titled Everybody Dance Now. The goal is a simple concept 0:18:59.560 --> 0:19:02.280 that's actually really hard to pull off. What if you 0:19:02.320 --> 0:19:05.719 were to take the movements of a professional dancer and 0:19:05.760 --> 0:19:09.160 then map those movements onto the body of someone who 0:19:09.320 --> 0:19:12.399 wasn't a dancer. What if you could create a video 0:19:12.720 --> 0:19:17.040 in which literally anyone would appear to move like a skilled, 0:19:17.440 --> 0:19:21.280 trained dancer. And how the heck would that be possible. Well, 0:19:21.320 --> 0:19:23.919 at the heart of the team's efforts was something I 0:19:23.960 --> 0:19:27.120 talked about in a recent episode of tech Stuff about 0:19:27.200 --> 0:19:31.919 an AI generated portrait, and that would be generative adversarial 0:19:32.040 --> 0:19:35.479 networks or g A n s. These use a pair 0:19:35.600 --> 0:19:40.199 of artificial neural networks in competition against each other. So 0:19:40.240 --> 0:19:42.600 since I covered this recently, i'll just give again a 0:19:42.640 --> 0:19:46.240 super quick high level summary. You've got one network that 0:19:46.320 --> 0:19:49.399 has a specific job, such as trying to create an 0:19:49.400 --> 0:19:51.760 original image of a cat. We'll go back to the 0:19:51.800 --> 0:19:54.560 cat pictures. That's one of my favorite ones because it 0:19:54.600 --> 0:19:58.600 was one of the early use cases of neural networks 0:19:58.600 --> 0:20:01.399 that I remember encountering when I was doing research. Now, 0:20:01.480 --> 0:20:04.479 let's say you've got your second network. Your second network 0:20:04.480 --> 0:20:08.159 has the specific job of evaluating pictures of cats to 0:20:08.280 --> 0:20:12.280 determine if they are valid, meaning is this a real 0:20:12.320 --> 0:20:16.000 picture of a cat that's part of the training material 0:20:16.320 --> 0:20:20.000 that I'm accepting, or is this, in fact a fake 0:20:20.400 --> 0:20:25.040 that was created by a computer program the other neural network. 0:20:25.400 --> 0:20:27.920 So you've got one network trying to fool the other network. 0:20:28.200 --> 0:20:31.560 And these networks get better at what they do over time, 0:20:31.880 --> 0:20:38.159 they improve, So your counterfeit network is getting better and 0:20:38.200 --> 0:20:41.959 better at making fake pictures of cats, and your detector 0:20:42.040 --> 0:20:45.960 network is getting better and better at detecting fake images 0:20:46.000 --> 0:20:49.440 of cats. Now, typically this requires humans to give feedback 0:20:49.560 --> 0:20:52.879 or tweaking weight values along the networks, but they do 0:20:52.920 --> 0:20:56.679 get better over time. So if the network trying to 0:20:56.680 --> 0:20:59.359 create a picture of a cat gets the feedback of sorry, buddy, 0:20:59.400 --> 0:21:01.960 but they're onto you, then it can try again and 0:21:02.040 --> 0:21:04.480 adjust it's approach slightly in an effort to fool the 0:21:04.480 --> 0:21:07.880 second network. If the second network gets the feedback you'll 0:21:07.960 --> 0:21:10.320 let this one slip by and it's fake, then it 0:21:10.359 --> 0:21:13.040 will adjust or it will be adjusted to look out 0:21:13.040 --> 0:21:16.080 for any tailtale signs that it had missed in that 0:21:16.200 --> 0:21:20.360 earlier evaluation. Over time, the two networks working against each 0:21:20.359 --> 0:21:24.480 other will create the ultimate result of better and better 0:21:24.600 --> 0:21:28.679 computer generated content, whether it's an image of a cat 0:21:29.440 --> 0:21:34.760 or a sonnet, or a song or a video. Now 0:21:34.880 --> 0:21:39.680 that doesn't mean that these computer generated things are at 0:21:39.680 --> 0:21:43.399 the same level as human generated stuff, especially when it 0:21:43.440 --> 0:21:46.479 comes to text. I've seen a lot of song lyrics 0:21:46.520 --> 0:21:51.000 that were inscrutable even by my old man standards. So 0:21:51.280 --> 0:21:54.520 I think that we're a long way away from getting 0:21:54.600 --> 0:21:57.520 to a point where they can fool us in every case. 0:21:57.600 --> 0:22:01.040 But with video they're getting pretty darn good. Now, this 0:22:01.119 --> 0:22:04.719 team had two groups of subjects, and so you had 0:22:04.760 --> 0:22:08.520 your source subjects and your target subjects. The source in 0:22:08.560 --> 0:22:11.399 this case, were the people who could dance, so like 0:22:11.520 --> 0:22:14.720 ballet dancers, hip hop dancers and that sort of stuff. 0:22:14.760 --> 0:22:19.000 They legit know how to move. They would demonstrate various 0:22:19.080 --> 0:22:22.400 dances on video. The second group of subjects were your 0:22:22.440 --> 0:22:27.439 target subjects. These were not trained dancers. They were to 0:22:27.600 --> 0:22:31.560 go through a series of moves and poses, essentially aping 0:22:31.720 --> 0:22:35.600 as best they could the movements of trained dancers, and 0:22:35.680 --> 0:22:40.080 the goal of this pair of networks was to smooth 0:22:40.119 --> 0:22:43.160 the movements out and adjust the timing so that these 0:22:43.280 --> 0:22:46.600 untrained dancers would appear to move more like their groovy 0:22:46.720 --> 0:22:50.800 source subject counterparts. I'll explain more in just a moment, 0:22:50.800 --> 0:22:53.840 but first let's take another quick break to thank our sponsor. 0:23:01.359 --> 0:23:05.280 According to the Everybody Dance Now paper, the team would 0:23:05.280 --> 0:23:09.040 transfer motion between the sources to the target through an 0:23:09.240 --> 0:23:14.399 end to end pixel based pipeline. So here's how that's done. 0:23:14.480 --> 0:23:18.919 Because if you're like me, that phrase meant next to 0:23:19.000 --> 0:23:22.760 nothing to you. So specifically, the group used three stages 0:23:22.800 --> 0:23:26.280 to take the movements of one person and transpose them 0:23:26.320 --> 0:23:31.200 to a target person. Those three were pose detection, global 0:23:31.280 --> 0:23:35.800 pose normalization, and mapping from normalized pose stick figures to 0:23:35.840 --> 0:23:41.040 the target subject. Post detection involves teaching machines, in other words, 0:23:41.040 --> 0:23:45.159 computers how to interpret images to determine where key body 0:23:45.200 --> 0:23:50.480 points are, like elbows, knees, hips, shoulders, the head, that 0:23:50.560 --> 0:23:53.679 kind of stuff. That first requires that you teach the 0:23:53.720 --> 0:23:57.760 machine to recognize those points in the first place. So 0:23:58.080 --> 0:24:00.160 first you have to train a machine to recogniz eyes 0:24:00.240 --> 0:24:04.360 those points and identify them with a target level of accuracy. 0:24:04.520 --> 0:24:08.000 It's pretty typical to represent these joints as as points 0:24:08.040 --> 0:24:11.320 in a stick figure, so each point represents another joint 0:24:11.400 --> 0:24:15.159 or point of articulation. The lines represent the trunk of 0:24:15.200 --> 0:24:18.199 the body, the limbs, the head. You end up with 0:24:18.240 --> 0:24:21.159 a stick figure. If your machine learning mechanism was a 0:24:21.160 --> 0:24:24.199 good one, the machine should be able to overlay a 0:24:24.280 --> 0:24:27.600 stick figure on top of any image of a person posing, 0:24:28.040 --> 0:24:30.439 and the stick figure should more or less conform to 0:24:30.560 --> 0:24:34.280 that image, including where the actual joints are. So if 0:24:34.280 --> 0:24:36.800 you have someone standing there in the classic Peter Pan 0:24:36.880 --> 0:24:40.239 pose of their their fists on their hips uh and 0:24:40.280 --> 0:24:43.480 their their arms out of kimbo, then it should draw 0:24:43.520 --> 0:24:46.320 a stick figure that's essentially aping the same thing and 0:24:46.359 --> 0:24:48.760 be able to overlay it on top of the original image. 0:24:49.000 --> 0:24:52.160 Now these days this can be done in real time. So, 0:24:52.240 --> 0:24:54.520 for example, there's a team at Google Creative Lab that 0:24:54.600 --> 0:24:57.639 used a machine learning model of pose net and created 0:24:57.680 --> 0:25:01.240 a JavaScript version with TensorFlow, which is an open source 0:25:01.320 --> 0:25:04.760 software library often used for machine learning. And with this 0:25:04.840 --> 0:25:08.080 tool you can do real time pose estimation through a 0:25:08.119 --> 0:25:12.399 browser and a webcam. The application doesn't have any technology 0:25:12.480 --> 0:25:15.040 related to identifying the person in the image. It's just 0:25:15.119 --> 0:25:17.600