1 00:00:04,120 --> 00:00:07,160 Speaker 1: Get in touch with technology with tech Stuff from how 2 00:00:07,200 --> 00:00:13,720 Speaker 1: stuff works dot com. Hey there, and welcome to tech Stuff. 3 00:00:13,760 --> 00:00:16,799 Speaker 1: I'm your host, Jonathan Strickland. I'm an executive producer and 4 00:00:16,840 --> 00:00:20,400 Speaker 1: I love all things tech. And here's a fun fact 5 00:00:20,440 --> 00:00:25,799 Speaker 1: about getting older. As you age, stuff that you once 6 00:00:25,840 --> 00:00:30,160 Speaker 1: thought was impossible will become not only possible, but will 7 00:00:30,200 --> 00:00:34,040 Speaker 1: become the norm, and future generations won't even think about 8 00:00:34,040 --> 00:00:37,400 Speaker 1: what it must have been like before the impossible was commonplace. 9 00:00:37,880 --> 00:00:40,720 Speaker 1: Now this is true for every generation. It's not like 10 00:00:40,760 --> 00:00:45,880 Speaker 1: this is, you know, a brand new, groundbreaking observation. Plenty 11 00:00:45,920 --> 00:00:48,400 Speaker 1: of people have made it before me, but I want 12 00:00:48,400 --> 00:00:53,720 Speaker 1: to talk about a specific implementation. For example, with photography, 13 00:00:54,160 --> 00:00:58,480 Speaker 1: it used to be pretty difficult to manipulate pictures convincingly. 14 00:00:58,840 --> 00:01:03,280 Speaker 1: There have been photo and tools for decades, but generally 15 00:01:03,320 --> 00:01:07,080 Speaker 1: it took a great deal of skill and training, plus 16 00:01:07,160 --> 00:01:10,440 Speaker 1: access to specialized equipment to pull it off, especially in 17 00:01:10,480 --> 00:01:15,240 Speaker 1: the old film days. Now, gradually, tools like Photoshop made 18 00:01:15,240 --> 00:01:19,160 Speaker 1: it easier to manipulate digital images. Now it still requires 19 00:01:19,200 --> 00:01:21,600 Speaker 1: a certain level of skill to pull off a really 20 00:01:21,680 --> 00:01:26,480 Speaker 1: convincing job, and it's easy to make really badly manipulated images. 21 00:01:26,880 --> 00:01:30,640 Speaker 1: But as these tools became widely available, people began to 22 00:01:30,720 --> 00:01:33,000 Speaker 1: learn how to use them. We had to come to 23 00:01:33,040 --> 00:01:36,840 Speaker 1: the realization that we cannot necessarily believe our own eyes 24 00:01:36,920 --> 00:01:40,479 Speaker 1: when we're looking at a digital image. Now, the same 25 00:01:40,520 --> 00:01:44,920 Speaker 1: thing is happening with video footage. It's quite possible to 26 00:01:45,080 --> 00:01:48,440 Speaker 1: fake video footage, though again, if you want to do 27 00:01:48,480 --> 00:01:52,000 Speaker 1: it really well, it requires some skill, some specialized tools, 28 00:01:52,680 --> 00:01:55,800 Speaker 1: and to really get some expertise in it in order 29 00:01:55,840 --> 00:01:58,120 Speaker 1: to do it in a way that's really convincing. But 30 00:01:58,760 --> 00:02:02,600 Speaker 1: it's a pretty recent techn logical capability in fiction, however, 31 00:02:02,720 --> 00:02:05,919 Speaker 1: it's been around for a long time. I remember seeing 32 00:02:06,040 --> 00:02:09,880 Speaker 1: the movie Rising Sun back in the early nineteen nineties. Now, 33 00:02:09,880 --> 00:02:13,799 Speaker 1: in that film, Wesley Snipes plays a detective and Sean 34 00:02:13,880 --> 00:02:17,560 Speaker 1: Connery in his most convincing roles, since he played an 35 00:02:17,560 --> 00:02:21,880 Speaker 1: Egyptian immortal posing as a Spaniard with a Scottish accent 36 00:02:21,919 --> 00:02:27,400 Speaker 1: in Highlander, would play a Japanese customs and culture expert 37 00:02:28,800 --> 00:02:33,880 Speaker 1: Sean Connery. Anyway, the film is a mystery thriller, and 38 00:02:34,160 --> 00:02:38,000 Speaker 1: while the two are investigating a murder, they come into 39 00:02:38,040 --> 00:02:41,200 Speaker 1: possession of some video footage, and they find out that 40 00:02:41,320 --> 00:02:44,600 Speaker 1: video footage has actually been manipulated it was planted for 41 00:02:44,639 --> 00:02:46,400 Speaker 1: them to find so that would put them on the 42 00:02:46,440 --> 00:02:51,560 Speaker 1: wrong trail. One person's face was replaced with someone else's, 43 00:02:51,600 --> 00:02:54,920 Speaker 1: and in a somewhat comedic scene, the video editor who 44 00:02:55,000 --> 00:02:59,560 Speaker 1: is explaining this casually swaps the heads of Snipes and 45 00:02:59,600 --> 00:03:02,280 Speaker 1: Connor in real time in a video feed to show 46 00:03:02,280 --> 00:03:06,280 Speaker 1: off this capability, which might have been a tad unrealistic, 47 00:03:06,320 --> 00:03:08,440 Speaker 1: but today we are in a world in which video 48 00:03:08,480 --> 00:03:13,799 Speaker 1: manipulation of increasingly convincing quality is achievable in real time. 49 00:03:14,400 --> 00:03:17,360 Speaker 1: In fact, these days, it's possible to use sophisticated computer 50 00:03:17,400 --> 00:03:21,960 Speaker 1: algorithms to allow for manipulation of captured video, almost as 51 00:03:22,040 --> 00:03:26,040 Speaker 1: if the video was a computer generated cartoon reacting to 52 00:03:26,120 --> 00:03:29,600 Speaker 1: real time inputs like a video game controller, only instead 53 00:03:29,600 --> 00:03:32,360 Speaker 1: of it being a video game, it's a real person 54 00:03:32,480 --> 00:03:36,280 Speaker 1: on video. There are, of course, lots of ways this 55 00:03:36,360 --> 00:03:40,000 Speaker 1: technology could be used unethically, and one of the best 56 00:03:40,040 --> 00:03:44,400 Speaker 1: known has been the focus of this whole conversation around 57 00:03:44,440 --> 00:03:48,240 Speaker 1: video manipulation, and it comes from a former Reddit user 58 00:03:48,560 --> 00:03:52,400 Speaker 1: who went by the handle deep fakes. Now that handle 59 00:03:52,400 --> 00:03:56,920 Speaker 1: has become the shorthand for the general practice, which frequently, 60 00:03:57,240 --> 00:04:01,280 Speaker 1: but not exclusively, would involve replace the face of an 61 00:04:01,280 --> 00:04:05,440 Speaker 1: actor in a pornographic scene with someone else's face like 62 00:04:05,560 --> 00:04:10,160 Speaker 1: that of a celebrity, which is pretty darn unethical and creepy. 63 00:04:10,240 --> 00:04:14,240 Speaker 1: The name itself was a reference to the technology used 64 00:04:14,280 --> 00:04:17,960 Speaker 1: in the approach, so it relies on a process called 65 00:04:18,279 --> 00:04:22,520 Speaker 1: deep learning. Deep learning is a type of machine learning. 66 00:04:22,520 --> 00:04:26,680 Speaker 1: It's a sub type of machine learning that utilizes artificial 67 00:04:26,720 --> 00:04:29,719 Speaker 1: neural networks. And I've talked an awful lot about those 68 00:04:29,800 --> 00:04:31,960 Speaker 1: kind of networks recently, so I'm not going to go 69 00:04:32,000 --> 00:04:34,480 Speaker 1: over the whole thing again. Will give just a really 70 00:04:34,560 --> 00:04:39,640 Speaker 1: quick rundown to say, you have nodes, artificial neurons in 71 00:04:39,680 --> 00:04:44,400 Speaker 1: these networks that receive input from potentially multiple other nodes, 72 00:04:45,160 --> 00:04:49,360 Speaker 1: and then on that input, your artificial neuron that you're 73 00:04:49,360 --> 00:04:52,880 Speaker 1: looking at will perform some sort of weighted operation and 74 00:04:52,960 --> 00:04:56,200 Speaker 1: produce a single output. That single output can move on 75 00:04:56,279 --> 00:05:00,000 Speaker 1: to become one of many inputs for a different ARTIFICI 76 00:05:00,000 --> 00:05:02,559 Speaker 1: shoal neuron in that network, and so on and so forth. 77 00:05:03,520 --> 00:05:08,400 Speaker 1: Deep learning networks are very very large artificial neural networks, 78 00:05:08,680 --> 00:05:11,640 Speaker 1: and they can accept a large amount of training data. 79 00:05:12,040 --> 00:05:15,520 Speaker 1: This is a scalable approach. This means the larger the 80 00:05:15,600 --> 00:05:18,479 Speaker 1: network and the more data you can feed it, the 81 00:05:18,600 --> 00:05:22,320 Speaker 1: better it performs. This is different from any other machine 82 00:05:22,400 --> 00:05:25,839 Speaker 1: learning models. Those tend to hit a performance plateau once 83 00:05:25,839 --> 00:05:28,640 Speaker 1: you hit a certain size, which means that if you 84 00:05:28,680 --> 00:05:32,279 Speaker 1: were to add more nodes to the network, you wouldn't 85 00:05:32,279 --> 00:05:35,560 Speaker 1: necessarily see a comparative increase in performance. You would you 86 00:05:35,560 --> 00:05:39,719 Speaker 1: would kind of flatten out over time. In fourteen, a 87 00:05:39,800 --> 00:05:43,680 Speaker 1: deep learning expert named Andrew ng gave a talk at 88 00:05:43,720 --> 00:05:48,159 Speaker 1: Stanford about the best use cases for deep learning, and 89 00:05:48,200 --> 00:05:54,000 Speaker 1: he mentioned that it was particularly good at supervised learning tasks. Now, 90 00:05:54,040 --> 00:05:56,800 Speaker 1: these are the types of computer problems in which we 91 00:05:56,960 --> 00:06:00,680 Speaker 1: humans already know the answer, such as is there a 92 00:06:00,800 --> 00:06:04,479 Speaker 1: cat in this photograph? Humans can pick up on that 93 00:06:04,800 --> 00:06:08,440 Speaker 1: right away, assuming someone has not carefully hidden a cat 94 00:06:08,600 --> 00:06:12,160 Speaker 1: in a very busy image. But for a computer, this 95 00:06:12,240 --> 00:06:14,760 Speaker 1: is a much more difficult problem. Even if the picture 96 00:06:14,800 --> 00:06:18,120 Speaker 1: has a cat center stage, it can be tough for 97 00:06:18,200 --> 00:06:21,880 Speaker 1: a computer to figure that out using a supervised learning approach. 98 00:06:21,920 --> 00:06:24,960 Speaker 1: With a deep learning network, you can train a system 99 00:06:25,080 --> 00:06:28,440 Speaker 1: to recognize cats and images with a high degree of 100 00:06:28,520 --> 00:06:32,600 Speaker 1: success if you have a large amount of training data 101 00:06:32,880 --> 00:06:36,800 Speaker 1: to train the network to recognize cats. Now, in this 102 00:06:36,920 --> 00:06:39,239 Speaker 1: other case, that I was talking about. The Reddit user 103 00:06:39,320 --> 00:06:44,280 Speaker 1: called deep fakes started posting on Reddit in late twenties seventeen. 104 00:06:44,720 --> 00:06:48,160 Speaker 1: The user made an open source code version of a 105 00:06:48,200 --> 00:06:51,720 Speaker 1: deep learning algorithm and made it available for the purposes 106 00:06:51,760 --> 00:06:56,200 Speaker 1: of video manipulation and anyone could take advantage of it. Now. Specifically, 107 00:06:56,680 --> 00:07:00,840 Speaker 1: this algorithm was designed for face swapping. The algorithm would 108 00:07:00,839 --> 00:07:03,400 Speaker 1: allow you to put the face of one person onto 109 00:07:03,480 --> 00:07:07,159 Speaker 1: the body of another in video form, and it wasn't 110 00:07:07,200 --> 00:07:11,000 Speaker 1: always convincing. In fact, it could often be easily detectable 111 00:07:11,120 --> 00:07:14,960 Speaker 1: as fake if someone had not trained the model properly 112 00:07:15,080 --> 00:07:18,800 Speaker 1: before creating the video. But it did open up a 113 00:07:18,840 --> 00:07:22,400 Speaker 1: can of worms once the practice started getting media coverage. However, 114 00:07:22,560 --> 00:07:25,760 Speaker 1: the actual technology to pull this off was already a 115 00:07:25,800 --> 00:07:29,360 Speaker 1: couple of years old when deep fakes shared it. Back 116 00:07:29,360 --> 00:07:33,560 Speaker 1: in there was a group of researchers from Stanford then 117 00:07:33,840 --> 00:07:37,280 Speaker 1: and also the University of Erlanger Nuremberg and the Max 118 00:07:37,320 --> 00:07:41,360 Speaker 1: Planck Institute for Informatics who collectively published a paper that 119 00:07:41,480 --> 00:07:45,640 Speaker 1: was titled Face to Face Real Time Face Capture and 120 00:07:45,680 --> 00:07:49,520 Speaker 1: Reenactment of r GB Videos and that's a face, the 121 00:07:49,680 --> 00:07:54,560 Speaker 1: number two and face. The paper details the methodology the 122 00:07:54,600 --> 00:07:58,400 Speaker 1: group used to create a pretty incredible effect. The algorithm 123 00:07:58,600 --> 00:08:02,600 Speaker 1: could take the facial expressis from one person and transfer 124 00:08:02,720 --> 00:08:06,120 Speaker 1: them in real time to a video target. It was 125 00:08:06,160 --> 00:08:10,880 Speaker 1: like turning the video into a digital puppet. So you 126 00:08:10,960 --> 00:08:14,040 Speaker 1: might have a video loop of a celebrity running, and 127 00:08:14,080 --> 00:08:17,200 Speaker 1: preferably it's a loop that's easily repeatable without the repeat 128 00:08:17,200 --> 00:08:22,240 Speaker 1: being terribly noticeable, and that's your target video. So if 129 00:08:22,280 --> 00:08:25,240 Speaker 1: you just let it run, you just would see video 130 00:08:25,280 --> 00:08:27,760 Speaker 1: of someone sitting down, maybe looking around a little bit, 131 00:08:27,760 --> 00:08:31,200 Speaker 1: but that's it, nothing special. Then you would have a 132 00:08:31,240 --> 00:08:36,000 Speaker 1: source subject sitting in view of a consumer quality webcam, 133 00:08:36,120 --> 00:08:41,040 Speaker 1: no special equipment here, and that person could make different expressions, 134 00:08:41,080 --> 00:08:44,640 Speaker 1: including opening and closing their mouths, and the video target 135 00:08:44,679 --> 00:08:49,080 Speaker 1: would match them move for move, like a digital puppet. Moreover, 136 00:08:49,480 --> 00:08:52,040 Speaker 1: the source subject didn't have to wear any special gear. 137 00:08:52,160 --> 00:08:54,160 Speaker 1: They didn't have to have any special markers, none of 138 00:08:54,200 --> 00:08:56,640 Speaker 1: those dots that you would see with motion capture. None 139 00:08:56,679 --> 00:09:00,040 Speaker 1: of that was necessary. All the algorithm needed was a 140 00:09:00,120 --> 00:09:03,719 Speaker 1: video feed from a monocular camera, so you didn't even 141 00:09:03,800 --> 00:09:07,080 Speaker 1: need depth perception for this. There's a video of their 142 00:09:07,080 --> 00:09:10,400 Speaker 1: work on YouTube that shows off this process and includes 143 00:09:10,440 --> 00:09:13,320 Speaker 1: a loop of George W. Bush sitting for an interview. 144 00:09:13,800 --> 00:09:18,280 Speaker 1: The source subject can manipulate the face that Bush makes 145 00:09:18,360 --> 00:09:21,040 Speaker 1: just by making faces of his own, and the algorithm 146 00:09:21,080 --> 00:09:24,240 Speaker 1: would map those movements to the target video. And it's 147 00:09:24,280 --> 00:09:29,160 Speaker 1: pretty wild to see an image of moving image of 148 00:09:29,440 --> 00:09:32,800 Speaker 1: George W. Bush responding in real time to all of 149 00:09:32,840 --> 00:09:35,760 Speaker 1: these different facial expressions this guy is making. So how 150 00:09:35,760 --> 00:09:37,920 Speaker 1: did the team do this? It's one thing to say 151 00:09:37,960 --> 00:09:40,760 Speaker 1: a deep learning algorithm gave them this capability, but that's 152 00:09:40,800 --> 00:09:46,640 Speaker 1: not really an explanation. The paper definitively spells this out 153 00:09:47,120 --> 00:09:52,360 Speaker 1: in real technical detail. It starts off the explanation by saying, quote, 154 00:09:52,720 --> 00:09:56,120 Speaker 1: in our method, we first reconstruct the shape identity of 155 00:09:56,120 --> 00:09:59,760 Speaker 1: the target actor using a new global non rigid model 156 00:09:59,800 --> 00:10:03,600 Speaker 1: BA fast bundling approach based on a prerecorded training sequence. 157 00:10:04,000 --> 00:10:06,800 Speaker 1: As this pre process is performed globally on a set 158 00:10:06,840 --> 00:10:10,439 Speaker 1: of training frames, we can resolve geometric ambiguities common to 159 00:10:10,520 --> 00:10:14,560 Speaker 1: binocular reconstruction. At runtime. We tracked both the expressions of 160 00:10:14,600 --> 00:10:17,760 Speaker 1: the source and target actors video by a dense analysis 161 00:10:17,760 --> 00:10:22,320 Speaker 1: by synthesis approach based on a statistical facial prior end quote, 162 00:10:22,760 --> 00:10:25,559 Speaker 1: and it goes on in that vein throughout the paper 163 00:10:26,000 --> 00:10:28,880 Speaker 1: which means it gets pretty dense. But I think we 164 00:10:28,920 --> 00:10:31,680 Speaker 1: can suss out what's going on from a high level 165 00:10:32,280 --> 00:10:36,160 Speaker 1: if we just take a moment. But first, I'm going 166 00:10:36,240 --> 00:10:39,080 Speaker 1: to take a moment of my own to thank my sponsor. 167 00:10:46,679 --> 00:10:50,559 Speaker 1: So how did the Face to Face team build this tool? Well, 168 00:10:50,640 --> 00:10:54,760 Speaker 1: for each target video, they would collect a large sample 169 00:10:55,200 --> 00:10:58,040 Speaker 1: of footage and images and feed it to this deep 170 00:10:58,160 --> 00:11:01,560 Speaker 1: learning algorithm. This would be necess sarry to identify all 171 00:11:01,600 --> 00:11:04,599 Speaker 1: the points on the face that would move with various expressions, 172 00:11:04,840 --> 00:11:08,400 Speaker 1: as well as to capture images of the inside of 173 00:11:08,440 --> 00:11:11,720 Speaker 1: the target's mouth when he or she spoke. This is 174 00:11:11,760 --> 00:11:15,080 Speaker 1: because the video loop they used to create the manipulated 175 00:11:15,160 --> 00:11:18,440 Speaker 1: video would feature the target subject, typically with his or 176 00:11:18,440 --> 00:11:21,600 Speaker 1: her mouth closed, so it might be a section in 177 00:11:21,640 --> 00:11:24,000 Speaker 1: which the subject was sitting down for an interview and 178 00:11:24,040 --> 00:11:27,600 Speaker 1: listening to an interviewer's questions but not responding yet they 179 00:11:27,600 --> 00:11:31,600 Speaker 1: were just listening. The additional video would provide information about 180 00:11:31,600 --> 00:11:33,840 Speaker 1: the inside of the target subject's mouth, which could be 181 00:11:33,880 --> 00:11:37,240 Speaker 1: rendered in time on the target video performance when it 182 00:11:37,360 --> 00:11:40,760 Speaker 1: came time to do that. Their approach improved the scanning 183 00:11:40,800 --> 00:11:44,720 Speaker 1: technique to build face templates for both the source subject 184 00:11:45,280 --> 00:11:48,720 Speaker 1: who provides all the expressions and the target subject, who 185 00:11:48,840 --> 00:11:52,679 Speaker 1: mimics all the expressions. As the source subject makes different 186 00:11:52,720 --> 00:11:57,120 Speaker 1: facial expressions, the computer face template detects how the subjects 187 00:11:57,240 --> 00:12:02,880 Speaker 1: face changes or deforms over time. The computer model then 188 00:12:02,960 --> 00:12:06,840 Speaker 1: takes that information, saying, all right, well the lips moved 189 00:12:06,840 --> 00:12:09,520 Speaker 1: in this way, there was a grimace here or a 190 00:12:09,559 --> 00:12:15,200 Speaker 1: smile there, and transfer those motions to the targets face template, 191 00:12:15,600 --> 00:12:19,280 Speaker 1: which is matched to the target's actual face. This process 192 00:12:19,360 --> 00:12:24,760 Speaker 1: transfers the expressions over to the target, and so when 193 00:12:24,800 --> 00:12:28,600 Speaker 1: the source subject grimaces, the target grimaces, if the source 194 00:12:28,640 --> 00:12:31,319 Speaker 1: subject just it's still the target. Video will continue to 195 00:12:31,400 --> 00:12:34,960 Speaker 1: loop and the targets face won't change. The more video 196 00:12:34,960 --> 00:12:37,720 Speaker 1: footage you can get of your target and your source, 197 00:12:38,120 --> 00:12:41,880 Speaker 1: the better the computer algorithms are that create those face templates, 198 00:12:41,920 --> 00:12:44,720 Speaker 1: and the more natural the manipulation will appear on the 199 00:12:44,720 --> 00:12:48,280 Speaker 1: finished video. You also want a really good amount of 200 00:12:48,800 --> 00:12:51,680 Speaker 1: footage just to get all that extra information you need, 201 00:12:51,679 --> 00:12:53,120 Speaker 1: like the inside of the mouth, so that that can 202 00:12:53,160 --> 00:12:57,319 Speaker 1: all be extrapolated properly. You have to design a tool 203 00:12:57,400 --> 00:13:00,880 Speaker 1: that can encode an image called the training image, and 204 00:13:00,800 --> 00:13:05,720 Speaker 1: then decodes this data to reconstruct the image. So imagine 205 00:13:05,760 --> 00:13:11,360 Speaker 1: you've got a picture. The encoder essentially creates data based 206 00:13:11,400 --> 00:13:14,400 Speaker 1: on that image. It's like a description of that image. 207 00:13:15,080 --> 00:13:19,280 Speaker 1: The decoder takes the description and tries to rebuild the 208 00:13:19,320 --> 00:13:22,520 Speaker 1: image based on the description. I think of this like 209 00:13:22,559 --> 00:13:25,720 Speaker 1: that scene in Willy Wonka where Mike TV gets broken 210 00:13:25,800 --> 00:13:28,479 Speaker 1: up in a million little pieces and then gets reconstructed 211 00:13:28,760 --> 00:13:31,960 Speaker 1: on the television screen. So the second image is not 212 00:13:32,040 --> 00:13:34,199 Speaker 1: a copy. It's not like you made a copy of 213 00:13:34,240 --> 00:13:36,880 Speaker 1: the first one. It's like you built a new image 214 00:13:36,920 --> 00:13:39,400 Speaker 1: based on the first one. By the way, when you 215 00:13:39,440 --> 00:13:45,200 Speaker 1: start off with these uh these algorithms, those reconstructions tend 216 00:13:45,240 --> 00:13:49,120 Speaker 1: to look pretty bad. You have to continually train and 217 00:13:49,160 --> 00:13:51,680 Speaker 1: train and train and train the model so that it 218 00:13:51,720 --> 00:13:57,160 Speaker 1: gets better and better at producing a close representation of 219 00:13:57,200 --> 00:14:01,480 Speaker 1: the original image. When it does, it's reconstru struction. And 220 00:14:01,960 --> 00:14:06,319 Speaker 1: you would have both essentially decoders for both your source 221 00:14:06,440 --> 00:14:10,720 Speaker 1: subject and your target subject. Use the same encoder for both, 222 00:14:11,120 --> 00:14:15,600 Speaker 1: but two different decoders, one dedicated to your source, one 223 00:14:15,679 --> 00:14:20,120 Speaker 1: dedicated to your target. Then you would feed the reconstructed 224 00:14:20,200 --> 00:14:22,720 Speaker 1: images through the system again and again. This is called 225 00:14:23,080 --> 00:14:27,240 Speaker 1: back propagation. You do this over millions of times, typically 226 00:14:27,520 --> 00:14:31,520 Speaker 1: to improve this process, and then you're ready to really 227 00:14:32,080 --> 00:14:34,880 Speaker 1: switch switch things up. So let's say we've got two people. 228 00:14:34,960 --> 00:14:37,800 Speaker 1: We've got person one and we've got person two, and 229 00:14:37,840 --> 00:14:41,160 Speaker 1: you've been feeding images of both of these people through 230 00:14:41,280 --> 00:14:44,680 Speaker 1: the same encoder, but of course you have dedicated decoders 231 00:14:44,720 --> 00:14:47,760 Speaker 1: to produce the reconstruction. So person one has decode er 232 00:14:47,840 --> 00:14:52,320 Speaker 1: one and person two has decoder two. Now let's say 233 00:14:52,320 --> 00:14:58,160 Speaker 1: you're ready to put person two's face on person one's body. Well, 234 00:14:58,200 --> 00:15:01,840 Speaker 1: you would feed an image of person one into the encoder, 235 00:15:02,200 --> 00:15:05,320 Speaker 1: but you use the decoder for a person to to 236 00:15:05,480 --> 00:15:08,680 Speaker 1: reconstruct the image, and what you get is persons who's 237 00:15:08,800 --> 00:15:13,480 Speaker 1: face but mimicking the expression from person one. You, or 238 00:15:13,600 --> 00:15:18,840 Speaker 1: rather the computer algorithm, does this frame by frame on video, 239 00:15:18,920 --> 00:15:20,920 Speaker 1: and you end up with a video appearing to feature 240 00:15:20,960 --> 00:15:24,680 Speaker 1: one person when in fact it's just their face on 241 00:15:24,760 --> 00:15:27,680 Speaker 1: top of someone else, and it's their face making the 242 00:15:27,680 --> 00:15:31,720 Speaker 1: exact same expressions as whoever was originally in that video. 243 00:15:32,680 --> 00:15:36,280 Speaker 1: Now back over to deep fakes. Before long after the 244 00:15:36,320 --> 00:15:40,840 Speaker 1: Reddit user initially posted this code, folks over at Reddit, 245 00:15:40,840 --> 00:15:43,600 Speaker 1: we're taking this open source code and making more advanced 246 00:15:43,680 --> 00:15:46,960 Speaker 1: software based off of it. Soon there were desktop apps 247 00:15:46,960 --> 00:15:50,200 Speaker 1: that would take over all the hard parts of this process, 248 00:15:50,440 --> 00:15:54,080 Speaker 1: all the codey bits, if you will, of training a model. 249 00:15:54,480 --> 00:15:57,160 Speaker 1: Some of them would guide users into creating the data 250 00:15:57,280 --> 00:15:59,800 Speaker 1: that would be used to train the model and go 251 00:16:00,040 --> 00:16:02,360 Speaker 1: all the way through the process of creating the final 252 00:16:02,440 --> 00:16:05,960 Speaker 1: fake videos. Even with some of the more sophisticated versions, 253 00:16:06,000 --> 00:16:09,720 Speaker 1: there were tell tales signs of tampering. Typically some blurring 254 00:16:09,760 --> 00:16:14,400 Speaker 1: around images, particularly near chins and mouths. Those would be signs. 255 00:16:14,960 --> 00:16:17,360 Speaker 1: If there was any flicker, that was a sign if 256 00:16:17,360 --> 00:16:21,720 Speaker 1: you didn't take enough time to train the model. Typically 257 00:16:21,720 --> 00:16:24,640 Speaker 1: you would want to do several days of training at least. 258 00:16:24,960 --> 00:16:26,840 Speaker 1: If you didn't take that time, you might see some 259 00:16:26,960 --> 00:16:29,840 Speaker 1: really nasty blurring and flickering, and it would be a 260 00:16:29,840 --> 00:16:35,480 Speaker 1: dead giveaway that this was tampered. Video in writer, director, 261 00:16:35,520 --> 00:16:39,520 Speaker 1: and comedian Jordan's Peel demonstrated the power of this technology. 262 00:16:39,600 --> 00:16:43,040 Speaker 1: He showed how, with his impersonation of Barack Obama and 263 00:16:43,200 --> 00:16:47,840 Speaker 1: some manipulation software, he could create a fake public service 264 00:16:47,840 --> 00:16:51,680 Speaker 1: address when which the president would appear to say things 265 00:16:52,080 --> 00:16:55,640 Speaker 1: that he normally would never say. The technology behind this 266 00:16:55,800 --> 00:16:58,640 Speaker 1: made use of what is called a long short term 267 00:16:58,680 --> 00:17:01,480 Speaker 1: memory network or l s TM, to go into the 268 00:17:01,520 --> 00:17:05,199 Speaker 1: mechanics of that would require another podcast, but using an 269 00:17:05,200 --> 00:17:08,520 Speaker 1: approach similar to what I've already described, a team was 270 00:17:08,560 --> 00:17:12,240 Speaker 1: able to make a video of Obama apparently lip syncing 271 00:17:12,480 --> 00:17:16,159 Speaker 1: Peel's satirical message. The goal of this p s A 272 00:17:16,480 --> 00:17:20,080 Speaker 1: was beyond alert because fakes are getting harder to spot. 273 00:17:20,600 --> 00:17:25,200 Speaker 1: The University of Washington showed off this and They're Synthesizing 274 00:17:25,240 --> 00:17:28,880 Speaker 1: Obama project in which they took the audio from one 275 00:17:29,000 --> 00:17:32,600 Speaker 1: of President Obama's speeches and then used it to animate 276 00:17:32,760 --> 00:17:36,600 Speaker 1: his face in video from a different address that he 277 00:17:36,680 --> 00:17:40,240 Speaker 1: gave during his presidency. So in this example, the person 278 00:17:40,280 --> 00:17:43,680 Speaker 1: in the target video is the same person as the 279 00:17:43,800 --> 00:17:47,399 Speaker 1: source for the audio. But the point was pretty clear 280 00:17:47,600 --> 00:17:51,399 Speaker 1: that tech would soon make it possible to fake someone 281 00:17:51,440 --> 00:17:54,960 Speaker 1: saying or doing something. It just takes the right algorithms, 282 00:17:55,280 --> 00:17:58,320 Speaker 1: the right amount of training data, and the right amount 283 00:17:58,359 --> 00:18:00,720 Speaker 1: of time to get the model trained up enough to 284 00:18:00,760 --> 00:18:04,960 Speaker 1: do it smoothly. Now, this technology could be used to 285 00:18:05,080 --> 00:18:09,119 Speaker 1: do stuff that isn't related to malicious deception or for 286 00:18:09,359 --> 00:18:12,399 Speaker 1: pornography or anything along those lines. It could be used 287 00:18:12,760 --> 00:18:16,119 Speaker 1: in television and film for lots of stuff, including potentially 288 00:18:16,160 --> 00:18:20,159 Speaker 1: adding in actors who have passed away into a film. 289 00:18:20,200 --> 00:18:23,320 Speaker 1: Paired with similar work that's going on in voice synthesis, 290 00:18:23,320 --> 00:18:26,560 Speaker 1: you could end up with a convincing replacement, which means 291 00:18:26,960 --> 00:18:30,800 Speaker 1: we could make movies with dead actors taking on new 292 00:18:30,920 --> 00:18:35,000 Speaker 1: parts because we can synthesize their speech, we can synthesize 293 00:18:35,040 --> 00:18:38,000 Speaker 1: their appearance. You would still have someone else acting out 294 00:18:38,080 --> 00:18:41,680 Speaker 1: the part physically, but you would replace their image with 295 00:18:42,040 --> 00:18:46,920 Speaker 1: this actor's image. Or maybe you would want to use 296 00:18:46,960 --> 00:18:49,080 Speaker 1: this kind of technology just to make everyone think you 297 00:18:49,080 --> 00:18:51,880 Speaker 1: can cut a rug. This brings me to the University 298 00:18:51,880 --> 00:18:55,080 Speaker 1: of California, Berkeley and is the subject of a paper 299 00:18:55,119 --> 00:18:59,480 Speaker 1: titled Everybody Dance Now. The goal is a simple concept 300 00:18:59,560 --> 00:19:02,280 Speaker 1: that's actually really hard to pull off. What if you 301 00:19:02,320 --> 00:19:05,719 Speaker 1: were to take the movements of a professional dancer and 302 00:19:05,760 --> 00:19:09,160 Speaker 1: then map those movements onto the body of someone who 303 00:19:09,320 --> 00:19:12,399 Speaker 1: wasn't a dancer. What if you could create a video 304 00:19:12,720 --> 00:19:17,040 Speaker 1: in which literally anyone would appear to move like a skilled, 305 00:19:17,440 --> 00:19:21,280 Speaker 1: trained dancer. And how the heck would that be possible. Well, 306 00:19:21,320 --> 00:19:23,919 Speaker 1: at the heart of the team's efforts was something I 307 00:19:23,960 --> 00:19:27,120 Speaker 1: talked about in a recent episode of tech Stuff about 308 00:19:27,200 --> 00:19:31,919 Speaker 1: an AI generated portrait, and that would be generative adversarial 309 00:19:32,040 --> 00:19:35,479 Speaker 1: networks or g A n s. These use a pair 310 00:19:35,600 --> 00:19:40,199 Speaker 1: of artificial neural networks in competition against each other. So 311 00:19:40,240 --> 00:19:42,600 Speaker 1: since I covered this recently, i'll just give again a 312 00:19:42,640 --> 00:19:46,240 Speaker 1: super quick high level summary. You've got one network that 313 00:19:46,320 --> 00:19:49,399 Speaker 1: has a specific job, such as trying to create an 314 00:19:49,400 --> 00:19:51,760 Speaker 1: original image of a cat. We'll go back to the 315 00:19:51,800 --> 00:19:54,560 Speaker 1: cat pictures. That's one of my favorite ones because it 316 00:19:54,600 --> 00:19:58,600 Speaker 1: was one of the early use cases of neural networks 317 00:19:58,600 --> 00:20:01,399 Speaker 1: that I remember encountering when I was doing research. Now, 318 00:20:01,480 --> 00:20:04,479 Speaker 1: let's say you've got your second network. Your second network 319 00:20:04,480 --> 00:20:08,159 Speaker 1: has the specific job of evaluating pictures of cats to 320 00:20:08,280 --> 00:20:12,280 Speaker 1: determine if they are valid, meaning is this a real 321 00:20:12,320 --> 00:20:16,000 Speaker 1: picture of a cat that's part of the training material 322 00:20:16,320 --> 00:20:20,000 Speaker 1: that I'm accepting, or is this, in fact a fake 323 00:20:20,400 --> 00:20:25,040 Speaker 1: that was created by a computer program the other neural network. 324 00:20:25,400 --> 00:20:27,920 Speaker 1: So you've got one network trying to fool the other network. 325 00:20:28,200 --> 00:20:31,560 Speaker 1: And these networks get better at what they do over time, 326 00:20:31,880 --> 00:20:38,159 Speaker 1: they improve, So your counterfeit network is getting better and 327 00:20:38,200 --> 00:20:41,959 Speaker 1: better at making fake pictures of cats, and your detector 328 00:20:42,040 --> 00:20:45,960 Speaker 1: network is getting better and better at detecting fake images 329 00:20:46,000 --> 00:20:49,440 Speaker 1: of cats. Now, typically this requires humans to give feedback 330 00:20:49,560 --> 00:20:52,879 Speaker 1: or tweaking weight values along the networks, but they do 331 00:20:52,920 --> 00:20:56,679 Speaker 1: get better over time. So if the network trying to 332 00:20:56,680 --> 00:20:59,359 Speaker 1: create a picture of a cat gets the feedback of sorry, buddy, 333 00:20:59,400 --> 00:21:01,960 Speaker 1: but they're onto you, then it can try again and 334 00:21:02,040 --> 00:21:04,480 Speaker 1: adjust it's approach slightly in an effort to fool the 335 00:21:04,480 --> 00:21:07,880 Speaker 1: second network. If the second network gets the feedback you'll 336 00:21:07,960 --> 00:21:10,320 Speaker 1: let this one slip by and it's fake, then it 337 00:21:10,359 --> 00:21:13,040 Speaker 1: will adjust or it will be adjusted to look out 338 00:21:13,040 --> 00:21:16,080 Speaker 1: for any tailtale signs that it had missed in that 339 00:21:16,200 --> 00:21:20,360 Speaker 1: earlier evaluation. Over time, the two networks working against each 340 00:21:20,359 --> 00:21:24,480 Speaker 1: other will create the ultimate result of better and better 341 00:21:24,600 --> 00:21:28,679 Speaker 1: computer generated content, whether it's an image of a cat 342 00:21:29,440 --> 00:21:34,760 Speaker 1: or a sonnet, or a song or a video. Now 343 00:21:34,880 --> 00:21:39,680 Speaker 1: that doesn't mean that these computer generated things are at 344 00:21:39,680 --> 00:21:43,399 Speaker 1: the same level as human generated stuff, especially when it 345 00:21:43,440 --> 00:21:46,479 Speaker 1: comes to text. I've seen a lot of song lyrics 346 00:21:46,520 --> 00:21:51,000 Speaker 1: that were inscrutable even by my old man standards. So 347 00:21:51,280 --> 00:21:54,520 Speaker 1: I think that we're a long way away from getting 348 00:21:54,600 --> 00:21:57,520 Speaker 1: to a point where they can fool us in every case. 349 00:21:57,600 --> 00:22:01,040 Speaker 1: But with video they're getting pretty darn good. Now, this 350 00:22:01,119 --> 00:22:04,719 Speaker 1: team had two groups of subjects, and so you had 351 00:22:04,760 --> 00:22:08,520 Speaker 1: your source subjects and your target subjects. The source in 352 00:22:08,560 --> 00:22:11,399 Speaker 1: this case, were the people who could dance, so like 353 00:22:11,520 --> 00:22:14,720 Speaker 1: ballet dancers, hip hop dancers and that sort of stuff. 354 00:22:14,760 --> 00:22:19,000 Speaker 1: They legit know how to move. They would demonstrate various 355 00:22:19,080 --> 00:22:22,400 Speaker 1: dances on video. The second group of subjects were your 356 00:22:22,440 --> 00:22:27,439 Speaker 1: target subjects. These were not trained dancers. They were to 357 00:22:27,600 --> 00:22:31,560 Speaker 1: go through a series of moves and poses, essentially aping 358 00:22:31,720 --> 00:22:35,600 Speaker 1: as best they could the movements of trained dancers, and 359 00:22:35,680 --> 00:22:40,080 Speaker 1: the goal of this pair of networks was to smooth 360 00:22:40,119 --> 00:22:43,160 Speaker 1: the movements out and adjust the timing so that these 361 00:22:43,280 --> 00:22:46,600 Speaker 1: untrained dancers would appear to move more like their groovy 362 00:22:46,720 --> 00:22:50,800 Speaker 1: source subject counterparts. I'll explain more in just a moment, 363 00:22:50,800 --> 00:22:53,840 Speaker 1: but first let's take another quick break to thank our sponsor. 364 00:23:01,359 --> 00:23:05,280 Speaker 1: According to the Everybody Dance Now paper, the team would 365 00:23:05,280 --> 00:23:09,040 Speaker 1: transfer motion between the sources to the target through an 366 00:23:09,240 --> 00:23:14,399 Speaker 1: end to end pixel based pipeline. So here's how that's done. 367 00:23:14,480 --> 00:23:18,919 Speaker 1: Because if you're like me, that phrase meant next to 368 00:23:19,000 --> 00:23:22,760 Speaker 1: nothing to you. So specifically, the group used three stages 369 00:23:22,800 --> 00:23:26,280 Speaker 1: to take the movements of one person and transpose them 370 00:23:26,320 --> 00:23:31,200 Speaker 1: to a target person. Those three were pose detection, global 371 00:23:31,280 --> 00:23:35,800 Speaker 1: pose normalization, and mapping from normalized pose stick figures to 372 00:23:35,840 --> 00:23:41,040 Speaker 1: the target subject. Post detection involves teaching machines, in other words, 373 00:23:41,040 --> 00:23:45,159 Speaker 1: computers how to interpret images to determine where key body 374 00:23:45,200 --> 00:23:50,480 Speaker 1: points are, like elbows, knees, hips, shoulders, the head, that 375 00:23:50,560 --> 00:23:53,679 Speaker 1: kind of stuff. That first requires that you teach the 376 00:23:53,720 --> 00:23:57,760 Speaker 1: machine to recognize those points in the first place. So 377 00:23:58,080 --> 00:24:00,160 Speaker 1: first you have to train a machine to recogniz eyes 378 00:24:00,240 --> 00:24:04,360 Speaker 1: those points and identify them with a target level of accuracy. 379 00:24:04,520 --> 00:24:08,000 Speaker 1: It's pretty typical to represent these joints as as points 380 00:24:08,040 --> 00:24:11,320 Speaker 1: in a stick figure, so each point represents another joint 381 00:24:11,400 --> 00:24:15,159 Speaker 1: or point of articulation. The lines represent the trunk of 382 00:24:15,200 --> 00:24:18,199 Speaker 1: the body, the limbs, the head. You end up with 383 00:24:18,240 --> 00:24:21,159 Speaker 1: a stick figure. If your machine learning mechanism was a 384 00:24:21,160 --> 00:24:24,199 Speaker 1: good one, the machine should be able to overlay a 385 00:24:24,280 --> 00:24:27,600 Speaker 1: stick figure on top of any image of a person posing, 386 00:24:28,040 --> 00:24:30,439 Speaker 1: and the stick figure should more or less conform to 387 00:24:30,560 --> 00:24:34,280 Speaker 1: that image, including where the actual joints are. So if 388 00:24:34,280 --> 00:24:36,800 Speaker 1: you have someone standing there in the classic Peter Pan 389 00:24:36,880 --> 00:24:40,239 Speaker 1: pose of their their fists on their hips uh and 390 00:24:40,280 --> 00:24:43,480 Speaker 1: their their arms out of kimbo, then it should draw 391 00:24:43,520 --> 00:24:46,320 Speaker 1: a stick figure that's essentially aping the same thing and 392 00:24:46,359 --> 00:24:48,760 Speaker 1: be able to overlay it on top of the original image. 393 00:24:49,000 --> 00:24:52,160 Speaker 1: Now these days this can be done in real time. So, 394 00:24:52,240 --> 00:24:54,520 Speaker 1: for example, there's a team at Google Creative Lab that 395 00:24:54,600 --> 00:24:57,639 Speaker 1: used a machine learning model of pose net and created 396 00:24:57,680 --> 00:25:01,240 Speaker 1: a JavaScript version with TensorFlow, which is an open source 397 00:25:01,320 --> 00:25:04,760 Speaker 1: software library often used for machine learning. And with this 398 00:25:04,840 --> 00:25:08,080 Speaker 1: tool you can do real time pose estimation through a 399 00:25:08,119 --> 00:25:12,399 Speaker 1: browser and a webcam. The application doesn't have any technology 400 00:25:12,480 --> 00:25:15,040 Speaker 1: related to identifying the person in the image. It's just 401 00:25:15,119 --> 00:25:17,600 Speaker 1: quote unquote interested in what the person is doing, not 402 00:25:17,720 --> 00:25:19,879 Speaker 1: who the person is. So you can actually run this 403 00:25:20,000 --> 00:25:22,880 Speaker 1: on your own machine in a browser, and you can 404 00:25:22,880 --> 00:25:24,800 Speaker 1: pose in front of a webcam and you'll see the 405 00:25:24,840 --> 00:25:29,679 Speaker 1: little stick figure uh painted on top of your image 406 00:25:29,680 --> 00:25:32,400 Speaker 1: on the computer. Essentially, so every time you move, every 407 00:25:32,400 --> 00:25:34,480 Speaker 1: time you bend a joint, you will see the stick 408 00:25:34,520 --> 00:25:37,760 Speaker 1: figure doing the same thing, um mapped on top of you. 409 00:25:38,240 --> 00:25:42,200 Speaker 1: The Berkeley team made use of a pre trained pose detector, 410 00:25:42,359 --> 00:25:45,040 Speaker 1: meaning they didn't build a new one, which helps save 411 00:25:45,080 --> 00:25:47,960 Speaker 1: a lot of time and expense on their project. Now 412 00:25:47,960 --> 00:25:51,639 Speaker 1: people come in all shapes and sizes. In the video 413 00:25:51,720 --> 00:25:54,600 Speaker 1: the team released, they showed off subjects who included a 414 00:25:54,640 --> 00:25:56,960 Speaker 1: woman who appeared to be of around average height and 415 00:25:57,000 --> 00:26:00,720 Speaker 1: a man who appeared to be pretty darn Tallman transfer 416 00:26:00,800 --> 00:26:03,639 Speaker 1: method that would only work between a subject and a 417 00:26:03,680 --> 00:26:07,280 Speaker 1: target who are of similar shape and size would be 418 00:26:07,280 --> 00:26:11,359 Speaker 1: pretty limited. So the purpose of the global pose normalization 419 00:26:11,440 --> 00:26:14,639 Speaker 1: stage is to account for all the differences between the 420 00:26:14,800 --> 00:26:18,480 Speaker 1: source and the target subjects and the locations within the 421 00:26:18,480 --> 00:26:22,520 Speaker 1: frame of the camera. Without this step, the motion transfer 422 00:26:22,640 --> 00:26:27,800 Speaker 1: might appear ghoulish. We don't have all the same proportions, right, 423 00:26:27,840 --> 00:26:31,240 Speaker 1: so a mismatch might mean a target's limbs would appear 424 00:26:31,280 --> 00:26:34,560 Speaker 1: to bend in places that were clearly not natural joints. 425 00:26:35,080 --> 00:26:36,919 Speaker 1: All you need to do is see an arm bend 426 00:26:36,960 --> 00:26:38,919 Speaker 1: where an arm isn't supposed to bend, and that's going 427 00:26:38,960 --> 00:26:41,560 Speaker 1: to ski the out quite a bit. Makes an effective 428 00:26:41,600 --> 00:26:44,920 Speaker 1: horror movie experience, but not one that would produce convincing 429 00:26:45,000 --> 00:26:47,760 Speaker 1: motion transfer. Now, there are a lot of ways that 430 00:26:47,760 --> 00:26:50,679 Speaker 1: the team could have gone about normalizing the poses, but 431 00:26:50,760 --> 00:26:54,320 Speaker 1: their choice seems particularly clever to me. They measured the 432 00:26:54,400 --> 00:26:58,600 Speaker 1: heights and ankle positions of the various subjects and used 433 00:26:58,720 --> 00:27:03,040 Speaker 1: linear mapping between the closest and farthest ankle positions in 434 00:27:03,080 --> 00:27:06,800 Speaker 1: both videos to normalize the stick figure for the target subjects. 435 00:27:07,440 --> 00:27:10,760 Speaker 1: The program would calculate the scale of the figure as 436 00:27:10,800 --> 00:27:13,720 Speaker 1: well as the scale of motion from frame to frame. 437 00:27:14,080 --> 00:27:16,240 Speaker 1: And I think that's pretty darn cool because it wasn't 438 00:27:16,280 --> 00:27:19,640 Speaker 1: just accounting for the size of the subjects to get 439 00:27:19,640 --> 00:27:21,920 Speaker 1: all the joints right, but also to make sure the 440 00:27:21,960 --> 00:27:25,240 Speaker 1: scale of the movements with respect to the body size 441 00:27:25,280 --> 00:27:29,320 Speaker 1: and proportions would remain the same. So a tall person 442 00:27:29,400 --> 00:27:34,520 Speaker 1: with really long limbs moving their arms in really big, big, 443 00:27:34,560 --> 00:27:39,399 Speaker 1: bold gestures, if you tried to transfer that motion to 444 00:27:39,480 --> 00:27:42,960 Speaker 1: someone who was of smaller stature, it could really look disturbing. 445 00:27:43,440 --> 00:27:47,359 Speaker 1: But by using this scaling approach, the movements on the 446 00:27:47,480 --> 00:27:52,600 Speaker 1: smaller person would be proportionate in size to the movements 447 00:27:52,680 --> 00:27:56,240 Speaker 1: of the larger person. The team would use two of 448 00:27:56,359 --> 00:28:00,439 Speaker 1: the Generative Adversarial Network setups to work on making a 449 00:28:00,440 --> 00:28:04,040 Speaker 1: convincing final video. The first was dedicated to image to 450 00:28:04,200 --> 00:28:07,840 Speaker 1: image translation, attempting to manipulate the image of the target 451 00:28:07,880 --> 00:28:10,760 Speaker 1: subjects that would follow the motions made from the pose 452 00:28:10,880 --> 00:28:14,919 Speaker 1: detection process, and like all g a N setups, this 453 00:28:15,000 --> 00:28:18,240 Speaker 1: included the generator, which would attempt to create a convincing 454 00:28:18,320 --> 00:28:21,879 Speaker 1: sequence of images, and the discriminators, which tried to weed 455 00:28:21,880 --> 00:28:25,440 Speaker 1: out the quote unquote fake sequences from the generator from 456 00:28:25,480 --> 00:28:28,080 Speaker 1: the ground truth data that was being fed to it. 457 00:28:28,840 --> 00:28:32,639 Speaker 1: The second g N set set up was specifically dedicated 458 00:28:33,000 --> 00:28:36,359 Speaker 1: to add detail and realism to the faces of the 459 00:28:36,400 --> 00:28:39,920 Speaker 1: target subjects. In some frames this appears to have worked 460 00:28:39,920 --> 00:28:42,600 Speaker 1: pretty well, and others there's a bit of an uncanny 461 00:28:42,720 --> 00:28:45,840 Speaker 1: valley thing or maybe even horror movie type element going on, 462 00:28:46,480 --> 00:28:49,720 Speaker 1: similar to how some of the AI generated portraits that 463 00:28:49,800 --> 00:28:52,000 Speaker 1: I talked about in the previous episode introduced a bit 464 00:28:52,040 --> 00:28:56,200 Speaker 1: of unrealistic qualities to the various images. When shooting video 465 00:28:56,280 --> 00:29:00,080 Speaker 1: of the target subjects, the team captured images at one 466 00:29:00,160 --> 00:29:03,360 Speaker 1: hundred twenty frames per second to get enough data for 467 00:29:03,440 --> 00:29:07,360 Speaker 1: each subject. The sessions lasted for about twenty minutes. They 468 00:29:07,440 --> 00:29:10,560 Speaker 1: used smartphone cameras to do it, since many smartphones allow 469 00:29:10,600 --> 00:29:12,840 Speaker 1: you to shoot video at this kind of frame rate 470 00:29:12,920 --> 00:29:15,840 Speaker 1: these days. They had their target subjects where close fitting 471 00:29:15,840 --> 00:29:19,920 Speaker 1: clothing that wasn't prone to wrinkling because the post recognition 472 00:29:19,920 --> 00:29:23,240 Speaker 1: tool they were using wasn't designed to encode information about clothing. 473 00:29:24,040 --> 00:29:27,040 Speaker 1: As for the source videos, the ones that would actually 474 00:29:27,080 --> 00:29:29,960 Speaker 1: create the motions that would be transferred to the targets, 475 00:29:30,240 --> 00:29:32,880 Speaker 1: the team didn't have to worry about capturing images at 476 00:29:32,960 --> 00:29:35,239 Speaker 1: such a high frame rate. They could use videos of 477 00:29:35,280 --> 00:29:39,479 Speaker 1: just reasonable quality, meaning decent resolution and frame rate, and 478 00:29:39,520 --> 00:29:42,400 Speaker 1: their post detection tool would do its work and create 479 00:29:42,400 --> 00:29:45,080 Speaker 1: the stick figure that would serve as the guide for 480 00:29:45,160 --> 00:29:48,920 Speaker 1: the target motions later on. Because of that, the team 481 00:29:49,040 --> 00:29:52,920 Speaker 1: can really use any online video of sufficient quality to 482 00:29:52,960 --> 00:29:55,960 Speaker 1: act as the source information for motion transfer. It doesn't 483 00:29:55,960 --> 00:29:58,880 Speaker 1: have to be a video shot specifically for that purpose. 484 00:29:59,400 --> 00:30:02,760 Speaker 1: In fact, one of the example videos the team used 485 00:30:02,760 --> 00:30:06,120 Speaker 1: in their demonstration was from a Bruno Mars music video 486 00:30:06,240 --> 00:30:09,440 Speaker 1: for That's what I Like. Before applying the motion transfer, 487 00:30:09,720 --> 00:30:12,920 Speaker 1: the team smoothed pose key points to reduce jitter in 488 00:30:12,920 --> 00:30:16,960 Speaker 1: the final output, and then the team applied the motion transfer. 489 00:30:17,360 --> 00:30:21,640 Speaker 1: The stick figure motions were then transferred to the target 490 00:30:21,760 --> 00:30:25,800 Speaker 1: subjects and the result is pretty interesting. It is not seamless. 491 00:30:26,240 --> 00:30:29,080 Speaker 1: You can definitely tell something odd is going on, but 492 00:30:29,160 --> 00:30:31,800 Speaker 1: it is an indication of where things are going and 493 00:30:31,880 --> 00:30:36,080 Speaker 1: using adversarial networks could lead to more convincing motion transfers 494 00:30:36,120 --> 00:30:40,440 Speaker 1: in the future. Now, this could lead to all sorts 495 00:30:40,440 --> 00:30:43,960 Speaker 1: of stuff nefarious and otherwise. You could imagine using it 496 00:30:44,000 --> 00:30:47,520 Speaker 1: to transform an average actor into a martial arts master, 497 00:30:48,480 --> 00:30:51,920 Speaker 1: or it might allow directors more freedom of casting, knowing 498 00:30:51,960 --> 00:30:55,920 Speaker 1: that if the actors they choose don't possess certain physical skills, 499 00:30:56,280 --> 00:30:59,480 Speaker 1: they can use this kind of technology to fake it, 500 00:30:59,520 --> 00:31:02,240 Speaker 1: but would also be used to fake footage to make 501 00:31:02,520 --> 00:31:06,200 Speaker 1: it looks like people like specific people are doing stuff 502 00:31:06,240 --> 00:31:09,160 Speaker 1: that they are not doing. It could be used to 503 00:31:09,200 --> 00:31:13,440 Speaker 1: spread misinformation and it likely will be, which means we'll 504 00:31:13,480 --> 00:31:15,760 Speaker 1: need to be on the lookout for signs of fakes, 505 00:31:15,800 --> 00:31:18,720 Speaker 1: which are going to get harder and harder to detect 506 00:31:18,880 --> 00:31:22,720 Speaker 1: as time goes on. And hey, you guys remember DARPA, 507 00:31:22,840 --> 00:31:25,240 Speaker 1: right because I just did a whole series of episodes 508 00:31:25,240 --> 00:31:29,280 Speaker 1: about them. Well, that agency has funded programs dedicated to 509 00:31:29,480 --> 00:31:34,240 Speaker 1: automating various forensic tools, including tools that could be used 510 00:31:34,240 --> 00:31:39,480 Speaker 1: to detect AI created forgeries in video and audio. Often 511 00:31:39,560 --> 00:31:43,040 Speaker 1: the secret is in the eyes. Most of these neural 512 00:31:43,080 --> 00:31:47,600 Speaker 1: networks are trained on still images, so you send thousands 513 00:31:47,680 --> 00:31:49,880 Speaker 1: or tens of thousands of images if you have them, 514 00:31:50,080 --> 00:31:54,000 Speaker 1: of your various subjects, your target, and your source. But 515 00:31:54,120 --> 00:31:58,320 Speaker 1: most published still images don't show people with their eyes closed. 516 00:31:59,440 --> 00:32:02,760 Speaker 1: So I've moved my movements and blinking tends to be 517 00:32:02,800 --> 00:32:05,200 Speaker 1: a little wonky in these fake videos. You might watch 518 00:32:05,200 --> 00:32:08,120 Speaker 1: one for a while and think, huh, that's weird. This 519 00:32:08,160 --> 00:32:11,800 Speaker 1: guy hasn't blinked for like ten minutes, or when they 520 00:32:11,840 --> 00:32:15,080 Speaker 1: blink it looks really strange. Well, that's an indication that 521 00:32:15,160 --> 00:32:17,680 Speaker 1: it's a fake video. There are other ones as well, 522 00:32:17,720 --> 00:32:23,200 Speaker 1: but DARPA is understandably keeping those quiet because not you know, 523 00:32:23,240 --> 00:32:27,280 Speaker 1: if if they publish how they figure out AI created 524 00:32:27,400 --> 00:32:31,600 Speaker 1: videos are in fact faked, then that gives the fakers 525 00:32:31,720 --> 00:32:35,440 Speaker 1: enough information to go back and improve their models. So 526 00:32:35,520 --> 00:32:38,760 Speaker 1: we're likely to see something akin to what happened with capture. 527 00:32:39,240 --> 00:32:44,360 Speaker 1: Specialists will develop new tools to detect a I generated media. 528 00:32:45,200 --> 00:32:49,080 Speaker 1: AI developers will then create more sophisticated models, and so 529 00:32:49,160 --> 00:32:51,760 Speaker 1: it becomes kind of an arms race a seesaw, and 530 00:32:51,840 --> 00:32:54,680 Speaker 1: one benefit is that AI as a whole will improve, 531 00:32:55,840 --> 00:32:57,920 Speaker 1: but we may not be able to believe it when 532 00:32:58,000 --> 00:33:02,200 Speaker 1: we see it. Well, that wraps up this episode of fascinating, 533 00:33:02,800 --> 00:33:07,640 Speaker 1: somewhat disturbing topic, and uh, I'm sure we're gonna hear 534 00:33:07,760 --> 00:33:10,040 Speaker 1: a lot more about this in the years to come. 535 00:33:10,120 --> 00:33:14,680 Speaker 1: We've seen a lot of of sites banning deep fakes 536 00:33:14,680 --> 00:33:20,080 Speaker 1: outright because of the misinformation that they can spread. So 537 00:33:20,160 --> 00:33:24,760 Speaker 1: we're already seeing a reaction to this in various online communities, 538 00:33:25,200 --> 00:33:28,840 Speaker 1: so that's very interesting to me. But we're definitely gonna 539 00:33:28,920 --> 00:33:32,520 Speaker 1: keep seeing this continue. It's a it's a valid area 540 00:33:32,560 --> 00:33:36,360 Speaker 1: of AI research, so we will have to wait and 541 00:33:36,400 --> 00:33:38,680 Speaker 1: see how it all plays out. If you guys have 542 00:33:38,680 --> 00:33:41,440 Speaker 1: any suggestions for future episodes of tech Stuff, why not 543 00:33:41,520 --> 00:33:43,760 Speaker 1: send me a message. You can go over to our 544 00:33:43,800 --> 00:33:47,800 Speaker 1: website that is Text Stuff podcast dot com. You'll find 545 00:33:47,800 --> 00:33:50,560 Speaker 1: all the different ways to contact me. I look forward 546 00:33:50,560 --> 00:33:52,640 Speaker 1: to hearing from you. Make sure you check out our 547 00:33:52,680 --> 00:33:56,080 Speaker 1: store over at t public dot com slash tech Stuff 548 00:33:56,360 --> 00:33:59,760 Speaker 1: by some merchandise. You can make sure that you get 549 00:34:00,000 --> 00:34:03,160 Speaker 1: all the really cool T shirts like prove to Me 550 00:34:03,200 --> 00:34:05,720 Speaker 1: You're not a Robot. That one's pretty appropriate for this 551 00:34:05,760 --> 00:34:08,960 Speaker 1: particular episode. And remember every single purchase you make goes 552 00:34:09,000 --> 00:34:11,520 Speaker 1: to help the show, so we greatly appreciate it. Also, 553 00:34:11,920 --> 00:34:14,399 Speaker 1: if you haven't heard, we have been nominated for an 554 00:34:14,400 --> 00:34:18,120 Speaker 1: I Heart Radio Podcast Award. It's the first year I 555 00:34:18,200 --> 00:34:20,760 Speaker 1: Heart Radio is giving out podcast awards. We are nominated 556 00:34:20,760 --> 00:34:24,839 Speaker 1: in the Science and Technology category. You can go online 557 00:34:24,880 --> 00:34:27,880 Speaker 1: and visit the I Heart Radio Podcast Awards page and 558 00:34:27,960 --> 00:34:32,040 Speaker 1: vote up to five times a day for your favorite podcasts. 559 00:34:32,360 --> 00:34:34,560 Speaker 1: If you wanted to. You could dedicate all five of 560 00:34:34,600 --> 00:34:38,279 Speaker 1: those votes every single day to us. I would not 561 00:34:38,360 --> 00:34:40,560 Speaker 1: complain if you did that. It would be really cool 562 00:34:40,600 --> 00:34:43,359 Speaker 1: to win that award, but make sure you check it out. 563 00:34:43,440 --> 00:34:45,440 Speaker 1: There may be lots of shows there that you truly 564 00:34:45,440 --> 00:34:47,080 Speaker 1: love and you want to throw your support behind them. 565 00:34:47,080 --> 00:34:49,120 Speaker 1: That would be really cool with you, And I'll talk 566 00:34:49,160 --> 00:34:57,719 Speaker 1: to you again really soon for more on this and 567 00:34:57,800 --> 00:35:00,000 Speaker 1: bousands of other topics because it has to have four. 568 00:35:00,080 --> 00:35:00,520 Speaker 1: Stock com