1 00:00:04,400 --> 00:00:07,760 Speaker 1: Welcome to tech Stuff, a production from I Heart Radio. 2 00:00:11,800 --> 00:00:14,200 Speaker 1: Hey there, and welcome to tech Stuff. I'm your host, 3 00:00:14,320 --> 00:00:16,840 Speaker 1: Jonathan Strickland. I'm an executive producer with I Heart Radio, 4 00:00:16,880 --> 00:00:20,720 Speaker 1: and how the tech are you. I am currently on 5 00:00:21,320 --> 00:00:26,280 Speaker 1: vacation celebrating my anniversary, and I didn't want to leave 6 00:00:26,360 --> 00:00:29,560 Speaker 1: you without an episode, So the episode We're going to 7 00:00:29,600 --> 00:00:32,640 Speaker 1: play for You was recorded and published on September seven, 8 00:00:32,760 --> 00:00:36,200 Speaker 1: twenty twenty. It is called deep learning and deep fakes, 9 00:00:36,200 --> 00:00:41,400 Speaker 1: and recent developments in the deep fakes field include researchers 10 00:00:41,440 --> 00:00:46,400 Speaker 1: creating tools that can detect tells in artificial voices, for example. 11 00:00:47,040 --> 00:00:49,280 Speaker 1: But really, when you think about that, it's just a 12 00:00:49,360 --> 00:00:52,560 Speaker 1: see saw like pattern that will see deep fake technology 13 00:00:52,600 --> 00:00:56,040 Speaker 1: improve over time, and then our ability to detect deep 14 00:00:56,080 --> 00:00:59,200 Speaker 1: fakes will improve, and this will keep going until one 15 00:00:59,200 --> 00:01:03,600 Speaker 1: side or the other has the edge permanently. Now we 16 00:01:03,800 --> 00:01:06,640 Speaker 1: kind of talk about that in this episode. In fact, 17 00:01:07,120 --> 00:01:10,160 Speaker 1: also deep fakes are very much in the spotlight literally 18 00:01:10,600 --> 00:01:13,679 Speaker 1: on the popular TV series America's Got Talent, A team 19 00:01:13,680 --> 00:01:16,360 Speaker 1: from the startup Metaphysics made it all the way to 20 00:01:16,360 --> 00:01:19,400 Speaker 1: the final round of the competition by creating deep fake 21 00:01:19,560 --> 00:01:22,319 Speaker 1: copies of the famous judges on the show all in 22 00:01:22,400 --> 00:01:28,640 Speaker 1: real time. It's equal parts entertaining and terrifying. Okay, maybe 23 00:01:29,520 --> 00:01:38,360 Speaker 1: entertaining terrifying. Anyway, enjoy this episode deep learning and deep Fakes. Now, 24 00:01:38,360 --> 00:01:41,039 Speaker 1: before I get into today's episode, I want to give 25 00:01:41,160 --> 00:01:45,120 Speaker 1: a little listener warning here. The topic at hand involves 26 00:01:45,200 --> 00:01:48,680 Speaker 1: some adult content, including the use of technology to do 27 00:01:48,800 --> 00:01:55,120 Speaker 1: stuff that can be unethical, illegal, hurtful, and just plain awful. Now, 28 00:01:55,160 --> 00:01:57,960 Speaker 1: I think this is an important topic, but I wanted 29 00:01:58,000 --> 00:01:59,840 Speaker 1: to give a bit of a heads up at this 30 00:02:00,000 --> 00:02:02,440 Speaker 1: are of the episode, just in case any of you 31 00:02:02,440 --> 00:02:05,720 Speaker 1: guys are listening to a podcast on like a family 32 00:02:05,960 --> 00:02:08,799 Speaker 1: road trip or something. I think this is an important 33 00:02:08,840 --> 00:02:12,200 Speaker 1: topic and I think everyone should know about it and 34 00:02:12,240 --> 00:02:14,320 Speaker 1: think about it. But I also respect that for some 35 00:02:14,360 --> 00:02:17,800 Speaker 1: people this subject might get a bit taboo. So let's 36 00:02:17,840 --> 00:02:23,160 Speaker 1: go on with the episode. Back in ninete a movie 37 00:02:23,320 --> 00:02:28,160 Speaker 1: called Rising Sun, directed by Philip Kaufman, based on a 38 00:02:28,320 --> 00:02:32,079 Speaker 1: Michael Crichton novel and starring Wesley Snipes and Sean Connery 39 00:02:32,240 --> 00:02:35,639 Speaker 1: came out in theaters. Now, I didn't see it in theaters, 40 00:02:36,320 --> 00:02:38,640 Speaker 1: but I did catch it when it came on you know, 41 00:02:39,200 --> 00:02:43,040 Speaker 1: HBO or Cinemax or something. Later on the movie included 42 00:02:43,080 --> 00:02:46,280 Speaker 1: a sequence that I found to be totally unbelievable. And 43 00:02:46,320 --> 00:02:50,000 Speaker 1: I'm not talking about buying into Sean Connery being an 44 00:02:50,040 --> 00:02:54,600 Speaker 1: expert on Japanese culture and business practices. Actually, side note, 45 00:02:54,760 --> 00:02:59,000 Speaker 1: Sean Connery has an interesting history of playing unlikely characters, 46 00:02:59,040 --> 00:03:01,760 Speaker 1: such as in high Lander, where he played an immortal 47 00:03:01,919 --> 00:03:05,720 Speaker 1: who is supposedly Egyptian, then who lived in feudal Japan 48 00:03:06,240 --> 00:03:09,120 Speaker 1: and ended up in Spain where he became known as Ramirez, 49 00:03:09,520 --> 00:03:12,080 Speaker 1: and all the while he's talking to a Scottish Highlander 50 00:03:12,240 --> 00:03:15,519 Speaker 1: who's played by a Belgian actor. But I'm getting way 51 00:03:15,520 --> 00:03:19,320 Speaker 1: off track here. Besides, I've heard Crichton actually wrote the 52 00:03:19,440 --> 00:03:22,280 Speaker 1: character while thinking of Connery, So you know, what the 53 00:03:22,280 --> 00:03:25,600 Speaker 1: heck do I know? In the film, Snips and Connery 54 00:03:25,720 --> 00:03:29,519 Speaker 1: are investigators, and they're looking into a homicide that happened 55 00:03:29,560 --> 00:03:34,080 Speaker 1: at a Japanese business but on American soil. The security 56 00:03:34,080 --> 00:03:38,360 Speaker 1: system in the building captured video of the homicide and 57 00:03:38,400 --> 00:03:40,800 Speaker 1: the identity of the killer appears to be a pretty 58 00:03:40,840 --> 00:03:44,200 Speaker 1: open and shut case. But that's not how it all 59 00:03:44,240 --> 00:03:47,800 Speaker 1: turns out. The investigators talked to a security expert played 60 00:03:47,840 --> 00:03:51,720 Speaker 1: by Tia Carrera, and she demonstrates in real time how 61 00:03:51,840 --> 00:03:56,440 Speaker 1: video footage can be altered. She records a short video 62 00:03:56,800 --> 00:04:00,240 Speaker 1: of Connery and snipes loads that onto a computer. Her 63 00:04:00,560 --> 00:04:04,520 Speaker 1: freezes a frame of the video and essentially performs a 64 00:04:04,600 --> 00:04:07,800 Speaker 1: cut and paste job swapping the heads of our two 65 00:04:07,880 --> 00:04:11,440 Speaker 1: lead characters. Then she resumes the video and the head 66 00:04:11,480 --> 00:04:16,800 Speaker 1: swap remains in place, and that head swap stuff is possible. 67 00:04:17,040 --> 00:04:19,120 Speaker 1: I mean, clearly it has to be possible, because you 68 00:04:19,160 --> 00:04:22,240 Speaker 1: actually do see that effect in the film itself. But 69 00:04:22,480 --> 00:04:25,040 Speaker 1: it takes a bit more than a quick cut and 70 00:04:25,120 --> 00:04:27,720 Speaker 1: paste job. But we'll leave off of that for now. 71 00:04:28,279 --> 00:04:31,920 Speaker 1: The whole point of that sequence, apart from showing off 72 00:04:32,000 --> 00:04:36,640 Speaker 1: some cinema magic, is to demonstrate to the investigators that video, 73 00:04:36,800 --> 00:04:41,080 Speaker 1: like photographs, can be altered. The expert has detected a 74 00:04:41,080 --> 00:04:44,520 Speaker 1: blue halo around the face of the supposed murderer in 75 00:04:44,600 --> 00:04:48,000 Speaker 1: the footage, indicating that some sort of trickery has happened. 76 00:04:48,480 --> 00:04:51,719 Speaker 1: She also reveals that she cannot magically restore the video 77 00:04:51,760 --> 00:04:54,839 Speaker 1: to its previous unaltered state, which I think was actually 78 00:04:54,880 --> 00:04:57,680 Speaker 1: a nice change of pace for a movie. By the way, 79 00:04:58,240 --> 00:05:01,480 Speaker 1: I think this movie is really, you know, not good, 80 00:05:01,960 --> 00:05:07,040 Speaker 1: like not worth your time, but that's my opinion anyway. 81 00:05:07,080 --> 00:05:10,400 Speaker 1: For years, this kind of video sorcery was pretty much 82 00:05:10,560 --> 00:05:14,600 Speaker 1: limited to the film and TV industries. It usually required 83 00:05:14,640 --> 00:05:18,360 Speaker 1: a lot of pre planning beforehand, so it wasn't as 84 00:05:18,400 --> 00:05:21,560 Speaker 1: simple as just taking footage that was already shot and 85 00:05:21,680 --> 00:05:24,559 Speaker 1: changing it in post on a whim with a couple 86 00:05:24,600 --> 00:05:26,800 Speaker 1: of clicks of a button. If it were, we would 87 00:05:26,800 --> 00:05:30,640 Speaker 1: see a lot fewer mistakes left in movies and television 88 00:05:30,760 --> 00:05:33,240 Speaker 1: because you could catch it later and just fix it. 89 00:05:33,640 --> 00:05:37,080 Speaker 1: But the tricks were possible, they were just difficult to 90 00:05:37,120 --> 00:05:40,159 Speaker 1: pull off. It just wasn't something you or I would 91 00:05:40,200 --> 00:05:43,960 Speaker 1: ever encounter in our day to day lives. But today 92 00:05:44,400 --> 00:05:47,239 Speaker 1: we live in a different world, a world that has 93 00:05:47,279 --> 00:05:52,240 Speaker 1: examples of synthetic media. Commonly referred to as deep fakes. 94 00:05:52,880 --> 00:05:56,640 Speaker 1: These are videos that have been altered or generated so 95 00:05:56,760 --> 00:05:59,360 Speaker 1: that the subject of the video is doing something that 96 00:05:59,400 --> 00:06:03,640 Speaker 1: they probably really would or could never do. They've brought 97 00:06:03,640 --> 00:06:07,279 Speaker 1: into question whether or not video evidence is even reliable, 98 00:06:07,600 --> 00:06:10,919 Speaker 1: much as the film Rising Sun was talking about. We 99 00:06:11,000 --> 00:06:16,560 Speaker 1: already know that eyewitness testimony is terribly unreliable. Our perception 100 00:06:16,720 --> 00:06:20,080 Speaker 1: and memories play tricks on us, and we can quote 101 00:06:20,160 --> 00:06:24,160 Speaker 1: unquote remember stuff that just didn't happen the way things 102 00:06:24,279 --> 00:06:28,640 Speaker 1: actually unfolded in reality. But now we're looking at video 103 00:06:28,720 --> 00:06:33,320 Speaker 1: evidence and potentially the same light. I mean, it's scary. 104 00:06:33,400 --> 00:06:37,680 Speaker 1: So today we're going to learn about synthetic media, how 105 00:06:37,960 --> 00:06:42,080 Speaker 1: it can be generated, the implications that follow with that 106 00:06:42,240 --> 00:06:44,680 Speaker 1: sort of reality, and ways that people are trying to 107 00:06:44,720 --> 00:06:50,240 Speaker 1: counteract a potentially dangerous threat. You know, fun stuff. Now, first, 108 00:06:50,360 --> 00:06:54,719 Speaker 1: the term synthetic media has a particular meaning. It refers 109 00:06:54,760 --> 00:06:59,560 Speaker 1: to art created through some sort of automated process, so 110 00:06:59,680 --> 00:07:04,000 Speaker 1: it's a largely hands off approach to creating the final 111 00:07:04,720 --> 00:07:08,560 Speaker 1: art piece. Now, under that definition, the example of rising 112 00:07:08,600 --> 00:07:12,040 Speaker 1: Sun would not apply here because we see in the 113 00:07:12,080 --> 00:07:14,840 Speaker 1: film and presumably this happens in the book as well, 114 00:07:14,880 --> 00:07:18,360 Speaker 1: but I haven't read the book that a human being 115 00:07:18,600 --> 00:07:22,400 Speaker 1: actually changes that. People have used tools to alter the 116 00:07:22,480 --> 00:07:26,240 Speaker 1: video footage. This would be more like using photoshop to 117 00:07:26,240 --> 00:07:29,280 Speaker 1: touch up a still image, with the computer system presumably 118 00:07:29,360 --> 00:07:32,520 Speaker 1: doing some of the work in the background to keep 119 00:07:32,560 --> 00:07:35,200 Speaker 1: things matched up. Either that or you would need to 120 00:07:35,240 --> 00:07:38,920 Speaker 1: alter each image in the footage frame by frame, or 121 00:07:39,000 --> 00:07:42,960 Speaker 1: use some sort of matt approach. To learn more about Matts, 122 00:07:43,360 --> 00:07:45,480 Speaker 1: you can listen to my episode about how blue and 123 00:07:45,560 --> 00:07:50,000 Speaker 1: green screens work. Synthetic media as a general practice has 124 00:07:50,200 --> 00:07:54,320 Speaker 1: been around for centuries. Artists have set up various contraptions 125 00:07:54,360 --> 00:07:58,360 Speaker 1: to create works with little or no human guidance. In 126 00:07:58,400 --> 00:08:01,559 Speaker 1: the twentieth century we started to see a movement called 127 00:08:01,760 --> 00:08:05,160 Speaker 1: generative art take form. This type of art is all 128 00:08:05,200 --> 00:08:08,960 Speaker 1: about creating a system that then creates or generates the 129 00:08:09,040 --> 00:08:12,600 Speaker 1: finished art piece. That would mean that the finished work, 130 00:08:12,720 --> 00:08:16,800 Speaker 1: such as a painting, wouldn't reflect the feelings or thoughts 131 00:08:16,800 --> 00:08:20,240 Speaker 1: of the artists who created the system. In fact, it 132 00:08:20,320 --> 00:08:23,560 Speaker 1: starts to raise the question what is the art? Is 133 00:08:23,600 --> 00:08:26,400 Speaker 1: it the painting that came about due to a machine 134 00:08:26,480 --> 00:08:30,480 Speaker 1: following a program of some sort, or is the art 135 00:08:30,720 --> 00:08:35,079 Speaker 1: the program itself? Is the art the process by which 136 00:08:35,120 --> 00:08:37,800 Speaker 1: the painting was made? Now, I'm not here to answer 137 00:08:37,840 --> 00:08:41,400 Speaker 1: that question. I just think it is an interesting question 138 00:08:41,440 --> 00:08:46,480 Speaker 1: to ask. Sometimes people ask much less polite questions, such 139 00:08:46,520 --> 00:08:50,600 Speaker 1: as is it art at all? Some art critics went 140 00:08:50,640 --> 00:08:53,640 Speaker 1: out of their way to dismiss generative art in the 141 00:08:53,679 --> 00:08:58,280 Speaker 1: early days. They found it insulting, but hey, that's kind 142 00:08:58,320 --> 00:09:02,160 Speaker 1: of the history of art and general Each new movement 143 00:09:02,200 --> 00:09:07,199 Speaker 1: in art inevitably finds both supporters and critics as it emerges. 144 00:09:07,640 --> 00:09:11,480 Speaker 1: If anything, you might argue that such a response legitimizes 145 00:09:11,720 --> 00:09:14,679 Speaker 1: the movement in you know, a weird way. If people 146 00:09:14,720 --> 00:09:18,880 Speaker 1: hate it, it must be something. In two thousand eighteen, 147 00:09:19,000 --> 00:09:23,800 Speaker 1: an artist collective called Obvious located out of Paris, France. 148 00:09:24,200 --> 00:09:27,760 Speaker 1: They submitted portrait style paintings that were created not by 149 00:09:27,880 --> 00:09:32,719 Speaker 1: an actual human painter, but by an artificially intelligent system. 150 00:09:32,760 --> 00:09:37,280 Speaker 1: Now they looked a lot like typical eighteenth century style portraits. 151 00:09:37,920 --> 00:09:41,400 Speaker 1: There was no attempt to pass off the portrait as 152 00:09:41,440 --> 00:09:44,360 Speaker 1: if it were actually made by a human artist. In fact, 153 00:09:44,800 --> 00:09:47,959 Speaker 1: the appeal of the piece was largely due to it 154 00:09:48,080 --> 00:09:52,840 Speaker 1: being synthetically generated. It went to auction at Christie's and 155 00:09:52,960 --> 00:09:59,000 Speaker 1: the AI created painting fetched more than four hundred thousand dollars. 156 00:09:59,120 --> 00:10:02,240 Speaker 1: And the way the group trained their AI is relevant 157 00:10:02,280 --> 00:10:06,720 Speaker 1: to our discussion about deep fakes. The collective relied on 158 00:10:06,800 --> 00:10:11,560 Speaker 1: a type of machine learning called generative adversarial networks or 159 00:10:11,800 --> 00:10:16,320 Speaker 1: g a N, which in turn is depending on deep learning. 160 00:10:16,400 --> 00:10:18,079 Speaker 1: So it looks like we've got a few things we're 161 00:10:18,080 --> 00:10:20,760 Speaker 1: gonna have to define here. Now, I'm going to keep 162 00:10:20,840 --> 00:10:24,840 Speaker 1: things fairly high level, because, as it turns out, there 163 00:10:24,880 --> 00:10:28,240 Speaker 1: are a few different ways to create machine learning models, 164 00:10:28,600 --> 00:10:31,160 Speaker 1: and to go through all of them in exhaustive detail 165 00:10:31,280 --> 00:10:34,760 Speaker 1: would represent a university level course in machine learning. I 166 00:10:34,800 --> 00:10:38,280 Speaker 1: have neither the time for that nor the expertise. I 167 00:10:38,320 --> 00:10:41,920 Speaker 1: would do a terrible job, so we'll go with a 168 00:10:42,040 --> 00:10:47,560 Speaker 1: high level perspective here. First. A generative adversarial network uses 169 00:10:47,679 --> 00:10:51,800 Speaker 1: two systems. You have a generator and you have a discriminator. 170 00:10:52,280 --> 00:10:55,600 Speaker 1: Both of these systems are a type of neural network. 171 00:10:56,000 --> 00:10:59,600 Speaker 1: A neural network is a computing model that is inspired 172 00:10:59,640 --> 00:11:03,960 Speaker 1: by the way our brains work. Our brains contain billions 173 00:11:03,960 --> 00:11:08,319 Speaker 1: of neurons, and these neurons work together, communicating through electrical 174 00:11:08,360 --> 00:11:13,080 Speaker 1: and chemical signals, controlling and coordinating pretty much everything in 175 00:11:13,120 --> 00:11:18,440 Speaker 1: our bodies. With computers, the neurons are nodes. The job 176 00:11:18,559 --> 00:11:21,720 Speaker 1: of a node is, you know, supposed to be kind 177 00:11:21,720 --> 00:11:24,400 Speaker 1: of like a neuron cell in the brain. It's to 178 00:11:24,520 --> 00:11:29,200 Speaker 1: take in multiple weighted input values and then generate a 179 00:11:29,320 --> 00:11:34,160 Speaker 1: single output value. Now, the word weighted w E I 180 00:11:34,320 --> 00:11:37,920 Speaker 1: G H T E D weighted is really important here 181 00:11:37,960 --> 00:11:42,160 Speaker 1: because the larger and inputs weight the more that input 182 00:11:42,280 --> 00:11:45,679 Speaker 1: will have an effect on whatever the output is. So 183 00:11:45,720 --> 00:11:48,720 Speaker 1: it kind of comes down to which inputs are the 184 00:11:48,800 --> 00:11:52,440 Speaker 1: most important for that nodes particular function. Now, if I 185 00:11:52,480 --> 00:11:55,720 Speaker 1: were to make an analogy, I would say, your boss 186 00:11:55,840 --> 00:11:59,560 Speaker 1: hands you three tasks to do. One of those tasks 187 00:11:59,600 --> 00:12:03,840 Speaker 1: has the label extremely important, and the second task has 188 00:12:03,920 --> 00:12:08,120 Speaker 1: the label critically important, and the third task has a 189 00:12:08,200 --> 00:12:10,200 Speaker 1: label saying you should have finished that one before it 190 00:12:10,280 --> 00:12:13,240 Speaker 1: was handed to you. Okay, so that's just some sort 191 00:12:13,280 --> 00:12:15,719 Speaker 1: of snarky office humor that I need to get off 192 00:12:15,720 --> 00:12:20,200 Speaker 1: my chest. But more seriously, imagine a node accepting three inputs. 193 00:12:20,200 --> 00:12:24,679 Speaker 1: In this example, input one has a fift weight, input 194 00:12:24,760 --> 00:12:28,320 Speaker 1: two has a weight, and input three has a ten 195 00:12:28,440 --> 00:12:32,040 Speaker 1: percent weight That adds up to and that would tell 196 00:12:32,080 --> 00:12:35,520 Speaker 1: you that the output that node generates will be most 197 00:12:35,679 --> 00:12:39,880 Speaker 1: affected by input one, followed by input two, and then 198 00:12:39,880 --> 00:12:43,439 Speaker 1: input three would have a smaller effect on whatever the 199 00:12:43,480 --> 00:12:48,560 Speaker 1: output is. Each node applies a nonlinear transformation on the 200 00:12:48,600 --> 00:12:53,720 Speaker 1: input values, again affected by each inputs weight value, and 201 00:12:53,800 --> 00:12:58,920 Speaker 1: that generates the output value. The details of that really 202 00:12:58,920 --> 00:13:02,360 Speaker 1: are not important are our episode. It involves performing changes 203 00:13:02,360 --> 00:13:06,040 Speaker 1: on variables that in turn change the correlation between variables, 204 00:13:06,040 --> 00:13:08,560 Speaker 1: and it gets a bit Matthew, and we would get 205 00:13:08,600 --> 00:13:11,840 Speaker 1: lost in the weeds. Pretty quickly. The important thing to 206 00:13:11,880 --> 00:13:15,520 Speaker 1: remember is that a node within a neural network takes 207 00:13:15,640 --> 00:13:20,520 Speaker 1: in a weighted sum of inputs, then performs a process 208 00:13:20,559 --> 00:13:25,480 Speaker 1: on those inputs before passing the result on as an output. 209 00:13:25,920 --> 00:13:30,319 Speaker 1: Then some other node a layer down will accept that output, 210 00:13:30,640 --> 00:13:33,079 Speaker 1: along with outputs from a couple of other nodes one 211 00:13:33,160 --> 00:13:36,840 Speaker 1: layer up, and then we'll perform an operation based on 212 00:13:36,920 --> 00:13:40,079 Speaker 1: those weighted inputs and pass that on to the next layer, 213 00:13:40,160 --> 00:13:43,480 Speaker 1: and so on. So these nodes are in layers, like 214 00:13:43,600 --> 00:13:47,240 Speaker 1: you know a cake. One layer of notes processes some inputs, 215 00:13:47,280 --> 00:13:50,280 Speaker 1: they send it onto the next layer of nodes, and 216 00:13:50,320 --> 00:13:52,240 Speaker 1: then that one does onto the next one, and the 217 00:13:52,280 --> 00:13:56,400 Speaker 1: next one and so on. This isn't a new idea. 218 00:13:56,800 --> 00:14:02,160 Speaker 1: Computer scientists began theorizing and experimenting with neural network approaches 219 00:14:02,640 --> 00:14:06,000 Speaker 1: as far back as the nineteen fifties with the perceptron, 220 00:14:06,320 --> 00:14:09,680 Speaker 1: which was a hypothetical system that was described by Frank 221 00:14:09,760 --> 00:14:13,559 Speaker 1: Rosenblatt of Cornell University. But it wasn't until the last 222 00:14:13,640 --> 00:14:17,160 Speaker 1: decade that computing power and our ability to handle a 223 00:14:17,200 --> 00:14:20,520 Speaker 1: lot of data reached a point where these sort of 224 00:14:20,600 --> 00:14:24,480 Speaker 1: learning models could really take off. The goal of this 225 00:14:24,680 --> 00:14:28,440 Speaker 1: system is to train it to perform a particular task 226 00:14:28,920 --> 00:14:33,120 Speaker 1: within a certain level of precision. The weights I mentioned 227 00:14:33,160 --> 00:14:35,880 Speaker 1: are adjustable, so you can think of it as teaching 228 00:14:35,880 --> 00:14:39,760 Speaker 1: a system which bits are the most important in order 229 00:14:39,760 --> 00:14:42,520 Speaker 1: to do whatever it is the system is supposed to 230 00:14:42,520 --> 00:14:45,680 Speaker 1: do in order to achieve your task. These are the 231 00:14:45,680 --> 00:14:49,200 Speaker 1: bits that are the most important and therefore should matter 232 00:14:49,240 --> 00:14:52,040 Speaker 1: the most when you weigh a decision. This is a 233 00:14:52,040 --> 00:14:54,840 Speaker 1: bit easier if we talk about a similar system with 234 00:14:55,080 --> 00:14:59,120 Speaker 1: the version of IBM S Watson that played on Jeopardy. 235 00:14:59,360 --> 00:15:03,080 Speaker 1: That system famously was not connected to the Internet. It 236 00:15:03,200 --> 00:15:06,400 Speaker 1: had to rely on all the information that was stored 237 00:15:06,480 --> 00:15:11,920 Speaker 1: within itself. When the system encountered a clue in Jeopardy, 238 00:15:11,960 --> 00:15:14,760 Speaker 1: it would analyze the clue, and then it would reference 239 00:15:14,800 --> 00:15:17,920 Speaker 1: its database to look for possible answers to whatever that 240 00:15:18,000 --> 00:15:21,920 Speaker 1: clue was. The system would weigh those possible answers and 241 00:15:21,960 --> 00:15:25,240 Speaker 1: attempt to determine which, if any, were the most likely 242 00:15:25,320 --> 00:15:29,040 Speaker 1: to be correct. If the certainty was over a certain threshold, 243 00:15:29,400 --> 00:15:33,320 Speaker 1: such as sure, the system would buzz in with its answer. 244 00:15:33,680 --> 00:15:37,360 Speaker 1: If no response rose above that threshold, the system would 245 00:15:37,400 --> 00:15:40,080 Speaker 1: not buzz in. So you could say that Watson was 246 00:15:40,120 --> 00:15:43,320 Speaker 1: playing the game with a best guess sort of approach. 247 00:15:43,840 --> 00:15:48,880 Speaker 1: Neural networks do essentially that sort of processing. With this 248 00:15:48,920 --> 00:15:52,440 Speaker 1: particular type of approach, we know what we want the 249 00:15:52,520 --> 00:15:55,480 Speaker 1: outcome to be, so we can judge whether or not 250 00:15:55,560 --> 00:15:59,760 Speaker 1: the system was successful. After each attempt, we can adjust 251 00:16:00,120 --> 00:16:03,800 Speaker 1: weight on the input between nodes to refine the decision 252 00:16:03,840 --> 00:16:07,760 Speaker 1: making process to get more accurate results. If the system 253 00:16:07,840 --> 00:16:11,080 Speaker 1: succeeds in its task, we can increase the weights that 254 00:16:11,160 --> 00:16:15,240 Speaker 1: contributed to the system picking the correct answer and thus 255 00:16:15,480 --> 00:16:21,800 Speaker 1: decrease the inputs that did not contribute to the successful response. 256 00:16:22,280 --> 00:16:25,880 Speaker 1: If the system done messed up and gave the wrong answer, 257 00:16:26,440 --> 00:16:28,880 Speaker 1: then we do the opposite. We look at the inputs 258 00:16:28,920 --> 00:16:32,880 Speaker 1: that contributed to the wrong answer, we diminish their weights, 259 00:16:33,200 --> 00:16:35,560 Speaker 1: and we increase the weights of the other input and 260 00:16:35,560 --> 00:16:40,120 Speaker 1: then we run the test again a lot. I'll explain 261 00:16:40,320 --> 00:16:42,760 Speaker 1: a bit more about this process when we come back, 262 00:16:42,800 --> 00:16:54,200 Speaker 1: but first let's take a quick break. Early in the 263 00:16:54,320 --> 00:16:58,680 Speaker 1: history of neural networks, computer scientists were hitting some pretty 264 00:16:58,760 --> 00:17:02,240 Speaker 1: hard stops do to the limitations of computing power at 265 00:17:02,280 --> 00:17:06,040 Speaker 1: the time. Early networks were only a couple of layers deep, 266 00:17:06,119 --> 00:17:08,800 Speaker 1: which really meant they weren't terribly powerful, and they could 267 00:17:08,800 --> 00:17:12,560 Speaker 1: only tackle rudimentary tasks like figuring out whether or not 268 00:17:12,600 --> 00:17:16,679 Speaker 1: a square is drawn on a piece of paper that 269 00:17:17,200 --> 00:17:23,320 Speaker 1: isn't terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and 270 00:17:23,480 --> 00:17:28,520 Speaker 1: Ronald Williams published a lecture titled learning representations by back 271 00:17:28,640 --> 00:17:34,159 Speaker 1: propagating errors. This was a big breakthrough with deep learning. 272 00:17:34,760 --> 00:17:36,960 Speaker 1: This all has to do with a deep learning system 273 00:17:37,000 --> 00:17:40,200 Speaker 1: improving its ability to complete a specific task. And basically 274 00:17:40,240 --> 00:17:43,679 Speaker 1: the algorithm's job is to go from the output layer, 275 00:17:43,920 --> 00:17:46,800 Speaker 1: you know, where the system has made a decision, and 276 00:17:46,840 --> 00:17:50,480 Speaker 1: then work backward through the neural network, adjusting the weights 277 00:17:50,520 --> 00:17:55,960 Speaker 1: that led to an incorrect decision. So let's say it's 278 00:17:56,040 --> 00:17:59,520 Speaker 1: a system that is looking to figure out whether or 279 00:17:59,560 --> 00:18:02,760 Speaker 1: not a hat is in a photograph and it says, 280 00:18:02,960 --> 00:18:05,320 Speaker 1: there's a cat in this picture, and you look at 281 00:18:05,320 --> 00:18:08,159 Speaker 1: the picture and there is no cat there. Then you 282 00:18:08,160 --> 00:18:12,439 Speaker 1: would look at the inputs one level back just before 283 00:18:12,480 --> 00:18:15,080 Speaker 1: the system said here's a picture of a cat, and 284 00:18:15,119 --> 00:18:17,520 Speaker 1: you'd say, all right, which of these inputs lad the 285 00:18:17,520 --> 00:18:20,760 Speaker 1: system to believe this was a picture of a cat? 286 00:18:21,160 --> 00:18:23,639 Speaker 1: And then you would adjust those Then you would go 287 00:18:23,840 --> 00:18:27,720 Speaker 1: back one layer up, So you're working your way up 288 00:18:27,920 --> 00:18:31,919 Speaker 1: the model and say which inputs here led to it 289 00:18:32,119 --> 00:18:36,240 Speaker 1: giving the outputs that led to the mistake, and you 290 00:18:36,320 --> 00:18:39,640 Speaker 1: do this all the way up until you get up 291 00:18:39,640 --> 00:18:42,800 Speaker 1: to the input level at the top of the computer model. 292 00:18:42,840 --> 00:18:46,000 Speaker 1: You are back propagating, and then you run the test 293 00:18:46,040 --> 00:18:50,760 Speaker 1: again to see if you've got improvement. It's exhaustive, but 294 00:18:50,840 --> 00:18:56,080 Speaker 1: it's also drastically improved neural network performance, much faster than 295 00:18:56,160 --> 00:18:59,920 Speaker 1: just throwing more brute force to it. The algorithm is 296 00:19:00,080 --> 00:19:02,439 Speaker 1: entually is checking to see if a small change in 297 00:19:02,520 --> 00:19:06,520 Speaker 1: each input value received by a layer of nodes would 298 00:19:06,560 --> 00:19:08,679 Speaker 1: have led to a more accurate results. So it's all 299 00:19:08,720 --> 00:19:11,960 Speaker 1: about going from that output working your way backward. In 300 00:19:12,040 --> 00:19:15,520 Speaker 1: two thousand twelve, Alex Krajewski published a paper that gave 301 00:19:15,600 --> 00:19:19,320 Speaker 1: us the next big breakthrough. He argued that a really 302 00:19:19,520 --> 00:19:23,080 Speaker 1: deep neural network with a lot of layers could give 303 00:19:23,200 --> 00:19:26,359 Speaker 1: really great results if you paired it with enough data 304 00:19:26,440 --> 00:19:29,800 Speaker 1: to train the system. So you needed to throw lots 305 00:19:29,840 --> 00:19:33,680 Speaker 1: of data at these models, and it needed to be 306 00:19:33,760 --> 00:19:37,760 Speaker 1: an enormous amount of data. However, once trained, the system 307 00:19:37,840 --> 00:19:40,880 Speaker 1: would produce lower error rates. So yeah, I would take 308 00:19:40,880 --> 00:19:43,640 Speaker 1: a long time, but you would get better results. Now, 309 00:19:43,680 --> 00:19:46,439 Speaker 1: at the time, a good error rate for such a 310 00:19:46,480 --> 00:19:51,480 Speaker 1: system was that means one out of four conclusions the 311 00:19:51,560 --> 00:19:54,480 Speaker 1: system would come to would be wrong. If you ran 312 00:19:54,560 --> 00:19:58,400 Speaker 1: it across a long enough number of decisions, you would 313 00:19:58,400 --> 00:20:02,240 Speaker 1: find that one out of every four wasn't right. The 314 00:20:02,320 --> 00:20:05,959 Speaker 1: system that Alex's team worked on produced results that had 315 00:20:06,000 --> 00:20:09,399 Speaker 1: an error rate of six percent, so much lower. And 316 00:20:09,440 --> 00:20:13,879 Speaker 1: then in just five years, with more improvements to this process, 317 00:20:14,280 --> 00:20:18,080 Speaker 1: the classification error rate had dropped down to two point 318 00:20:18,320 --> 00:20:22,800 Speaker 1: three percent for deep learning systems. So from to two 319 00:20:22,880 --> 00:20:27,560 Speaker 1: point three percent, it was really powerful stuff. Okay, so 320 00:20:27,720 --> 00:20:31,879 Speaker 1: you've got your artificial neural network. You've got your layers 321 00:20:31,960 --> 00:20:35,760 Speaker 1: and layers of nodes. You've adjusted the weights of the 322 00:20:35,800 --> 00:20:39,719 Speaker 1: inputs into each node to see if your system can identify, 323 00:20:40,119 --> 00:20:44,960 Speaker 1: you know, pictures of cats, and you start feeding images 324 00:20:45,040 --> 00:20:48,879 Speaker 1: to this system, lots of them. This is the domain 325 00:20:49,080 --> 00:20:51,360 Speaker 1: that you are feeding to your system. The more images 326 00:20:51,400 --> 00:20:53,520 Speaker 1: you can feed to it, the better. And you want 327 00:20:53,520 --> 00:20:55,840 Speaker 1: a wide variety of images of all sorts of stuff, 328 00:20:56,240 --> 00:20:58,800 Speaker 1: not just of different types of cats, but stuff that 329 00:20:58,920 --> 00:21:03,400 Speaker 1: most certainly isn't not a cat, like dogs, or cars 330 00:21:03,520 --> 00:21:06,760 Speaker 1: or chartered public accountants. You name it, and you look 331 00:21:06,840 --> 00:21:10,520 Speaker 1: to see which images the system identifies correctly and which 332 00:21:10,560 --> 00:21:14,040 Speaker 1: ones it screws up, both which images have cats in 333 00:21:14,080 --> 00:21:17,880 Speaker 1: it that actually don't have cats in it, or images 334 00:21:17,920 --> 00:21:20,760 Speaker 1: the system has identified as saying there is no cat here, 335 00:21:20,960 --> 00:21:23,880 Speaker 1: but there is a cat there. This guides you into 336 00:21:23,920 --> 00:21:27,520 Speaker 1: adjusting the weights again and again, and you start over 337 00:21:27,560 --> 00:21:29,440 Speaker 1: and you do it again, and that's your basic deep 338 00:21:29,520 --> 00:21:33,000 Speaker 1: learning system, and it gets better over time as you 339 00:21:33,080 --> 00:21:36,399 Speaker 1: train it. It learns. Now, let's transition over to the 340 00:21:36,440 --> 00:21:40,439 Speaker 1: adversarial systems I mentioned earlier, because they take this and 341 00:21:40,480 --> 00:21:45,560 Speaker 1: twist it a little bit. So you've got to artificial 342 00:21:45,720 --> 00:21:49,520 Speaker 1: neural networks and they are using this general approach to 343 00:21:49,720 --> 00:21:53,400 Speaker 1: deep learning, and you're setting them up so that they 344 00:21:53,440 --> 00:21:58,000 Speaker 1: feed into each other. One network. The generator has the 345 00:21:58,040 --> 00:22:01,919 Speaker 1: task to learn how to do something such as create 346 00:22:01,960 --> 00:22:05,919 Speaker 1: an eighteenth century style portrait based off lots and lots 347 00:22:06,000 --> 00:22:09,600 Speaker 1: of examples of the real thing. The domain the problem 348 00:22:09,960 --> 00:22:14,760 Speaker 1: domain the second network. The discriminator has a different job. 349 00:22:15,359 --> 00:22:18,800 Speaker 1: It has to tell the difference between authentic portraits that 350 00:22:19,040 --> 00:22:23,960 Speaker 1: came from the problem domain and computer generated portraits that 351 00:22:24,040 --> 00:22:27,919 Speaker 1: came from the generator itself. So essentially, the discriminator is 352 00:22:28,000 --> 00:22:31,199 Speaker 1: like the model I mentioned earlier that was identifying pictures 353 00:22:31,200 --> 00:22:33,320 Speaker 1: of cats, It's doing the same sort of thing, except 354 00:22:33,359 --> 00:22:36,600 Speaker 1: instead of saying cat or no cat, it's saying real 355 00:22:36,760 --> 00:22:40,600 Speaker 1: portrait or computer generated portrait. So there are essentially two 356 00:22:40,600 --> 00:22:44,359 Speaker 1: outcomes the discriminator could reach, and that's whether or not 357 00:22:44,440 --> 00:22:48,119 Speaker 1: an images computer generated or it wasn't. So do you 358 00:22:48,119 --> 00:22:51,680 Speaker 1: see where this is going? You train up both models. 359 00:22:52,119 --> 00:22:54,879 Speaker 1: You have the generator attempt to make its own version 360 00:22:54,960 --> 00:22:58,400 Speaker 1: of something such as that eighteenth century portrait. It does 361 00:22:58,440 --> 00:23:01,119 Speaker 1: so it designs the portrait it based on what the 362 00:23:01,160 --> 00:23:05,720 Speaker 1: model believes are the key elements of a portrait, so 363 00:23:05,920 --> 00:23:10,679 Speaker 1: things like colors, shapes, the ratio of size, like you know, 364 00:23:10,720 --> 00:23:13,720 Speaker 1: how large should the head be in relation to the body. 365 00:23:13,760 --> 00:23:17,960 Speaker 1: All of these factors and many more come into play. 366 00:23:18,119 --> 00:23:22,399 Speaker 1: The generator creates its own idea of what a portrait 367 00:23:22,520 --> 00:23:25,159 Speaker 1: is supposed to look like, and chances are the early 368 00:23:25,240 --> 00:23:29,879 Speaker 1: rounds of this will not be terribly convincing. The results 369 00:23:30,040 --> 00:23:33,280 Speaker 1: are then fed to the discriminator, which tries to suss 370 00:23:33,320 --> 00:23:36,359 Speaker 1: out which of the images fed to it are computer 371 00:23:36,480 --> 00:23:40,360 Speaker 1: generated and which ones aren't. After that round, both models 372 00:23:40,600 --> 00:23:45,480 Speaker 1: are tweaked. The generator adjusts input weights to get closer 373 00:23:45,560 --> 00:23:49,159 Speaker 1: to the genuine article, and the discriminator adjust weights to 374 00:23:49,320 --> 00:23:53,320 Speaker 1: reduce false positives or to catch computer generated images. And 375 00:23:53,359 --> 00:23:57,560 Speaker 1: then you go again and again and again and again, 376 00:23:57,840 --> 00:24:01,479 Speaker 1: and they both get better over time. So, assuming everything 377 00:24:01,560 --> 00:24:04,840 Speaker 1: is working properly, over time, the adjustment of input weights 378 00:24:04,880 --> 00:24:08,320 Speaker 1: will lead to more convincing results, and given enough time 379 00:24:08,520 --> 00:24:11,480 Speaker 1: and enough repetition, you'll end up with a computer generated 380 00:24:11,520 --> 00:24:13,879 Speaker 1: painting that you can auction off for nearly half a 381 00:24:13,960 --> 00:24:18,479 Speaker 1: million dollars. Though keep in mind that huge price relates 382 00:24:18,520 --> 00:24:21,720 Speaker 1: back to the novelty of it being an early AI 383 00:24:21,760 --> 00:24:25,399 Speaker 1: generated painting. It would be shocking to me if we 384 00:24:25,480 --> 00:24:29,400 Speaker 1: saw that actually become a trend. Also, the painting, while interesting, 385 00:24:29,880 --> 00:24:32,760 Speaker 1: isn't exactly so astounding as to make you think there's 386 00:24:32,800 --> 00:24:35,399 Speaker 1: no way a machine did that. You'd look at them 387 00:24:35,400 --> 00:24:38,160 Speaker 1: and go, yeah, I can imagine a machine did that. One. 388 00:24:38,840 --> 00:24:43,160 Speaker 1: A group of computer scientists first described the general adversarial 389 00:24:43,200 --> 00:24:46,040 Speaker 1: network architecture in a paper in two thousand and fourteen, 390 00:24:46,640 --> 00:24:49,840 Speaker 1: and like other neural networks, these models require a lot 391 00:24:49,880 --> 00:24:52,480 Speaker 1: of data. The more the better. In fact, smaller data 392 00:24:52,480 --> 00:24:56,159 Speaker 1: sets means the models have to make some pretty big assumptions, 393 00:24:56,720 --> 00:25:00,440 Speaker 1: and you tend to get pretty lousy results. More data, 394 00:25:00,600 --> 00:25:03,879 Speaker 1: as in more examples, teaches the models more about the 395 00:25:03,920 --> 00:25:07,119 Speaker 1: parameters of the domain, whatever it is they are trying 396 00:25:07,160 --> 00:25:10,560 Speaker 1: to generate. It refines the approach. So if you have 397 00:25:10,600 --> 00:25:13,280 Speaker 1: a sophisticated enough pair of models and you have enough 398 00:25:13,400 --> 00:25:16,280 Speaker 1: data to fill up a domain, you can generate some 399 00:25:16,440 --> 00:25:20,520 Speaker 1: convincing material, and that includes video. And this brings us 400 00:25:20,560 --> 00:25:26,240 Speaker 1: around to deep fakes. And in addition to generative adversarial networks, 401 00:25:26,280 --> 00:25:31,400 Speaker 1: a couple of other things really converged to create the 402 00:25:31,480 --> 00:25:35,040 Speaker 1: techniques and trends and technology that would allow for deep 403 00:25:35,040 --> 00:25:42,040 Speaker 1: fakes proper. In Malcolm Slaney, Michelle Covell, and Christoph Bregler 404 00:25:42,520 --> 00:25:46,680 Speaker 1: wrote some software that they called the Video Rewrite Program. 405 00:25:46,680 --> 00:25:50,959 Speaker 1: The software would analyze faces and then create or synthesize 406 00:25:51,240 --> 00:25:55,920 Speaker 1: lip animation which could be matched to pre recorded audio. 407 00:25:56,080 --> 00:25:59,480 Speaker 1: So you could take some film footage of a person 408 00:25:59,720 --> 00:26:03,439 Speaker 1: and and reanimate their lips so that they could appear 409 00:26:03,480 --> 00:26:06,000 Speaker 1: to say all sorts of things, which in some ways 410 00:26:06,119 --> 00:26:09,439 Speaker 1: set the stage for deep fakes. This case, it was 411 00:26:09,480 --> 00:26:12,840 Speaker 1: really just focusing on the lips and the general area 412 00:26:12,920 --> 00:26:16,560 Speaker 1: around the lips, so you weren't changing the rest of 413 00:26:16,600 --> 00:26:19,560 Speaker 1: the expression of the face, and you would have to, 414 00:26:20,160 --> 00:26:23,520 Speaker 1: you know, keep your recording to be about the same 415 00:26:23,600 --> 00:26:25,960 Speaker 1: length as whatever the film clip was, or you would 416 00:26:25,960 --> 00:26:28,080 Speaker 1: have to loop the film clip over and over it, 417 00:26:28,080 --> 00:26:30,320 Speaker 1: which would make it, you know, far more obvious that 418 00:26:30,440 --> 00:26:35,000 Speaker 1: this was a fake. In addition, motion tracking technology was 419 00:26:35,040 --> 00:26:37,720 Speaker 1: advancing over time too, and this also became an important 420 00:26:37,760 --> 00:26:41,080 Speaker 1: tool in computer animation. This tool would also be used 421 00:26:41,400 --> 00:26:45,800 Speaker 1: by deep fake algorithms to create facial expressions, manipulating the 422 00:26:45,840 --> 00:26:48,760 Speaker 1: digital image just as it would if it were a 423 00:26:48,840 --> 00:26:53,199 Speaker 1: video game character or a Pixar animated character. Typically, you 424 00:26:53,280 --> 00:26:56,439 Speaker 1: need to start with some existing video in order to 425 00:26:56,480 --> 00:27:00,720 Speaker 1: manipulate it. You're not actually computer generating the animation, like, 426 00:27:00,760 --> 00:27:05,720 Speaker 1: you're not creating a computer generated version of whomever it 427 00:27:05,840 --> 00:27:11,119 Speaker 1: is you're doing the fake of. You're using existing imagery 428 00:27:11,200 --> 00:27:13,880 Speaker 1: in order to do that and then manipulating that existing imagery, 429 00:27:14,000 --> 00:27:17,719 Speaker 1: So it's a little different from computer animation. In two 430 00:27:17,760 --> 00:27:21,440 Speaker 1: thousand and sixteen, students and faculty at the Technical University 431 00:27:21,440 --> 00:27:25,600 Speaker 1: of Munich created the face to face project that would 432 00:27:25,600 --> 00:27:30,040 Speaker 1: be face the numeral two and then face and this 433 00:27:30,119 --> 00:27:33,120 Speaker 1: was particularly jaw dropping to me at the time when 434 00:27:33,119 --> 00:27:37,440 Speaker 1: I first saw these videos back in I was floored. 435 00:27:37,920 --> 00:27:41,480 Speaker 1: They created a system that had a target actor. This 436 00:27:41,520 --> 00:27:44,120 Speaker 1: would be the video of the person that you want 437 00:27:44,160 --> 00:27:47,440 Speaker 1: to manipulate. In the example they used, it was former 438 00:27:47,520 --> 00:27:52,240 Speaker 1: US President George W. Bush. Their process also had a 439 00:27:52,320 --> 00:27:56,880 Speaker 1: source actor. This was the source of the expressions and 440 00:27:56,920 --> 00:28:00,400 Speaker 1: facial movements you would see in the target So kind 441 00:28:00,440 --> 00:28:03,679 Speaker 1: of like a digital puppeteer in a way, but the 442 00:28:03,680 --> 00:28:05,720 Speaker 1: way they did it was really cool. They had a 443 00:28:05,760 --> 00:28:09,840 Speaker 1: camera trained on the source actor and it would track 444 00:28:09,960 --> 00:28:13,840 Speaker 1: specific points of movement on the source actor's face, and 445 00:28:13,880 --> 00:28:17,400 Speaker 1: then the system would manipulate the same points of movement 446 00:28:17,600 --> 00:28:21,280 Speaker 1: on the target actor's face in the video. So if 447 00:28:21,320 --> 00:28:25,520 Speaker 1: the source actor smiled, then the target smiled, so the 448 00:28:25,560 --> 00:28:27,960 Speaker 1: source actor would smile, and then you would see George W. 449 00:28:28,160 --> 00:28:31,080 Speaker 1: Bush and the video smile in real time. It was 450 00:28:31,440 --> 00:28:37,040 Speaker 1: really strange. They used this looping video of George W. 451 00:28:37,160 --> 00:28:40,960 Speaker 1: Bush wearing a neutral expression. They had to start with 452 00:28:41,600 --> 00:28:45,360 Speaker 1: that as there they're sort of zero point, and I 453 00:28:45,400 --> 00:28:48,240 Speaker 1: gotta tell you, it really does look like the former 454 00:28:48,280 --> 00:28:50,400 Speaker 1: president George W. Bush is having a bit of a 455 00:28:50,480 --> 00:28:54,440 Speaker 1: freak out on a looping video because he keeps on, 456 00:28:54,600 --> 00:28:59,160 Speaker 1: opening his mouth, closing his mouth, grimacing, raising his eyebrows. 457 00:28:59,440 --> 00:29:02,040 Speaker 1: You need to watch this video. It is still available 458 00:29:02,080 --> 00:29:06,600 Speaker 1: online to check out. In ten, students and faculty over 459 00:29:06,600 --> 00:29:10,800 Speaker 1: at the University of Washington created the Synthesizing Obama project, 460 00:29:11,080 --> 00:29:13,960 Speaker 1: in which they trained a computer model to generate a 461 00:29:14,040 --> 00:29:18,280 Speaker 1: synthetic video of former US President Barack Obama, and they 462 00:29:18,400 --> 00:29:21,800 Speaker 1: made it lip sync to a pre recorded audio clip 463 00:29:22,000 --> 00:29:26,640 Speaker 1: from one of Obama's addresses to the nation. They actually 464 00:29:26,680 --> 00:29:30,440 Speaker 1: had the original video of that address for comparison, so 465 00:29:30,640 --> 00:29:33,920 Speaker 1: they could look back at that and see how they're 466 00:29:33,960 --> 00:29:37,680 Speaker 1: generated one compared to the real thing. And their approach 467 00:29:37,920 --> 00:29:41,840 Speaker 1: used a model that analyzed hundreds of hours of video 468 00:29:41,880 --> 00:29:46,840 Speaker 1: footage of Obama speaking, and it mapped specific mouth shapes 469 00:29:47,000 --> 00:29:51,680 Speaker 1: to specific sounds. It would also include some of Obama's mannerisms, 470 00:29:51,720 --> 00:29:53,719 Speaker 1: such as how he moves his head when he talks 471 00:29:53,840 --> 00:29:57,520 Speaker 1: or uses facial expressions to emphasize words. And watching the 472 00:29:57,640 --> 00:30:01,600 Speaker 1: video and that the the real one next to the 473 00:30:01,640 --> 00:30:05,840 Speaker 1: generated one is pretty strange. You can tell the generated 474 00:30:05,880 --> 00:30:09,960 Speaker 1: one isn't quite right. It's not matching the audio exactly, 475 00:30:10,240 --> 00:30:14,720 Speaker 1: at least not on the early versions, but it's fairly 476 00:30:14,800 --> 00:30:17,720 Speaker 1: close and it might even pass casual inspection for a 477 00:30:17,760 --> 00:30:20,280 Speaker 1: lot of people who weren't, like, you know, actually paying attention. 478 00:30:20,920 --> 00:30:26,120 Speaker 1: Authors Morass and Alexandro defined deep fakes as quote the 479 00:30:26,160 --> 00:30:31,480 Speaker 1: product of artificial intelligence applications that merge, combine, replace, and 480 00:30:31,600 --> 00:30:35,719 Speaker 1: superimpose images and video clips to create fake videos that 481 00:30:35,760 --> 00:30:41,280 Speaker 1: appear authentic end quote. They first emerged in seventeen and 482 00:30:41,360 --> 00:30:45,040 Speaker 1: so this is a pretty darn young application of technology. 483 00:30:45,680 --> 00:30:48,880 Speaker 1: One thing that is worrisome is that once someone has 484 00:30:48,920 --> 00:30:52,640 Speaker 1: access to the tools, it's not that difficult to create 485 00:30:52,720 --> 00:30:55,760 Speaker 1: a deep fake video. You pretty much just need a 486 00:30:55,800 --> 00:30:59,560 Speaker 1: decent computer, the tools, a bit of know how on 487 00:30:59,640 --> 00:31:02,840 Speaker 1: how to do it, and some time you also need 488 00:31:03,000 --> 00:31:06,720 Speaker 1: some reference material, as in like videos and images of 489 00:31:06,760 --> 00:31:10,560 Speaker 1: the person that you are replicating, and like the machine 490 00:31:10,640 --> 00:31:13,960 Speaker 1: learning systems I've mentioned, the more reference material you have, 491 00:31:14,200 --> 00:31:17,480 Speaker 1: the better. That's why the deep fakes you encounter these 492 00:31:17,560 --> 00:31:21,560 Speaker 1: days tend to be of notable famous people like celebrities 493 00:31:21,560 --> 00:31:25,560 Speaker 1: and politicians. Mainly there's no shortage of reference material for 494 00:31:25,600 --> 00:31:28,960 Speaker 1: those types of individuals, and so they are easier to 495 00:31:29,000 --> 00:31:32,360 Speaker 1: replicate with deep fakes than someone who maintains a much 496 00:31:32,560 --> 00:31:35,520 Speaker 1: lower profile. Not to say that that will always be 497 00:31:35,600 --> 00:31:38,160 Speaker 1: the case, or that there aren't systems out there that 498 00:31:38,240 --> 00:31:43,680 Speaker 1: can accept smaller amounts of reference material. It's just harder 499 00:31:43,720 --> 00:31:50,200 Speaker 1: to make a convincing version with fewer samples. But in 500 00:31:50,320 --> 00:31:53,760 Speaker 1: order to make a convincing fake, the system really has 501 00:31:53,800 --> 00:31:57,920 Speaker 1: to learn how a person moves. All those facial expressions matter. 502 00:31:58,160 --> 00:32:01,200 Speaker 1: It also has to learn how a person sounds. Will 503 00:32:01,240 --> 00:32:07,240 Speaker 1: get into sound a little bit later. But mannerisms, inflection, accent, emphasis, cadence, 504 00:32:07,360 --> 00:32:09,920 Speaker 1: quirks and ticks, all of these things have to be 505 00:32:09,960 --> 00:32:13,760 Speaker 1: analyzed and replicated to make a convincing fake, and it 506 00:32:13,800 --> 00:32:16,120 Speaker 1: has to be done just right or else it comes 507 00:32:16,120 --> 00:32:20,960 Speaker 1: off as creepy or unrealistic. Think about how impressionists will 508 00:32:21,000 --> 00:32:24,600 Speaker 1: take a celebrity's manner of speech and then heighten some 509 00:32:24,720 --> 00:32:28,200 Speaker 1: of it in comedic effect. You'll hear all the time 510 00:32:28,240 --> 00:32:31,240 Speaker 1: with folks who do impressions of people like Jack Nicholson 511 00:32:31,440 --> 00:32:35,400 Speaker 1: or Christopher Walkin or Barbara streisand people who have a 512 00:32:35,520 --> 00:32:40,040 Speaker 1: very particular way of speaking. Impressionists will take those as 513 00:32:40,160 --> 00:32:43,680 Speaker 1: markers and they really punch in on them. Well, a 514 00:32:43,760 --> 00:32:46,520 Speaker 1: deep fake can't really do that too much, or else 515 00:32:46,560 --> 00:32:48,840 Speaker 1: it won't come across as genuine. It'll feel like you're 516 00:32:48,840 --> 00:32:54,040 Speaker 1: watching a famous person impersonating themselves, which is weird. Now. 517 00:32:54,040 --> 00:32:56,520 Speaker 1: The earliest mention of deep fakes I can find dates 518 00:32:56,560 --> 00:32:59,880 Speaker 1: to a two thousand seventeen Reddit forum in which you 519 00:33:00,080 --> 00:33:03,960 Speaker 1: are shared deep faked videos that appeared to show female 520 00:33:03,960 --> 00:33:09,000 Speaker 1: celebrities in sexual situations. Heads and faces had been replaced, 521 00:33:09,240 --> 00:33:13,000 Speaker 1: and the actors in pornographic movies had their heads or 522 00:33:13,040 --> 00:33:17,200 Speaker 1: faces swapped out for these various celebrities. Now the fakes 523 00:33:17,360 --> 00:33:22,680 Speaker 1: can look fairly convincing, extremely convincing in some cases, which 524 00:33:22,880 --> 00:33:25,760 Speaker 1: can lead to some people assuming that the videos are 525 00:33:25,760 --> 00:33:29,160 Speaker 1: genuine and that the folks that they saw in the 526 00:33:29,240 --> 00:33:32,160 Speaker 1: videos are really the ones who were in it. And 527 00:33:32,320 --> 00:33:35,680 Speaker 1: obviously that's a real problem, right, I mean that this 528 00:33:35,760 --> 00:33:40,080 Speaker 1: technology we've given enough reference data DEFEATA system, someone could 529 00:33:40,160 --> 00:33:43,040 Speaker 1: fabricate a video that appears to put a person in 530 00:33:43,080 --> 00:33:47,040 Speaker 1: a compromising position, whether it's a sexual act or making 531 00:33:47,120 --> 00:33:50,720 Speaker 1: damaging statements or committing a crime or whatever. And there 532 00:33:50,720 --> 00:33:52,800 Speaker 1: are tools right now that allow you to do pretty 533 00:33:52,880 --> 00:33:55,720 Speaker 1: much what the face to face tool was doing back 534 00:33:55,720 --> 00:33:59,320 Speaker 1: in two thousands sixteen, a program called avatar. If I 535 00:34:00,160 --> 00:34:04,280 Speaker 1: just not that easy to say anyway, It can run 536 00:34:04,280 --> 00:34:08,400 Speaker 1: on top of live streaming conference services like Zoom and Skype, 537 00:34:08,719 --> 00:34:12,160 Speaker 1: and you can swap out your face for a celebrities face. 538 00:34:12,719 --> 00:34:17,920 Speaker 1: Your facial expressions map to the computer manipulated celebrity face. 539 00:34:18,880 --> 00:34:21,680 Speaker 1: It just looks at you through your webcam, and then 540 00:34:21,719 --> 00:34:25,160 Speaker 1: if you smile, the celebrity image smiles, et cetera. It's 541 00:34:25,239 --> 00:34:27,879 Speaker 1: like that old face to face program. It does need 542 00:34:28,000 --> 00:34:32,279 Speaker 1: a pretty beefy PC to manage doing all this because 543 00:34:32,280 --> 00:34:35,680 Speaker 1: you're also running that live streaming service underneath it. It's 544 00:34:35,719 --> 00:34:39,640 Speaker 1: also not exactly user friendly. You need some programming experience 545 00:34:39,640 --> 00:34:43,719 Speaker 1: to really get it to work. But it is widely accessible, 546 00:34:44,120 --> 00:34:48,160 Speaker 1: as the source code is is open source and it's 547 00:34:48,200 --> 00:34:51,840 Speaker 1: on get hubs, so anyone can get it. Samantha Cole, 548 00:34:52,120 --> 00:34:55,240 Speaker 1: who writes for Vice, has covered the topic of deep 549 00:34:55,280 --> 00:34:58,760 Speaker 1: fakes pretty extensively and the potential harm they can cause, 550 00:34:59,160 --> 00:35:01,680 Speaker 1: and I recommend you check out her work if you're 551 00:35:01,719 --> 00:35:05,560 Speaker 1: interested in learning more about that. Do be warned that 552 00:35:05,640 --> 00:35:09,640 Speaker 1: Coal covers some pretty adult themed topics and I think 553 00:35:09,640 --> 00:35:13,480 Speaker 1: she does great work and very important work. But as 554 00:35:13,520 --> 00:35:15,480 Speaker 1: a guy who grew up in the Deep South, it's 555 00:35:15,520 --> 00:35:17,840 Speaker 1: also the kind of stuff that occasionally makes me clutch 556 00:35:17,920 --> 00:35:20,400 Speaker 1: my pearls, But that's more of a statement about me 557 00:35:20,800 --> 00:35:24,880 Speaker 1: than her work. She does great work. I think most 558 00:35:24,920 --> 00:35:28,040 Speaker 1: of us can imagine plenty of scenarios in which this 559 00:35:28,080 --> 00:35:31,440 Speaker 1: sort of technology could cause mischief on a good day 560 00:35:31,520 --> 00:35:35,680 Speaker 1: and catastrophe on a bad day, whether it's spreading misinformation, 561 00:35:35,960 --> 00:35:41,040 Speaker 1: creating fear, uncertainty and doubt fud or by making people 562 00:35:41,160 --> 00:35:44,560 Speaker 1: seem to say things they never actually said, or contributing 563 00:35:44,600 --> 00:35:47,359 Speaker 1: to an ugly subculture in which people try to make 564 00:35:47,400 --> 00:35:51,480 Speaker 1: their more base fantasies a reality by putting one person's 565 00:35:51,520 --> 00:35:54,160 Speaker 1: head on another person's body. You know, it's not great. 566 00:35:54,760 --> 00:35:57,840 Speaker 1: There are legitimate uses of the technology too, of course, 567 00:35:58,120 --> 00:36:01,200 Speaker 1: you know, tech itself is rarely good or bad. It's 568 00:36:01,320 --> 00:36:04,640 Speaker 1: all in how we use it. But this particular technology 569 00:36:04,680 --> 00:36:07,719 Speaker 1: has a lot of potentially harmful uses, and Samantha Cole 570 00:36:07,760 --> 00:36:10,799 Speaker 1: has done a great job explaining them. When we come back, 571 00:36:11,000 --> 00:36:13,799 Speaker 1: I'll talk a bit more about the war against deep 572 00:36:13,880 --> 00:36:16,200 Speaker 1: fakes and how people are trying to prepare for a 573 00:36:16,239 --> 00:36:20,840 Speaker 1: world that is increasingly filled with media we can't really trust. 574 00:36:21,360 --> 00:36:33,240 Speaker 1: But first let's take a quick break. Before the break, 575 00:36:33,680 --> 00:36:37,680 Speaker 1: I mentioned Samantha Cole, who has written extensively about deep fags, 576 00:36:37,719 --> 00:36:40,480 Speaker 1: and one point she makes that I think is important 577 00:36:40,520 --> 00:36:44,880 Speaker 1: for us to note is that the vast majority of 578 00:36:44,960 --> 00:36:49,600 Speaker 1: instances of deep fake videos haven't been some manufactured video 579 00:36:49,640 --> 00:36:53,960 Speaker 1: of a political leader saying inflammatory things. That continues to 580 00:36:53,960 --> 00:36:57,480 Speaker 1: be a big concern. There's a genuine fear that someone 581 00:36:57,560 --> 00:37:01,040 Speaker 1: is going to manufacture a video in which a politician 582 00:37:01,080 --> 00:37:04,359 Speaker 1: appears to say or do something truly terrible in an 583 00:37:04,360 --> 00:37:08,560 Speaker 1: effort to either discredit the politician or perhaps instigate a 584 00:37:08,680 --> 00:37:13,600 Speaker 1: conflict with some other group. There are literal doomsday scenarios 585 00:37:13,600 --> 00:37:18,440 Speaker 1: in which such a video would prompt a massive military response, 586 00:37:18,719 --> 00:37:21,320 Speaker 1: though that does seem like it might be a little 587 00:37:21,440 --> 00:37:24,239 Speaker 1: far fetched, though heck, I don't know, considering the world 588 00:37:24,239 --> 00:37:26,040 Speaker 1: we live in, maybe it's not that big of a 589 00:37:26,080 --> 00:37:30,640 Speaker 1: stretch anyway. Cole's point is that so far, debt has 590 00:37:30,800 --> 00:37:34,239 Speaker 1: not happened. She points out that the most frequent use 591 00:37:34,400 --> 00:37:37,160 Speaker 1: for the tech either tends to be people goofing around 592 00:37:37,320 --> 00:37:41,040 Speaker 1: or disturbingly using it to in her words, quote, take 593 00:37:41,080 --> 00:37:45,240 Speaker 1: ownership of women's bodies in non consensual porn end quote. 594 00:37:45,560 --> 00:37:48,759 Speaker 1: Cole argues that the reason we haven't really seen deep 595 00:37:48,760 --> 00:37:52,000 Speaker 1: fix used much outside of these realms, apart from a 596 00:37:52,040 --> 00:37:56,040 Speaker 1: few advertising campaigns, is that people are pretty good at 597 00:37:56,120 --> 00:37:59,879 Speaker 1: spotting deep fix they aren't quite at a level where 598 00:38:00,000 --> 00:38:03,040 Speaker 1: they can easily pass for the real thing. There's still 599 00:38:03,080 --> 00:38:06,399 Speaker 1: something slightly off about them. They tend to butt up 600 00:38:06,440 --> 00:38:09,440 Speaker 1: against the uncanny valley. Now, for those of you not 601 00:38:09,600 --> 00:38:13,520 Speaker 1: familiar with that term, the uncanny valley describes the feeling 602 00:38:13,719 --> 00:38:17,000 Speaker 1: we humans get when we encounter a robot or a 603 00:38:17,040 --> 00:38:23,640 Speaker 1: computer generated figure that closely resembles a human or human behavior, 604 00:38:24,239 --> 00:38:27,760 Speaker 1: but you can still tell it's not actually a person, 605 00:38:28,040 --> 00:38:30,200 Speaker 1: and it's not a good feeling. It tends to be 606 00:38:30,239 --> 00:38:34,120 Speaker 1: described as repulsive and disturbing, or at the very best, 607 00:38:34,640 --> 00:38:39,879 Speaker 1: off putting. See also the animated film Polar Express. There's 608 00:38:39,920 --> 00:38:43,399 Speaker 1: a reason that when that film came out, people kind 609 00:38:43,440 --> 00:38:47,440 Speaker 1: of reacted negatively to the animation, and it's also a 610 00:38:47,480 --> 00:38:51,200 Speaker 1: reason why Pixar tends to prefer to go with stylized 611 00:38:51,280 --> 00:38:54,560 Speaker 1: human characters who are different enough from the way real 612 00:38:54,680 --> 00:38:58,040 Speaker 1: humans look to kind of bypass uncanny valley. We just 613 00:38:58,120 --> 00:39:00,680 Speaker 1: think of that as a cartoon nuts that's trying to 614 00:39:00,760 --> 00:39:04,280 Speaker 1: pass itself off as being human. But while there hasn't 615 00:39:04,320 --> 00:39:06,800 Speaker 1: really been a flood of fake videos hitting the Internet 616 00:39:06,920 --> 00:39:11,200 Speaker 1: with the intent to discredit politicians or infuriate specific people 617 00:39:11,320 --> 00:39:14,720 Speaker 1: or whatever. There remains a general sense that this is coming. 618 00:39:15,239 --> 00:39:18,480 Speaker 1: It's just not here now. The sense I get is 619 00:39:18,480 --> 00:39:21,840 Speaker 1: that people feel it's an inevitability, and there are already 620 00:39:21,880 --> 00:39:24,480 Speaker 1: folks working on tools that will help us sort out 621 00:39:24,480 --> 00:39:29,000 Speaker 1: the real stuff from the fakes. Take Microsoft, for example. 622 00:39:29,520 --> 00:39:34,240 Speaker 1: There R and D division fittingly called Microsoft Research, developed 623 00:39:34,239 --> 00:39:38,600 Speaker 1: a tool they called the Video Authenticator. This tool analyzes 624 00:39:38,760 --> 00:39:42,960 Speaker 1: video samples and looks for signs of deep fakery. In 625 00:39:43,000 --> 00:39:45,800 Speaker 1: a blog post written by Tom Bert and Eric Horvitts 626 00:39:45,840 --> 00:39:50,520 Speaker 1: to Microsoft executives, they say, quote it works by detecting 627 00:39:50,560 --> 00:39:54,160 Speaker 1: the blending boundary of the deep fake and subtle fading 628 00:39:54,280 --> 00:39:57,120 Speaker 1: or gray scale elements that might not be detectable by 629 00:39:57,120 --> 00:40:01,279 Speaker 1: the human eye. End quote. Now I'm no expert, but 630 00:40:01,480 --> 00:40:05,560 Speaker 1: to me, it sounds like the video Authenticator is working 631 00:40:05,560 --> 00:40:09,720 Speaker 1: in a way that's not too dissimilar to a discriminator 632 00:40:10,040 --> 00:40:14,240 Speaker 1: in a generative adversarial network. I mean, the whole purpose 633 00:40:14,560 --> 00:40:18,000 Speaker 1: of the discriminator is to discriminate or to tell the 634 00:40:18,040 --> 00:40:23,319 Speaker 1: difference between genuine, unaltered videos and computer generated ones. So 635 00:40:23,520 --> 00:40:27,200 Speaker 1: the video authenticator is looking for tailtale signs that a 636 00:40:27,320 --> 00:40:32,720 Speaker 1: video was not produced through traditional means but was computer generated. However, 637 00:40:32,760 --> 00:40:36,040 Speaker 1: that's the very thing that the generators in G A 638 00:40:36,239 --> 00:40:39,080 Speaker 1: N systems are looking out for. So when a generator 639 00:40:39,120 --> 00:40:43,760 Speaker 1: receives feedback that a video it generated did not slip 640 00:40:43,800 --> 00:40:47,960 Speaker 1: past the discriminator, it then tweaks those input weights and 641 00:40:48,080 --> 00:40:51,800 Speaker 1: starts to shift its approach in order to bypass whatever 642 00:40:51,840 --> 00:40:54,840 Speaker 1: it was that gave away its last attempt, and it 643 00:40:54,920 --> 00:40:59,440 Speaker 1: does this again and again. So the video authenticator might 644 00:40:59,480 --> 00:41:02,719 Speaker 1: work well for a given amount of time, but I 645 00:41:02,719 --> 00:41:05,319 Speaker 1: would suspect that in the long run, the deep fake 646 00:41:05,400 --> 00:41:10,440 Speaker 1: systems will become sophisticated enough to fool the authenticator. Of course, 647 00:41:10,960 --> 00:41:14,960 Speaker 1: Microsoft will continue to tweak the authenticator as well, and 648 00:41:15,040 --> 00:41:17,919 Speaker 1: it will become something of a seesaw battle as one 649 00:41:18,000 --> 00:41:22,040 Speaker 1: side outperforms the other temporarily, and then the balance will shift. 650 00:41:22,440 --> 00:41:24,719 Speaker 1: Though there may come a time where either the deep 651 00:41:24,760 --> 00:41:27,680 Speaker 1: fakes are too good and they don't set off any 652 00:41:27,719 --> 00:41:34,239 Speaker 1: alarms from the discriminator, or the discriminator gets so sensitive 653 00:41:34,640 --> 00:41:37,759 Speaker 1: that it starts to flag real videos and hits a 654 00:41:37,840 --> 00:41:41,640 Speaker 1: lot of false positives and calls them generated videos instead. 655 00:41:42,040 --> 00:41:44,719 Speaker 1: Either way, you reach a point where a tool like 656 00:41:44,760 --> 00:41:47,600 Speaker 1: this no longer really serves a useful purpose, and the 657 00:41:47,680 --> 00:41:51,239 Speaker 1: video authenticator will be obsolete. Now, this is something we 658 00:41:51,280 --> 00:41:54,680 Speaker 1: see in artificial intelligence all the time. If you remember 659 00:41:54,719 --> 00:41:57,760 Speaker 1: the good old days of capture, you know, the approving 660 00:41:57,840 --> 00:42:00,399 Speaker 1: you're not a robot stuff. The stuff up we were 661 00:42:00,400 --> 00:42:03,759 Speaker 1: told to do was typically type in a series of 662 00:42:03,920 --> 00:42:06,680 Speaker 1: letters and numbers, and it wasn't that hard when it 663 00:42:06,760 --> 00:42:10,320 Speaker 1: first started, at least not at first. That's because the 664 00:42:10,560 --> 00:42:14,600 Speaker 1: text recognition algorithms of the time weren't very good. They 665 00:42:14,640 --> 00:42:19,480 Speaker 1: couldn't decipher mildly deformed text because the shapes of the 666 00:42:19,520 --> 00:42:22,920 Speaker 1: text felt too far outside the parameters of what the 667 00:42:22,960 --> 00:42:26,759 Speaker 1: system could recognize as a legitimate letter or number. You 668 00:42:26,800 --> 00:42:30,120 Speaker 1: make the number a little you know, deformed, and then 669 00:42:30,160 --> 00:42:32,279 Speaker 1: suddenly the systems like, well, that doesn't look like a 670 00:42:32,360 --> 00:42:34,920 Speaker 1: three to me, because it's not in the shape of 671 00:42:34,920 --> 00:42:39,319 Speaker 1: a three. But over time, people developed better text recognition 672 00:42:39,400 --> 00:42:42,600 Speaker 1: programs that could recognize these shapes even if they weren't 673 00:42:42,600 --> 00:42:46,480 Speaker 1: in a standard three orientation, and those systems began to 674 00:42:46,520 --> 00:42:51,560 Speaker 1: defeat those simple early captures that required captured designers to 675 00:42:51,640 --> 00:42:55,359 Speaker 1: make tougher versions and Eventually the machines got good enough 676 00:42:55,400 --> 00:42:58,920 Speaker 1: that they could match or even outperform humans, and at 677 00:42:58,960 --> 00:43:01,920 Speaker 1: that point those tech based captures proved to be more 678 00:43:02,000 --> 00:43:05,680 Speaker 1: challenging for people than for machines, which meant if you 679 00:43:05,800 --> 00:43:08,440 Speaker 1: use them, you defeated the whole purpose in the first place. 680 00:43:08,600 --> 00:43:11,640 Speaker 1: So while this escalation proved to be a challenge for security, 681 00:43:12,280 --> 00:43:15,680 Speaker 1: it was a boon for artificial intelligence. And while I 682 00:43:15,719 --> 00:43:19,680 Speaker 1: focused almost exclusively on the imagery of video here, the 683 00:43:19,760 --> 00:43:22,400 Speaker 1: same sort of stuff is going on with generated speech, 684 00:43:22,560 --> 00:43:28,040 Speaker 1: including generated speech that imitates specific voices like deep big videos. 685 00:43:28,280 --> 00:43:31,080 Speaker 1: This approach works best if you have a really big 686 00:43:31,160 --> 00:43:35,600 Speaker 1: data set of recorded audio, so people like movie and 687 00:43:35,680 --> 00:43:41,640 Speaker 1: TV stars, news reporters, politicians, and um, you know, podcasters, 688 00:43:42,400 --> 00:43:45,480 Speaker 1: we're great targets for this stuff. There might be hundreds 689 00:43:45,560 --> 00:43:48,880 Speaker 1: or you know, in my case, thousands of hours of 690 00:43:48,920 --> 00:43:52,680 Speaker 1: recording material to work from. Training a model to use 691 00:43:52,760 --> 00:43:59,040 Speaker 1: the frequencies. Timbre, intonation, pronunciation, pauses, and other mannerisms of 692 00:43:59,040 --> 00:44:02,560 Speaker 1: speech can versus in a system that can generate vocals 693 00:44:02,640 --> 00:44:06,680 Speaker 1: that sound like the target, sometimes to a fairly convincing degree, 694 00:44:07,360 --> 00:44:10,160 Speaker 1: and for a while. To peek behind the curtain here 695 00:44:10,760 --> 00:44:12,880 Speaker 1: we at tech stuff. We're working with a company that 696 00:44:12,960 --> 00:44:15,080 Speaker 1: I'm not going to name, but they were going to 697 00:44:15,120 --> 00:44:17,680 Speaker 1: do something like this as an experiment. I was going 698 00:44:17,719 --> 00:44:20,200 Speaker 1: to do a whole episode on it, and I had 699 00:44:20,280 --> 00:44:25,640 Speaker 1: planned on crafting a segment of that episode only through text. 700 00:44:25,800 --> 00:44:28,520 Speaker 1: I was not going to actually record it myself and 701 00:44:28,520 --> 00:44:32,240 Speaker 1: then use a system that was trained on my voice 702 00:44:32,680 --> 00:44:37,320 Speaker 1: to replicate my voice and deliver that segment on its own. 703 00:44:37,680 --> 00:44:40,080 Speaker 1: I was curious if it can nail not just the 704 00:44:40,120 --> 00:44:44,239 Speaker 1: audio quality of my voice, which, let's be honest, is amazing. 705 00:44:44,920 --> 00:44:48,560 Speaker 1: That's sarcasm. I can't stand listening to myself, but it 706 00:44:48,600 --> 00:44:53,000 Speaker 1: would also have to replicate how I actually make certain sounds, 707 00:44:53,080 --> 00:44:55,160 Speaker 1: Like would it get the bit of the southern accent 708 00:44:55,440 --> 00:44:59,200 Speaker 1: that's in my voice, or the way I emphasize certain words. 709 00:44:59,480 --> 00:45:01,960 Speaker 1: Would it us for effect at all? Or would it 710 00:45:02,040 --> 00:45:05,759 Speaker 1: just robotically say one word after the next and only 711 00:45:05,840 --> 00:45:09,400 Speaker 1: pause when there was some helpful punctuation that told it 712 00:45:09,480 --> 00:45:12,880 Speaker 1: to do so. Would it indicate a question by raising 713 00:45:12,920 --> 00:45:16,040 Speaker 1: the pitch at the end of its sentence. Sadly, we 714 00:45:16,560 --> 00:45:20,600 Speaker 1: never got far with that particular project, so I don't 715 00:45:20,600 --> 00:45:22,440 Speaker 1: have any answers for you. I don't know how it 716 00:45:22,480 --> 00:45:25,040 Speaker 1: would have turned out, But clearly one of the things 717 00:45:25,080 --> 00:45:27,799 Speaker 1: I thought of was that it's a bit of a 718 00:45:27,840 --> 00:45:30,360 Speaker 1: red flag. If you can train a computer to sound 719 00:45:30,400 --> 00:45:33,839 Speaker 1: exactly like a specific person, that means you could make 720 00:45:33,920 --> 00:45:38,279 Speaker 1: that person say anything you like, and obviously, like deep 721 00:45:38,320 --> 00:45:41,839 Speaker 1: fake videos, that could have some pretty devastating consequences if 722 00:45:41,840 --> 00:45:47,120 Speaker 1: it were at all, you know, believable or seemed realistic. Now, 723 00:45:47,160 --> 00:45:50,120 Speaker 1: the company we were working with was working hard to 724 00:45:50,120 --> 00:45:52,360 Speaker 1: make sure that the only person to have access to 725 00:45:52,600 --> 00:45:55,520 Speaker 1: a specific voice would be the owner of that voice, 726 00:45:55,640 --> 00:45:59,600 Speaker 1: or presumably the company employing that person, though that does 727 00:45:59,640 --> 00:46:02,239 Speaker 1: bring up a whole bunch of other potential problems, like 728 00:46:02,280 --> 00:46:06,560 Speaker 1: can you imagine eliminating voice actors from a job because 729 00:46:06,600 --> 00:46:08,400 Speaker 1: you've got enough of their voice and you can just 730 00:46:08,560 --> 00:46:11,960 Speaker 1: replicate it. That wouldn't be great, But even so, it 731 00:46:12,080 --> 00:46:14,920 Speaker 1: was something I felt was both fascinating from a technology 732 00:46:14,960 --> 00:46:19,160 Speaker 1: standpoint and potentially problematic when it comes to an application 733 00:46:19,440 --> 00:46:22,880 Speaker 1: of that technology. One other thing I should mention is 734 00:46:22,960 --> 00:46:26,239 Speaker 1: that the Internet at large has been pretty active in 735 00:46:26,400 --> 00:46:29,799 Speaker 1: fighting deep fakes, not necessarily in detecting them, but removing 736 00:46:29,840 --> 00:46:33,560 Speaker 1: the platforms from which they were being shared, Reddit being 737 00:46:33,600 --> 00:46:36,160 Speaker 1: a big one, the subreddit that was dedicated to deep 738 00:46:36,160 --> 00:46:39,640 Speaker 1: fakes what had been shut down, So there have been 739 00:46:39,719 --> 00:46:41,600 Speaker 1: some of those moves as well. Now this is not 740 00:46:41,960 --> 00:46:46,080 Speaker 1: directly against the technology, it's more against the proliferation of 741 00:46:46,120 --> 00:46:51,120 Speaker 1: the uh the output of that technology. As for detecting 742 00:46:51,160 --> 00:46:53,919 Speaker 1: deep fakes, it's interesting to me that people are even 743 00:46:54,000 --> 00:46:57,319 Speaker 1: developing tools to detect them, because to me, the best 744 00:46:57,360 --> 00:47:00,839 Speaker 1: tools so far seems to be human perception. It's not 745 00:47:01,080 --> 00:47:06,160 Speaker 1: that the images aren't really convincing, or that we can 746 00:47:06,200 --> 00:47:09,799 Speaker 1: suddenly detect these, you know, blending lines like the video 747 00:47:09,840 --> 00:47:13,719 Speaker 1: Authenticator tool. It's rather that it's just not hard for 748 00:47:13,800 --> 00:47:16,640 Speaker 1: us to spot a deep fake. Stuff just doesn't quite 749 00:47:16,960 --> 00:47:21,400 Speaker 1: look right in the way that people behave in these videos. 750 00:47:21,400 --> 00:47:25,960 Speaker 1: The vocals and animation often don't quite match. The expressions 751 00:47:26,320 --> 00:47:31,200 Speaker 1: aren't really natural, the progression of mannerisms feels synthetic and 752 00:47:31,280 --> 00:47:36,120 Speaker 1: not genuine. It just it looks off. It's that uncanny 753 00:47:36,200 --> 00:47:39,760 Speaker 1: Valley thing, and so just paying attention and thinking critically 754 00:47:39,760 --> 00:47:41,880 Speaker 1: can really help you suss out the fakes from the 755 00:47:41,920 --> 00:47:45,200 Speaker 1: real thing. Even if we reach a point where machines 756 00:47:45,320 --> 00:47:49,080 Speaker 1: can create a convincing enough fake to pass for reality, 757 00:47:49,360 --> 00:47:53,120 Speaker 1: we can still apply critical thinking, and we always should. Heck, 758 00:47:53,440 --> 00:47:55,960 Speaker 1: we should be applying critical thinking even when there's no 759 00:47:56,080 --> 00:47:59,399 Speaker 1: doubt as to the validity of the video, because there 760 00:47:59,400 --> 00:48:03,960 Speaker 1: may be enough to doubt the content of the video itself. 761 00:48:04,360 --> 00:48:07,600 Speaker 1: If I listen to a genuine scam artist in a 762 00:48:07,680 --> 00:48:12,200 Speaker 1: genuine video, that doesn't make the scam more legitimate. We 763 00:48:12,239 --> 00:48:15,080 Speaker 1: always need to use critical thinking. What I think is 764 00:48:15,120 --> 00:48:18,600 Speaker 1: most important is that we acknowledge the very real fact 765 00:48:18,880 --> 00:48:23,880 Speaker 1: that there are numerous organizations, agencies, governments, and other groups 766 00:48:23,920 --> 00:48:29,520 Speaker 1: that are actively attempting to spread misinformation and disinformation. There 767 00:48:29,560 --> 00:48:34,799 Speaker 1: are entire intelligence agencies dedicated to this endeavor, and then 768 00:48:35,200 --> 00:48:38,440 Speaker 1: there are more independent groups that are doing it for 769 00:48:38,520 --> 00:48:41,960 Speaker 1: one reason or another, typically either to advance a particular 770 00:48:42,160 --> 00:48:45,879 Speaker 1: political agenda or just to make as much money as 771 00:48:46,000 --> 00:48:50,080 Speaker 1: quickly as possible. This is beyond doubt or question. There 772 00:48:50,120 --> 00:48:54,279 Speaker 1: are numerous misinformation campaigns that are actively going on out 773 00:48:54,320 --> 00:48:57,560 Speaker 1: there in the real world right now. Most of them 774 00:48:57,840 --> 00:49:01,920 Speaker 1: are not depending on deep fakes, because one, deep fakes 775 00:49:01,960 --> 00:49:05,200 Speaker 1: aren't really good enough to fool most people right now, 776 00:49:05,640 --> 00:49:08,840 Speaker 1: and too, they don't need the deep fakes in the 777 00:49:08,880 --> 00:49:11,640 Speaker 1: first place. There are other methods that are simpler that 778 00:49:11,760 --> 00:49:15,600 Speaker 1: don't need nearly the processing power that work just fine. 779 00:49:15,880 --> 00:49:18,440 Speaker 1: Why would you go through the trouble of synthesizing a 780 00:49:18,560 --> 00:49:21,080 Speaker 1: video if you can get a better response with a 781 00:49:21,120 --> 00:49:25,160 Speaker 1: blog post filled with lies or half truths. It's just 782 00:49:25,280 --> 00:49:28,759 Speaker 1: not a great return on investment. So bottom line, be 783 00:49:28,960 --> 00:49:33,799 Speaker 1: vigilant out there, particularly on social media. Be aware that 784 00:49:33,840 --> 00:49:36,520 Speaker 1: there are plenty of people who will not hesitate to 785 00:49:36,640 --> 00:49:40,000 Speaker 1: mislead others in order to get what they want. Use 786 00:49:40,000 --> 00:49:45,279 Speaker 1: a critical eye to evaluate the information you encounter. Ask questions, 787 00:49:45,719 --> 00:49:50,440 Speaker 1: check sources, look for corroborating reports. It's a lot of work, 788 00:49:50,480 --> 00:49:53,359 Speaker 1: but trust me, it's way better that we do our 789 00:49:53,400 --> 00:49:56,400 Speaker 1: best to make sure the stuff we're depending on is 790 00:49:56,480 --> 00:50:00,600 Speaker 1: actually dependable. It'll turn out better for us in long run. 791 00:50:00,880 --> 00:50:04,319 Speaker 1: Well that wraps up this episode of tech stuff, which yeah, 792 00:50:04,600 --> 00:50:07,640 Speaker 1: I used as a backdoor to argue about critical thinking. Again, 793 00:50:07,719 --> 00:50:12,040 Speaker 1: sue me, don't, don't really sue me. But I think 794 00:50:12,040 --> 00:50:16,360 Speaker 1: that that's another instance where it's a really clear example 795 00:50:16,400 --> 00:50:18,520 Speaker 1: where we have to use that kind of stuff. So 796 00:50:18,680 --> 00:50:22,680 Speaker 1: I'm gonna keep on stressing it. And you guys are awesome. 797 00:50:22,960 --> 00:50:25,840 Speaker 1: I believe in you. I think that when we start 798 00:50:25,920 --> 00:50:29,400 Speaker 1: using these tools at our disposal that everybody can develop 799 00:50:29,840 --> 00:50:33,919 Speaker 1: just with some practice that things will be better. We'll 800 00:50:33,960 --> 00:50:37,720 Speaker 1: be able to suss out the nonsense from the real stuff, 801 00:50:38,400 --> 00:50:40,960 Speaker 1: and we're all better off in the long run if 802 00:50:41,000 --> 00:50:43,719 Speaker 1: we can do that. If you guys have suggestions for 803 00:50:43,840 --> 00:50:46,600 Speaker 1: future topics I should cover in episodes of tech Stuff, 804 00:50:46,719 --> 00:50:50,360 Speaker 1: let me know via Twitter. The handle is text stuff 805 00:50:50,719 --> 00:50:55,000 Speaker 1: H s W and I'll talk to you again really soon. 806 00:51:01,200 --> 00:51:04,239 Speaker 1: Tech Stuff is an I Heart Radio production. For more 807 00:51:04,320 --> 00:51:07,720 Speaker 1: podcasts from I Heart Radio, visit the i Heart Radio app, 808 00:51:07,840 --> 00:51:11,000 Speaker 1: Apple Podcasts, or wherever you listen to your favorite shows.