1 00:00:04,400 --> 00:00:07,800 Speaker 1: Welcome to Tech Stuff, a production from I Heart Radio. 2 00:00:12,400 --> 00:00:15,200 Speaker 1: Hey there, and welcome to tech Stuff. I'm your host, 3 00:00:15,360 --> 00:00:18,400 Speaker 1: Jonathan Strickland. I'm an executive producer with I Heart Radio 4 00:00:18,440 --> 00:00:21,720 Speaker 1: and I love all things tech. Now, before I get 5 00:00:21,760 --> 00:00:25,000 Speaker 1: into today's episode, I want to give a little listener 6 00:00:25,239 --> 00:00:29,280 Speaker 1: warning here. The topic at hand involves some adult content, 7 00:00:29,760 --> 00:00:33,040 Speaker 1: including the use of technology to do stuff that can 8 00:00:33,120 --> 00:00:37,960 Speaker 1: be unethical, illegal, hurtful, and just plain awful. Now, I 9 00:00:38,000 --> 00:00:40,800 Speaker 1: think this is an important topic, but I wanted to 10 00:00:40,840 --> 00:00:42,800 Speaker 1: give a bit of a heads up at the start 11 00:00:42,840 --> 00:00:45,440 Speaker 1: of the episode, just in case any of you guys 12 00:00:45,479 --> 00:00:48,880 Speaker 1: are listening to a podcast on like a family road 13 00:00:48,920 --> 00:00:51,880 Speaker 1: trip or something. I think this is an important topic 14 00:00:52,320 --> 00:00:55,120 Speaker 1: and I think everyone should know about it and think 15 00:00:55,160 --> 00:00:57,360 Speaker 1: about it. But I also respect that for some people 16 00:00:57,360 --> 00:01:00,680 Speaker 1: this subject might get a bit taboo. So let's go 17 00:01:00,880 --> 00:01:06,360 Speaker 1: on with the episode. Back in nine, a movie called 18 00:01:06,720 --> 00:01:11,319 Speaker 1: Rising Sun, directed by Philip Kaufman, based on a Michael 19 00:01:11,360 --> 00:01:15,119 Speaker 1: Crichton novel and starring Wesley Snipes and Sean Connery came 20 00:01:15,120 --> 00:01:18,360 Speaker 1: out in theaters. Now, I didn't see it in theaters, 21 00:01:19,040 --> 00:01:21,360 Speaker 1: but I did catch it when it came on you know, 22 00:01:21,920 --> 00:01:25,760 Speaker 1: HBO or Cinemax or something. Later on, the movie included 23 00:01:25,760 --> 00:01:28,959 Speaker 1: a sequence that I found to be totally unbelievable. And 24 00:01:29,000 --> 00:01:32,720 Speaker 1: I'm not talking about buying into Sean Connery being an 25 00:01:32,720 --> 00:01:37,319 Speaker 1: expert on Japanese culture and business practices. Actually, side note, 26 00:01:37,480 --> 00:01:41,720 Speaker 1: Sean Connery has an interesting history of playing unlikely characters, 27 00:01:41,760 --> 00:01:44,759 Speaker 1: such as in Highlander, where he played an immortal who 28 00:01:44,840 --> 00:01:49,080 Speaker 1: was supposedly Egyptian, then who lived in feudal Japan and 29 00:01:49,200 --> 00:01:51,840 Speaker 1: ended up in Spain where he became known as Ramirez. 30 00:01:52,200 --> 00:01:54,760 Speaker 1: And all the while he's talking to a Scottish Highlander 31 00:01:54,960 --> 00:01:58,200 Speaker 1: who's played by a Belgian actor. But I'm getting way 32 00:01:58,240 --> 00:02:02,040 Speaker 1: off track here. Besides, I've heard Crichton actually wrote the 33 00:02:02,160 --> 00:02:05,000 Speaker 1: character while thinking of Connery, So you know, what the 34 00:02:05,000 --> 00:02:08,320 Speaker 1: heck do I know? In the film, Snives and Connery 35 00:02:08,440 --> 00:02:12,240 Speaker 1: are investigators, and they're looking into a homicide that happened 36 00:02:12,280 --> 00:02:16,760 Speaker 1: at a Japanese business but on American soil. The security 37 00:02:16,800 --> 00:02:21,080 Speaker 1: system in the building captured video of the homicide and 38 00:02:21,120 --> 00:02:23,480 Speaker 1: the identity of the killer appears to be a pretty 39 00:02:23,560 --> 00:02:26,880 Speaker 1: open and shut case. But that's not how it all 40 00:02:26,960 --> 00:02:30,520 Speaker 1: turns out. The investigators talked to a security expert played 41 00:02:30,520 --> 00:02:34,440 Speaker 1: by Tia Carrera, and she demonstrates in real time how 42 00:02:34,560 --> 00:02:39,160 Speaker 1: video footage can be altered. She records a short video 43 00:02:39,520 --> 00:02:43,800 Speaker 1: of Connery and snipes loads that onto a computer, freezes 44 00:02:43,960 --> 00:02:47,600 Speaker 1: a frame of the video, and essentially performs a cut 45 00:02:47,639 --> 00:02:51,440 Speaker 1: and paste job swapping the heads of our two lead characters. 46 00:02:51,880 --> 00:02:55,000 Speaker 1: Then she resumes the video and the head swap remains 47 00:02:55,000 --> 00:02:59,960 Speaker 1: in place, and that head swap stuff is possible. I mean, 48 00:03:00,040 --> 00:03:02,440 Speaker 1: clearly it has to be possible, because you actually do 49 00:03:02,600 --> 00:03:05,680 Speaker 1: see that effect in the film itself. But it takes 50 00:03:05,800 --> 00:03:08,680 Speaker 1: a bit more than a quick cut and paste job. 51 00:03:08,760 --> 00:03:11,320 Speaker 1: But we'll leave off of that for now. The whole 52 00:03:11,360 --> 00:03:15,720 Speaker 1: point of that sequence, apart from showing off some cinema magic, 53 00:03:16,240 --> 00:03:20,560 Speaker 1: is to demonstrate to the investigators that video, like photographs, 54 00:03:20,880 --> 00:03:24,560 Speaker 1: can be altered. The expert has detected a blue halo 55 00:03:24,720 --> 00:03:27,919 Speaker 1: around the face of the supposed murderer in the footage, 56 00:03:28,240 --> 00:03:31,680 Speaker 1: indicating that some sort of trickery has happened. She also 57 00:03:31,760 --> 00:03:34,800 Speaker 1: reveals that she cannot magically restore the video to its 58 00:03:34,800 --> 00:03:37,920 Speaker 1: previous unaltered state, which I think was actually a nice 59 00:03:38,000 --> 00:03:41,040 Speaker 1: change of pace for a movie. By the way, I 60 00:03:41,080 --> 00:03:44,760 Speaker 1: think this movie is really, you know, not good, like 61 00:03:45,480 --> 00:03:50,360 Speaker 1: not worth your time, but that's my opinion anyway. For years, 62 00:03:50,680 --> 00:03:54,040 Speaker 1: this kind of video sorcery was pretty much limited to 63 00:03:54,160 --> 00:03:57,760 Speaker 1: the film and TV industries. It usually required a lot 64 00:03:57,800 --> 00:04:01,720 Speaker 1: of pre planning beforehand, so it wasn't as simple as 65 00:04:01,760 --> 00:04:04,920 Speaker 1: just taking footage that was already shot and changing it 66 00:04:04,960 --> 00:04:07,720 Speaker 1: in post on a whim with a couple of clicks 67 00:04:07,760 --> 00:04:09,800 Speaker 1: of a button. If it were, we would see a 68 00:04:09,800 --> 00:04:13,920 Speaker 1: lot fewer mistakes left in movies and television because you 69 00:04:13,960 --> 00:04:16,599 Speaker 1: could catch it later and just fix it. But the 70 00:04:16,640 --> 00:04:20,359 Speaker 1: tricks were possible, they were just difficult to pull off. 71 00:04:20,880 --> 00:04:23,840 Speaker 1: It just wasn't something you or I would ever encounter 72 00:04:23,960 --> 00:04:27,560 Speaker 1: in our day to day lives. But today we live 73 00:04:27,680 --> 00:04:30,880 Speaker 1: in a different world, a world that has examples of 74 00:04:31,000 --> 00:04:35,960 Speaker 1: synthetic media. Commonly referred to as deep fakes. These are 75 00:04:36,080 --> 00:04:39,839 Speaker 1: videos that have been altered or generated so that the 76 00:04:39,920 --> 00:04:42,839 Speaker 1: subject of the video is doing something that they probably 77 00:04:42,960 --> 00:04:47,200 Speaker 1: would or could never do. They've brought into question whether 78 00:04:47,320 --> 00:04:50,800 Speaker 1: or not video evidence is even reliable, much as the 79 00:04:50,839 --> 00:04:54,440 Speaker 1: film Rising Sun was talking about. We already know that 80 00:04:54,520 --> 00:05:00,559 Speaker 1: eyewitness testimony is terribly unreliable. Our perception and memory play 81 00:05:00,600 --> 00:05:04,560 Speaker 1: tricks on us, and we can quote unquote remember stuff 82 00:05:04,600 --> 00:05:09,560 Speaker 1: that just didn't happen the way things actually unfolded in reality. 83 00:05:09,600 --> 00:05:13,359 Speaker 1: But now we're looking at video evidence and potentially the 84 00:05:13,440 --> 00:05:17,600 Speaker 1: same light. I mean, it's scary. So today we're going 85 00:05:17,680 --> 00:05:21,760 Speaker 1: to learn about synthetic media, how it can be generated, 86 00:05:22,080 --> 00:05:26,000 Speaker 1: the implications that follow with that sort of reality, and 87 00:05:26,120 --> 00:05:29,559 Speaker 1: ways that people are trying to counteract a potentially dangerous threat, 88 00:05:30,040 --> 00:05:34,800 Speaker 1: you know, fun stuff. Now, first, the term synthetic media 89 00:05:35,120 --> 00:05:39,400 Speaker 1: has a particular meaning. It refers to art created through 90 00:05:39,520 --> 00:05:43,760 Speaker 1: some sort of automated process, so it's a largely hands 91 00:05:43,839 --> 00:05:49,000 Speaker 1: off approach to creating the final art piece. Now, under 92 00:05:49,040 --> 00:05:52,880 Speaker 1: that definition, the example of rising sun would not apply 93 00:05:53,080 --> 00:05:56,400 Speaker 1: here because we see in the film and presumably this 94 00:05:56,480 --> 00:05:58,599 Speaker 1: happens in the book as well, but I haven't read 95 00:05:58,680 --> 00:06:03,200 Speaker 1: the book that a human being actually changes that. People 96 00:06:03,279 --> 00:06:06,880 Speaker 1: have used tools to alter the video footage. This would 97 00:06:06,880 --> 00:06:10,280 Speaker 1: be more like using photoshop to touch up a still image, 98 00:06:10,279 --> 00:06:14,039 Speaker 1: with the computer system presumably doing some of the work 99 00:06:14,040 --> 00:06:16,800 Speaker 1: in the background to keep things matched up. Either that 100 00:06:16,960 --> 00:06:19,640 Speaker 1: or you would need to alter each image in the 101 00:06:19,640 --> 00:06:23,760 Speaker 1: footage frame by frame, or use some sort of matt approach. 102 00:06:24,360 --> 00:06:26,880 Speaker 1: To learn more about matts, you can listen to my 103 00:06:26,920 --> 00:06:30,760 Speaker 1: episode about how blue and green screens work. Synthetic media 104 00:06:31,040 --> 00:06:35,200 Speaker 1: as a general practice has been around for centuries. Artists 105 00:06:35,200 --> 00:06:38,640 Speaker 1: have set up various contraptions to create works with little 106 00:06:38,880 --> 00:06:43,039 Speaker 1: or no human guidance. In the twentieth century we started 107 00:06:43,120 --> 00:06:46,960 Speaker 1: to see a movement called generative art take form. This 108 00:06:47,000 --> 00:06:49,560 Speaker 1: type of art is all about creating a system that 109 00:06:49,680 --> 00:06:53,880 Speaker 1: then creates or generates the finished art piece. That would 110 00:06:53,920 --> 00:06:57,080 Speaker 1: mean that the finished work, such as a painting, wouldn't 111 00:06:57,400 --> 00:07:00,400 Speaker 1: reflect the feelings or thoughts of the art is who 112 00:07:00,440 --> 00:07:03,919 Speaker 1: created the system. In fact, it starts to raise the 113 00:07:04,000 --> 00:07:07,120 Speaker 1: question what is the art? Is it the painting that 114 00:07:07,200 --> 00:07:11,000 Speaker 1: came about due to a machine following a program of 115 00:07:11,080 --> 00:07:15,400 Speaker 1: some sort, or is the art the program itself? Is 116 00:07:15,440 --> 00:07:19,000 Speaker 1: the art the process by which the painting was made? 117 00:07:19,320 --> 00:07:22,000 Speaker 1: Now I'm not here to answer that question. I just 118 00:07:22,320 --> 00:07:26,640 Speaker 1: think it is an interesting question to ask. Sometimes people 119 00:07:26,680 --> 00:07:30,600 Speaker 1: ask much less polite questions, such as is it art 120 00:07:30,640 --> 00:07:34,280 Speaker 1: at all? Some art critics went out of their way 121 00:07:34,320 --> 00:07:37,520 Speaker 1: to dismiss generative art in the early days. They found 122 00:07:37,520 --> 00:07:42,000 Speaker 1: it insulting, but hey, that's kind of the history of 123 00:07:42,200 --> 00:07:46,560 Speaker 1: art in general. Each new movement and art inevitably finds 124 00:07:46,600 --> 00:07:51,080 Speaker 1: both supporters and critics as it emerges. If anything, you 125 00:07:51,200 --> 00:07:55,360 Speaker 1: might argue that such a response legitimizes the movement in 126 00:07:55,560 --> 00:07:58,640 Speaker 1: you know, a weird way. If people hate it, it 127 00:07:58,720 --> 00:08:02,720 Speaker 1: must be something. In two thousand eighteen, an artist collective 128 00:08:03,040 --> 00:08:07,920 Speaker 1: called Obvious located out of Paris, France. They submitted portrait 129 00:08:08,000 --> 00:08:11,920 Speaker 1: style paintings that were created not by an actual human painter, 130 00:08:12,440 --> 00:08:16,440 Speaker 1: but by an artificially intelligent system. Now they looked a 131 00:08:16,480 --> 00:08:21,720 Speaker 1: lot like typical eighteenth century style portraits. There was no 132 00:08:21,800 --> 00:08:24,640 Speaker 1: attempt to pass off the portrait as if it were 133 00:08:24,720 --> 00:08:28,120 Speaker 1: actually made by a human artist. In fact, the appeal 134 00:08:28,320 --> 00:08:32,760 Speaker 1: of the piece was largely due to it being synthetically generated. 135 00:08:33,200 --> 00:08:36,720 Speaker 1: It went to auction at Christie's and the AI created 136 00:08:36,800 --> 00:08:42,000 Speaker 1: painting fetched more than four hundred thousand dollars. And the 137 00:08:42,040 --> 00:08:45,280 Speaker 1: way the group trained their AI is relevant to our 138 00:08:45,320 --> 00:08:49,960 Speaker 1: discussion about deep fakes. The collective relied on a type 139 00:08:49,960 --> 00:08:55,560 Speaker 1: of machine learning called generative adversarial networks or g a N, 140 00:08:56,080 --> 00:08:59,319 Speaker 1: which in turn is depending on deep learning. So it 141 00:08:59,360 --> 00:09:00,959 Speaker 1: looks like we've got a few things we're going to 142 00:09:01,080 --> 00:09:03,840 Speaker 1: have to define here. Now, I'm going to keep things 143 00:09:04,160 --> 00:09:07,719 Speaker 1: fairly high level, because as it turns out there are 144 00:09:07,760 --> 00:09:11,439 Speaker 1: a few different ways to create machine learning models, and 145 00:09:11,520 --> 00:09:14,280 Speaker 1: to go through all of them in exhaustive detail would 146 00:09:14,280 --> 00:09:17,600 Speaker 1: represent a university level course in machine learning. I have 147 00:09:17,760 --> 00:09:21,240 Speaker 1: neither the time for that nor the expertise. I would 148 00:09:21,320 --> 00:09:24,960 Speaker 1: do a terrible job, So we'll go with a high 149 00:09:25,040 --> 00:09:31,480 Speaker 1: level perspective here first. A generative adversarial network uses two systems. 150 00:09:31,520 --> 00:09:35,280 Speaker 1: You have a generator and you have a discriminator. Both 151 00:09:35,360 --> 00:09:38,760 Speaker 1: of these systems are a type of neural network. A 152 00:09:38,840 --> 00:09:42,480 Speaker 1: neural network is a computing model that is inspired by 153 00:09:42,480 --> 00:09:47,520 Speaker 1: the way our brains work. Our brains contain billions of neurons, 154 00:09:47,760 --> 00:09:52,200 Speaker 1: and these neurons work together, communicating through electrical and chemical signals, 155 00:09:52,440 --> 00:09:57,680 Speaker 1: controlling and coordinating pretty much everything in our bodies. With computers, 156 00:09:58,040 --> 00:10:02,720 Speaker 1: the neurons are Note the job of a node is, 157 00:10:03,120 --> 00:10:05,400 Speaker 1: you know, supposed to be kind of like a neuron 158 00:10:05,640 --> 00:10:08,960 Speaker 1: cell in the brain. It's to take in multiple weighted 159 00:10:09,080 --> 00:10:14,360 Speaker 1: input values and then generate a single output value. Now, 160 00:10:14,400 --> 00:10:18,000 Speaker 1: the word weighted w E I G H T E 161 00:10:18,080 --> 00:10:21,840 Speaker 1: D weighted is really important here because the larger and 162 00:10:21,960 --> 00:10:26,120 Speaker 1: inputs weight, the more that input will have an effect 163 00:10:26,360 --> 00:10:29,000 Speaker 1: on whatever the output is. So it kind of comes 164 00:10:29,040 --> 00:10:32,679 Speaker 1: down to which inputs are the most important for that 165 00:10:32,800 --> 00:10:36,760 Speaker 1: nodes particular function. Now, if I were to make an analogy, 166 00:10:36,840 --> 00:10:40,560 Speaker 1: I would say, your boss hands you three tasks to do. 167 00:10:41,240 --> 00:10:45,360 Speaker 1: One of those tasks has the label extremely important, and 168 00:10:45,440 --> 00:10:49,320 Speaker 1: the second task has the label critically important, and the 169 00:10:49,400 --> 00:10:52,240 Speaker 1: third task has a label saying you should have finished 170 00:10:52,280 --> 00:10:55,040 Speaker 1: that one before it was handed to you. Okay, so 171 00:10:55,080 --> 00:10:57,800 Speaker 1: that's just some sort of snarky office humor that I 172 00:10:57,840 --> 00:11:00,520 Speaker 1: need to get off my chest. But more seriously, imagine 173 00:11:00,559 --> 00:11:05,000 Speaker 1: a node accepting three inputs. In this example, input one 174 00:11:05,280 --> 00:11:09,680 Speaker 1: has a fifty weight, Input two has a weight, and 175 00:11:09,720 --> 00:11:12,360 Speaker 1: input three has a ten percent weight. That adds up 176 00:11:12,400 --> 00:11:16,200 Speaker 1: to and that would tell you that the output that 177 00:11:16,280 --> 00:11:21,160 Speaker 1: node generates will be most affected by input one, followed 178 00:11:21,200 --> 00:11:24,199 Speaker 1: by input two, and then input three would have a 179 00:11:24,280 --> 00:11:29,120 Speaker 1: smaller effect on whatever the output is. Each node applies 180 00:11:29,200 --> 00:11:34,080 Speaker 1: a nonlinear transformation on the input values, again affected by 181 00:11:34,240 --> 00:11:39,000 Speaker 1: each inputs weight value, and that generates the output value. 182 00:11:39,480 --> 00:11:43,520 Speaker 1: The details of that really are not important for our episode, 183 00:11:43,520 --> 00:11:46,920 Speaker 1: and involves performing changes on variables that in turn change 184 00:11:46,960 --> 00:11:50,360 Speaker 1: the correlation between variables, and it gets a bit Matthew, 185 00:11:50,559 --> 00:11:53,360 Speaker 1: and we would get lost in the weeds pretty quickly. 186 00:11:53,679 --> 00:11:56,480 Speaker 1: The important thing to remember is that a node within 187 00:11:56,520 --> 00:12:01,280 Speaker 1: a neural network takes in a weighted sum inputs, then 188 00:12:01,320 --> 00:12:06,680 Speaker 1: performs a process on those inputs before passing the result 189 00:12:06,800 --> 00:12:10,520 Speaker 1: on as an output. Then some other node a layer 190 00:12:10,640 --> 00:12:14,400 Speaker 1: down will accept that output, along with outputs from a 191 00:12:14,440 --> 00:12:17,600 Speaker 1: couple of other nodes one layer up, and then we'll 192 00:12:17,640 --> 00:12:21,400 Speaker 1: perform an operation based on those weighted inputs and pass 193 00:12:21,480 --> 00:12:23,840 Speaker 1: that on to the next layer, and so on. So 194 00:12:23,920 --> 00:12:27,000 Speaker 1: these nodes are in layers, like you know a cake. 195 00:12:27,600 --> 00:12:30,520 Speaker 1: One layer of notes processes some inputs, they send it 196 00:12:30,559 --> 00:12:33,440 Speaker 1: on to the next layer of nodes, and then that 197 00:12:33,480 --> 00:12:35,320 Speaker 1: one does onto the next one, and the next one 198 00:12:35,360 --> 00:12:40,880 Speaker 1: and so on. This isn't a new idea. Computer scientists 199 00:12:41,040 --> 00:12:45,679 Speaker 1: began theorizing and experimenting with neural network approaches as far 200 00:12:45,760 --> 00:12:49,360 Speaker 1: back as the nineteen fifties with the perceptron, which was 201 00:12:49,400 --> 00:12:53,280 Speaker 1: a hypothetical system that was described by Frank Rosenblatt of 202 00:12:53,320 --> 00:12:57,160 Speaker 1: Cornell University. But it wasn't until the last decade that 203 00:12:57,280 --> 00:13:00,400 Speaker 1: computing power and our ability to handle a lot of 204 00:13:00,520 --> 00:13:04,040 Speaker 1: data reached a point where these sort of learning models 205 00:13:04,040 --> 00:13:08,280 Speaker 1: could really take off. The goal of this system is 206 00:13:08,320 --> 00:13:12,080 Speaker 1: to train it to perform a particular task within a 207 00:13:12,120 --> 00:13:16,880 Speaker 1: certain level of precision. The weights I mentioned are adjustable, 208 00:13:17,040 --> 00:13:19,360 Speaker 1: so you can think of it as teaching a system 209 00:13:19,480 --> 00:13:22,840 Speaker 1: which bits are the most important in order to do 210 00:13:23,040 --> 00:13:25,760 Speaker 1: whatever it is the system is supposed to do in 211 00:13:25,840 --> 00:13:28,880 Speaker 1: order to achieve your task, These are the bits that 212 00:13:28,920 --> 00:13:32,320 Speaker 1: are the most important and therefore should matter the most 213 00:13:32,320 --> 00:13:35,240 Speaker 1: when you weigh a decision. This is a bit easier 214 00:13:35,280 --> 00:13:38,319 Speaker 1: if we talk about a similar system with the version 215 00:13:38,360 --> 00:13:42,679 Speaker 1: of IBM S Watson that played on Jeopardy. That system 216 00:13:42,800 --> 00:13:46,280 Speaker 1: famously was not connected to the Internet. It had to 217 00:13:46,320 --> 00:13:50,319 Speaker 1: rely on all the information that was stored within itself. 218 00:13:50,960 --> 00:13:55,000 Speaker 1: When the system encountered a clue in Jeopardy, it would 219 00:13:55,000 --> 00:13:57,959 Speaker 1: analyze the clue, and then it would reference its data 220 00:13:57,960 --> 00:14:01,320 Speaker 1: base to look for possible answers to whatever that clue was. 221 00:14:01,800 --> 00:14:05,160 Speaker 1: The system would weigh those possible answers and attempt to 222 00:14:05,160 --> 00:14:08,760 Speaker 1: determine which, if any, were the most likely to be correct. 223 00:14:09,200 --> 00:14:13,920 Speaker 1: If the certainty was over a certain threshold, such as sure, 224 00:14:14,200 --> 00:14:16,720 Speaker 1: the system would buzz in with its answer. If no 225 00:14:16,880 --> 00:14:20,920 Speaker 1: response rose above that threshold, the system would not buzz in, 226 00:14:21,280 --> 00:14:23,480 Speaker 1: So you could say that Watson was playing the game 227 00:14:23,520 --> 00:14:27,680 Speaker 1: with a best guess sort of approach. Neural networks do 228 00:14:28,240 --> 00:14:33,000 Speaker 1: essentially that sort of processing. With this particular type of approach, 229 00:14:33,400 --> 00:14:36,640 Speaker 1: we know what we want the outcome to be, so 230 00:14:36,840 --> 00:14:39,880 Speaker 1: we can judge whether or not the system was successful. 231 00:14:40,200 --> 00:14:43,760 Speaker 1: After each attempt, we can adjust the weight on the 232 00:14:43,800 --> 00:14:47,760 Speaker 1: input between nodes to refine the decision making process to 233 00:14:47,840 --> 00:14:51,880 Speaker 1: get more accurate results. If the system succeeds in its task, 234 00:14:52,360 --> 00:14:55,720 Speaker 1: we can increase the weights that contributed to the system 235 00:14:55,760 --> 00:15:00,240 Speaker 1: picking the correct answer and thus decrease the input it's 236 00:15:00,320 --> 00:15:05,280 Speaker 1: that did not contribute to the successful response. If the 237 00:15:05,280 --> 00:15:09,320 Speaker 1: system done messed up and gave the wrong answer, then 238 00:15:09,360 --> 00:15:11,720 Speaker 1: we do the opposite. We look at the inputs that 239 00:15:11,760 --> 00:15:16,000 Speaker 1: contributed to the wrong answer, we diminish their weights, and 240 00:15:16,080 --> 00:15:18,440 Speaker 1: we increase the weights of the other input and then 241 00:15:18,440 --> 00:15:23,120 Speaker 1: we run the test again a lot. I'll explain a 242 00:15:23,160 --> 00:15:25,600 Speaker 1: bit more about this process when we come back, but 243 00:15:25,680 --> 00:15:36,400 Speaker 1: first let's take a quick break. Early in the history 244 00:15:36,520 --> 00:15:40,760 Speaker 1: of neural networks, computer scientists were hitting some pretty hard 245 00:15:40,880 --> 00:15:44,400 Speaker 1: stops due to the limitations of computing power at the time. 246 00:15:44,720 --> 00:15:48,080 Speaker 1: Early networks were only a couple of layers deep, which 247 00:15:48,080 --> 00:15:50,720 Speaker 1: really meant they weren't terribly powerful, and they could only 248 00:15:50,760 --> 00:15:54,400 Speaker 1: tackle rudimentary tasks like figuring out whether or not a 249 00:15:54,520 --> 00:15:59,160 Speaker 1: square is drawn on a piece of paper that isn't 250 00:15:59,240 --> 00:16:05,560 Speaker 1: terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and Ronald 251 00:16:05,600 --> 00:16:12,120 Speaker 1: Williams published a lecture titled learning representations by back propagating errors. 252 00:16:12,160 --> 00:16:16,840 Speaker 1: This was a big breakthrough with deep learning. This all 253 00:16:16,880 --> 00:16:19,360 Speaker 1: has to do with a deep learning system improving its 254 00:16:19,360 --> 00:16:22,760 Speaker 1: ability to complete a specific task. And basically the algorithm's 255 00:16:22,840 --> 00:16:25,840 Speaker 1: job is to go from the output layer, you know, 256 00:16:25,960 --> 00:16:29,000 Speaker 1: where the system has made a decision, and then work 257 00:16:29,160 --> 00:16:32,680 Speaker 1: backward through the neural network, adjusting the weights that led 258 00:16:32,720 --> 00:16:38,480 Speaker 1: to an incorrect decision. So let's say it's a system 259 00:16:38,520 --> 00:16:41,680 Speaker 1: that is looking to figure out whether or not a 260 00:16:41,720 --> 00:16:45,000 Speaker 1: cat is in a photograph and it says, there's a 261 00:16:45,040 --> 00:16:47,400 Speaker 1: cat in this picture, and you look at the picture 262 00:16:47,400 --> 00:16:50,440 Speaker 1: and there is no cat there. Then you would look 263 00:16:50,560 --> 00:16:54,720 Speaker 1: at the inputs one level back just before the system 264 00:16:54,800 --> 00:16:57,160 Speaker 1: said here's a picture of a cat, and you'd say, 265 00:16:57,200 --> 00:16:59,720 Speaker 1: all right, which of these inputs lad the system to 266 00:17:00,120 --> 00:17:03,200 Speaker 1: leave this was a picture of a cat, And then 267 00:17:03,280 --> 00:17:06,200 Speaker 1: you would adjust those. Then you would go back one 268 00:17:06,320 --> 00:17:10,159 Speaker 1: layer up, so you're working your way up the model 269 00:17:10,520 --> 00:17:14,240 Speaker 1: and say which inputs here led to it giving the 270 00:17:14,280 --> 00:17:18,400 Speaker 1: outputs that led to the mistake, and you do this 271 00:17:18,640 --> 00:17:21,760 Speaker 1: all the way up until you get up to the 272 00:17:21,800 --> 00:17:24,639 Speaker 1: input level at the top of the computer model. You 273 00:17:24,680 --> 00:17:28,040 Speaker 1: are back propagating, and then you run the test again 274 00:17:28,160 --> 00:17:32,720 Speaker 1: to see if you've got improvement. It's exhaustive, but it's 275 00:17:32,800 --> 00:17:38,000 Speaker 1: also drastically improved neural network performance, much faster than just 276 00:17:38,520 --> 00:17:42,080 Speaker 1: throwing more brute force to it. The algorithm essentially is 277 00:17:42,160 --> 00:17:44,920 Speaker 1: checking to see if a small change in each input 278 00:17:45,040 --> 00:17:48,640 Speaker 1: value received by a layer of nodes would have led 279 00:17:48,680 --> 00:17:51,200 Speaker 1: to a more accurate results. So it's all about going 280 00:17:51,240 --> 00:17:54,679 Speaker 1: from that output working your way backward. In two thousand twelve, 281 00:17:54,720 --> 00:17:57,920 Speaker 1: Alex Krajewski published a paper that gave us the next 282 00:17:58,000 --> 00:18:02,480 Speaker 1: big breakthrough. He argued that a really deep neural network 283 00:18:02,760 --> 00:18:06,040 Speaker 1: with a lot of layers could give really great results 284 00:18:06,200 --> 00:18:09,960 Speaker 1: if you paired it with enough data to train the system. 285 00:18:10,000 --> 00:18:13,600 Speaker 1: So you needed to throw lots of data at these models, 286 00:18:14,320 --> 00:18:17,720 Speaker 1: and it needed to be an enormous amount of data. However, 287 00:18:17,880 --> 00:18:22,120 Speaker 1: once trained, the system would produce lower error rates. So yeah, 288 00:18:22,160 --> 00:18:24,040 Speaker 1: I would take a long time but you would get 289 00:18:24,080 --> 00:18:27,560 Speaker 1: better results. Now, at the time, a good error rate 290 00:18:27,720 --> 00:18:31,840 Speaker 1: for such a system was that means one out of 291 00:18:31,920 --> 00:18:35,159 Speaker 1: four conclusions the system would come to would be wrong. 292 00:18:35,600 --> 00:18:39,800 Speaker 1: If you ran it across a long enough number of decisions, 293 00:18:39,800 --> 00:18:43,120 Speaker 1: you would find that one out of every four wasn't right. 294 00:18:43,880 --> 00:18:47,520 Speaker 1: The system that Alex's team worked on produced results that 295 00:18:47,560 --> 00:18:50,880 Speaker 1: had an error rate of six percent, so much lower. 296 00:18:51,040 --> 00:18:54,720 Speaker 1: And then in just five years, with more improvements to 297 00:18:54,800 --> 00:18:58,919 Speaker 1: this process, the classification error rate had dropped down to 298 00:18:59,080 --> 00:19:02,760 Speaker 1: two point three percent for deep learning systems. So from 299 00:19:04,160 --> 00:19:09,080 Speaker 1: to two point three it was really powerful stuff. Okay, 300 00:19:09,119 --> 00:19:12,960 Speaker 1: so you've got your artificial neural network. You've got your 301 00:19:13,080 --> 00:19:17,359 Speaker 1: layers and layers of nodes. You've adjusted the weights of 302 00:19:17,400 --> 00:19:20,439 Speaker 1: the inputs into each node to see if your system 303 00:19:20,520 --> 00:19:25,240 Speaker 1: can identify, you know, pictures of cats, and you start 304 00:19:25,320 --> 00:19:29,479 Speaker 1: feeding images to this system, lots of them. This is 305 00:19:29,520 --> 00:19:32,439 Speaker 1: the domain that you are feeding to your system. The 306 00:19:32,480 --> 00:19:34,919 Speaker 1: more images you can feed to it, the better. And 307 00:19:34,960 --> 00:19:37,120 Speaker 1: you want a wide variety of images of all sorts 308 00:19:37,160 --> 00:19:39,879 Speaker 1: of stuff, not just of different types of cats, but 309 00:19:40,000 --> 00:19:43,760 Speaker 1: stuff that most certainly is not a cat, like dogs, 310 00:19:43,880 --> 00:19:48,000 Speaker 1: or cars or chartered public accountants, you name it, and 311 00:19:48,080 --> 00:19:51,400 Speaker 1: you look to see which images the system identifies correctly 312 00:19:51,800 --> 00:19:55,080 Speaker 1: and which ones it screws up, both which images have 313 00:19:55,320 --> 00:19:58,400 Speaker 1: cats in it that actually don't have cats in it, 314 00:19:58,840 --> 00:20:01,639 Speaker 1: or images the system has identified as saying there is 315 00:20:01,680 --> 00:20:04,719 Speaker 1: no cat here, but there is a cat there. This 316 00:20:04,800 --> 00:20:08,480 Speaker 1: guides you into adjusting the weights again and again, and 317 00:20:08,560 --> 00:20:10,360 Speaker 1: you start over and you do it again, and that's 318 00:20:10,400 --> 00:20:13,920 Speaker 1: your basic deep learning system, and it gets better over 319 00:20:14,000 --> 00:20:17,560 Speaker 1: time as you train it. It learns. Now, let's transition 320 00:20:17,600 --> 00:20:21,480 Speaker 1: over to the adversarial systems I mentioned earlier, because they 321 00:20:21,520 --> 00:20:25,160 Speaker 1: take this and twist it a little bit. So you've 322 00:20:25,200 --> 00:20:30,040 Speaker 1: got two artificial neural networks and they are using this 323 00:20:30,160 --> 00:20:33,840 Speaker 1: general approach to deep learning, and you're setting them up 324 00:20:34,160 --> 00:20:38,800 Speaker 1: so that they feed into each other. One network, the generator, 325 00:20:39,320 --> 00:20:42,880 Speaker 1: has the task to learn how to do something such 326 00:20:42,920 --> 00:20:47,000 Speaker 1: as create an eighteenth century style portrait based off lots 327 00:20:47,080 --> 00:20:50,400 Speaker 1: and lots of examples of the real thing. The domain 328 00:20:50,560 --> 00:20:55,800 Speaker 1: the problem domain. The second network, the discriminator, has a 329 00:20:55,840 --> 00:20:59,639 Speaker 1: different job. It has to tell the difference between authentic 330 00:20:59,720 --> 00:21:03,840 Speaker 1: port traits that came from the problem domain and computer 331 00:21:04,080 --> 00:21:08,120 Speaker 1: generated portraits that came from the generator itself. So essentially 332 00:21:08,200 --> 00:21:11,600 Speaker 1: the discriminator is like the model I mentioned earlier that 333 00:21:11,680 --> 00:21:14,480 Speaker 1: was identifying pictures of cats. It's doing the same sort 334 00:21:14,480 --> 00:21:17,000 Speaker 1: of thing, except instead of saying cat or no cat, 335 00:21:17,080 --> 00:21:21,359 Speaker 1: it's saying real portrait or a computer generated portrait. So 336 00:21:21,400 --> 00:21:25,120 Speaker 1: there are essentially two outcomes the discriminator could reach, and 337 00:21:25,240 --> 00:21:28,679 Speaker 1: that's whether or not an images computer generated or it wasn't. 338 00:21:29,520 --> 00:21:31,720 Speaker 1: So do you see where this is going? You train 339 00:21:31,880 --> 00:21:35,680 Speaker 1: up both models. You have the generator attempt to make 340 00:21:35,720 --> 00:21:39,080 Speaker 1: its own version of something such as that eighteenth century portrait. 341 00:21:39,680 --> 00:21:42,680 Speaker 1: It does so it designs the portrait based on what 342 00:21:42,760 --> 00:21:46,440 Speaker 1: the model believes are the key elements of a portrait, 343 00:21:47,160 --> 00:21:51,960 Speaker 1: So things like colors, shapes, the ratio of size, like 344 00:21:52,200 --> 00:21:54,480 Speaker 1: you know, how large should the head be in relation 345 00:21:54,520 --> 00:21:58,080 Speaker 1: to the body. All of these factors and many more 346 00:21:58,440 --> 00:22:03,200 Speaker 1: come into play. The generator creates its own idea of 347 00:22:03,240 --> 00:22:06,080 Speaker 1: what a portrait is supposed to look like, and chances 348 00:22:06,080 --> 00:22:09,800 Speaker 1: are the early rounds of this will not be terribly convincing. 349 00:22:10,560 --> 00:22:14,560 Speaker 1: The results are then fed to the discriminator, which tries 350 00:22:14,600 --> 00:22:17,520 Speaker 1: to suss out which of the images fed to it 351 00:22:17,560 --> 00:22:20,879 Speaker 1: our computer generated and which ones aren't. After that round, 352 00:22:21,280 --> 00:22:26,320 Speaker 1: both models are tweaked the generator adjusts input weights to 353 00:22:26,359 --> 00:22:29,880 Speaker 1: get closer to the genuine article, and the discriminator adjust 354 00:22:29,960 --> 00:22:34,720 Speaker 1: weights to reduce false positives or to catch computer generated images. 355 00:22:34,960 --> 00:22:39,280 Speaker 1: And then you go again and again and again and again, 356 00:22:39,560 --> 00:22:43,199 Speaker 1: and they both get better over time. So, assuming everything 357 00:22:43,280 --> 00:22:46,560 Speaker 1: is working properly, over time, the adjustment of input weights 358 00:22:46,600 --> 00:22:50,040 Speaker 1: will lead to more convincing results, and given enough time 359 00:22:50,240 --> 00:22:53,200 Speaker 1: and enough repetition, you'll end up with a computer generated 360 00:22:53,240 --> 00:22:55,639 Speaker 1: painting that you can auction off for nearly half a 361 00:22:55,680 --> 00:22:59,920 Speaker 1: million dollars. Though keep in mind that huge price or 362 00:23:00,040 --> 00:23:02,760 Speaker 1: dates back to the novelty of it being an early 363 00:23:02,960 --> 00:23:06,960 Speaker 1: AI generated painting. It would be shocking to me if 364 00:23:07,000 --> 00:23:10,320 Speaker 1: we saw that actually become a trend. Also, the painting, 365 00:23:10,359 --> 00:23:13,800 Speaker 1: while interesting, isn't exactly so astounding as to make you 366 00:23:13,840 --> 00:23:16,840 Speaker 1: think there's no way a machine did that. You'd look 367 00:23:16,880 --> 00:23:19,240 Speaker 1: at them and go, yeah, I can imagine a machine 368 00:23:19,240 --> 00:23:23,400 Speaker 1: did that. One. A group of computer scientists first described 369 00:23:23,520 --> 00:23:26,879 Speaker 1: the general adversarial network architecture in a paper in two 370 00:23:26,920 --> 00:23:30,560 Speaker 1: thousand and fourteen, and like other neural networks, these models 371 00:23:30,600 --> 00:23:33,399 Speaker 1: require a lot of data. The more the better. In fact, 372 00:23:33,480 --> 00:23:36,040 Speaker 1: smaller data sets means the models have to make some 373 00:23:36,119 --> 00:23:40,960 Speaker 1: pretty big assumptions, and you tend to get pretty lousy results. 374 00:23:41,440 --> 00:23:45,160 Speaker 1: More data, as in more examples, teaches the models more 375 00:23:45,200 --> 00:23:48,320 Speaker 1: about the parameters of the domain, whatever it is they 376 00:23:48,320 --> 00:23:52,080 Speaker 1: are trying to generate. It refines the approach. So if 377 00:23:52,119 --> 00:23:54,600 Speaker 1: you have a sophisticated enough pair of models and you 378 00:23:54,640 --> 00:23:57,280 Speaker 1: have enough data to fill up a domain, you can 379 00:23:57,359 --> 00:24:01,439 Speaker 1: generate some convincing material. And that in ludes video and 380 00:24:01,560 --> 00:24:05,080 Speaker 1: this brings us around to deep fakes. And in addition 381 00:24:05,200 --> 00:24:09,679 Speaker 1: to generative adversarial networks, a couple of other things really 382 00:24:10,200 --> 00:24:15,520 Speaker 1: converged to create the techniques and trends and technology that 383 00:24:15,560 --> 00:24:22,160 Speaker 1: would allow for deep fakes proper. In Malcolm Slaney, Michelle Covell, 384 00:24:22,440 --> 00:24:26,360 Speaker 1: and Christoph Bregler wrote some software that they called the 385 00:24:26,440 --> 00:24:30,960 Speaker 1: Video Rewrite Program. The software would analyze faces and then 386 00:24:31,040 --> 00:24:35,920 Speaker 1: create or synthesize lip animation which could be matched to 387 00:24:36,320 --> 00:24:39,800 Speaker 1: pre recorded audio. So you could take some film footage 388 00:24:40,280 --> 00:24:44,040 Speaker 1: of a person and then reanimate their lips so that 389 00:24:44,080 --> 00:24:47,000 Speaker 1: they could appear to say all sorts of things, which 390 00:24:47,000 --> 00:24:50,840 Speaker 1: in some ways set the stage for deep fakes. This case, 391 00:24:50,920 --> 00:24:53,840 Speaker 1: it was really just focusing on the lips and the 392 00:24:53,880 --> 00:24:57,879 Speaker 1: general area around the lips, so you weren't changing the 393 00:24:57,960 --> 00:25:00,920 Speaker 1: rest of the expression of the face, and you would 394 00:25:00,960 --> 00:25:04,800 Speaker 1: have to, you know, keep your recording to be about 395 00:25:04,880 --> 00:25:07,359 Speaker 1: the same length as whatever the film clip was, or 396 00:25:07,400 --> 00:25:09,719 Speaker 1: you would have to loop the film clip over and over, 397 00:25:09,800 --> 00:25:12,040 Speaker 1: which would make it, you know, far more obvious that 398 00:25:12,160 --> 00:25:16,720 Speaker 1: this was a fake. In addition, motion tracking technology was 399 00:25:16,760 --> 00:25:19,440 Speaker 1: advancing over time too, and this also became an important 400 00:25:19,480 --> 00:25:22,800 Speaker 1: tool in computer animation. This tool would also be used 401 00:25:23,119 --> 00:25:27,479 Speaker 1: by deep fake algorithms to create facial expressions, manipulating the 402 00:25:27,560 --> 00:25:30,479 Speaker 1: digital image just as it would if it were a 403 00:25:30,560 --> 00:25:34,959 Speaker 1: video game character or a Pixar animated character. Typically, you 404 00:25:35,000 --> 00:25:38,159 Speaker 1: need to start with some existing video in order to 405 00:25:38,200 --> 00:25:42,439 Speaker 1: manipulate it. You're not actually computer generating the animation, like, 406 00:25:42,480 --> 00:25:47,439 Speaker 1: you're not creating a computer generated version of whomever it 407 00:25:47,560 --> 00:25:51,320 Speaker 1: is you're you're doing the fake of You're using existing 408 00:25:51,760 --> 00:25:54,639 Speaker 1: imagery in order to do that and then manipulating that 409 00:25:54,720 --> 00:25:58,760 Speaker 1: existing imagery, so it's a little different from computer animation. 410 00:25:59,200 --> 00:26:02,000 Speaker 1: In two thousands six teen, students and faculty at the 411 00:26:02,040 --> 00:26:06,720 Speaker 1: Technical University of Munich created the face to Face project 412 00:26:07,000 --> 00:26:10,600 Speaker 1: that would be face the numeral two and then face 413 00:26:11,359 --> 00:26:14,640 Speaker 1: and this was particularly jaw dropping to me at the time. 414 00:26:14,640 --> 00:26:18,320 Speaker 1: When I first saw these videos back in ten, I 415 00:26:18,400 --> 00:26:22,600 Speaker 1: was floored. They created a system that had a target actor. 416 00:26:23,040 --> 00:26:25,600 Speaker 1: This would be the video of the person that you 417 00:26:25,640 --> 00:26:28,520 Speaker 1: want to manipulate. In the example they used, it was 418 00:26:28,760 --> 00:26:33,600 Speaker 1: former US President George W. Bush. Their process also had 419 00:26:33,640 --> 00:26:38,399 Speaker 1: a source actor. This was the source of the expressions 420 00:26:38,440 --> 00:26:41,440 Speaker 1: and facial movements you would see in the targets, so 421 00:26:41,960 --> 00:26:45,240 Speaker 1: kind of like a digital puppeteer in a way. But 422 00:26:45,320 --> 00:26:47,280 Speaker 1: the way they did it was really cool. They had 423 00:26:47,280 --> 00:26:51,160 Speaker 1: a camera trained on the source actor and it would 424 00:26:51,280 --> 00:26:54,919 Speaker 1: track specific points of movement on the source actor's face, 425 00:26:55,480 --> 00:26:58,600 Speaker 1: and then the system would manipulate the same points of 426 00:26:58,720 --> 00:27:02,520 Speaker 1: movement on the target actor's face in the video. So 427 00:27:02,720 --> 00:27:07,040 Speaker 1: if the source actor smiled, then the target smiled, so 428 00:27:07,160 --> 00:27:08,920 Speaker 1: the source actor would smile, and then you would see 429 00:27:08,920 --> 00:27:12,240 Speaker 1: George W. Bush in the video smile in real time. 430 00:27:12,560 --> 00:27:17,919 Speaker 1: It was really strange. They used this looping video of 431 00:27:18,240 --> 00:27:21,560 Speaker 1: George W. Bush wearing a neutral expression. They had to 432 00:27:21,640 --> 00:27:26,359 Speaker 1: start with that as there they're sort of zero point, 433 00:27:26,880 --> 00:27:28,880 Speaker 1: and I gotta tell you, it really does look like 434 00:27:29,520 --> 00:27:31,920 Speaker 1: the former president George W. Bush is having a bit 435 00:27:31,920 --> 00:27:35,600 Speaker 1: of a freak out on a looping video because he 436 00:27:35,720 --> 00:27:40,159 Speaker 1: keeps on opening his mouth, closing his mouth, grimacing, raising 437 00:27:40,160 --> 00:27:43,040 Speaker 1: his eyebrows. You need to watch this video. It is 438 00:27:43,080 --> 00:27:48,080 Speaker 1: still available online to check out. In Students and faculty 439 00:27:48,119 --> 00:27:52,560 Speaker 1: over at the University of Washington created the Synthesizing Obama project, 440 00:27:52,800 --> 00:27:55,639 Speaker 1: in which they trained a computer model to generate a 441 00:27:55,760 --> 00:27:59,920 Speaker 1: synthetic video of former US President Barack Obama, and they 442 00:28:00,119 --> 00:28:03,520 Speaker 1: made it lip sync to a pre recorded audio clip 443 00:28:03,720 --> 00:28:08,320 Speaker 1: from one of Obama's addresses to the nation. They actually 444 00:28:08,400 --> 00:28:12,160 Speaker 1: had the original video of that address for comparison, so 445 00:28:12,359 --> 00:28:15,639 Speaker 1: they could look back at that and see how they're 446 00:28:15,680 --> 00:28:19,400 Speaker 1: generated one compared to the real thing. And their approach 447 00:28:19,640 --> 00:28:23,560 Speaker 1: used a model that analyzed hundreds of hours of video 448 00:28:23,600 --> 00:28:28,600 Speaker 1: footage of Obama speaking, and it mapped specific mouth shapes 449 00:28:28,720 --> 00:28:33,400 Speaker 1: to specific sounds. It would also include some of Obama's mannerisms, 450 00:28:33,440 --> 00:28:35,439 Speaker 1: such as how he moves his head when he talks 451 00:28:35,560 --> 00:28:39,240 Speaker 1: or uses facial expressions to emphasize words. And watching the 452 00:28:39,360 --> 00:28:43,200 Speaker 1: video and that, you know the real one next to 453 00:28:43,240 --> 00:28:46,960 Speaker 1: the generated one is pretty strange. You can tell the 454 00:28:47,000 --> 00:28:51,680 Speaker 1: generated one isn't quite right. It's not matching the audio exactly, 455 00:28:51,960 --> 00:28:56,840 Speaker 1: at least not on the early versions, but it's fairly close, 456 00:28:56,920 --> 00:28:59,600 Speaker 1: and it might even pass casual inspection for a lot 457 00:28:59,640 --> 00:29:02,000 Speaker 1: of people who weren't, like, you know, actually paying attention. 458 00:29:02,600 --> 00:29:07,840 Speaker 1: Authors Morass and Alexandro defined deep fakes as quote the 459 00:29:07,880 --> 00:29:13,200 Speaker 1: product of artificial intelligence applications that merge, combine, replace, and 460 00:29:13,320 --> 00:29:17,440 Speaker 1: superimpose images and video clips to create fake videos that 461 00:29:17,480 --> 00:29:22,600 Speaker 1: appear authentic end quote. They first emerged in two seventeen, 462 00:29:22,880 --> 00:29:26,760 Speaker 1: and so this is a pretty darn young application of technology. 463 00:29:27,400 --> 00:29:30,600 Speaker 1: One thing that is worrisome is that once someone has 464 00:29:30,640 --> 00:29:34,360 Speaker 1: access to the tools, it's not that difficult to create 465 00:29:34,440 --> 00:29:37,479 Speaker 1: a deep fake video. You pretty much just need a 466 00:29:37,520 --> 00:29:41,320 Speaker 1: decent computer, the tools, a bit of know how on 467 00:29:41,400 --> 00:29:44,560 Speaker 1: how to do it, and some time you also need 468 00:29:44,720 --> 00:29:48,440 Speaker 1: some reference material, as in like videos and images of 469 00:29:48,480 --> 00:29:52,280 Speaker 1: the person that you are replicating, and like the machine 470 00:29:52,360 --> 00:29:55,640 Speaker 1: learning systems I've mentioned, the more reference material you have, 471 00:29:55,920 --> 00:29:59,200 Speaker 1: the better. That's why the deep fakes you encounter these 472 00:29:59,280 --> 00:30:03,280 Speaker 1: days tend to be of notable famous people like celebrities 473 00:30:03,280 --> 00:30:07,240 Speaker 1: and politicians. Mainly there's no shortage of reference material for 474 00:30:07,320 --> 00:30:10,680 Speaker 1: those types of individuals, and so they are easier to 475 00:30:10,720 --> 00:30:14,080 Speaker 1: replicate with deep fakes than someone who maintains a much 476 00:30:14,280 --> 00:30:17,240 Speaker 1: lower profile. Not to say that that will always be 477 00:30:17,320 --> 00:30:19,880 Speaker 1: the case, or that there aren't systems out there that 478 00:30:19,960 --> 00:30:25,400 Speaker 1: can accept smaller amounts of reference material. It's just harder 479 00:30:25,440 --> 00:30:31,920 Speaker 1: to make a convincing version with fewer samples. But in 480 00:30:32,040 --> 00:30:35,480 Speaker 1: order to make a convincing fake, the system really has 481 00:30:35,520 --> 00:30:39,640 Speaker 1: to learn how a person moves. All those facial expressions matter. 482 00:30:39,880 --> 00:30:42,920 Speaker 1: It also has to learn how a person sounds. Will 483 00:30:42,960 --> 00:30:48,960 Speaker 1: get into sound a little bit later. But mannerisms, inflection, accent, emphasis, cadence, 484 00:30:49,080 --> 00:30:51,640 Speaker 1: quirks and ticks, all of these things have to be 485 00:30:51,680 --> 00:30:55,480 Speaker 1: analyzed and replicated to make a convincing fake, and it 486 00:30:55,520 --> 00:30:57,840 Speaker 1: has to be done just right, or else it comes 487 00:30:57,840 --> 00:31:02,680 Speaker 1: off as creepy or unrealistic. Think about how impressionists will 488 00:31:02,720 --> 00:31:06,320 Speaker 1: take a celebrity's manner of speech and then heighten some 489 00:31:06,440 --> 00:31:09,920 Speaker 1: of it in comedic effect. You'll hear all the time 490 00:31:09,960 --> 00:31:12,959 Speaker 1: with folks who do impressions of people like Jack Nicholson 491 00:31:13,160 --> 00:31:17,080 Speaker 1: or Christopher Walkin or Barbara streisand people who have a 492 00:31:17,240 --> 00:31:21,760 Speaker 1: very particular way of speaking. Impressionists will take those as 493 00:31:21,880 --> 00:31:25,360 Speaker 1: markers and they really punch in on them. Well, a 494 00:31:25,480 --> 00:31:28,240 Speaker 1: deep fake can't really do that too much, or else 495 00:31:28,280 --> 00:31:30,560 Speaker 1: it won't come across as genuine. It'll feel like you're 496 00:31:30,560 --> 00:31:35,760 Speaker 1: watching a famous person impersonating themselves, which is weird. Now. 497 00:31:35,760 --> 00:31:38,240 Speaker 1: The earliest mention of deep fakes I can find dates 498 00:31:38,280 --> 00:31:41,480 Speaker 1: to a two thousand seventeen Reddit forum in which a 499 00:31:41,600 --> 00:31:45,680 Speaker 1: user shared deep faked videos that appeared to show female 500 00:31:45,680 --> 00:31:50,720 Speaker 1: celebrities in sexual situations. Heads and faces had been replaced, 501 00:31:50,960 --> 00:31:54,720 Speaker 1: and the actors in pornographic movies had their heads or 502 00:31:54,760 --> 00:31:58,840 Speaker 1: faces swapped out for these various celebrities. Now the fakes 503 00:31:59,080 --> 00:32:04,400 Speaker 1: can look fairly convincing, extremely convincing in some cases, which 504 00:32:04,600 --> 00:32:07,479 Speaker 1: can lead to some people assuming that the videos are 505 00:32:07,480 --> 00:32:10,880 Speaker 1: genuine and that the folks that they saw in the 506 00:32:10,960 --> 00:32:13,880 Speaker 1: videos are really the ones who are in it. And 507 00:32:14,040 --> 00:32:17,400 Speaker 1: obviously that's a real problem, right. I mean that this 508 00:32:17,480 --> 00:32:21,800 Speaker 1: technology we've given enough reference data DEFEATA system, someone could 509 00:32:21,840 --> 00:32:24,760 Speaker 1: fabricate a video that appears to put a person in 510 00:32:24,800 --> 00:32:28,760 Speaker 1: a compromising position, whether it's a sexual act or making 511 00:32:28,840 --> 00:32:32,400 Speaker 1: damaging statements or committing a crime or whatever. And there 512 00:32:32,440 --> 00:32:34,520 Speaker 1: are tools right now that allow you to do pretty 513 00:32:34,600 --> 00:32:37,440 Speaker 1: much what the face to face tool was doing back 514 00:32:37,440 --> 00:32:41,040 Speaker 1: in two thousand sixteen. A program called avatar if I, 515 00:32:41,640 --> 00:32:45,720 Speaker 1: which is not that easy to say anyway. It can 516 00:32:45,800 --> 00:32:49,520 Speaker 1: run on top of live streaming conference services like Zoom 517 00:32:49,520 --> 00:32:52,400 Speaker 1: and Skype, and you can swap out your face for 518 00:32:52,480 --> 00:32:57,200 Speaker 1: a celebrities face. Your facial expressions map to the computer 519 00:32:57,840 --> 00:33:02,200 Speaker 1: manipulated celebrity face uh that just looks at you through 520 00:33:02,240 --> 00:33:06,680 Speaker 1: your webcam, and then if you smile, the celebrity image smiles, etcetera. 521 00:33:06,720 --> 00:33:09,240 Speaker 1: It's like that old face to face program. It does 522 00:33:09,360 --> 00:33:13,720 Speaker 1: need a pretty beefy PC to manage doing all this 523 00:33:13,760 --> 00:33:17,240 Speaker 1: because you're also running that live streaming service underneath it. 524 00:33:17,240 --> 00:33:20,320 Speaker 1: It's also not exactly user friendly. You need some programming 525 00:33:20,760 --> 00:33:24,000 Speaker 1: experience to really get it to work. But it is 526 00:33:24,280 --> 00:33:29,080 Speaker 1: widely accessible as the source code is is open source 527 00:33:29,480 --> 00:33:31,880 Speaker 1: and it's on get hubs, so anyone can get it. 528 00:33:32,640 --> 00:33:36,440 Speaker 1: Samantha Cole, who writes for Vice, has covered the topic 529 00:33:36,480 --> 00:33:39,880 Speaker 1: of deep fakes pretty extensively and the potential harm they 530 00:33:39,880 --> 00:33:42,920 Speaker 1: can cause, and I recommend you check out her work 531 00:33:43,080 --> 00:33:46,440 Speaker 1: if you're interested in learning more about that. Do be 532 00:33:46,520 --> 00:33:51,000 Speaker 1: warned that Coal covers some pretty adult themed topics and 533 00:33:51,040 --> 00:33:53,880 Speaker 1: I think she does great work and very important work. 534 00:33:54,320 --> 00:33:57,000 Speaker 1: But as a guy who grew up in the Deep South, 535 00:33:57,040 --> 00:33:59,200 Speaker 1: it's also the kind of stuff that occasionally makes me 536 00:33:59,240 --> 00:34:01,600 Speaker 1: clutch my purse roles. But that's more of a statement 537 00:34:01,600 --> 00:34:06,040 Speaker 1: about me than her work. She does great work. I 538 00:34:06,080 --> 00:34:09,400 Speaker 1: think most of us can imagine plenty of scenarios in 539 00:34:09,400 --> 00:34:12,239 Speaker 1: which this sort of technology could cause mischief on a 540 00:34:12,280 --> 00:34:16,080 Speaker 1: good day and catastrophe on a bad day, whether it's 541 00:34:16,160 --> 00:34:21,640 Speaker 1: spreading misinformation, creating fear and certainty and doubt fud or 542 00:34:21,760 --> 00:34:25,200 Speaker 1: by making people seem to say things they never actually said, 543 00:34:25,360 --> 00:34:28,759 Speaker 1: or contributing to an ugly subculture in which people try 544 00:34:28,800 --> 00:34:32,560 Speaker 1: to make their more base fantasies a reality by putting 545 00:34:32,600 --> 00:34:35,239 Speaker 1: one person's head on another person's body. You know, it's 546 00:34:35,280 --> 00:34:39,040 Speaker 1: not great. There are legitimate uses of the technology too, 547 00:34:39,120 --> 00:34:42,600 Speaker 1: of course, you know, tech itself is rarely good or bad. 548 00:34:42,719 --> 00:34:45,759 Speaker 1: It's all in how we use it. But this particular 549 00:34:45,800 --> 00:34:49,200 Speaker 1: technology has a lot of potentially harmful uses, and Samantha 550 00:34:49,239 --> 00:34:52,040 Speaker 1: Coll has done a great job explaining them. When we 551 00:34:52,120 --> 00:34:54,479 Speaker 1: come back, I'll talk a bit more about the war 552 00:34:54,600 --> 00:34:57,600 Speaker 1: against deep fakes and how people are trying to prepare 553 00:34:57,680 --> 00:35:00,880 Speaker 1: for a world that is increasingly filled with media we 554 00:35:01,080 --> 00:35:05,600 Speaker 1: can't really trust. But first, let's take a quick break. 555 00:35:13,120 --> 00:35:16,880 Speaker 1: Before the break, I mentioned Samantha Cole, who has written 556 00:35:16,920 --> 00:35:19,880 Speaker 1: extensively about deep fags, and one point she makes that 557 00:35:19,920 --> 00:35:23,359 Speaker 1: I think is important for us to note is that 558 00:35:23,440 --> 00:35:28,320 Speaker 1: the vast majority of instances of deep fake videos haven't 559 00:35:28,480 --> 00:35:33,120 Speaker 1: been some manufactured video of a political leader saying inflammatory things. 560 00:35:33,840 --> 00:35:37,200 Speaker 1: That continues to be a big concern. There's a genuine 561 00:35:37,320 --> 00:35:40,400 Speaker 1: fear that someone is going to manufacture a video in 562 00:35:40,440 --> 00:35:43,920 Speaker 1: which a politician appears to say or do something truly 563 00:35:44,000 --> 00:35:47,480 Speaker 1: terrible in an effort to either discredit the politician or 564 00:35:47,520 --> 00:35:52,319 Speaker 1: perhaps instigate a conflict with some other group. There are 565 00:35:52,400 --> 00:35:56,960 Speaker 1: literal doomsday scenarios in which such a video would prompt 566 00:35:56,960 --> 00:36:01,160 Speaker 1: a massive military response, though it does seem like it 567 00:36:01,239 --> 00:36:04,120 Speaker 1: might be a little far fetched. Though heck, I don't know, 568 00:36:04,239 --> 00:36:06,279 Speaker 1: considering the world we live in, maybe it's not that 569 00:36:06,400 --> 00:36:10,200 Speaker 1: big of a stretch anyway. Cole's point is that so far, 570 00:36:10,640 --> 00:36:14,080 Speaker 1: debt has not happened. She points out that the most 571 00:36:14,280 --> 00:36:17,000 Speaker 1: frequent use for the tech either tends to be people 572 00:36:17,120 --> 00:36:20,920 Speaker 1: goofing around or disturbingly using it too. In her words, 573 00:36:21,000 --> 00:36:25,160 Speaker 1: quote take ownership of women's bodies in non consensual porn 574 00:36:25,440 --> 00:36:28,920 Speaker 1: end quote. Cole argues that the reason we haven't really 575 00:36:28,920 --> 00:36:32,240 Speaker 1: seen deep fix used much outside of these realms, apart 576 00:36:32,280 --> 00:36:36,400 Speaker 1: from a few advertising campaigns. Is that people are pretty 577 00:36:36,440 --> 00:36:39,719 Speaker 1: good at spotting Deep Fix. They aren't quite at a 578 00:36:39,840 --> 00:36:42,759 Speaker 1: level where they can easily pass for the real thing. 579 00:36:43,320 --> 00:36:46,400 Speaker 1: There's still something slightly off about them. They tend to 580 00:36:46,560 --> 00:36:49,880 Speaker 1: butt up against the uncanny valley. Now, for those of 581 00:36:49,880 --> 00:36:53,560 Speaker 1: you not familiar with that term, the uncanny valley describes 582 00:36:53,600 --> 00:36:57,320 Speaker 1: the feeling we humans get when we encounter a robot 583 00:36:57,520 --> 00:37:02,520 Speaker 1: or a computer generated figure that closely resembles a human 584 00:37:02,880 --> 00:37:06,600 Speaker 1: or human behavior, but you can still tell it's not 585 00:37:07,040 --> 00:37:10,400 Speaker 1: actually a person, and it's not a good feeling. It 586 00:37:10,440 --> 00:37:13,960 Speaker 1: tends to be described as repulsive and disturbing, or at 587 00:37:14,160 --> 00:37:18,720 Speaker 1: the very best, off putting. See also the animated film 588 00:37:18,760 --> 00:37:22,960 Speaker 1: Polar Express. There's a reason that when that film came out, 589 00:37:23,120 --> 00:37:27,839 Speaker 1: people kind of reacted negatively to the animation, and it's 590 00:37:27,840 --> 00:37:30,640 Speaker 1: also a reason why picks are tends to prefer to 591 00:37:30,680 --> 00:37:34,479 Speaker 1: go with stylized human characters who are different enough from 592 00:37:34,600 --> 00:37:38,320 Speaker 1: the way real humans look to kind of bypass uncanny valley. 593 00:37:38,520 --> 00:37:40,880 Speaker 1: We just think of that as a cartoon, not something 594 00:37:40,920 --> 00:37:44,120 Speaker 1: that's trying to pass itself off as being human. But 595 00:37:44,200 --> 00:37:46,800 Speaker 1: while there hasn't really been a flood of fake videos 596 00:37:46,840 --> 00:37:50,319 Speaker 1: hitting the Internet with the intent to discredit politicians or 597 00:37:50,400 --> 00:37:54,280 Speaker 1: infuriate specific people or whatever, there remains a general sense 598 00:37:54,320 --> 00:37:58,040 Speaker 1: that this is coming. It's just not here now. The 599 00:37:58,120 --> 00:38:01,600 Speaker 1: sense I get is that people feel it's an inevitability, 600 00:38:01,680 --> 00:38:04,080 Speaker 1: and there are already folks working on tools that will 601 00:38:04,080 --> 00:38:07,160 Speaker 1: help us sort out the real stuff from the fakes. 602 00:38:07,719 --> 00:38:12,440 Speaker 1: Take Microsoft, for example. There R and D division fittingly 603 00:38:12,640 --> 00:38:17,680 Speaker 1: called Microsoft Research, developed a tool they call the Video Authenticator. 604 00:38:18,120 --> 00:38:21,960 Speaker 1: This tool analyzes video samples and looks for signs of 605 00:38:22,320 --> 00:38:25,440 Speaker 1: deep fakery. In a blog post written by Tom Bert 606 00:38:25,520 --> 00:38:30,040 Speaker 1: and Eric Horvitts to Microsoft executives, they say, quote it 607 00:38:30,080 --> 00:38:33,600 Speaker 1: works by detecting the blending boundary of the deep fake 608 00:38:33,760 --> 00:38:36,840 Speaker 1: and subtle fading or gray scale elements that might not 609 00:38:36,960 --> 00:38:40,759 Speaker 1: be detectable by the human eye. End quote. Now I'm 610 00:38:40,800 --> 00:38:44,360 Speaker 1: no expert, but to me, it sounds like the video 611 00:38:44,440 --> 00:38:48,600 Speaker 1: Authenticator is working in a way that's not too dissimilar 612 00:38:48,880 --> 00:38:53,719 Speaker 1: to a discriminator in a generative adversarial network. I mean, 613 00:38:54,040 --> 00:38:58,080 Speaker 1: the whole purpose of the discriminator is to discriminate or 614 00:38:58,160 --> 00:39:01,960 Speaker 1: to tell the difference between genuine when unaltered videos and 615 00:39:02,080 --> 00:39:06,440 Speaker 1: computer generated ones. So the video authenticator is looking for 616 00:39:06,520 --> 00:39:10,400 Speaker 1: tailtale signs that a video was not produced through traditional 617 00:39:10,480 --> 00:39:14,560 Speaker 1: means but was computer generated. However, that's the very thing 618 00:39:14,840 --> 00:39:18,200 Speaker 1: that the generators in G A N systems are looking 619 00:39:18,239 --> 00:39:21,960 Speaker 1: out for. So when a generator receives feedback that a 620 00:39:22,080 --> 00:39:26,360 Speaker 1: video it generated did not slip past the discriminator, it 621 00:39:26,440 --> 00:39:30,000 Speaker 1: then tweaks those input weights and starts to shift its 622 00:39:30,040 --> 00:39:33,680 Speaker 1: approach in order to bypass whatever it was that gave 623 00:39:33,719 --> 00:39:37,600 Speaker 1: away its last attempt, and it does this again and again. 624 00:39:38,120 --> 00:39:41,880 Speaker 1: So the video authenticator might work well for a given 625 00:39:41,920 --> 00:39:44,759 Speaker 1: amount of time, but I would suspect that in the 626 00:39:44,880 --> 00:39:48,120 Speaker 1: long run, the deep fake systems will become sophisticated enough 627 00:39:48,440 --> 00:39:53,319 Speaker 1: to fool the authenticator. Of course, Microsoft will continue to 628 00:39:53,400 --> 00:39:56,720 Speaker 1: tweak the authenticator as well, and it will become something 629 00:39:56,760 --> 00:40:00,920 Speaker 1: of a seesaw battle as one side outperforms the other temporarily, 630 00:40:01,280 --> 00:40:04,000 Speaker 1: and then the balance will shift. Though there may come 631 00:40:04,000 --> 00:40:06,760 Speaker 1: a time where either the deep fakes are too good 632 00:40:07,120 --> 00:40:10,240 Speaker 1: and they don't set off any alarms from the discriminator, 633 00:40:11,080 --> 00:40:16,040 Speaker 1: or the discriminator gets so sensitive that it starts to 634 00:40:16,080 --> 00:40:19,200 Speaker 1: flag real videos and it hits a lot of false 635 00:40:19,280 --> 00:40:23,680 Speaker 1: positives and calls them generated videos instead. Either way, you 636 00:40:23,760 --> 00:40:26,720 Speaker 1: reach a point where a tool like this no longer 637 00:40:26,760 --> 00:40:29,839 Speaker 1: really serves a useful purpose, and the video authenticator will 638 00:40:29,840 --> 00:40:32,920 Speaker 1: be obsolete. Now, this is something we see in artificial 639 00:40:32,960 --> 00:40:36,080 Speaker 1: intelligence all the time. If you remember the good old 640 00:40:36,120 --> 00:40:39,000 Speaker 1: days of capture, you know, the approving you're not a 641 00:40:39,120 --> 00:40:42,480 Speaker 1: robot stuff. The stuff we were told to do was 642 00:40:42,840 --> 00:40:45,960 Speaker 1: typically type in a series of letters and numbers, and 643 00:40:46,000 --> 00:40:48,960 Speaker 1: it wasn't that hard when it first started, at least 644 00:40:49,000 --> 00:40:53,000 Speaker 1: not at first. That's because the text recognition algorithms of 645 00:40:53,040 --> 00:40:58,160 Speaker 1: the time weren't very good. They couldn't decipher mildly deformed 646 00:40:58,280 --> 00:41:01,439 Speaker 1: text because the shape to the text felt too far 647 00:41:01,560 --> 00:41:05,399 Speaker 1: outside the parameters of what the system could recognize as 648 00:41:05,440 --> 00:41:08,480 Speaker 1: a legitimate letter or number. You make the number a little, 649 00:41:09,040 --> 00:41:12,239 Speaker 1: you know, deformed, and then suddenly the systems like, well, 650 00:41:12,239 --> 00:41:14,839 Speaker 1: that doesn't look like a three to me because it's 651 00:41:14,880 --> 00:41:17,400 Speaker 1: not in the shape of a three. But over time 652 00:41:17,560 --> 00:41:22,239 Speaker 1: people developed better text recognition programs that could recognize these 653 00:41:22,239 --> 00:41:25,360 Speaker 1: shapes even if they weren't in a standard three orientation, 654 00:41:25,960 --> 00:41:30,040 Speaker 1: and those systems began to defeat those simple early captures 655 00:41:30,600 --> 00:41:34,800 Speaker 1: that required captured designers to make tougher versions, and eventually 656 00:41:34,840 --> 00:41:37,239 Speaker 1: the machines got good enough that they can match or 657 00:41:37,320 --> 00:41:41,280 Speaker 1: even outperform humans. And at that point, those text based 658 00:41:41,360 --> 00:41:45,240 Speaker 1: captures proved to be more challenging for people than for machines, 659 00:41:45,280 --> 00:41:47,839 Speaker 1: which meant if you use them, you defeated the whole 660 00:41:47,880 --> 00:41:50,959 Speaker 1: purpose in the first place. So while this escalation proved 661 00:41:51,000 --> 00:41:53,800 Speaker 1: to be a challenge for security, it was a boon 662 00:41:54,120 --> 00:41:58,360 Speaker 1: for artificial intelligence. And while I focused almost exclusively on 663 00:41:58,440 --> 00:42:01,320 Speaker 1: the imagery of video here, the same sort of stuff 664 00:42:01,400 --> 00:42:04,880 Speaker 1: is going on with generated speech, including generated speech that 665 00:42:04,960 --> 00:42:09,920 Speaker 1: imitates specific voices like deep big videos. This approach works 666 00:42:09,960 --> 00:42:12,680 Speaker 1: best if you have a really big data set of 667 00:42:12,760 --> 00:42:19,680 Speaker 1: recorded audio, so people like movie and TV stars, news reporters, politicians, 668 00:42:19,760 --> 00:42:24,880 Speaker 1: and um, you know, podcasters, we're great targets for this stuff. 669 00:42:25,120 --> 00:42:27,280 Speaker 1: There might be hundreds or you know, in my case, 670 00:42:27,680 --> 00:42:32,440 Speaker 1: thousands of hours of recording material to work from. Training 671 00:42:32,440 --> 00:42:38,439 Speaker 1: a model to use the frequencies timbre, intonation, pronunciation, pauses, 672 00:42:38,520 --> 00:42:41,560 Speaker 1: and other mannerisms of speech can result in a system 673 00:42:41,640 --> 00:42:45,160 Speaker 1: that can generate vocals that sound like the target, sometimes 674 00:42:45,160 --> 00:42:49,640 Speaker 1: to a fairly convincing degree, and for a while to 675 00:42:49,640 --> 00:42:52,560 Speaker 1: peek behind the curtain here we at tech stuff. We're 676 00:42:52,600 --> 00:42:54,520 Speaker 1: working with a company that I'm not going to name, 677 00:42:54,800 --> 00:42:57,399 Speaker 1: but they were going to do something like this as 678 00:42:57,480 --> 00:42:59,960 Speaker 1: an experiment. I was gonna do a whole episode on it, 679 00:43:00,520 --> 00:43:03,680 Speaker 1: and I had planned on crafting a segment of that 680 00:43:03,800 --> 00:43:07,800 Speaker 1: episode only through text. I was not going to actually 681 00:43:07,800 --> 00:43:10,880 Speaker 1: record it myself and then use a system that was 682 00:43:10,960 --> 00:43:16,120 Speaker 1: trained on my voice to replicate my voice and deliver 683 00:43:16,280 --> 00:43:19,520 Speaker 1: that segment on its own. I was curious if it 684 00:43:19,520 --> 00:43:22,479 Speaker 1: can nail not just the audio quality of my voice, which, 685 00:43:22,840 --> 00:43:27,200 Speaker 1: let's be honest, is amazing that sarcasm I can't stand 686 00:43:27,200 --> 00:43:30,600 Speaker 1: listening to myself, but it would also have to replicate 687 00:43:30,640 --> 00:43:34,480 Speaker 1: how I actually make certain sounds, Like would it get 688 00:43:34,480 --> 00:43:37,160 Speaker 1: the bit of the southern accent that's in my voice, 689 00:43:37,800 --> 00:43:40,960 Speaker 1: or the way I emphasize certain words. Would it pause 690 00:43:41,040 --> 00:43:44,399 Speaker 1: for effect at all or would it just robotically say 691 00:43:44,560 --> 00:43:47,279 Speaker 1: one word after the next and only pause when there 692 00:43:47,360 --> 00:43:50,759 Speaker 1: was some helpful punctuation that told it to do so. 693 00:43:51,280 --> 00:43:54,080 Speaker 1: Would it indicate a question by raising the pitch at 694 00:43:54,080 --> 00:43:58,239 Speaker 1: the end of its sentence. Sadly, we never got far 695 00:43:58,760 --> 00:44:01,640 Speaker 1: with that particular problem check, so I don't have any 696 00:44:01,680 --> 00:44:03,520 Speaker 1: answers for you. I don't know how it would have 697 00:44:03,600 --> 00:44:06,240 Speaker 1: turned out, but clearly one of the things I thought 698 00:44:06,280 --> 00:44:09,200 Speaker 1: of was that it's a bit of a red flag. 699 00:44:09,239 --> 00:44:11,839 Speaker 1: If you can train a computer to sound exactly like 700 00:44:11,960 --> 00:44:15,240 Speaker 1: a specific person, that means you can make that person 701 00:44:15,760 --> 00:44:19,840 Speaker 1: say anything you like, and obviously, like deep fake videos, 702 00:44:19,880 --> 00:44:22,919 Speaker 1: that could have some pretty devastating consequences if it were 703 00:44:23,000 --> 00:44:27,960 Speaker 1: at all, you know, believable or seemed realistic. Now, the 704 00:44:27,960 --> 00:44:31,000 Speaker 1: company we were working with was working hard to make 705 00:44:31,000 --> 00:44:33,440 Speaker 1: sure that the only person to have access to a 706 00:44:33,480 --> 00:44:36,600 Speaker 1: specific voice would be the owner of that voice, or 707 00:44:37,160 --> 00:44:40,600 Speaker 1: presumably the company employing that person. Though that does bring 708 00:44:40,680 --> 00:44:43,160 Speaker 1: up a whole bunch of other potential problems, like can 709 00:44:43,200 --> 00:44:47,480 Speaker 1: you imagine eliminating voice actors from a job because you've 710 00:44:47,480 --> 00:44:50,000 Speaker 1: got enough of their voice and you can just replicate it. 711 00:44:50,080 --> 00:44:53,160 Speaker 1: That wouldn't be great, But even so, it was something 712 00:44:53,200 --> 00:44:56,480 Speaker 1: I felt was both fascinating from a technology standpoint and 713 00:44:56,520 --> 00:45:01,319 Speaker 1: potentially problematic when it comes to an application of that technology. 714 00:45:01,719 --> 00:45:05,080 Speaker 1: One other thing I should mention is that the Internet 715 00:45:05,200 --> 00:45:08,240 Speaker 1: at large has been pretty active in fighting deep fakes, 716 00:45:08,280 --> 00:45:11,640 Speaker 1: not necessarily in detecting them, but removing the platforms from 717 00:45:12,040 --> 00:45:14,839 Speaker 1: which they were being shared, Reddit being a big one. 718 00:45:14,960 --> 00:45:17,560 Speaker 1: The subreddit that was dedicated to deep fakes what had 719 00:45:17,560 --> 00:45:20,960 Speaker 1: been shut down. So there have been some of those 720 00:45:21,000 --> 00:45:24,160 Speaker 1: moves as well. Now this is not directly against the technology, 721 00:45:24,160 --> 00:45:28,840 Speaker 1: it's more against the proliferation of the uh the output 722 00:45:29,280 --> 00:45:33,040 Speaker 1: of that technology. As for detecting deep fakes, it's interesting 723 00:45:33,080 --> 00:45:36,800 Speaker 1: to me that people are even developing tools to detect them, 724 00:45:36,840 --> 00:45:39,719 Speaker 1: because to me, the best tools so far seems to 725 00:45:39,760 --> 00:45:45,839 Speaker 1: be human perception. It's not that the images aren't really convincing, 726 00:45:46,000 --> 00:45:49,120 Speaker 1: or that we can suddenly detect these, you know, blending 727 00:45:49,239 --> 00:45:53,440 Speaker 1: lines like the video Authenticator tool. It's rather that it's 728 00:45:53,480 --> 00:45:56,160 Speaker 1: just not hard for us to spot a deep fake. Now, 729 00:45:56,200 --> 00:46:00,040 Speaker 1: stuff just doesn't quite look right in the way that 730 00:46:00,200 --> 00:46:04,360 Speaker 1: people behave in these videos. The vocals and animation often 731 00:46:04,440 --> 00:46:09,280 Speaker 1: don't quite match. The expressions aren't really natural, the progression 732 00:46:09,320 --> 00:46:14,319 Speaker 1: of mannerisms feels synthetic and not genuine. It just it 733 00:46:14,360 --> 00:46:18,360 Speaker 1: looks off. It's that uncanny Valley thing, and so just 734 00:46:18,440 --> 00:46:21,640 Speaker 1: paying attention and thinking critically can really help use suss 735 00:46:21,640 --> 00:46:24,319 Speaker 1: out the fakes from the real thing. Even if we 736 00:46:24,400 --> 00:46:27,759 Speaker 1: reach a point where machines can create a convincing enough 737 00:46:27,800 --> 00:46:32,000 Speaker 1: fake to pass for reality. We can still apply critical thinking, 738 00:46:32,360 --> 00:46:35,440 Speaker 1: and we always should. Heck, we should be applying critical 739 00:46:35,480 --> 00:46:38,480 Speaker 1: thinking even when there's no doubt as to the validity 740 00:46:38,520 --> 00:46:42,200 Speaker 1: of the video, because there may be enough to doubt 741 00:46:42,280 --> 00:46:45,920 Speaker 1: the content of the video itself. If I listen to 742 00:46:46,000 --> 00:46:50,360 Speaker 1: a genuine scam artist in a genuine video, that doesn't 743 00:46:50,400 --> 00:46:53,799 Speaker 1: make the scam more legitimate. We always need to use 744 00:46:53,840 --> 00:46:57,200 Speaker 1: critical thinking. What I think is most important is that 745 00:46:57,239 --> 00:47:03,560 Speaker 1: we acknowledge the very real fact that there are numerous organizations, agencies, governments, 746 00:47:03,840 --> 00:47:08,160 Speaker 1: and other groups that are actively attempting to spread misinformation 747 00:47:08,400 --> 00:47:14,719 Speaker 1: and disinformation. There are entire intelligence agencies dedicated to this endeavor, 748 00:47:15,160 --> 00:47:18,640 Speaker 1: and then there are more independent groups that are doing 749 00:47:18,680 --> 00:47:22,000 Speaker 1: it for one reason or another, typically either to advance 750 00:47:22,040 --> 00:47:25,839 Speaker 1: a particular political agenda or just to make as much 751 00:47:25,920 --> 00:47:30,560 Speaker 1: money as quickly as possible. This is beyond doubt or question. 752 00:47:30,640 --> 00:47:34,600 Speaker 1: There are numerous misinformation campaigns that are actively going on 753 00:47:34,760 --> 00:47:38,080 Speaker 1: out there in the real world right now. Most of 754 00:47:38,120 --> 00:47:42,279 Speaker 1: them are not depending on deep fakes, because one, deep 755 00:47:42,320 --> 00:47:45,920 Speaker 1: fakes aren't really good enough to fool most people right now, 756 00:47:46,400 --> 00:47:49,600 Speaker 1: and too, they don't need the deep fakes in the 757 00:47:49,640 --> 00:47:52,400 Speaker 1: first place. There are other methods that are simpler, that 758 00:47:52,520 --> 00:47:56,280 Speaker 1: don't need nearly the processing power that work just fine. 759 00:47:56,600 --> 00:47:59,160 Speaker 1: Why would you go through the trouble of synthesizing a 760 00:47:59,280 --> 00:48:01,839 Speaker 1: video if you can get a better response with a 761 00:48:01,840 --> 00:48:05,920 Speaker 1: blog post filled with lies or half truths. It's just 762 00:48:06,000 --> 00:48:09,520 Speaker 1: not a great return on investment. So bottom line, be 763 00:48:09,680 --> 00:48:14,520 Speaker 1: vigilant out there, particularly on social media. Be aware that 764 00:48:14,560 --> 00:48:17,239 Speaker 1: there are plenty of people who will not hesitate to 765 00:48:17,360 --> 00:48:20,719 Speaker 1: mislead others in order to get what they want. Use 766 00:48:20,760 --> 00:48:26,000 Speaker 1: a critical eye to evaluate the information you encounter. Ask questions, 767 00:48:26,440 --> 00:48:31,160 Speaker 1: check sources, look for corroborating reports. It's a lot of work, 768 00:48:31,200 --> 00:48:34,080 Speaker 1: but trust me, it's way better that we do our 769 00:48:34,120 --> 00:48:37,120 Speaker 1: best to make sure the stuff we're depending on is 770 00:48:37,200 --> 00:48:40,759 Speaker 1: actually dependable. It'll turn out better for us in the 771 00:48:40,800 --> 00:48:43,919 Speaker 1: long run. Well, that wraps up this episode of text stuff, 772 00:48:43,960 --> 00:48:47,399 Speaker 1: which yeah, I used as a backdoor to argue about 773 00:48:47,440 --> 00:48:51,000 Speaker 1: critical thinking. Again, sue me, don't, don't really sue me. 774 00:48:51,520 --> 00:48:55,560 Speaker 1: But I think that that's another instance where it's a 775 00:48:55,640 --> 00:48:58,680 Speaker 1: really clear example where we have to use that kind 776 00:48:58,680 --> 00:49:01,000 Speaker 1: of stuff. So I'm gonna keep keep on stressing it. 777 00:49:01,480 --> 00:49:05,080 Speaker 1: And you guys are awesome. I believe in you. I 778 00:49:05,120 --> 00:49:08,080 Speaker 1: think that when we start using these tools at our 779 00:49:08,080 --> 00:49:12,560 Speaker 1: disposal that everybody can develop just with some practice, that 780 00:49:13,040 --> 00:49:16,120 Speaker 1: things will be better. We'll be able to suss out 781 00:49:16,200 --> 00:49:20,719 Speaker 1: the nonsense from the real stuff, and we're all better 782 00:49:20,760 --> 00:49:22,439 Speaker 1: off in the long run if we can do that. 783 00:49:23,000 --> 00:49:25,680 Speaker 1: If you guys have suggestions for future topics I should 784 00:49:25,719 --> 00:49:28,960 Speaker 1: cover in episodes of tech Stuff, let me know via Twitter. 785 00:49:29,280 --> 00:49:33,279 Speaker 1: The handle is text stuff H s W and I'll 786 00:49:33,280 --> 00:49:41,480 Speaker 1: talk to you again really soon. Text Stuff is an 787 00:49:41,480 --> 00:49:45,200 Speaker 1: I Heart Radio production. For more podcasts from my Heart Radio, 788 00:49:45,520 --> 00:49:48,680 Speaker 1: visit the i Heart Radio app, Apple Podcasts, or wherever 789 00:49:48,760 --> 00:49:50,280 Speaker 1: you listen to your favorite shows.