WEBVTT - Deep Learning and Deepfakes 0:00:04.400 --> 0:00:07.800 Welcome to Tech Stuff, a production from I Heart Radio. 0:00:12.400 --> 0:00:15.200 Hey there, and welcome to tech Stuff. I'm your host, 0:00:15.360 --> 0:00:18.400 Jonathan Strickland. I'm an executive producer with I Heart Radio 0:00:18.440 --> 0:00:21.720 and I love all things tech. Now, before I get 0:00:21.760 --> 0:00:25.000 into today's episode, I want to give a little listener 0:00:25.239 --> 0:00:29.280 warning here. The topic at hand involves some adult content, 0:00:29.760 --> 0:00:33.040 including the use of technology to do stuff that can 0:00:33.120 --> 0:00:37.960 be unethical, illegal, hurtful, and just plain awful. Now, I 0:00:38.000 --> 0:00:40.800 think this is an important topic, but I wanted to 0:00:40.840 --> 0:00:42.800 give a bit of a heads up at the start 0:00:42.840 --> 0:00:45.440 of the episode, just in case any of you guys 0:00:45.479 --> 0:00:48.880 are listening to a podcast on like a family road 0:00:48.920 --> 0:00:51.880 trip or something. I think this is an important topic 0:00:52.320 --> 0:00:55.120 and I think everyone should know about it and think 0:00:55.160 --> 0:00:57.360 about it. But I also respect that for some people 0:00:57.360 --> 0:01:00.680 this subject might get a bit taboo. So let's go 0:01:00.880 --> 0:01:06.360 on with the episode. Back in nine, a movie called 0:01:06.720 --> 0:01:11.319 Rising Sun, directed by Philip Kaufman, based on a Michael 0:01:11.360 --> 0:01:15.119 Crichton novel and starring Wesley Snipes and Sean Connery came 0:01:15.120 --> 0:01:18.360 out in theaters. Now, I didn't see it in theaters, 0:01:19.040 --> 0:01:21.360 but I did catch it when it came on you know, 0:01:21.920 --> 0:01:25.760 HBO or Cinemax or something. Later on, the movie included 0:01:25.760 --> 0:01:28.959 a sequence that I found to be totally unbelievable. And 0:01:29.000 --> 0:01:32.720 I'm not talking about buying into Sean Connery being an 0:01:32.720 --> 0:01:37.319 expert on Japanese culture and business practices. Actually, side note, 0:01:37.480 --> 0:01:41.720 Sean Connery has an interesting history of playing unlikely characters, 0:01:41.760 --> 0:01:44.759 such as in Highlander, where he played an immortal who 0:01:44.840 --> 0:01:49.080 was supposedly Egyptian, then who lived in feudal Japan and 0:01:49.200 --> 0:01:51.840 ended up in Spain where he became known as Ramirez. 0:01:52.200 --> 0:01:54.760 And all the while he's talking to a Scottish Highlander 0:01:54.960 --> 0:01:58.200 who's played by a Belgian actor. But I'm getting way 0:01:58.240 --> 0:02:02.040 off track here. Besides, I've heard Crichton actually wrote the 0:02:02.160 --> 0:02:05.000 character while thinking of Connery, So you know, what the 0:02:05.000 --> 0:02:08.320 heck do I know? In the film, Snives and Connery 0:02:08.440 --> 0:02:12.240 are investigators, and they're looking into a homicide that happened 0:02:12.280 --> 0:02:16.760 at a Japanese business but on American soil. The security 0:02:16.800 --> 0:02:21.080 system in the building captured video of the homicide and 0:02:21.120 --> 0:02:23.480 the identity of the killer appears to be a pretty 0:02:23.560 --> 0:02:26.880 open and shut case. But that's not how it all 0:02:26.960 --> 0:02:30.520 turns out. The investigators talked to a security expert played 0:02:30.520 --> 0:02:34.440 by Tia Carrera, and she demonstrates in real time how 0:02:34.560 --> 0:02:39.160 video footage can be altered. She records a short video 0:02:39.520 --> 0:02:43.800 of Connery and snipes loads that onto a computer, freezes 0:02:43.960 --> 0:02:47.600 a frame of the video, and essentially performs a cut 0:02:47.639 --> 0:02:51.440 and paste job swapping the heads of our two lead characters. 0:02:51.880 --> 0:02:55.000 Then she resumes the video and the head swap remains 0:02:55.000 --> 0:02:59.960 in place, and that head swap stuff is possible. I mean, 0:03:00.040 --> 0:03:02.440 clearly it has to be possible, because you actually do 0:03:02.600 --> 0:03:05.680 see that effect in the film itself. But it takes 0:03:05.800 --> 0:03:08.680 a bit more than a quick cut and paste job. 0:03:08.760 --> 0:03:11.320 But we'll leave off of that for now. The whole 0:03:11.360 --> 0:03:15.720 point of that sequence, apart from showing off some cinema magic, 0:03:16.240 --> 0:03:20.560 is to demonstrate to the investigators that video, like photographs, 0:03:20.880 --> 0:03:24.560 can be altered. The expert has detected a blue halo 0:03:24.720 --> 0:03:27.919 around the face of the supposed murderer in the footage, 0:03:28.240 --> 0:03:31.680 indicating that some sort of trickery has happened. She also 0:03:31.760 --> 0:03:34.800 reveals that she cannot magically restore the video to its 0:03:34.800 --> 0:03:37.920 previous unaltered state, which I think was actually a nice 0:03:38.000 --> 0:03:41.040 change of pace for a movie. By the way, I 0:03:41.080 --> 0:03:44.760 think this movie is really, you know, not good, like 0:03:45.480 --> 0:03:50.360 not worth your time, but that's my opinion anyway. For years, 0:03:50.680 --> 0:03:54.040 this kind of video sorcery was pretty much limited to 0:03:54.160 --> 0:03:57.760 the film and TV industries. It usually required a lot 0:03:57.800 --> 0:04:01.720 of pre planning beforehand, so it wasn't as simple as 0:04:01.760 --> 0:04:04.920 just taking footage that was already shot and changing it 0:04:04.960 --> 0:04:07.720 in post on a whim with a couple of clicks 0:04:07.760 --> 0:04:09.800 of a button. If it were, we would see a 0:04:09.800 --> 0:04:13.920 lot fewer mistakes left in movies and television because you 0:04:13.960 --> 0:04:16.599 could catch it later and just fix it. But the 0:04:16.640 --> 0:04:20.359 tricks were possible, they were just difficult to pull off. 0:04:20.880 --> 0:04:23.840 It just wasn't something you or I would ever encounter 0:04:23.960 --> 0:04:27.560 in our day to day lives. But today we live 0:04:27.680 --> 0:04:30.880 in a different world, a world that has examples of 0:04:31.000 --> 0:04:35.960 synthetic media. Commonly referred to as deep fakes. These are 0:04:36.080 --> 0:04:39.839 videos that have been altered or generated so that the 0:04:39.920 --> 0:04:42.839 subject of the video is doing something that they probably 0:04:42.960 --> 0:04:47.200 would or could never do. They've brought into question whether 0:04:47.320 --> 0:04:50.800 or not video evidence is even reliable, much as the 0:04:50.839 --> 0:04:54.440 film Rising Sun was talking about. We already know that 0:04:54.520 --> 0:05:00.559 eyewitness testimony is terribly unreliable. Our perception and memory play 0:05:00.600 --> 0:05:04.560 tricks on us, and we can quote unquote remember stuff 0:05:04.600 --> 0:05:09.560 that just didn't happen the way things actually unfolded in reality. 0:05:09.600 --> 0:05:13.359 But now we're looking at video evidence and potentially the 0:05:13.440 --> 0:05:17.600 same light. I mean, it's scary. So today we're going 0:05:17.680 --> 0:05:21.760 to learn about synthetic media, how it can be generated, 0:05:22.080 --> 0:05:26.000 the implications that follow with that sort of reality, and 0:05:26.120 --> 0:05:29.559 ways that people are trying to counteract a potentially dangerous threat, 0:05:30.040 --> 0:05:34.800 you know, fun stuff. Now, first, the term synthetic media 0:05:35.120 --> 0:05:39.400 has a particular meaning. It refers to art created through 0:05:39.520 --> 0:05:43.760 some sort of automated process, so it's a largely hands 0:05:43.839 --> 0:05:49.000 off approach to creating the final art piece. Now, under 0:05:49.040 --> 0:05:52.880 that definition, the example of rising sun would not apply 0:05:53.080 --> 0:05:56.400 here because we see in the film and presumably this 0:05:56.480 --> 0:05:58.599 happens in the book as well, but I haven't read 0:05:58.680 --> 0:06:03.200 the book that a human being actually changes that. People 0:06:03.279 --> 0:06:06.880 have used tools to alter the video footage. This would 0:06:06.880 --> 0:06:10.280 be more like using photoshop to touch up a still image, 0:06:10.279 --> 0:06:14.039 with the computer system presumably doing some of the work 0:06:14.040 --> 0:06:16.800 in the background to keep things matched up. Either that 0:06:16.960 --> 0:06:19.640 or you would need to alter each image in the 0:06:19.640 --> 0:06:23.760 footage frame by frame, or use some sort of matt approach. 0:06:24.360 --> 0:06:26.880 To learn more about matts, you can listen to my 0:06:26.920 --> 0:06:30.760 episode about how blue and green screens work. Synthetic media 0:06:31.040 --> 0:06:35.200 as a general practice has been around for centuries. Artists 0:06:35.200 --> 0:06:38.640 have set up various contraptions to create works with little 0:06:38.880 --> 0:06:43.039 or no human guidance. In the twentieth century we started 0:06:43.120 --> 0:06:46.960 to see a movement called generative art take form. This 0:06:47.000 --> 0:06:49.560 type of art is all about creating a system that 0:06:49.680 --> 0:06:53.880 then creates or generates the finished art piece. That would 0:06:53.920 --> 0:06:57.080 mean that the finished work, such as a painting, wouldn't 0:06:57.400 --> 0:07:00.400 reflect the feelings or thoughts of the art is who 0:07:00.440 --> 0:07:03.919 created the system. In fact, it starts to raise the 0:07:04.000 --> 0:07:07.120 question what is the art? Is it the painting that 0:07:07.200 --> 0:07:11.000 came about due to a machine following a program of 0:07:11.080 --> 0:07:15.400 some sort, or is the art the program itself? Is 0:07:15.440 --> 0:07:19.000 the art the process by which the painting was made? 0:07:19.320 --> 0:07:22.000 Now I'm not here to answer that question. I just 0:07:22.320 --> 0:07:26.640 think it is an interesting question to ask. Sometimes people 0:07:26.680 --> 0:07:30.600 ask much less polite questions, such as is it art 0:07:30.640 --> 0:07:34.280 at all? Some art critics went out of their way 0:07:34.320 --> 0:07:37.520 to dismiss generative art in the early days. They found 0:07:37.520 --> 0:07:42.000 it insulting, but hey, that's kind of the history of 0:07:42.200 --> 0:07:46.560 art in general. Each new movement and art inevitably finds 0:07:46.600 --> 0:07:51.080 both supporters and critics as it emerges. If anything, you 0:07:51.200 --> 0:07:55.360 might argue that such a response legitimizes the movement in 0:07:55.560 --> 0:07:58.640 you know, a weird way. If people hate it, it 0:07:58.720 --> 0:08:02.720 must be something. In two thousand eighteen, an artist collective 0:08:03.040 --> 0:08:07.920 called Obvious located out of Paris, France. They submitted portrait 0:08:08.000 --> 0:08:11.920 style paintings that were created not by an actual human painter, 0:08:12.440 --> 0:08:16.440 but by an artificially intelligent system. Now they looked a 0:08:16.480 --> 0:08:21.720 lot like typical eighteenth century style portraits. There was no 0:08:21.800 --> 0:08:24.640 attempt to pass off the portrait as if it were 0:08:24.720 --> 0:08:28.120 actually made by a human artist. In fact, the appeal 0:08:28.320 --> 0:08:32.760 of the piece was largely due to it being synthetically generated. 0:08:33.200 --> 0:08:36.720 It went to auction at Christie's and the AI created 0:08:36.800 --> 0:08:42.000 painting fetched more than four hundred thousand dollars. And the 0:08:42.040 --> 0:08:45.280 way the group trained their AI is relevant to our 0:08:45.320 --> 0:08:49.960 discussion about deep fakes. The collective relied on a type 0:08:49.960 --> 0:08:55.560 of machine learning called generative adversarial networks or g a N, 0:08:56.080 --> 0:08:59.319 which in turn is depending on deep learning. So it 0:08:59.360 --> 0:09:00.959 looks like we've got a few things we're going to 0:09:01.080 --> 0:09:03.840 have to define here. Now, I'm going to keep things 0:09:04.160 --> 0:09:07.719 fairly high level, because as it turns out there are 0:09:07.760 --> 0:09:11.439 a few different ways to create machine learning models, and 0:09:11.520 --> 0:09:14.280 to go through all of them in exhaustive detail would 0:09:14.280 --> 0:09:17.600 represent a university level course in machine learning. I have 0:09:17.760 --> 0:09:21.240 neither the time for that nor the expertise. I would 0:09:21.320 --> 0:09:24.960 do a terrible job, So we'll go with a high 0:09:25.040 --> 0:09:31.480 level perspective here first. A generative adversarial network uses two systems. 0:09:31.520 --> 0:09:35.280 You have a generator and you have a discriminator. Both 0:09:35.360 --> 0:09:38.760 of these systems are a type of neural network. A 0:09:38.840 --> 0:09:42.480 neural network is a computing model that is inspired by 0:09:42.480 --> 0:09:47.520 the way our brains work. Our brains contain billions of neurons, 0:09:47.760 --> 0:09:52.200 and these neurons work together, communicating through electrical and chemical signals, 0:09:52.440 --> 0:09:57.680 controlling and coordinating pretty much everything in our bodies. With computers, 0:09:58.040 --> 0:10:02.720 the neurons are Note the job of a node is, 0:10:03.120 --> 0:10:05.400 you know, supposed to be kind of like a neuron 0:10:05.640 --> 0:10:08.960 cell in the brain. It's to take in multiple weighted 0:10:09.080 --> 0:10:14.360 input values and then generate a single output value. Now, 0:10:14.400 --> 0:10:18.000 the word weighted w E I G H T E 0:10:18.080 --> 0:10:21.840 D weighted is really important here because the larger and 0:10:21.960 --> 0:10:26.120 inputs weight, the more that input will have an effect 0:10:26.360 --> 0:10:29.000 on whatever the output is. So it kind of comes 0:10:29.040 --> 0:10:32.679 down to which inputs are the most important for that 0:10:32.800 --> 0:10:36.760 nodes particular function. Now, if I were to make an analogy, 0:10:36.840 --> 0:10:40.560 I would say, your boss hands you three tasks to do. 0:10:41.240 --> 0:10:45.360 One of those tasks has the label extremely important, and 0:10:45.440 --> 0:10:49.320 the second task has the label critically important, and the 0:10:49.400 --> 0:10:52.240 third task has a label saying you should have finished 0:10:52.280 --> 0:10:55.040 that one before it was handed to you. Okay, so 0:10:55.080 --> 0:10:57.800 that's just some sort of snarky office humor that I 0:10:57.840 --> 0:11:00.520 need to get off my chest. But more seriously, imagine 0:11:00.559 --> 0:11:05.000 a node accepting three inputs. In this example, input one 0:11:05.280 --> 0:11:09.680 has a fifty weight, Input two has a weight, and 0:11:09.720 --> 0:11:12.360 input three has a ten percent weight. That adds up 0:11:12.400 --> 0:11:16.200 to and that would tell you that the output that 0:11:16.280 --> 0:11:21.160 node generates will be most affected by input one, followed 0:11:21.200 --> 0:11:24.199 by input two, and then input three would have a 0:11:24.280 --> 0:11:29.120 smaller effect on whatever the output is. Each node applies 0:11:29.200 --> 0:11:34.080 a nonlinear transformation on the input values, again affected by 0:11:34.240 --> 0:11:39.000 each inputs weight value, and that generates the output value. 0:11:39.480 --> 0:11:43.520 The details of that really are not important for our episode, 0:11:43.520 --> 0:11:46.920 and involves performing changes on variables that in turn change 0:11:46.960 --> 0:11:50.360 the correlation between variables, and it gets a bit Matthew, 0:11:50.559 --> 0:11:53.360 and we would get lost in the weeds pretty quickly. 0:11:53.679 --> 0:11:56.480 The important thing to remember is that a node within 0:11:56.520 --> 0:12:01.280 a neural network takes in a weighted sum inputs, then 0:12:01.320 --> 0:12:06.680 performs a process on those inputs before passing the result 0:12:06.800 --> 0:12:10.520 on as an output. Then some other node a layer 0:12:10.640 --> 0:12:14.400 down will accept that output, along with outputs from a 0:12:14.440 --> 0:12:17.600 couple of other nodes one layer up, and then we'll 0:12:17.640 --> 0:12:21.400 perform an operation based on those weighted inputs and pass 0:12:21.480 --> 0:12:23.840 that on to the next layer, and so on. So 0:12:23.920 --> 0:12:27.000 these nodes are in layers, like you know a cake. 0:12:27.600 --> 0:12:30.520 One layer of notes processes some inputs, they send it 0:12:30.559 --> 0:12:33.440 on to the next layer of nodes, and then that 0:12:33.480 --> 0:12:35.320 one does onto the next one, and the next one 0:12:35.360 --> 0:12:40.880 and so on. This isn't a new idea. Computer scientists 0:12:41.040 --> 0:12:45.679 began theorizing and experimenting with neural network approaches as far 0:12:45.760 --> 0:12:49.360 back as the nineteen fifties with the perceptron, which was 0:12:49.400 --> 0:12:53.280 a hypothetical system that was described by Frank Rosenblatt of 0:12:53.320 --> 0:12:57.160 Cornell University. But it wasn't until the last decade that 0:12:57.280 --> 0:13:00.400 computing power and our ability to handle a lot of 0:13:00.520 --> 0:13:04.040 data reached a point where these sort of learning models 0:13:04.040 --> 0:13:08.280 could really take off. The goal of this system is 0:13:08.320 --> 0:13:12.080 to train it to perform a particular task within a 0:13:12.120 --> 0:13:16.880 certain level of precision. The weights I mentioned are adjustable, 0:13:17.040 --> 0:13:19.360 so you can think of it as teaching a system 0:13:19.480 --> 0:13:22.840 which bits are the most important in order to do 0:13:23.040 --> 0:13:25.760 whatever it is the system is supposed to do in 0:13:25.840 --> 0:13:28.880 order to achieve your task, These are the bits that 0:13:28.920 --> 0:13:32.320 are the most important and therefore should matter the most 0:13:32.320 --> 0:13:35.240 when you weigh a decision. This is a bit easier 0:13:35.280 --> 0:13:38.319 if we talk about a similar system with the version 0:13:38.360 --> 0:13:42.679 of IBM S Watson that played on Jeopardy. That system 0:13:42.800 --> 0:13:46.280 famously was not connected to the Internet. It had to 0:13:46.320 --> 0:13:50.319 rely on all the information that was stored within itself. 0:13:50.960 --> 0:13:55.000 When the system encountered a clue in Jeopardy, it would 0:13:55.000 --> 0:13:57.959 analyze the clue, and then it would reference its data 0:13:57.960 --> 0:14:01.320 base to look for possible answers to whatever that clue was. 0:14:01.800 --> 0:14:05.160 The system would weigh those possible answers and attempt to 0:14:05.160 --> 0:14:08.760 determine which, if any, were the most likely to be correct. 0:14:09.200 --> 0:14:13.920 If the certainty was over a certain threshold, such as sure, 0:14:14.200 --> 0:14:16.720 the system would buzz in with its answer. If no 0:14:16.880 --> 0:14:20.920 response rose above that threshold, the system would not buzz in, 0:14:21.280 --> 0:14:23.480 So you could say that Watson was playing the game 0:14:23.520 --> 0:14:27.680 with a best guess sort of approach. Neural networks do 0:14:28.240 --> 0:14:33.000 essentially that sort of processing. With this particular type of approach, 0:14:33.400 --> 0:14:36.640 we know what we want the outcome to be, so 0:14:36.840 --> 0:14:39.880 we can judge whether or not the system was successful. 0:14:40.200 --> 0:14:43.760 After each attempt, we can adjust the weight on the 0:14:43.800 --> 0:14:47.760 input between nodes to refine the decision making process to 0:14:47.840 --> 0:14:51.880 get more accurate results. If the system succeeds in its task, 0:14:52.360 --> 0:14:55.720 we can increase the weights that contributed to the system 0:14:55.760 --> 0:15:00.240 picking the correct answer and thus decrease the input it's 0:15:00.320 --> 0:15:05.280 that did not contribute to the successful response. If the 0:15:05.280 --> 0:15:09.320 system done messed up and gave the wrong answer, then 0:15:09.360 --> 0:15:11.720 we do the opposite. We look at the inputs that 0:15:11.760 --> 0:15:16.000 contributed to the wrong answer, we diminish their weights, and 0:15:16.080 --> 0:15:18.440 we increase the weights of the other input and then 0:15:18.440 --> 0:15:23.120 we run the test again a lot. I'll explain a 0:15:23.160 --> 0:15:25.600 bit more about this process when we come back, but 0:15:25.680 --> 0:15:36.400 first let's take a quick break. Early in the history 0:15:36.520 --> 0:15:40.760 of neural networks, computer scientists were hitting some pretty hard 0:15:40.880 --> 0:15:44.400 stops due to the limitations of computing power at the time. 0:15:44.720 --> 0:15:48.080 Early networks were only a couple of layers deep, which 0:15:48.080 --> 0:15:50.720 really meant they weren't terribly powerful, and they could only 0:15:50.760 --> 0:15:54.400 tackle rudimentary tasks like figuring out whether or not a 0:15:54.520 --> 0:15:59.160 square is drawn on a piece of paper that isn't 0:15:59.240 --> 0:16:05.560 terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and Ronald 0:16:05.600 --> 0:16:12.120 Williams published a lecture titled learning representations by back propagating errors. 0:16:12.160 --> 0:16:16.840 This was a big breakthrough with deep learning. This all 0:16:16.880 --> 0:16:19.360 has to do with a deep learning system improving its 0:16:19.360 --> 0:16:22.760 ability to complete a specific task. And basically the algorithm's 0:16:22.840 --> 0:16:25.840 job is to go from the output layer, you know, 0:16:25.960 --> 0:16:29.000 where the system has made a decision, and then work 0:16:29.160 --> 0:16:32.680 backward through the neural network, adjusting the weights that led 0:16:32.720 --> 0:16:38.480 to an incorrect decision. So let's say it's a system 0:16:38.520 --> 0:16:41.680 that is looking to figure out whether or not a 0:16:41.720 --> 0:16:45.000 cat is in a photograph and it says, there's a 0:16:45.040 --> 0:16:47.400 cat in this picture, and you look at the picture 0:16:47.400 --> 0:16:50.440 and there is no cat there. Then you would look 0:16:50.560 --> 0:16:54.720 at the inputs one level back just before the system 0:16:54.800 --> 0:16:57.160 said here's a picture of a cat, and you'd say, 0:16:57.200 --> 0:16:59.720 all right, which of these inputs lad the system to 0:17:00.120 --> 0:17:03.200 leave this was a picture of a cat, And then 0:17:03.280 --> 0:17:06.200 you would adjust those. Then you would go back one 0:17:06.320 --> 0:17:10.159 layer up, so you're working your way up the model 0:17:10.520 --> 0:17:14.240 and say which inputs here led to it giving the 0:17:14.280 --> 0:17:18.400 outputs that led to the mistake, and you do this 0:17:18.640 --> 0:17:21.760 all the way up until you get up to the 0:17:21.800 --> 0:17:24.639 input level at the top of the computer model. You 0:17:24.680 --> 0:17:28.040 are back propagating, and then you run the test again 0:17:28.160 --> 0:17:32.720 to see if you've got improvement. It's exhaustive, but it's 0:17:32.800 --> 0:17:38.000 also drastically improved neural network performance, much faster than just 0:17:38.520 --> 0:17:42.080 throwing more brute force to it. The algorithm essentially is 0:17:42.160 --> 0:17:44.920 checking to see if a small change in each input 0:17:45.040 --> 0:17:48.640 value received by a layer of nodes would have led 0:17:48.680 --> 0:17:51.200 to a more accurate results. So it's all about going 0:17:51.240 --> 0:17:54.679 from that output working your way backward. In two thousand twelve, 0:17:54.720 --> 0:17:57.920 Alex Krajewski published a paper that gave us the next 0:17:58.000 --> 0:18:02.480 big breakthrough. He argued that a really deep neural network 0:18:02.760 --> 0:18:06.040 with a lot of layers could give really great results 0:18:06.200 --> 0:18:09.960 if you paired it with enough data to train the system. 0:18:10.000 --> 0:18:13.600 So you needed to throw lots of data at these models, 0:18:14.320 --> 0:18:17.720 and it needed to be an enormous amount of data. However, 0:18:17.880 --> 0:18:22.120 once trained, the system would produce lower error rates. So yeah, 0:18:22.160 --> 0:18:24.040 I would take a long time but you would get 0:18:24.080 --> 0:18:27.560 better results. Now, at the time, a good error rate 0:18:27.720 --> 0:18:31.840 for such a system was that means one out of 0:18:31.920 --> 0:18:35.159 four conclusions the system would come to would be wrong. 0:18:35.600 --> 0:18:39.800 If you ran it across a long enough number of decisions, 0:18:39.800 --> 0:18:43.120 you would find that one out of every four wasn't right. 0:18:43.880 --> 0:18:47.520 The system that Alex's team worked on produced results that 0:18:47.560 --> 0:18:50.880 had an error rate of six percent, so much lower. 0:18:51.040 --> 0:18:54.720 And then in just five years, with more improvements to 0:18:54.800 --> 0:18:58.919 this process, the classification error rate had dropped down to 0:18:59.080 --> 0:19:02.760 two point three percent for deep learning systems. So from 0:19:04.160 --> 0:19:09.080 to two point three it was really powerful stuff. Okay, 0:19:09.119 --> 0:19:12.960 so you've got your artificial neural network. You've got your 0:19:13.080 --> 0:19:17.359 layers and layers of nodes. You've adjusted the weights of 0:19:17.400 --> 0:19:20.439 the inputs into each node to see if your system 0:19:20.520 --> 0:19:25.240 can identify, you know, pictures of cats, and you start 0:19:25.320 --> 0:19:29.479 feeding images to this system, lots of them. This is 0:19:29.520 --> 0:19:32.439 the domain that you are feeding to your system. The 0:19:32.480 --> 0:19:34.919 more images you can feed to it, the better. And 0:19:34.960 --> 0:19:37.120 you want a wide variety of images of all sorts 0:19:37.160 --> 0:19:39.879 of stuff, not just of different types of cats, but 0:19:40.000 --> 0:19:43.760 stuff that most certainly is not a cat, like dogs, 0:19:43.880 --> 0:19:48.000 or cars or chartered public accountants, you name it, and 0:19:48.080 --> 0:19:51.400 you look to see which images the system identifies correctly 0:19:51.800 --> 0:19:55.080 and which ones it screws up, both which images have 0:19:55.320 --> 0:19:58.400 cats in it that actually don't have cats in it, 0:19:58.840 --> 0:20:01.639 or images the system has identified as saying there is 0:20:01.680 --> 0:20:04.719 no cat here, but there is a cat there. This 0:20:04.800 --> 0:20:08.480 guides you into adjusting the weights again and again, and 0:20:08.560 --> 0:20:10.360 you start over and you do it again, and that's 0:20:10.400 --> 0:20:13.920 your basic deep learning system, and it gets better over 0:20:14.000 --> 0:20:17.560 time as you train it. It learns. Now, let's transition 0:20:17.600 --> 0:20:21.480 over to the adversarial systems I mentioned earlier, because they 0:20:21.520 --> 0:20:25.160 take this and twist it a little bit. So you've 0:20:25.200 --> 0:20:30.040 got two artificial neural networks and they are using this 0:20:30.160 --> 0:20:33.840 general approach to deep learning, and you're setting them up 0:20:34.160 --> 0:20:38.800 so that they feed into each other. One network, the generator, 0:20:39.320 --> 0:20:42.880 has the task to learn how to do something such 0:20:42.920 --> 0:20:47.000 as create an eighteenth century style portrait based off lots 0:20:47.080 --> 0:20:50.400 and lots of examples of the real thing. The domain 0:20:50.560 --> 0:20:55.800 the problem domain. The second network, the discriminator, has a 0:20:55.840 --> 0:20:59.639 different job. It has to tell the difference between authentic 0:20:59.720 --> 0:21:03.840 port traits that came from the problem domain and computer 0:21:04.080 --> 0:21:08.120 generated portraits that came from the generator itself. So essentially 0:21:08.200 --> 0:21:11.600 the discriminator is like the model I mentioned earlier that 0:21:11.680 --> 0:21:14.480 was identifying pictures of cats. It's doing the same sort 0:21:14.480 --> 0:21:17.000 of thing, except instead of saying cat or no cat, 0:21:17.080 --> 0:21:21.359 it's saying real portrait or a computer generated portrait. So 0:21:21.400 --> 0:21:25.120 there are essentially two outcomes the discriminator could reach, and 0:21:25.240 --> 0:21:28.679 that's whether or not an images computer generated or it wasn't. 0:21:29.520 --> 0:21:31.720 So do you see where this is going? You train 0:21:31.880 --> 0:21:35.680 up both models. You have the generator attempt to make 0:21:35.720 --> 0:21:39.080 its own version of something such as that eighteenth century portrait. 0:21:39.680 --> 0:21:42.680 It does so it designs the portrait based on what 0:21:42.760 --> 0:21:46.440 the model believes are the key elements of a portrait, 0:21:47.160 --> 0:21:51.960 So things like colors, shapes, the ratio of size, like 0:21:52.200 --> 0:21:54.480 you know, how large should the head be in relation 0:21:54.520 --> 0:21:58.080 to the body. All of these factors and many more 0:21:58.440 --> 0:22:03.200 come into play. The generator creates its own idea of 0:22:03.240 --> 0:22:06.080 what a portrait is supposed to look like, and chances 0:22:06.080 --> 0:22:09.800 are the early rounds of this will not be terribly convincing. 0:22:10.560 --> 0:22:14.560 The results are then fed to the discriminator, which tries 0:22:14.600 --> 0:22:17.520 to suss out which of the images fed to it 0:22:17.560 --> 0:22:20.879 our computer generated and which ones aren't. After that round, 0:22:21.280 --> 0:22:26.320 both models are tweaked the generator adjusts input weights to 0:22:26.359 --> 0:22:29.880 get closer to the genuine article, and the discriminator adjust 0:22:29.960 --> 0:22:34.720 weights to reduce false positives or to catch computer generated images. 0:22:34.960 --> 0:22:39.280 And then you go again and again and again and again, 0:22:39.560 --> 0:22:43.199 and they both get better over time. So, assuming everything 0:22:43.280 --> 0:22:46.560 is working properly, over time, the adjustment of input weights 0:22:46.600 --> 0:22:50.040 will lead to more convincing results, and given enough time 0:22:50.240 --> 0:22:53.200 and enough repetition, you'll end up with a computer generated 0:22:53.240 --> 0:22:55.639 painting that you can auction off for nearly half a 0:22:55.680 --> 0:22:59.920 million dollars. Though keep in mind that huge price or 0:23:00.040 --> 0:23:02.760 dates back to the novelty of it being an early 0:23:02.960 --> 0:23:06.960 AI generated painting. It would be shocking to me if 0:23:07.000 --> 0:23:10.320 we saw that actually become a trend. Also, the painting, 0:23:10.359 --> 0:23:13.800 while interesting, isn't exactly so astounding as to make you 0:23:13.840 --> 0:23:16.840 think there's no way a machine did that. You'd look 0:23:16.880 --> 0:23:19.240 at them and go, yeah, I can imagine a machine 0:23:19.240 --> 0:23:23.400 did that. One. A group of computer scientists first described 0:23:23.520 --> 0:23:26.879 the general adversarial network architecture in a paper in two 0:23:26.920 --> 0:23:30.560 thousand and fourteen, and like other neural networks, these models 0:23:30.600 --> 0:23:33.399 require a lot of data. The more the better. In fact, 0:23:33.480 --> 0:23:36.040 smaller data sets means the models have to make some 0:23:36.119 --> 0:23:40.960 pretty big assumptions, and you tend to get pretty lousy results. 0:23:41.440 --> 0:23:45.160 More data, as in more examples, teaches the models more 0:23:45.200 --> 0:23:48.320 about the parameters of the domain, whatever it is they 0:23:48.320 --> 0:23:52.080 are trying to generate. It refines the approach. So if 0:23:52.119 --> 0:23:54.600 you have a sophisticated enough pair of models and you 0:23:54.640 --> 0:23:57.280 have enough data to fill up a domain, you can 0:23:57.359 --> 0:24:01.439 generate some convincing material. And that in ludes video and 0:24:01.560 --> 0:24:05.080 this brings us around to deep fakes. And in addition 0:24:05.200 --> 0:24:09.679 to generative adversarial networks, a couple of other things really 0:24:10.200 --> 0:24:15.520 converged to create the techniques and trends and technology that 0:24:15.560 --> 0:24:22.160 would allow for deep fakes proper. In Malcolm Slaney, Michelle Covell, 0:24:22.440 --> 0:24:26.360 and Christoph Bregler wrote some software that they called the 0:24:26.440 --> 0:24:30.960 Video Rewrite Program. The software would analyze faces and then 0:24:31.040 --> 0:24:35.920 create or synthesize lip animation which could be matched to 0:24:36.320 --> 0:24:39.800 pre recorded audio. So you could take some film footage 0:24:40.280 --> 0:24:44.040 of a person and then reanimate their lips so that 0:24:44.080 --> 0:24:47.000 they could appear to say all sorts of things, which 0:24:47.000 --> 0:24:50.840 in some ways set the stage for deep fakes. This case, 0:24:50.920 --> 0:24:53.840 it was really just focusing on the lips and the 0:24:53.880 --> 0:24:57.879 general area around the lips, so you weren't changing the 0:24:57.960 --> 0:25:00.920 rest of the expression of the face, and you would 0:25:00.960 --> 0:25:04.800 have to, you know, keep your recording to be about 0:25:04.880 --> 0:25:07.359 the same length as whatever the film clip was, or 0:25:07.400 --> 0:25:09.719 you would have to loop the film clip over and over, 0:25:09.800 --> 0:25:12.040 which would make it, you know, far more obvious that 0:25:12.160 --> 0:25:16.720 this was a fake. In addition, motion tracking technology was 0:25:16.760 --> 0:25:19.440 advancing over time too, and this also became an important 0:25:19.480 --> 0:25:22.800 tool in computer animation. This tool would also be used 0:25:23.119 --> 0:25:27.479