1
00:00:04,400 --> 00:00:07,760
Speaker 1: Welcome to tech Stuff, a production from I Heart Radio.

2
00:00:11,800 --> 00:00:14,200
Speaker 1: Hey there, and welcome to tech Stuff. I'm your host,

3
00:00:14,320 --> 00:00:16,840
Speaker 1: Jonathan Strickland. I'm an executive producer with I Heart Radio,

4
00:00:16,880 --> 00:00:20,720
Speaker 1: and how the tech are you. I am currently on

5
00:00:21,320 --> 00:00:26,280
Speaker 1: vacation celebrating my anniversary, and I didn't want to leave

6
00:00:26,360 --> 00:00:29,560
Speaker 1: you without an episode, So the episode We're going to

7
00:00:29,600 --> 00:00:32,640
Speaker 1: play for You was recorded and published on September seven,

8
00:00:32,760 --> 00:00:36,200
Speaker 1: twenty twenty. It is called deep learning and deep fakes,

9
00:00:36,200 --> 00:00:41,400
Speaker 1: and recent developments in the deep fakes field include researchers

10
00:00:41,440 --> 00:00:46,400
Speaker 1: creating tools that can detect tells in artificial voices, for example.

11
00:00:47,040 --> 00:00:49,280
Speaker 1: But really, when you think about that, it's just a

12
00:00:49,360 --> 00:00:52,560
Speaker 1: see saw like pattern that will see deep fake technology

13
00:00:52,600 --> 00:00:56,040
Speaker 1: improve over time, and then our ability to detect deep

14
00:00:56,080 --> 00:00:59,200
Speaker 1: fakes will improve, and this will keep going until one

15
00:00:59,200 --> 00:01:03,600
Speaker 1: side or the other has the edge permanently. Now we

16
00:01:03,800 --> 00:01:06,640
Speaker 1: kind of talk about that in this episode. In fact,

17
00:01:07,120 --> 00:01:10,160
Speaker 1: also deep fakes are very much in the spotlight literally

18
00:01:10,600 --> 00:01:13,679
Speaker 1: on the popular TV series America's Got Talent, A team

19
00:01:13,680 --> 00:01:16,360
Speaker 1: from the startup Metaphysics made it all the way to

20
00:01:16,360 --> 00:01:19,400
Speaker 1: the final round of the competition by creating deep fake

21
00:01:19,560 --> 00:01:22,319
Speaker 1: copies of the famous judges on the show all in

22
00:01:22,400 --> 00:01:28,640
Speaker 1: real time. It's equal parts entertaining and terrifying. Okay, maybe

23
00:01:29,520 --> 00:01:38,360
Speaker 1: entertaining terrifying. Anyway, enjoy this episode deep learning and deep Fakes. Now,

24
00:01:38,360 --> 00:01:41,039
Speaker 1: before I get into today's episode, I want to give

25
00:01:41,160 --> 00:01:45,120
Speaker 1: a little listener warning here. The topic at hand involves

26
00:01:45,200 --> 00:01:48,680
Speaker 1: some adult content, including the use of technology to do

27
00:01:48,800 --> 00:01:55,120
Speaker 1: stuff that can be unethical, illegal, hurtful, and just plain awful. Now,

28
00:01:55,160 --> 00:01:57,960
Speaker 1: I think this is an important topic, but I wanted

29
00:01:58,000 --> 00:01:59,840
Speaker 1: to give a bit of a heads up at this

30
00:02:00,000 --> 00:02:02,440
Speaker 1: are of the episode, just in case any of you

31
00:02:02,440 --> 00:02:05,720
Speaker 1: guys are listening to a podcast on like a family

32
00:02:05,960 --> 00:02:08,799
Speaker 1: road trip or something. I think this is an important

33
00:02:08,840 --> 00:02:12,200
Speaker 1: topic and I think everyone should know about it and

34
00:02:12,240 --> 00:02:14,320
Speaker 1: think about it. But I also respect that for some

35
00:02:14,360 --> 00:02:17,800
Speaker 1: people this subject might get a bit taboo. So let's

36
00:02:17,840 --> 00:02:23,160
Speaker 1: go on with the episode. Back in ninete a movie

37
00:02:23,320 --> 00:02:28,160
Speaker 1: called Rising Sun, directed by Philip Kaufman, based on a

38
00:02:28,320 --> 00:02:32,079
Speaker 1: Michael Crichton novel and starring Wesley Snipes and Sean Connery

39
00:02:32,240 --> 00:02:35,639
Speaker 1: came out in theaters. Now, I didn't see it in theaters,

40
00:02:36,320 --> 00:02:38,640
Speaker 1: but I did catch it when it came on you know,

41
00:02:39,200 --> 00:02:43,040
Speaker 1: HBO or Cinemax or something. Later on the movie included

42
00:02:43,080 --> 00:02:46,280
Speaker 1: a sequence that I found to be totally unbelievable. And

43
00:02:46,320 --> 00:02:50,000
Speaker 1: I'm not talking about buying into Sean Connery being an

44
00:02:50,040 --> 00:02:54,600
Speaker 1: expert on Japanese culture and business practices. Actually, side note,

45
00:02:54,760 --> 00:02:59,000
Speaker 1: Sean Connery has an interesting history of playing unlikely characters,

46
00:02:59,040 --> 00:03:01,760
Speaker 1: such as in high Lander, where he played an immortal

47
00:03:01,919 --> 00:03:05,720
Speaker 1: who is supposedly Egyptian, then who lived in feudal Japan

48
00:03:06,240 --> 00:03:09,120
Speaker 1: and ended up in Spain where he became known as Ramirez,

49
00:03:09,520 --> 00:03:12,080
Speaker 1: and all the while he's talking to a Scottish Highlander

50
00:03:12,240 --> 00:03:15,519
Speaker 1: who's played by a Belgian actor. But I'm getting way

51
00:03:15,520 --> 00:03:19,320
Speaker 1: off track here. Besides, I've heard Crichton actually wrote the

52
00:03:19,440 --> 00:03:22,280
Speaker 1: character while thinking of Connery, So you know, what the

53
00:03:22,280 --> 00:03:25,600
Speaker 1: heck do I know? In the film, Snips and Connery

54
00:03:25,720 --> 00:03:29,519
Speaker 1: are investigators, and they're looking into a homicide that happened

55
00:03:29,560 --> 00:03:34,080
Speaker 1: at a Japanese business but on American soil. The security

56
00:03:34,080 --> 00:03:38,360
Speaker 1: system in the building captured video of the homicide and

57
00:03:38,400 --> 00:03:40,800
Speaker 1: the identity of the killer appears to be a pretty

58
00:03:40,840 --> 00:03:44,200
Speaker 1: open and shut case. But that's not how it all

59
00:03:44,240 --> 00:03:47,800
Speaker 1: turns out. The investigators talked to a security expert played

60
00:03:47,840 --> 00:03:51,720
Speaker 1: by Tia Carrera, and she demonstrates in real time how

61
00:03:51,840 --> 00:03:56,440
Speaker 1: video footage can be altered. She records a short video

62
00:03:56,800 --> 00:04:00,240
Speaker 1: of Connery and snipes loads that onto a computer. Her

63
00:04:00,560 --> 00:04:04,520
Speaker 1: freezes a frame of the video and essentially performs a

64
00:04:04,600 --> 00:04:07,800
Speaker 1: cut and paste job swapping the heads of our two

65
00:04:07,880 --> 00:04:11,440
Speaker 1: lead characters. Then she resumes the video and the head

66
00:04:11,480 --> 00:04:16,800
Speaker 1: swap remains in place, and that head swap stuff is possible.

67
00:04:17,040 --> 00:04:19,120
Speaker 1: I mean, clearly it has to be possible, because you

68
00:04:19,160 --> 00:04:22,240
Speaker 1: actually do see that effect in the film itself. But

69
00:04:22,480 --> 00:04:25,040
Speaker 1: it takes a bit more than a quick cut and

70
00:04:25,120 --> 00:04:27,720
Speaker 1: paste job. But we'll leave off of that for now.

71
00:04:28,279 --> 00:04:31,920
Speaker 1: The whole point of that sequence, apart from showing off

72
00:04:32,000 --> 00:04:36,640
Speaker 1: some cinema magic, is to demonstrate to the investigators that video,

73
00:04:36,800 --> 00:04:41,080
Speaker 1: like photographs, can be altered. The expert has detected a

74
00:04:41,080 --> 00:04:44,520
Speaker 1: blue halo around the face of the supposed murderer in

75
00:04:44,600 --> 00:04:48,000
Speaker 1: the footage, indicating that some sort of trickery has happened.

76
00:04:48,480 --> 00:04:51,719
Speaker 1: She also reveals that she cannot magically restore the video

77
00:04:51,760 --> 00:04:54,839
Speaker 1: to its previous unaltered state, which I think was actually

78
00:04:54,880 --> 00:04:57,680
Speaker 1: a nice change of pace for a movie. By the way,

79
00:04:58,240 --> 00:05:01,480
Speaker 1: I think this movie is really, you know, not good,

80
00:05:01,960 --> 00:05:07,040
Speaker 1: like not worth your time, but that's my opinion anyway.

81
00:05:07,080 --> 00:05:10,400
Speaker 1: For years, this kind of video sorcery was pretty much

82
00:05:10,560 --> 00:05:14,600
Speaker 1: limited to the film and TV industries. It usually required

83
00:05:14,640 --> 00:05:18,360
Speaker 1: a lot of pre planning beforehand, so it wasn't as

84
00:05:18,400 --> 00:05:21,560
Speaker 1: simple as just taking footage that was already shot and

85
00:05:21,680 --> 00:05:24,559
Speaker 1: changing it in post on a whim with a couple

86
00:05:24,600 --> 00:05:26,800
Speaker 1: of clicks of a button. If it were, we would

87
00:05:26,800 --> 00:05:30,640
Speaker 1: see a lot fewer mistakes left in movies and television

88
00:05:30,760 --> 00:05:33,240
Speaker 1: because you could catch it later and just fix it.

89
00:05:33,640 --> 00:05:37,080
Speaker 1: But the tricks were possible, they were just difficult to

90
00:05:37,120 --> 00:05:40,159
Speaker 1: pull off. It just wasn't something you or I would

91
00:05:40,200 --> 00:05:43,960
Speaker 1: ever encounter in our day to day lives. But today

92
00:05:44,400 --> 00:05:47,239
Speaker 1: we live in a different world, a world that has

93
00:05:47,279 --> 00:05:52,240
Speaker 1: examples of synthetic media. Commonly referred to as deep fakes.

94
00:05:52,880 --> 00:05:56,640
Speaker 1: These are videos that have been altered or generated so

95
00:05:56,760 --> 00:05:59,360
Speaker 1: that the subject of the video is doing something that

96
00:05:59,400 --> 00:06:03,640
Speaker 1: they probably really would or could never do. They've brought

97
00:06:03,640 --> 00:06:07,279
Speaker 1: into question whether or not video evidence is even reliable,

98
00:06:07,600 --> 00:06:10,919
Speaker 1: much as the film Rising Sun was talking about. We

99
00:06:11,000 --> 00:06:16,560
Speaker 1: already know that eyewitness testimony is terribly unreliable. Our perception

100
00:06:16,720 --> 00:06:20,080
Speaker 1: and memories play tricks on us, and we can quote

101
00:06:20,160 --> 00:06:24,160
Speaker 1: unquote remember stuff that just didn't happen the way things

102
00:06:24,279 --> 00:06:28,640
Speaker 1: actually unfolded in reality. But now we're looking at video

103
00:06:28,720 --> 00:06:33,320
Speaker 1: evidence and potentially the same light. I mean, it's scary.

104
00:06:33,400 --> 00:06:37,680
Speaker 1: So today we're going to learn about synthetic media, how

105
00:06:37,960 --> 00:06:42,080
Speaker 1: it can be generated, the implications that follow with that

106
00:06:42,240 --> 00:06:44,680
Speaker 1: sort of reality, and ways that people are trying to

107
00:06:44,720 --> 00:06:50,240
Speaker 1: counteract a potentially dangerous threat. You know, fun stuff. Now, first,

108
00:06:50,360 --> 00:06:54,719
Speaker 1: the term synthetic media has a particular meaning. It refers

109
00:06:54,760 --> 00:06:59,560
Speaker 1: to art created through some sort of automated process, so

110
00:06:59,680 --> 00:07:04,000
Speaker 1: it's a largely hands off approach to creating the final

111
00:07:04,720 --> 00:07:08,560
Speaker 1: art piece. Now, under that definition, the example of rising

112
00:07:08,600 --> 00:07:12,040
Speaker 1: Sun would not apply here because we see in the

113
00:07:12,080 --> 00:07:14,840
Speaker 1: film and presumably this happens in the book as well,

114
00:07:14,880 --> 00:07:18,360
Speaker 1: but I haven't read the book that a human being

115
00:07:18,600 --> 00:07:22,400
Speaker 1: actually changes that. People have used tools to alter the

116
00:07:22,480 --> 00:07:26,240
Speaker 1: video footage. This would be more like using photoshop to

117
00:07:26,240 --> 00:07:29,280
Speaker 1: touch up a still image, with the computer system presumably

118
00:07:29,360 --> 00:07:32,520
Speaker 1: doing some of the work in the background to keep

119
00:07:32,560 --> 00:07:35,200
Speaker 1: things matched up. Either that or you would need to

120
00:07:35,240 --> 00:07:38,920
Speaker 1: alter each image in the footage frame by frame, or

121
00:07:39,000 --> 00:07:42,960
Speaker 1: use some sort of matt approach. To learn more about Matts,

122
00:07:43,360 --> 00:07:45,480
Speaker 1: you can listen to my episode about how blue and

123
00:07:45,560 --> 00:07:50,000
Speaker 1: green screens work. Synthetic media as a general practice has

124
00:07:50,200 --> 00:07:54,320
Speaker 1: been around for centuries. Artists have set up various contraptions

125
00:07:54,360 --> 00:07:58,360
Speaker 1: to create works with little or no human guidance. In

126
00:07:58,400 --> 00:08:01,559
Speaker 1: the twentieth century we started to see a movement called

127
00:08:01,760 --> 00:08:05,160
Speaker 1: generative art take form. This type of art is all

128
00:08:05,200 --> 00:08:08,960
Speaker 1: about creating a system that then creates or generates the

129
00:08:09,040 --> 00:08:12,600
Speaker 1: finished art piece. That would mean that the finished work,

130
00:08:12,720 --> 00:08:16,800
Speaker 1: such as a painting, wouldn't reflect the feelings or thoughts

131
00:08:16,800 --> 00:08:20,240
Speaker 1: of the artists who created the system. In fact, it

132
00:08:20,320 --> 00:08:23,560
Speaker 1: starts to raise the question what is the art? Is

133
00:08:23,600 --> 00:08:26,400
Speaker 1: it the painting that came about due to a machine

134
00:08:26,480 --> 00:08:30,480
Speaker 1: following a program of some sort, or is the art

135
00:08:30,720 --> 00:08:35,079
Speaker 1: the program itself? Is the art the process by which

136
00:08:35,120 --> 00:08:37,800
Speaker 1: the painting was made? Now, I'm not here to answer

137
00:08:37,840 --> 00:08:41,400
Speaker 1: that question. I just think it is an interesting question

138
00:08:41,440 --> 00:08:46,480
Speaker 1: to ask. Sometimes people ask much less polite questions, such

139
00:08:46,520 --> 00:08:50,600
Speaker 1: as is it art at all? Some art critics went

140
00:08:50,640 --> 00:08:53,640
Speaker 1: out of their way to dismiss generative art in the

141
00:08:53,679 --> 00:08:58,280
Speaker 1: early days. They found it insulting, but hey, that's kind

142
00:08:58,320 --> 00:09:02,160
Speaker 1: of the history of art and general Each new movement

143
00:09:02,200 --> 00:09:07,199
Speaker 1: in art inevitably finds both supporters and critics as it emerges.

144
00:09:07,640 --> 00:09:11,480
Speaker 1: If anything, you might argue that such a response legitimizes

145
00:09:11,720 --> 00:09:14,679
Speaker 1: the movement in you know, a weird way. If people

146
00:09:14,720 --> 00:09:18,880
Speaker 1: hate it, it must be something. In two thousand eighteen,

147
00:09:19,000 --> 00:09:23,800
Speaker 1: an artist collective called Obvious located out of Paris, France.

148
00:09:24,200 --> 00:09:27,760
Speaker 1: They submitted portrait style paintings that were created not by

149
00:09:27,880 --> 00:09:32,719
Speaker 1: an actual human painter, but by an artificially intelligent system.

150
00:09:32,760 --> 00:09:37,280
Speaker 1: Now they looked a lot like typical eighteenth century style portraits.

151
00:09:37,920 --> 00:09:41,400
Speaker 1: There was no attempt to pass off the portrait as

152
00:09:41,440 --> 00:09:44,360
Speaker 1: if it were actually made by a human artist. In fact,

153
00:09:44,800 --> 00:09:47,959
Speaker 1: the appeal of the piece was largely due to it

154
00:09:48,080 --> 00:09:52,840
Speaker 1: being synthetically generated. It went to auction at Christie's and

155
00:09:52,960 --> 00:09:59,000
Speaker 1: the AI created painting fetched more than four hundred thousand dollars.

156
00:09:59,120 --> 00:10:02,240
Speaker 1: And the way the group trained their AI is relevant

157
00:10:02,280 --> 00:10:06,720
Speaker 1: to our discussion about deep fakes. The collective relied on

158
00:10:06,800 --> 00:10:11,560
Speaker 1: a type of machine learning called generative adversarial networks or

159
00:10:11,800 --> 00:10:16,320
Speaker 1: g a N, which in turn is depending on deep learning.

160
00:10:16,400 --> 00:10:18,079
Speaker 1: So it looks like we've got a few things we're

161
00:10:18,080 --> 00:10:20,760
Speaker 1: gonna have to define here. Now, I'm going to keep

162
00:10:20,840 --> 00:10:24,840
Speaker 1: things fairly high level, because, as it turns out, there

163
00:10:24,880 --> 00:10:28,240
Speaker 1: are a few different ways to create machine learning models,

164
00:10:28,600 --> 00:10:31,160
Speaker 1: and to go through all of them in exhaustive detail

165
00:10:31,280 --> 00:10:34,760
Speaker 1: would represent a university level course in machine learning. I

166
00:10:34,800 --> 00:10:38,280
Speaker 1: have neither the time for that nor the expertise. I

167
00:10:38,320 --> 00:10:41,920
Speaker 1: would do a terrible job, so we'll go with a

168
00:10:42,040 --> 00:10:47,560
Speaker 1: high level perspective here. First. A generative adversarial network uses

169
00:10:47,679 --> 00:10:51,800
Speaker 1: two systems. You have a generator and you have a discriminator.

170
00:10:52,280 --> 00:10:55,600
Speaker 1: Both of these systems are a type of neural network.

171
00:10:56,000 --> 00:10:59,600
Speaker 1: A neural network is a computing model that is inspired

172
00:10:59,640 --> 00:11:03,960
Speaker 1: by the way our brains work. Our brains contain billions

173
00:11:03,960 --> 00:11:08,319
Speaker 1: of neurons, and these neurons work together, communicating through electrical

174
00:11:08,360 --> 00:11:13,080
Speaker 1: and chemical signals, controlling and coordinating pretty much everything in

175
00:11:13,120 --> 00:11:18,440
Speaker 1: our bodies. With computers, the neurons are nodes. The job

176
00:11:18,559 --> 00:11:21,720
Speaker 1: of a node is, you know, supposed to be kind

177
00:11:21,720 --> 00:11:24,400
Speaker 1: of like a neuron cell in the brain. It's to

178
00:11:24,520 --> 00:11:29,200
Speaker 1: take in multiple weighted input values and then generate a

179
00:11:29,320 --> 00:11:34,160
Speaker 1: single output value. Now, the word weighted w E I

180
00:11:34,320 --> 00:11:37,920
Speaker 1: G H T E D weighted is really important here

181
00:11:37,960 --> 00:11:42,160
Speaker 1: because the larger and inputs weight the more that input

182
00:11:42,280 --> 00:11:45,679
Speaker 1: will have an effect on whatever the output is. So

183
00:11:45,720 --> 00:11:48,720
Speaker 1: it kind of comes down to which inputs are the

184
00:11:48,800 --> 00:11:52,440
Speaker 1: most important for that nodes particular function. Now, if I

185
00:11:52,480 --> 00:11:55,720
Speaker 1: were to make an analogy, I would say, your boss

186
00:11:55,840 --> 00:11:59,560
Speaker 1: hands you three tasks to do. One of those tasks

187
00:11:59,600 --> 00:12:03,840
Speaker 1: has the label extremely important, and the second task has

188
00:12:03,920 --> 00:12:08,120
Speaker 1: the label critically important, and the third task has a

189
00:12:08,200 --> 00:12:10,200
Speaker 1: label saying you should have finished that one before it

190
00:12:10,280 --> 00:12:13,240
Speaker 1: was handed to you. Okay, so that's just some sort

191
00:12:13,280 --> 00:12:15,719
Speaker 1: of snarky office humor that I need to get off

192
00:12:15,720 --> 00:12:20,200
Speaker 1: my chest. But more seriously, imagine a node accepting three inputs.

193
00:12:20,200 --> 00:12:24,679
Speaker 1: In this example, input one has a fift weight, input

194
00:12:24,760 --> 00:12:28,320
Speaker 1: two has a weight, and input three has a ten

195
00:12:28,440 --> 00:12:32,040
Speaker 1: percent weight That adds up to and that would tell

196
00:12:32,080 --> 00:12:35,520
Speaker 1: you that the output that node generates will be most

197
00:12:35,679 --> 00:12:39,880
Speaker 1: affected by input one, followed by input two, and then

198
00:12:39,880 --> 00:12:43,439
Speaker 1: input three would have a smaller effect on whatever the

199
00:12:43,480 --> 00:12:48,560
Speaker 1: output is. Each node applies a nonlinear transformation on the

200
00:12:48,600 --> 00:12:53,720
Speaker 1: input values, again affected by each inputs weight value, and

201
00:12:53,800 --> 00:12:58,920
Speaker 1: that generates the output value. The details of that really

202
00:12:58,920 --> 00:13:02,360
Speaker 1: are not important are our episode. It involves performing changes

203
00:13:02,360 --> 00:13:06,040
Speaker 1: on variables that in turn change the correlation between variables,

204
00:13:06,040 --> 00:13:08,560
Speaker 1: and it gets a bit Matthew, and we would get

205
00:13:08,600 --> 00:13:11,840
Speaker 1: lost in the weeds. Pretty quickly. The important thing to

206
00:13:11,880 --> 00:13:15,520
Speaker 1: remember is that a node within a neural network takes

207
00:13:15,640 --> 00:13:20,520
Speaker 1: in a weighted sum of inputs, then performs a process

208
00:13:20,559 --> 00:13:25,480
Speaker 1: on those inputs before passing the result on as an output.

209
00:13:25,920 --> 00:13:30,319
Speaker 1: Then some other node a layer down will accept that output,

210
00:13:30,640 --> 00:13:33,079
Speaker 1: along with outputs from a couple of other nodes one

211
00:13:33,160 --> 00:13:36,840
Speaker 1: layer up, and then we'll perform an operation based on

212
00:13:36,920 --> 00:13:40,079
Speaker 1: those weighted inputs and pass that on to the next layer,

213
00:13:40,160 --> 00:13:43,480
Speaker 1: and so on. So these nodes are in layers, like

214
00:13:43,600 --> 00:13:47,240
Speaker 1: you know a cake. One layer of notes processes some inputs,

215
00:13:47,280 --> 00:13:50,280
Speaker 1: they send it onto the next layer of nodes, and

216
00:13:50,320 --> 00:13:52,240
Speaker 1: then that one does onto the next one, and the

217
00:13:52,280 --> 00:13:56,400
Speaker 1: next one and so on. This isn't a new idea.

218
00:13:56,800 --> 00:14:02,160
Speaker 1: Computer scientists began theorizing and experimenting with neural network approaches

219
00:14:02,640 --> 00:14:06,000
Speaker 1: as far back as the nineteen fifties with the perceptron,

220
00:14:06,320 --> 00:14:09,680
Speaker 1: which was a hypothetical system that was described by Frank

221
00:14:09,760 --> 00:14:13,559
Speaker 1: Rosenblatt of Cornell University. But it wasn't until the last

222
00:14:13,640 --> 00:14:17,160
Speaker 1: decade that computing power and our ability to handle a

223
00:14:17,200 --> 00:14:20,520
Speaker 1: lot of data reached a point where these sort of

224
00:14:20,600 --> 00:14:24,480
Speaker 1: learning models could really take off. The goal of this

225
00:14:24,680 --> 00:14:28,440
Speaker 1: system is to train it to perform a particular task

226
00:14:28,920 --> 00:14:33,120
Speaker 1: within a certain level of precision. The weights I mentioned

227
00:14:33,160 --> 00:14:35,880
Speaker 1: are adjustable, so you can think of it as teaching

228
00:14:35,880 --> 00:14:39,760
Speaker 1: a system which bits are the most important in order

229
00:14:39,760 --> 00:14:42,520
Speaker 1: to do whatever it is the system is supposed to

230
00:14:42,520 --> 00:14:45,680
Speaker 1: do in order to achieve your task. These are the

231
00:14:45,680 --> 00:14:49,200
Speaker 1: bits that are the most important and therefore should matter

232
00:14:49,240 --> 00:14:52,040
Speaker 1: the most when you weigh a decision. This is a

233
00:14:52,040 --> 00:14:54,840
Speaker 1: bit easier if we talk about a similar system with

234
00:14:55,080 --> 00:14:59,120
Speaker 1: the version of IBM S Watson that played on Jeopardy.

235
00:14:59,360 --> 00:15:03,080
Speaker 1: That system famously was not connected to the Internet. It

236
00:15:03,200 --> 00:15:06,400
Speaker 1: had to rely on all the information that was stored

237
00:15:06,480 --> 00:15:11,920
Speaker 1: within itself. When the system encountered a clue in Jeopardy,

238
00:15:11,960 --> 00:15:14,760
Speaker 1: it would analyze the clue, and then it would reference

239
00:15:14,800 --> 00:15:17,920
Speaker 1: its database to look for possible answers to whatever that

240
00:15:18,000 --> 00:15:21,920
Speaker 1: clue was. The system would weigh those possible answers and

241
00:15:21,960 --> 00:15:25,240
Speaker 1: attempt to determine which, if any, were the most likely

242
00:15:25,320 --> 00:15:29,040
Speaker 1: to be correct. If the certainty was over a certain threshold,

243
00:15:29,400 --> 00:15:33,320
Speaker 1: such as sure, the system would buzz in with its answer.

244
00:15:33,680 --> 00:15:37,360
Speaker 1: If no response rose above that threshold, the system would

245
00:15:37,400 --> 00:15:40,080
Speaker 1: not buzz in. So you could say that Watson was

246
00:15:40,120 --> 00:15:43,320
Speaker 1: playing the game with a best guess sort of approach.

247
00:15:43,840 --> 00:15:48,880
Speaker 1: Neural networks do essentially that sort of processing. With this

248
00:15:48,920 --> 00:15:52,440
Speaker 1: particular type of approach, we know what we want the

249
00:15:52,520 --> 00:15:55,480
Speaker 1: outcome to be, so we can judge whether or not

250
00:15:55,560 --> 00:15:59,760
Speaker 1: the system was successful. After each attempt, we can adjust

251
00:16:00,120 --> 00:16:03,800
Speaker 1: weight on the input between nodes to refine the decision

252
00:16:03,840 --> 00:16:07,760
Speaker 1: making process to get more accurate results. If the system

253
00:16:07,840 --> 00:16:11,080
Speaker 1: succeeds in its task, we can increase the weights that

254
00:16:11,160 --> 00:16:15,240
Speaker 1: contributed to the system picking the correct answer and thus

255
00:16:15,480 --> 00:16:21,800
Speaker 1: decrease the inputs that did not contribute to the successful response.

256
00:16:22,280 --> 00:16:25,880
Speaker 1: If the system done messed up and gave the wrong answer,

257
00:16:26,440 --> 00:16:28,880
Speaker 1: then we do the opposite. We look at the inputs

258
00:16:28,920 --> 00:16:32,880
Speaker 1: that contributed to the wrong answer, we diminish their weights,

259
00:16:33,200 --> 00:16:35,560
Speaker 1: and we increase the weights of the other input and

260
00:16:35,560 --> 00:16:40,120
Speaker 1: then we run the test again a lot. I'll explain

261
00:16:40,320 --> 00:16:42,760
Speaker 1: a bit more about this process when we come back,

262
00:16:42,800 --> 00:16:54,200
Speaker 1: but first let's take a quick break. Early in the

263
00:16:54,320 --> 00:16:58,680
Speaker 1: history of neural networks, computer scientists were hitting some pretty

264
00:16:58,760 --> 00:17:02,240
Speaker 1: hard stops do to the limitations of computing power at

265
00:17:02,280 --> 00:17:06,040
Speaker 1: the time. Early networks were only a couple of layers deep,

266
00:17:06,119 --> 00:17:08,800
Speaker 1: which really meant they weren't terribly powerful, and they could

267
00:17:08,800 --> 00:17:12,560
Speaker 1: only tackle rudimentary tasks like figuring out whether or not

268
00:17:12,600 --> 00:17:16,679
Speaker 1: a square is drawn on a piece of paper that

269
00:17:17,200 --> 00:17:23,320
Speaker 1: isn't terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and

270
00:17:23,480 --> 00:17:28,520
Speaker 1: Ronald Williams published a lecture titled learning representations by back

271
00:17:28,640 --> 00:17:34,159
Speaker 1: propagating errors. This was a big breakthrough with deep learning.

272
00:17:34,760 --> 00:17:36,960
Speaker 1: This all has to do with a deep learning system

273
00:17:37,000 --> 00:17:40,200
Speaker 1: improving its ability to complete a specific task. And basically

274
00:17:40,240 --> 00:17:43,679
Speaker 1: the algorithm's job is to go from the output layer,

275
00:17:43,920 --> 00:17:46,800
Speaker 1: you know, where the system has made a decision, and

276
00:17:46,840 --> 00:17:50,480
Speaker 1: then work backward through the neural network, adjusting the weights

277
00:17:50,520 --> 00:17:55,960
Speaker 1: that led to an incorrect decision. So let's say it's

278
00:17:56,040 --> 00:17:59,520
Speaker 1: a system that is looking to figure out whether or

279
00:17:59,560 --> 00:18:02,760
Speaker 1: not a hat is in a photograph and it says,

280
00:18:02,960 --> 00:18:05,320
Speaker 1: there's a cat in this picture, and you look at

281
00:18:05,320 --> 00:18:08,159
Speaker 1: the picture and there is no cat there. Then you

282
00:18:08,160 --> 00:18:12,439
Speaker 1: would look at the inputs one level back just before

283
00:18:12,480 --> 00:18:15,080
Speaker 1: the system said here's a picture of a cat, and

284
00:18:15,119 --> 00:18:17,520
Speaker 1: you'd say, all right, which of these inputs lad the

285
00:18:17,520 --> 00:18:20,760
Speaker 1: system to believe this was a picture of a cat?

286
00:18:21,160 --> 00:18:23,639
Speaker 1: And then you would adjust those Then you would go

287
00:18:23,840 --> 00:18:27,720
Speaker 1: back one layer up, So you're working your way up

288
00:18:27,920 --> 00:18:31,919
Speaker 1: the model and say which inputs here led to it

289
00:18:32,119 --> 00:18:36,240
Speaker 1: giving the outputs that led to the mistake, and you

290
00:18:36,320 --> 00:18:39,640
Speaker 1: do this all the way up until you get up

291
00:18:39,640 --> 00:18:42,800
Speaker 1: to the input level at the top of the computer model.

292
00:18:42,840 --> 00:18:46,000
Speaker 1: You are back propagating, and then you run the test

293
00:18:46,040 --> 00:18:50,760
Speaker 1: again to see if you've got improvement. It's exhaustive, but

294
00:18:50,840 --> 00:18:56,080
Speaker 1: it's also drastically improved neural network performance, much faster than

295
00:18:56,160 --> 00:18:59,920
Speaker 1: just throwing more brute force to it. The algorithm is

296
00:19:00,080 --> 00:19:02,439
Speaker 1: entually is checking to see if a small change in

297
00:19:02,520 --> 00:19:06,520
Speaker 1: each input value received by a layer of nodes would

298
00:19:06,560 --> 00:19:08,679
Speaker 1: have led to a more accurate results. So it's all

299
00:19:08,720 --> 00:19:11,960
Speaker 1: about going from that output working your way backward. In

300
00:19:12,040 --> 00:19:15,520
Speaker 1: two thousand twelve, Alex Krajewski published a paper that gave

301
00:19:15,600 --> 00:19:19,320
Speaker 1: us the next big breakthrough. He argued that a really

302
00:19:19,520 --> 00:19:23,080
Speaker 1: deep neural network with a lot of layers could give

303
00:19:23,200 --> 00:19:26,359
Speaker 1: really great results if you paired it with enough data

304
00:19:26,440 --> 00:19:29,800
Speaker 1: to train the system. So you needed to throw lots

305
00:19:29,840 --> 00:19:33,680
Speaker 1: of data at these models, and it needed to be

306
00:19:33,760 --> 00:19:37,760
Speaker 1: an enormous amount of data. However, once trained, the system

307
00:19:37,840 --> 00:19:40,880
Speaker 1: would produce lower error rates. So yeah, I would take

308
00:19:40,880 --> 00:19:43,640
Speaker 1: a long time, but you would get better results. Now,

309
00:19:43,680 --> 00:19:46,439
Speaker 1: at the time, a good error rate for such a

310
00:19:46,480 --> 00:19:51,480
Speaker 1: system was that means one out of four conclusions the

311
00:19:51,560 --> 00:19:54,480
Speaker 1: system would come to would be wrong. If you ran

312
00:19:54,560 --> 00:19:58,400
Speaker 1: it across a long enough number of decisions, you would

313
00:19:58,400 --> 00:20:02,240
Speaker 1: find that one out of every four wasn't right. The

314
00:20:02,320 --> 00:20:05,959
Speaker 1: system that Alex's team worked on produced results that had

315
00:20:06,000 --> 00:20:09,399
Speaker 1: an error rate of six percent, so much lower. And

316
00:20:09,440 --> 00:20:13,879
Speaker 1: then in just five years, with more improvements to this process,

317
00:20:14,280 --> 00:20:18,080
Speaker 1: the classification error rate had dropped down to two point

318
00:20:18,320 --> 00:20:22,800
Speaker 1: three percent for deep learning systems. So from to two

319
00:20:22,880 --> 00:20:27,560
Speaker 1: point three percent, it was really powerful stuff. Okay, so

320
00:20:27,720 --> 00:20:31,879
Speaker 1: you've got your artificial neural network. You've got your layers

321
00:20:31,960 --> 00:20:35,760
Speaker 1: and layers of nodes. You've adjusted the weights of the

322
00:20:35,800 --> 00:20:39,719
Speaker 1: inputs into each node to see if your system can identify,

323
00:20:40,119 --> 00:20:44,960
Speaker 1: you know, pictures of cats, and you start feeding images

324
00:20:45,040 --> 00:20:48,879
Speaker 1: to this system, lots of them. This is the domain

325
00:20:49,080 --> 00:20:51,360
Speaker 1: that you are feeding to your system. The more images

326
00:20:51,400 --> 00:20:53,520
Speaker 1: you can feed to it, the better. And you want

327
00:20:53,520 --> 00:20:55,840
Speaker 1: a wide variety of images of all sorts of stuff,

328
00:20:56,240 --> 00:20:58,800
Speaker 1: not just of different types of cats, but stuff that

329
00:20:58,920 --> 00:21:03,400
Speaker 1: most certainly isn't not a cat, like dogs, or cars

330
00:21:03,520 --> 00:21:06,760
Speaker 1: or chartered public accountants. You name it, and you look

331
00:21:06,840 --> 00:21:10,520
Speaker 1: to see which images the system identifies correctly and which

332
00:21:10,560 --> 00:21:14,040
Speaker 1: ones it screws up, both which images have cats in

333
00:21:14,080 --> 00:21:17,880
Speaker 1: it that actually don't have cats in it, or images

334
00:21:17,920 --> 00:21:20,760
Speaker 1: the system has identified as saying there is no cat here,

335
00:21:20,960 --> 00:21:23,880
Speaker 1: but there is a cat there. This guides you into

336
00:21:23,920 --> 00:21:27,520
Speaker 1: adjusting the weights again and again, and you start over

337
00:21:27,560 --> 00:21:29,440
Speaker 1: and you do it again, and that's your basic deep

338
00:21:29,520 --> 00:21:33,000
Speaker 1: learning system, and it gets better over time as you

339
00:21:33,080 --> 00:21:36,399
Speaker 1: train it. It learns. Now, let's transition over to the

340
00:21:36,440 --> 00:21:40,439
Speaker 1: adversarial systems I mentioned earlier, because they take this and

341
00:21:40,480 --> 00:21:45,560
Speaker 1: twist it a little bit. So you've got to artificial

342
00:21:45,720 --> 00:21:49,520
Speaker 1: neural networks and they are using this general approach to

343
00:21:49,720 --> 00:21:53,400
Speaker 1: deep learning, and you're setting them up so that they

344
00:21:53,440 --> 00:21:58,000
Speaker 1: feed into each other. One network. The generator has the

345
00:21:58,040 --> 00:22:01,919
Speaker 1: task to learn how to do something such as create

346
00:22:01,960 --> 00:22:05,919
Speaker 1: an eighteenth century style portrait based off lots and lots

347
00:22:06,000 --> 00:22:09,600
Speaker 1: of examples of the real thing. The domain the problem

348
00:22:09,960 --> 00:22:14,760
Speaker 1: domain the second network. The discriminator has a different job.

349
00:22:15,359 --> 00:22:18,800
Speaker 1: It has to tell the difference between authentic portraits that

350
00:22:19,040 --> 00:22:23,960
Speaker 1: came from the problem domain and computer generated portraits that

351
00:22:24,040 --> 00:22:27,919
Speaker 1: came from the generator itself. So essentially, the discriminator is

352
00:22:28,000 --> 00:22:31,199
Speaker 1: like the model I mentioned earlier that was identifying pictures

353
00:22:31,200 --> 00:22:33,320
Speaker 1: of cats, It's doing the same sort of thing, except

354
00:22:33,359 --> 00:22:36,600
Speaker 1: instead of saying cat or no cat, it's saying real

355
00:22:36,760 --> 00:22:40,600
Speaker 1: portrait or computer generated portrait. So there are essentially two

356
00:22:40,600 --> 00:22:44,359
Speaker 1: outcomes the discriminator could reach, and that's whether or not

357
00:22:44,440 --> 00:22:48,119
Speaker 1: an images computer generated or it wasn't. So do you

358
00:22:48,119 --> 00:22:51,680
Speaker 1: see where this is going? You train up both models.

359
00:22:52,119 --> 00:22:54,879
Speaker 1: You have the generator attempt to make its own version

360
00:22:54,960 --> 00:22:58,400
Speaker 1: of something such as that eighteenth century portrait. It does

361
00:22:58,440 --> 00:23:01,119
Speaker 1: so it designs the portrait it based on what the

362
00:23:01,160 --> 00:23:05,720
Speaker 1: model believes are the key elements of a portrait, so

363
00:23:05,920 --> 00:23:10,679
Speaker 1: things like colors, shapes, the ratio of size, like you know,

364
00:23:10,720 --> 00:23:13,720
Speaker 1: how large should the head be in relation to the body.

365
00:23:13,760 --> 00:23:17,960
Speaker 1: All of these factors and many more come into play.

366
00:23:18,119 --> 00:23:22,399
Speaker 1: The generator creates its own idea of what a portrait

367
00:23:22,520 --> 00:23:25,159
Speaker 1: is supposed to look like, and chances are the early

368
00:23:25,240 --> 00:23:29,879
Speaker 1: rounds of this will not be terribly convincing. The results

369
00:23:30,040 --> 00:23:33,280
Speaker 1: are then fed to the discriminator, which tries to suss

370
00:23:33,320 --> 00:23:36,359
Speaker 1: out which of the images fed to it are computer

371
00:23:36,480 --> 00:23:40,360
Speaker 1: generated and which ones aren't. After that round, both models

372
00:23:40,600 --> 00:23:45,480
Speaker 1: are tweaked. The generator adjusts input weights to get closer

373
00:23:45,560 --> 00:23:49,159
Speaker 1: to the genuine article, and the discriminator adjust weights to

374
00:23:49,320 --> 00:23:53,320
Speaker 1: reduce false positives or to catch computer generated images. And

375
00:23:53,359 --> 00:23:57,560
Speaker 1: then you go again and again and again and again,

376
00:23:57,840 --> 00:24:01,479
Speaker 1: and they both get better over time. So, assuming everything

377
00:24:01,560 --> 00:24:04,840
Speaker 1: is working properly, over time, the adjustment of input weights

378
00:24:04,880 --> 00:24:08,320
Speaker 1: will lead to more convincing results, and given enough time

379
00:24:08,520 --> 00:24:11,480
Speaker 1: and enough repetition, you'll end up with a computer generated

380
00:24:11,520 --> 00:24:13,879
Speaker 1: painting that you can auction off for nearly half a

381
00:24:13,960 --> 00:24:18,479
Speaker 1: million dollars. Though keep in mind that huge price relates

382
00:24:18,520 --> 00:24:21,720
Speaker 1: back to the novelty of it being an early AI

383
00:24:21,760 --> 00:24:25,399
Speaker 1: generated painting. It would be shocking to me if we

384
00:24:25,480 --> 00:24:29,400
Speaker 1: saw that actually become a trend. Also, the painting, while interesting,

385
00:24:29,880 --> 00:24:32,760
Speaker 1: isn't exactly so astounding as to make you think there's

386
00:24:32,800 --> 00:24:35,399
Speaker 1: no way a machine did that. You'd look at them

387
00:24:35,400 --> 00:24:38,160
Speaker 1: and go, yeah, I can imagine a machine did that. One.

388
00:24:38,840 --> 00:24:43,160
Speaker 1: A group of computer scientists first described the general adversarial

389
00:24:43,200 --> 00:24:46,040
Speaker 1: network architecture in a paper in two thousand and fourteen,

390
00:24:46,640 --> 00:24:49,840
Speaker 1: and like other neural networks, these models require a lot

391
00:24:49,880 --> 00:24:52,480
Speaker 1: of data. The more the better. In fact, smaller data

392
00:24:52,480 --> 00:24:56,159
Speaker 1: sets means the models have to make some pretty big assumptions,

393
00:24:56,720 --> 00:25:00,440
Speaker 1: and you tend to get pretty lousy results. More data,

394
00:25:00,600 --> 00:25:03,879
Speaker 1: as in more examples, teaches the models more about the

395
00:25:03,920 --> 00:25:07,119
Speaker 1: parameters of the domain, whatever it is they are trying

396
00:25:07,160 --> 00:25:10,560
Speaker 1: to generate. It refines the approach. So if you have

397
00:25:10,600 --> 00:25:13,280
Speaker 1: a sophisticated enough pair of models and you have enough

398
00:25:13,400 --> 00:25:16,280
Speaker 1: data to fill up a domain, you can generate some

399
00:25:16,440 --> 00:25:20,520
Speaker 1: convincing material, and that includes video. And this brings us

400
00:25:20,560 --> 00:25:26,240
Speaker 1: around to deep fakes. And in addition to generative adversarial networks,

401
00:25:26,280 --> 00:25:31,400
Speaker 1: a couple of other things really converged to create the

402
00:25:31,480 --> 00:25:35,040
Speaker 1: techniques and trends and technology that would allow for deep

403
00:25:35,040 --> 00:25:42,040
Speaker 1: fakes proper. In Malcolm Slaney, Michelle Covell, and Christoph Bregler

404
00:25:42,520 --> 00:25:46,680
Speaker 1: wrote some software that they called the Video Rewrite Program.

405
00:25:46,680 --> 00:25:50,959
Speaker 1: The software would analyze faces and then create or synthesize

406
00:25:51,240 --> 00:25:55,920
Speaker 1: lip animation which could be matched to pre recorded audio.

407
00:25:56,080 --> 00:25:59,480
Speaker 1: So you could take some film footage of a person

408
00:25:59,720 --> 00:26:03,439
Speaker 1: and and reanimate their lips so that they could appear

409
00:26:03,480 --> 00:26:06,000
Speaker 1: to say all sorts of things, which in some ways

410
00:26:06,119 --> 00:26:09,439
Speaker 1: set the stage for deep fakes. This case, it was

411
00:26:09,480 --> 00:26:12,840
Speaker 1: really just focusing on the lips and the general area

412
00:26:12,920 --> 00:26:16,560
Speaker 1: around the lips, so you weren't changing the rest of

413
00:26:16,600 --> 00:26:19,560
Speaker 1: the expression of the face, and you would have to,

414
00:26:20,160 --> 00:26:23,520
Speaker 1: you know, keep your recording to be about the same

415
00:26:23,600 --> 00:26:25,960
Speaker 1: length as whatever the film clip was, or you would

416
00:26:25,960 --> 00:26:28,080
Speaker 1: have to loop the film clip over and over it,

417
00:26:28,080 --> 00:26:30,320
Speaker 1: which would make it, you know, far more obvious that

418
00:26:30,440 --> 00:26:35,000
Speaker 1: this was a fake. In addition, motion tracking technology was

419
00:26:35,040 --> 00:26:37,720
Speaker 1: advancing over time too, and this also became an important

420
00:26:37,760 --> 00:26:41,080
Speaker 1: tool in computer animation. This tool would also be used

421
00:26:41,400 --> 00:26:45,800
Speaker 1: by deep fake algorithms to create facial expressions, manipulating the

422
00:26:45,840 --> 00:26:48,760
Speaker 1: digital image just as it would if it were a

423
00:26:48,840 --> 00:26:53,199
Speaker 1: video game character or a Pixar animated character. Typically, you

424
00:26:53,280 --> 00:26:56,439
Speaker 1: need to start with some existing video in order to

425
00:26:56,480 --> 00:27:00,720
Speaker 1: manipulate it. You're not actually computer generating the animation, like,

426
00:27:00,760 --> 00:27:05,720
Speaker 1: you're not creating a computer generated version of whomever it

427
00:27:05,840 --> 00:27:11,119
Speaker 1: is you're doing the fake of. You're using existing imagery

428
00:27:11,200 --> 00:27:13,880
Speaker 1: in order to do that and then manipulating that existing imagery,

429
00:27:14,000 --> 00:27:17,719
Speaker 1: So it's a little different from computer animation. In two

430
00:27:17,760 --> 00:27:21,440
Speaker 1: thousand and sixteen, students and faculty at the Technical University

431
00:27:21,440 --> 00:27:25,600
Speaker 1: of Munich created the face to face project that would

432
00:27:25,600 --> 00:27:30,040
Speaker 1: be face the numeral two and then face and this

433
00:27:30,119 --> 00:27:33,120
Speaker 1: was particularly jaw dropping to me at the time when

434
00:27:33,119 --> 00:27:37,440
Speaker 1: I first saw these videos back in I was floored.

435
00:27:37,920 --> 00:27:41,480
Speaker 1: They created a system that had a target actor. This

436
00:27:41,520 --> 00:27:44,120
Speaker 1: would be the video of the person that you want

437
00:27:44,160 --> 00:27:47,440
Speaker 1: to manipulate. In the example they used, it was former

438
00:27:47,520 --> 00:27:52,240
Speaker 1: US President George W. Bush. Their process also had a

439
00:27:52,320 --> 00:27:56,880
Speaker 1: source actor. This was the source of the expressions and

440
00:27:56,920 --> 00:28:00,400
Speaker 1: facial movements you would see in the target So kind

441
00:28:00,440 --> 00:28:03,679
Speaker 1: of like a digital puppeteer in a way, but the

442
00:28:03,680 --> 00:28:05,720
Speaker 1: way they did it was really cool. They had a

443
00:28:05,760 --> 00:28:09,840
Speaker 1: camera trained on the source actor and it would track

444
00:28:09,960 --> 00:28:13,840
Speaker 1: specific points of movement on the source actor's face, and

445
00:28:13,880 --> 00:28:17,400
Speaker 1: then the system would manipulate the same points of movement

446
00:28:17,600 --> 00:28:21,280
Speaker 1: on the target actor's face in the video. So if

447
00:28:21,320 --> 00:28:25,520
Speaker 1: the source actor smiled, then the target smiled, so the

448
00:28:25,560 --> 00:28:27,960
Speaker 1: source actor would smile, and then you would see George W.

449
00:28:28,160 --> 00:28:31,080
Speaker 1: Bush and the video smile in real time. It was

450
00:28:31,440 --> 00:28:37,040
Speaker 1: really strange. They used this looping video of George W.

451
00:28:37,160 --> 00:28:40,960
Speaker 1: Bush wearing a neutral expression. They had to start with

452
00:28:41,600 --> 00:28:45,360
Speaker 1: that as there they're sort of zero point, and I

453
00:28:45,400 --> 00:28:48,240
Speaker 1: gotta tell you, it really does look like the former

454
00:28:48,280 --> 00:28:50,400
Speaker 1: president George W. Bush is having a bit of a

455
00:28:50,480 --> 00:28:54,440
Speaker 1: freak out on a looping video because he keeps on,

456
00:28:54,600 --> 00:28:59,160
Speaker 1: opening his mouth, closing his mouth, grimacing, raising his eyebrows.

457
00:28:59,440 --> 00:29:02,040
Speaker 1: You need to watch this video. It is still available

458
00:29:02,080 --> 00:29:06,600
Speaker 1: online to check out. In ten, students and faculty over

459
00:29:06,600 --> 00:29:10,800
Speaker 1: at the University of Washington created the Synthesizing Obama project,

460
00:29:11,080 --> 00:29:13,960
Speaker 1: in which they trained a computer model to generate a

461
00:29:14,040 --> 00:29:18,280
Speaker 1: synthetic video of former US President Barack Obama, and they

462
00:29:18,400 --> 00:29:21,800
Speaker 1: made it lip sync to a pre recorded audio clip

463
00:29:22,000 --> 00:29:26,640
Speaker 1: from one of Obama's addresses to the nation. They actually

464
00:29:26,680 --> 00:29:30,440
Speaker 1: had the original video of that address for comparison, so

465
00:29:30,640 --> 00:29:33,920
Speaker 1: they could look back at that and see how they're

466
00:29:33,960 --> 00:29:37,680
Speaker 1: generated one compared to the real thing. And their approach

467
00:29:37,920 --> 00:29:41,840
Speaker 1: used a model that analyzed hundreds of hours of video

468
00:29:41,880 --> 00:29:46,840
Speaker 1: footage of Obama speaking, and it mapped specific mouth shapes

469
00:29:47,000 --> 00:29:51,680
Speaker 1: to specific sounds. It would also include some of Obama's mannerisms,

470
00:29:51,720 --> 00:29:53,719
Speaker 1: such as how he moves his head when he talks

471
00:29:53,840 --> 00:29:57,520
Speaker 1: or uses facial expressions to emphasize words. And watching the

472
00:29:57,640 --> 00:30:01,600
Speaker 1: video and that the the real one next to the

473
00:30:01,640 --> 00:30:05,840
Speaker 1: generated one is pretty strange. You can tell the generated

474
00:30:05,880 --> 00:30:09,960
Speaker 1: one isn't quite right. It's not matching the audio exactly,

475
00:30:10,240 --> 00:30:14,720
Speaker 1: at least not on the early versions, but it's fairly

476
00:30:14,800 --> 00:30:17,720
Speaker 1: close and it might even pass casual inspection for a

477
00:30:17,760 --> 00:30:20,280
Speaker 1: lot of people who weren't, like, you know, actually paying attention.

478
00:30:20,920 --> 00:30:26,120
Speaker 1: Authors Morass and Alexandro defined deep fakes as quote the

479
00:30:26,160 --> 00:30:31,480
Speaker 1: product of artificial intelligence applications that merge, combine, replace, and

480
00:30:31,600 --> 00:30:35,719
Speaker 1: superimpose images and video clips to create fake videos that

481
00:30:35,760 --> 00:30:41,280
Speaker 1: appear authentic end quote. They first emerged in seventeen and

482
00:30:41,360 --> 00:30:45,040
Speaker 1: so this is a pretty darn young application of technology.

483
00:30:45,680 --> 00:30:48,880
Speaker 1: One thing that is worrisome is that once someone has

484
00:30:48,920 --> 00:30:52,640
Speaker 1: access to the tools, it's not that difficult to create

485
00:30:52,720 --> 00:30:55,760
Speaker 1: a deep fake video. You pretty much just need a

486
00:30:55,800 --> 00:30:59,560
Speaker 1: decent computer, the tools, a bit of know how on

487
00:30:59,640 --> 00:31:02,840
Speaker 1: how to do it, and some time you also need

488
00:31:03,000 --> 00:31:06,720
Speaker 1: some reference material, as in like videos and images of

489
00:31:06,760 --> 00:31:10,560
Speaker 1: the person that you are replicating, and like the machine

490
00:31:10,640 --> 00:31:13,960
Speaker 1: learning systems I've mentioned, the more reference material you have,

491
00:31:14,200 --> 00:31:17,480
Speaker 1: the better. That's why the deep fakes you encounter these

492
00:31:17,560 --> 00:31:21,560
Speaker 1: days tend to be of notable famous people like celebrities

493
00:31:21,560 --> 00:31:25,560
Speaker 1: and politicians. Mainly there's no shortage of reference material for

494
00:31:25,600 --> 00:31:28,960
Speaker 1: those types of individuals, and so they are easier to

495
00:31:29,000 --> 00:31:32,360
Speaker 1: replicate with deep fakes than someone who maintains a much

496
00:31:32,560 --> 00:31:35,520
Speaker 1: lower profile. Not to say that that will always be

497
00:31:35,600 --> 00:31:38,160
Speaker 1: the case, or that there aren't systems out there that

498
00:31:38,240 --> 00:31:43,680
Speaker 1: can accept smaller amounts of reference material. It's just harder

499
00:31:43,720 --> 00:31:50,200
Speaker 1: to make a convincing version with fewer samples. But in

500
00:31:50,320 --> 00:31:53,760
Speaker 1: order to make a convincing fake, the system really has

501
00:31:53,800 --> 00:31:57,920
Speaker 1: to learn how a person moves. All those facial expressions matter.

502
00:31:58,160 --> 00:32:01,200
Speaker 1: It also has to learn how a person sounds. Will

503
00:32:01,240 --> 00:32:07,240
Speaker 1: get into sound a little bit later. But mannerisms, inflection, accent, emphasis, cadence,

504
00:32:07,360 --> 00:32:09,920
Speaker 1: quirks and ticks, all of these things have to be

505
00:32:09,960 --> 00:32:13,760
Speaker 1: analyzed and replicated to make a convincing fake, and it

506
00:32:13,800 --> 00:32:16,120
Speaker 1: has to be done just right or else it comes

507
00:32:16,120 --> 00:32:20,960
Speaker 1: off as creepy or unrealistic. Think about how impressionists will

508
00:32:21,000 --> 00:32:24,600
Speaker 1: take a celebrity's manner of speech and then heighten some

509
00:32:24,720 --> 00:32:28,200
Speaker 1: of it in comedic effect. You'll hear all the time

510
00:32:28,240 --> 00:32:31,240
Speaker 1: with folks who do impressions of people like Jack Nicholson

511
00:32:31,440 --> 00:32:35,400
Speaker 1: or Christopher Walkin or Barbara streisand people who have a

512
00:32:35,520 --> 00:32:40,040
Speaker 1: very particular way of speaking. Impressionists will take those as

513
00:32:40,160 --> 00:32:43,680
Speaker 1: markers and they really punch in on them. Well, a

514
00:32:43,760 --> 00:32:46,520
Speaker 1: deep fake can't really do that too much, or else

515
00:32:46,560 --> 00:32:48,840
Speaker 1: it won't come across as genuine. It'll feel like you're

516
00:32:48,840 --> 00:32:54,040
Speaker 1: watching a famous person impersonating themselves, which is weird. Now.

517
00:32:54,040 --> 00:32:56,520
Speaker 1: The earliest mention of deep fakes I can find dates

518
00:32:56,560 --> 00:32:59,880
Speaker 1: to a two thousand seventeen Reddit forum in which you

519
00:33:00,080 --> 00:33:03,960
Speaker 1: are shared deep faked videos that appeared to show female

520
00:33:03,960 --> 00:33:09,000
Speaker 1: celebrities in sexual situations. Heads and faces had been replaced,

521
00:33:09,240 --> 00:33:13,000
Speaker 1: and the actors in pornographic movies had their heads or

522
00:33:13,040 --> 00:33:17,200
Speaker 1: faces swapped out for these various celebrities. Now the fakes

523
00:33:17,360 --> 00:33:22,680
Speaker 1: can look fairly convincing, extremely convincing in some cases, which

524
00:33:22,880 --> 00:33:25,760
Speaker 1: can lead to some people assuming that the videos are

525
00:33:25,760 --> 00:33:29,160
Speaker 1: genuine and that the folks that they saw in the

526
00:33:29,240 --> 00:33:32,160
Speaker 1: videos are really the ones who were in it. And

527
00:33:32,320 --> 00:33:35,680
Speaker 1: obviously that's a real problem, right, I mean that this

528
00:33:35,760 --> 00:33:40,080
Speaker 1: technology we've given enough reference data DEFEATA system, someone could

529
00:33:40,160 --> 00:33:43,040
Speaker 1: fabricate a video that appears to put a person in

530
00:33:43,080 --> 00:33:47,040
Speaker 1: a compromising position, whether it's a sexual act or making

531
00:33:47,120 --> 00:33:50,720
Speaker 1: damaging statements or committing a crime or whatever. And there

532
00:33:50,720 --> 00:33:52,800
Speaker 1: are tools right now that allow you to do pretty

533
00:33:52,880 --> 00:33:55,720
Speaker 1: much what the face to face tool was doing back

534
00:33:55,720 --> 00:33:59,320
Speaker 1: in two thousands sixteen, a program called avatar. If I

535
00:34:00,160 --> 00:34:04,280
Speaker 1: just not that easy to say anyway, It can run

536
00:34:04,280 --> 00:34:08,400
Speaker 1: on top of live streaming conference services like Zoom and Skype,

537
00:34:08,719 --> 00:34:12,160
Speaker 1: and you can swap out your face for a celebrities face.

538
00:34:12,719 --> 00:34:17,920
Speaker 1: Your facial expressions map to the computer manipulated celebrity face.

539
00:34:18,880 --> 00:34:21,680
Speaker 1: It just looks at you through your webcam, and then

540
00:34:21,719 --> 00:34:25,160
Speaker 1: if you smile, the celebrity image smiles, et cetera. It's

541
00:34:25,239 --> 00:34:27,879
Speaker 1: like that old face to face program. It does need

542
00:34:28,000 --> 00:34:32,279
Speaker 1: a pretty beefy PC to manage doing all this because

543
00:34:32,280 --> 00:34:35,680
Speaker 1: you're also running that live streaming service underneath it. It's

544
00:34:35,719 --> 00:34:39,640
Speaker 1: also not exactly user friendly. You need some programming experience

545
00:34:39,640 --> 00:34:43,719
Speaker 1: to really get it to work. But it is widely accessible,

546
00:34:44,120 --> 00:34:48,160
Speaker 1: as the source code is is open source and it's

547
00:34:48,200 --> 00:34:51,840
Speaker 1: on get hubs, so anyone can get it. Samantha Cole,

548
00:34:52,120 --> 00:34:55,240
Speaker 1: who writes for Vice, has covered the topic of deep

549
00:34:55,280 --> 00:34:58,760
Speaker 1: fakes pretty extensively and the potential harm they can cause,

550
00:34:59,160 --> 00:35:01,680
Speaker 1: and I recommend you check out her work if you're

551
00:35:01,719 --> 00:35:05,560
Speaker 1: interested in learning more about that. Do be warned that

552
00:35:05,640 --> 00:35:09,640
Speaker 1: Coal covers some pretty adult themed topics and I think

553
00:35:09,640 --> 00:35:13,480
Speaker 1: she does great work and very important work. But as

554
00:35:13,520 --> 00:35:15,480
Speaker 1: a guy who grew up in the Deep South, it's

555
00:35:15,520 --> 00:35:17,840
Speaker 1: also the kind of stuff that occasionally makes me clutch

556
00:35:17,920 --> 00:35:20,400
Speaker 1: my pearls, But that's more of a statement about me

557
00:35:20,800 --> 00:35:24,880
Speaker 1: than her work. She does great work. I think most

558
00:35:24,920 --> 00:35:28,040
Speaker 1: of us can imagine plenty of scenarios in which this

559
00:35:28,080 --> 00:35:31,440
Speaker 1: sort of technology could cause mischief on a good day

560
00:35:31,520 --> 00:35:35,680
Speaker 1: and catastrophe on a bad day, whether it's spreading misinformation,

561
00:35:35,960 --> 00:35:41,040
Speaker 1: creating fear, uncertainty and doubt fud or by making people

562
00:35:41,160 --> 00:35:44,560
Speaker 1: seem to say things they never actually said, or contributing

563
00:35:44,600 --> 00:35:47,359
Speaker 1: to an ugly subculture in which people try to make

564
00:35:47,400 --> 00:35:51,480
Speaker 1: their more base fantasies a reality by putting one person's

565
00:35:51,520 --> 00:35:54,160
Speaker 1: head on another person's body. You know, it's not great.

566
00:35:54,760 --> 00:35:57,840
Speaker 1: There are legitimate uses of the technology too, of course,

567
00:35:58,120 --> 00:36:01,200
Speaker 1: you know, tech itself is rarely good or bad. It's

568
00:36:01,320 --> 00:36:04,640
Speaker 1: all in how we use it. But this particular technology

569
00:36:04,680 --> 00:36:07,719
Speaker 1: has a lot of potentially harmful uses, and Samantha Cole

570
00:36:07,760 --> 00:36:10,799
Speaker 1: has done a great job explaining them. When we come back,

571
00:36:11,000 --> 00:36:13,799
Speaker 1: I'll talk a bit more about the war against deep

572
00:36:13,880 --> 00:36:16,200
Speaker 1: fakes and how people are trying to prepare for a

573
00:36:16,239 --> 00:36:20,840
Speaker 1: world that is increasingly filled with media we can't really trust.

574
00:36:21,360 --> 00:36:33,240
Speaker 1: But first let's take a quick break. Before the break,

575
00:36:33,680 --> 00:36:37,680
Speaker 1: I mentioned Samantha Cole, who has written extensively about deep fags,

576
00:36:37,719 --> 00:36:40,480
Speaker 1: and one point she makes that I think is important

577
00:36:40,520 --> 00:36:44,880
Speaker 1: for us to note is that the vast majority of

578
00:36:44,960 --> 00:36:49,600
Speaker 1: instances of deep fake videos haven't been some manufactured video

579
00:36:49,640 --> 00:36:53,960
Speaker 1: of a political leader saying inflammatory things. That continues to

580
00:36:53,960 --> 00:36:57,480
Speaker 1: be a big concern. There's a genuine fear that someone

581
00:36:57,560 --> 00:37:01,040
Speaker 1: is going to manufacture a video in which a politician

582
00:37:01,080 --> 00:37:04,359
Speaker 1: appears to say or do something truly terrible in an

583
00:37:04,360 --> 00:37:08,560
Speaker 1: effort to either discredit the politician or perhaps instigate a

584
00:37:08,680 --> 00:37:13,600
Speaker 1: conflict with some other group. There are literal doomsday scenarios

585
00:37:13,600 --> 00:37:18,440
Speaker 1: in which such a video would prompt a massive military response,

586
00:37:18,719 --> 00:37:21,320
Speaker 1: though that does seem like it might be a little

587
00:37:21,440 --> 00:37:24,239
Speaker 1: far fetched, though heck, I don't know, considering the world

588
00:37:24,239 --> 00:37:26,040
Speaker 1: we live in, maybe it's not that big of a

589
00:37:26,080 --> 00:37:30,640
Speaker 1: stretch anyway. Cole's point is that so far, debt has

590
00:37:30,800 --> 00:37:34,239
Speaker 1: not happened. She points out that the most frequent use

591
00:37:34,400 --> 00:37:37,160
Speaker 1: for the tech either tends to be people goofing around

592
00:37:37,320 --> 00:37:41,040
Speaker 1: or disturbingly using it to in her words, quote, take

593
00:37:41,080 --> 00:37:45,240
Speaker 1: ownership of women's bodies in non consensual porn end quote.

594
00:37:45,560 --> 00:37:48,759
Speaker 1: Cole argues that the reason we haven't really seen deep

595
00:37:48,760 --> 00:37:52,000
Speaker 1: fix used much outside of these realms, apart from a

596
00:37:52,040 --> 00:37:56,040
Speaker 1: few advertising campaigns, is that people are pretty good at

597
00:37:56,120 --> 00:37:59,879
Speaker 1: spotting deep fix they aren't quite at a level where

598
00:38:00,000 --> 00:38:03,040
Speaker 1: they can easily pass for the real thing. There's still

599
00:38:03,080 --> 00:38:06,399
Speaker 1: something slightly off about them. They tend to butt up

600
00:38:06,440 --> 00:38:09,440
Speaker 1: against the uncanny valley. Now, for those of you not

601
00:38:09,600 --> 00:38:13,520
Speaker 1: familiar with that term, the uncanny valley describes the feeling

602
00:38:13,719 --> 00:38:17,000
Speaker 1: we humans get when we encounter a robot or a

603
00:38:17,040 --> 00:38:23,640
Speaker 1: computer generated figure that closely resembles a human or human behavior,

604
00:38:24,239 --> 00:38:27,760
Speaker 1: but you can still tell it's not actually a person,

605
00:38:28,040 --> 00:38:30,200
Speaker 1: and it's not a good feeling. It tends to be

606
00:38:30,239 --> 00:38:34,120
Speaker 1: described as repulsive and disturbing, or at the very best,

607
00:38:34,640 --> 00:38:39,879
Speaker 1: off putting. See also the animated film Polar Express. There's

608
00:38:39,920 --> 00:38:43,399
Speaker 1: a reason that when that film came out, people kind

609
00:38:43,440 --> 00:38:47,440
Speaker 1: of reacted negatively to the animation, and it's also a

610
00:38:47,480 --> 00:38:51,200
Speaker 1: reason why Pixar tends to prefer to go with stylized

611
00:38:51,280 --> 00:38:54,560
Speaker 1: human characters who are different enough from the way real

612
00:38:54,680 --> 00:38:58,040
Speaker 1: humans look to kind of bypass uncanny valley. We just

613
00:38:58,120 --> 00:39:00,680
Speaker 1: think of that as a cartoon nuts that's trying to

614
00:39:00,760 --> 00:39:04,280
Speaker 1: pass itself off as being human. But while there hasn't

615
00:39:04,320 --> 00:39:06,800
Speaker 1: really been a flood of fake videos hitting the Internet

616
00:39:06,920 --> 00:39:11,200
Speaker 1: with the intent to discredit politicians or infuriate specific people

617
00:39:11,320 --> 00:39:14,720
Speaker 1: or whatever. There remains a general sense that this is coming.

618
00:39:15,239 --> 00:39:18,480
Speaker 1: It's just not here now. The sense I get is

619
00:39:18,480 --> 00:39:21,840
Speaker 1: that people feel it's an inevitability, and there are already

620
00:39:21,880 --> 00:39:24,480
Speaker 1: folks working on tools that will help us sort out

621
00:39:24,480 --> 00:39:29,000
Speaker 1: the real stuff from the fakes. Take Microsoft, for example.

622
00:39:29,520 --> 00:39:34,240
Speaker 1: There R and D division fittingly called Microsoft Research, developed

623
00:39:34,239 --> 00:39:38,600
Speaker 1: a tool they called the Video Authenticator. This tool analyzes

624
00:39:38,760 --> 00:39:42,960
Speaker 1: video samples and looks for signs of deep fakery. In

625
00:39:43,000 --> 00:39:45,800
Speaker 1: a blog post written by Tom Bert and Eric Horvitts

626
00:39:45,840 --> 00:39:50,520
Speaker 1: to Microsoft executives, they say, quote it works by detecting

627
00:39:50,560 --> 00:39:54,160
Speaker 1: the blending boundary of the deep fake and subtle fading

628
00:39:54,280 --> 00:39:57,120
Speaker 1: or gray scale elements that might not be detectable by

629
00:39:57,120 --> 00:40:01,279
Speaker 1: the human eye. End quote. Now I'm no expert, but

630
00:40:01,480 --> 00:40:05,560
Speaker 1: to me, it sounds like the video Authenticator is working

631
00:40:05,560 --> 00:40:09,720
Speaker 1: in a way that's not too dissimilar to a discriminator

632
00:40:10,040 --> 00:40:14,240
Speaker 1: in a generative adversarial network. I mean, the whole purpose

633
00:40:14,560 --> 00:40:18,000
Speaker 1: of the discriminator is to discriminate or to tell the

634
00:40:18,040 --> 00:40:23,319
Speaker 1: difference between genuine, unaltered videos and computer generated ones. So

635
00:40:23,520 --> 00:40:27,200
Speaker 1: the video authenticator is looking for tailtale signs that a

636
00:40:27,320 --> 00:40:32,720
Speaker 1: video was not produced through traditional means but was computer generated. However,

637
00:40:32,760 --> 00:40:36,040
Speaker 1: that's the very thing that the generators in G A

638
00:40:36,239 --> 00:40:39,080
Speaker 1: N systems are looking out for. So when a generator

639
00:40:39,120 --> 00:40:43,760
Speaker 1: receives feedback that a video it generated did not slip

640
00:40:43,800 --> 00:40:47,960
Speaker 1: past the discriminator, it then tweaks those input weights and

641
00:40:48,080 --> 00:40:51,800
Speaker 1: starts to shift its approach in order to bypass whatever

642
00:40:51,840 --> 00:40:54,840
Speaker 1: it was that gave away its last attempt, and it

643
00:40:54,920 --> 00:40:59,440
Speaker 1: does this again and again. So the video authenticator might

644
00:40:59,480 --> 00:41:02,719
Speaker 1: work well for a given amount of time, but I

645
00:41:02,719 --> 00:41:05,319
Speaker 1: would suspect that in the long run, the deep fake

646
00:41:05,400 --> 00:41:10,440
Speaker 1: systems will become sophisticated enough to fool the authenticator. Of course,

647
00:41:10,960 --> 00:41:14,960
Speaker 1: Microsoft will continue to tweak the authenticator as well, and

648
00:41:15,040 --> 00:41:17,919
Speaker 1: it will become something of a seesaw battle as one

649
00:41:18,000 --> 00:41:22,040
Speaker 1: side outperforms the other temporarily, and then the balance will shift.

650
00:41:22,440 --> 00:41:24,719
Speaker 1: Though there may come a time where either the deep

651
00:41:24,760 --> 00:41:27,680
Speaker 1: fakes are too good and they don't set off any

652
00:41:27,719 --> 00:41:34,239
Speaker 1: alarms from the discriminator, or the discriminator gets so sensitive

653
00:41:34,640 --> 00:41:37,759
Speaker 1: that it starts to flag real videos and hits a

654
00:41:37,840 --> 00:41:41,640
Speaker 1: lot of false positives and calls them generated videos instead.

655
00:41:42,040 --> 00:41:44,719
Speaker 1: Either way, you reach a point where a tool like

656
00:41:44,760 --> 00:41:47,600
Speaker 1: this no longer really serves a useful purpose, and the

657
00:41:47,680 --> 00:41:51,239
Speaker 1: video authenticator will be obsolete. Now, this is something we

658
00:41:51,280 --> 00:41:54,680
Speaker 1: see in artificial intelligence all the time. If you remember

659
00:41:54,719 --> 00:41:57,760
Speaker 1: the good old days of capture, you know, the approving

660
00:41:57,840 --> 00:42:00,399
Speaker 1: you're not a robot stuff. The stuff up we were

661
00:42:00,400 --> 00:42:03,759
Speaker 1: told to do was typically type in a series of

662
00:42:03,920 --> 00:42:06,680
Speaker 1: letters and numbers, and it wasn't that hard when it

663
00:42:06,760 --> 00:42:10,320
Speaker 1: first started, at least not at first. That's because the

664
00:42:10,560 --> 00:42:14,600
Speaker 1: text recognition algorithms of the time weren't very good. They

665
00:42:14,640 --> 00:42:19,480
Speaker 1: couldn't decipher mildly deformed text because the shapes of the

666
00:42:19,520 --> 00:42:22,920
Speaker 1: text felt too far outside the parameters of what the

667
00:42:22,960 --> 00:42:26,759
Speaker 1: system could recognize as a legitimate letter or number. You

668
00:42:26,800 --> 00:42:30,120
Speaker 1: make the number a little you know, deformed, and then

669
00:42:30,160 --> 00:42:32,279
Speaker 1: suddenly the systems like, well, that doesn't look like a

670
00:42:32,360 --> 00:42:34,920
Speaker 1: three to me, because it's not in the shape of

671
00:42:34,920 --> 00:42:39,319
Speaker 1: a three. But over time, people developed better text recognition

672
00:42:39,400 --> 00:42:42,600
Speaker 1: programs that could recognize these shapes even if they weren't

673
00:42:42,600 --> 00:42:46,480
Speaker 1: in a standard three orientation, and those systems began to

674
00:42:46,520 --> 00:42:51,560
Speaker 1: defeat those simple early captures that required captured designers to

675
00:42:51,640 --> 00:42:55,359
Speaker 1: make tougher versions and Eventually the machines got good enough

676
00:42:55,400 --> 00:42:58,920
Speaker 1: that they could match or even outperform humans, and at

677
00:42:58,960 --> 00:43:01,920
Speaker 1: that point those tech based captures proved to be more

678
00:43:02,000 --> 00:43:05,680
Speaker 1: challenging for people than for machines, which meant if you

679
00:43:05,800 --> 00:43:08,440
Speaker 1: use them, you defeated the whole purpose in the first place.

680
00:43:08,600 --> 00:43:11,640
Speaker 1: So while this escalation proved to be a challenge for security,

681
00:43:12,280 --> 00:43:15,680
Speaker 1: it was a boon for artificial intelligence. And while I

682
00:43:15,719 --> 00:43:19,680
Speaker 1: focused almost exclusively on the imagery of video here, the

683
00:43:19,760 --> 00:43:22,400
Speaker 1: same sort of stuff is going on with generated speech,

684
00:43:22,560 --> 00:43:28,040
Speaker 1: including generated speech that imitates specific voices like deep big videos.

685
00:43:28,280 --> 00:43:31,080
Speaker 1: This approach works best if you have a really big

686
00:43:31,160 --> 00:43:35,600
Speaker 1: data set of recorded audio, so people like movie and

687
00:43:35,680 --> 00:43:41,640
Speaker 1: TV stars, news reporters, politicians, and um, you know, podcasters,

688
00:43:42,400 --> 00:43:45,480
Speaker 1: we're great targets for this stuff. There might be hundreds

689
00:43:45,560 --> 00:43:48,880
Speaker 1: or you know, in my case, thousands of hours of

690
00:43:48,920 --> 00:43:52,680
Speaker 1: recording material to work from. Training a model to use

691
00:43:52,760 --> 00:43:59,040
Speaker 1: the frequencies. Timbre, intonation, pronunciation, pauses, and other mannerisms of

692
00:43:59,040 --> 00:44:02,560
Speaker 1: speech can versus in a system that can generate vocals

693
00:44:02,640 --> 00:44:06,680
Speaker 1: that sound like the target, sometimes to a fairly convincing degree,

694
00:44:07,360 --> 00:44:10,160
Speaker 1: and for a while. To peek behind the curtain here

695
00:44:10,760 --> 00:44:12,880
Speaker 1: we at tech stuff. We're working with a company that

696
00:44:12,960 --> 00:44:15,080
Speaker 1: I'm not going to name, but they were going to

697
00:44:15,120 --> 00:44:17,680
Speaker 1: do something like this as an experiment. I was going

698
00:44:17,719 --> 00:44:20,200
Speaker 1: to do a whole episode on it, and I had

699
00:44:20,280 --> 00:44:25,640
Speaker 1: planned on crafting a segment of that episode only through text.

700
00:44:25,800 --> 00:44:28,520
Speaker 1: I was not going to actually record it myself and

701
00:44:28,520 --> 00:44:32,240
Speaker 1: then use a system that was trained on my voice

702
00:44:32,680 --> 00:44:37,320
Speaker 1: to replicate my voice and deliver that segment on its own.

703
00:44:37,680 --> 00:44:40,080
Speaker 1: I was curious if it can nail not just the

704
00:44:40,120 --> 00:44:44,239
Speaker 1: audio quality of my voice, which, let's be honest, is amazing.

705
00:44:44,920 --> 00:44:48,560
Speaker 1: That's sarcasm. I can't stand listening to myself, but it

706
00:44:48,600 --> 00:44:53,000
Speaker 1: would also have to replicate how I actually make certain sounds,

707
00:44:53,080 --> 00:44:55,160
Speaker 1: Like would it get the bit of the southern accent

708
00:44:55,440 --> 00:44:59,200
Speaker 1: that's in my voice, or the way I emphasize certain words.

709
00:44:59,480 --> 00:45:01,960
Speaker 1: Would it us for effect at all? Or would it

710
00:45:02,040 --> 00:45:05,759
Speaker 1: just robotically say one word after the next and only

711
00:45:05,840 --> 00:45:09,400
Speaker 1: pause when there was some helpful punctuation that told it

712
00:45:09,480 --> 00:45:12,880
Speaker 1: to do so. Would it indicate a question by raising

713
00:45:12,920 --> 00:45:16,040
Speaker 1: the pitch at the end of its sentence. Sadly, we

714
00:45:16,560 --> 00:45:20,600
Speaker 1: never got far with that particular project, so I don't

715
00:45:20,600 --> 00:45:22,440
Speaker 1: have any answers for you. I don't know how it

716
00:45:22,480 --> 00:45:25,040
Speaker 1: would have turned out, But clearly one of the things

717
00:45:25,080 --> 00:45:27,799
Speaker 1: I thought of was that it's a bit of a

718
00:45:27,840 --> 00:45:30,360
Speaker 1: red flag. If you can train a computer to sound

719
00:45:30,400 --> 00:45:33,839
Speaker 1: exactly like a specific person, that means you could make

720
00:45:33,920 --> 00:45:38,279
Speaker 1: that person say anything you like, and obviously, like deep

721
00:45:38,320 --> 00:45:41,839
Speaker 1: fake videos, that could have some pretty devastating consequences if

722
00:45:41,840 --> 00:45:47,120
Speaker 1: it were at all, you know, believable or seemed realistic. Now,

723
00:45:47,160 --> 00:45:50,120
Speaker 1: the company we were working with was working hard to

724
00:45:50,120 --> 00:45:52,360
Speaker 1: make sure that the only person to have access to

725
00:45:52,600 --> 00:45:55,520
Speaker 1: a specific voice would be the owner of that voice,

726
00:45:55,640 --> 00:45:59,600
Speaker 1: or presumably the company employing that person, though that does

727
00:45:59,640 --> 00:46:02,239
Speaker 1: bring up a whole bunch of other potential problems, like

728
00:46:02,280 --> 00:46:06,560
Speaker 1: can you imagine eliminating voice actors from a job because

729
00:46:06,600 --> 00:46:08,400
Speaker 1: you've got enough of their voice and you can just

730
00:46:08,560 --> 00:46:11,960
Speaker 1: replicate it. That wouldn't be great, But even so, it

731
00:46:12,080 --> 00:46:14,920
Speaker 1: was something I felt was both fascinating from a technology

732
00:46:14,960 --> 00:46:19,160
Speaker 1: standpoint and potentially problematic when it comes to an application

733
00:46:19,440 --> 00:46:22,880
Speaker 1: of that technology. One other thing I should mention is

734
00:46:22,960 --> 00:46:26,239
Speaker 1: that the Internet at large has been pretty active in

735
00:46:26,400 --> 00:46:29,799
Speaker 1: fighting deep fakes, not necessarily in detecting them, but removing

736
00:46:29,840 --> 00:46:33,560
Speaker 1: the platforms from which they were being shared, Reddit being

737
00:46:33,600 --> 00:46:36,160
Speaker 1: a big one, the subreddit that was dedicated to deep

738
00:46:36,160 --> 00:46:39,640
Speaker 1: fakes what had been shut down, So there have been

739
00:46:39,719 --> 00:46:41,600
Speaker 1: some of those moves as well. Now this is not

740
00:46:41,960 --> 00:46:46,080
Speaker 1: directly against the technology, it's more against the proliferation of

741
00:46:46,120 --> 00:46:51,120
Speaker 1: the uh the output of that technology. As for detecting

742
00:46:51,160 --> 00:46:53,919
Speaker 1: deep fakes, it's interesting to me that people are even

743
00:46:54,000 --> 00:46:57,319
Speaker 1: developing tools to detect them, because to me, the best

744
00:46:57,360 --> 00:47:00,839
Speaker 1: tools so far seems to be human perception. It's not

745
00:47:01,080 --> 00:47:06,160
Speaker 1: that the images aren't really convincing, or that we can

746
00:47:06,200 --> 00:47:09,799
Speaker 1: suddenly detect these, you know, blending lines like the video

747
00:47:09,840 --> 00:47:13,719
Speaker 1: Authenticator tool. It's rather that it's just not hard for

748
00:47:13,800 --> 00:47:16,640
Speaker 1: us to spot a deep fake. Stuff just doesn't quite

749
00:47:16,960 --> 00:47:21,400
Speaker 1: look right in the way that people behave in these videos.

750
00:47:21,400 --> 00:47:25,960
Speaker 1: The vocals and animation often don't quite match. The expressions

751
00:47:26,320 --> 00:47:31,200
Speaker 1: aren't really natural, the progression of mannerisms feels synthetic and

752
00:47:31,280 --> 00:47:36,120
Speaker 1: not genuine. It just it looks off. It's that uncanny

753
00:47:36,200 --> 00:47:39,760
Speaker 1: Valley thing, and so just paying attention and thinking critically

754
00:47:39,760 --> 00:47:41,880
Speaker 1: can really help you suss out the fakes from the

755
00:47:41,920 --> 00:47:45,200
Speaker 1: real thing. Even if we reach a point where machines

756
00:47:45,320 --> 00:47:49,080
Speaker 1: can create a convincing enough fake to pass for reality,

757
00:47:49,360 --> 00:47:53,120
Speaker 1: we can still apply critical thinking, and we always should. Heck,

758
00:47:53,440 --> 00:47:55,960
Speaker 1: we should be applying critical thinking even when there's no

759
00:47:56,080 --> 00:47:59,399
Speaker 1: doubt as to the validity of the video, because there

760
00:47:59,400 --> 00:48:03,960
Speaker 1: may be enough to doubt the content of the video itself.

761
00:48:04,360 --> 00:48:07,600
Speaker 1: If I listen to a genuine scam artist in a

762
00:48:07,680 --> 00:48:12,200
Speaker 1: genuine video, that doesn't make the scam more legitimate. We

763
00:48:12,239 --> 00:48:15,080
Speaker 1: always need to use critical thinking. What I think is

764
00:48:15,120 --> 00:48:18,600
Speaker 1: most important is that we acknowledge the very real fact

765
00:48:18,880 --> 00:48:23,880
Speaker 1: that there are numerous organizations, agencies, governments, and other groups

766
00:48:23,920 --> 00:48:29,520
Speaker 1: that are actively attempting to spread misinformation and disinformation. There

767
00:48:29,560 --> 00:48:34,799
Speaker 1: are entire intelligence agencies dedicated to this endeavor, and then

768
00:48:35,200 --> 00:48:38,440
Speaker 1: there are more independent groups that are doing it for

769
00:48:38,520 --> 00:48:41,960
Speaker 1: one reason or another, typically either to advance a particular

770
00:48:42,160 --> 00:48:45,879
Speaker 1: political agenda or just to make as much money as

771
00:48:46,000 --> 00:48:50,080
Speaker 1: quickly as possible. This is beyond doubt or question. There

772
00:48:50,120 --> 00:48:54,279
Speaker 1: are numerous misinformation campaigns that are actively going on out

773
00:48:54,320 --> 00:48:57,560
Speaker 1: there in the real world right now. Most of them

774
00:48:57,840 --> 00:49:01,920
Speaker 1: are not depending on deep fakes, because one, deep fakes

775
00:49:01,960 --> 00:49:05,200
Speaker 1: aren't really good enough to fool most people right now,

776
00:49:05,640 --> 00:49:08,840
Speaker 1: and too, they don't need the deep fakes in the

777
00:49:08,880 --> 00:49:11,640
Speaker 1: first place. There are other methods that are simpler that

778
00:49:11,760 --> 00:49:15,600
Speaker 1: don't need nearly the processing power that work just fine.

779
00:49:15,880 --> 00:49:18,440
Speaker 1: Why would you go through the trouble of synthesizing a

780
00:49:18,560 --> 00:49:21,080
Speaker 1: video if you can get a better response with a

781
00:49:21,120 --> 00:49:25,160
Speaker 1: blog post filled with lies or half truths. It's just

782
00:49:25,280 --> 00:49:28,759
Speaker 1: not a great return on investment. So bottom line, be

783
00:49:28,960 --> 00:49:33,799
Speaker 1: vigilant out there, particularly on social media. Be aware that

784
00:49:33,840 --> 00:49:36,520
Speaker 1: there are plenty of people who will not hesitate to

785
00:49:36,640 --> 00:49:40,000
Speaker 1: mislead others in order to get what they want. Use

786
00:49:40,000 --> 00:49:45,279
Speaker 1: a critical eye to evaluate the information you encounter. Ask questions,

787
00:49:45,719 --> 00:49:50,440
Speaker 1: check sources, look for corroborating reports. It's a lot of work,

788
00:49:50,480 --> 00:49:53,359
Speaker 1: but trust me, it's way better that we do our

789
00:49:53,400 --> 00:49:56,400
Speaker 1: best to make sure the stuff we're depending on is

790
00:49:56,480 --> 00:50:00,600
Speaker 1: actually dependable. It'll turn out better for us in long run.

791
00:50:00,880 --> 00:50:04,319
Speaker 1: Well that wraps up this episode of tech stuff, which yeah,

792
00:50:04,600 --> 00:50:07,640
Speaker 1: I used as a backdoor to argue about critical thinking. Again,

793
00:50:07,719 --> 00:50:12,040
Speaker 1: sue me, don't, don't really sue me. But I think

794
00:50:12,040 --> 00:50:16,360
Speaker 1: that that's another instance where it's a really clear example

795
00:50:16,400 --> 00:50:18,520
Speaker 1: where we have to use that kind of stuff. So

796
00:50:18,680 --> 00:50:22,680
Speaker 1: I'm gonna keep on stressing it. And you guys are awesome.

797
00:50:22,960 --> 00:50:25,840
Speaker 1: I believe in you. I think that when we start

798
00:50:25,920 --> 00:50:29,400
Speaker 1: using these tools at our disposal that everybody can develop

799
00:50:29,840 --> 00:50:33,919
Speaker 1: just with some practice that things will be better. We'll

800
00:50:33,960 --> 00:50:37,720
Speaker 1: be able to suss out the nonsense from the real stuff,

801
00:50:38,400 --> 00:50:40,960
Speaker 1: and we're all better off in the long run if

802
00:50:41,000 --> 00:50:43,719
Speaker 1: we can do that. If you guys have suggestions for

803
00:50:43,840 --> 00:50:46,600
Speaker 1: future topics I should cover in episodes of tech Stuff,

804
00:50:46,719 --> 00:50:50,360
Speaker 1: let me know via Twitter. The handle is text stuff

805
00:50:50,719 --> 00:50:55,000
Speaker 1: H s W and I'll talk to you again really soon.

806
00:51:01,200 --> 00:51:04,239
Speaker 1: Tech Stuff is an I Heart Radio production. For more

807
00:51:04,320 --> 00:51:07,720
Speaker 1: podcasts from I Heart Radio, visit the i Heart Radio app,

808
00:51:07,840 --> 00:51:11,000
Speaker 1: Apple Podcasts, or wherever you listen to your favorite shows.