1
00:00:04,400 --> 00:00:07,800
Speaker 1: Welcome to Tech Stuff, a production from I Heart Radio.

2
00:00:12,400 --> 00:00:15,200
Speaker 1: Hey there, and welcome to tech Stuff. I'm your host,

3
00:00:15,360 --> 00:00:18,400
Speaker 1: Jonathan Strickland. I'm an executive producer with I Heart Radio

4
00:00:18,440 --> 00:00:21,720
Speaker 1: and I love all things tech. Now, before I get

5
00:00:21,760 --> 00:00:25,000
Speaker 1: into today's episode, I want to give a little listener

6
00:00:25,239 --> 00:00:29,280
Speaker 1: warning here. The topic at hand involves some adult content,

7
00:00:29,760 --> 00:00:33,040
Speaker 1: including the use of technology to do stuff that can

8
00:00:33,120 --> 00:00:37,960
Speaker 1: be unethical, illegal, hurtful, and just plain awful. Now, I

9
00:00:38,000 --> 00:00:40,800
Speaker 1: think this is an important topic, but I wanted to

10
00:00:40,840 --> 00:00:42,800
Speaker 1: give a bit of a heads up at the start

11
00:00:42,840 --> 00:00:45,440
Speaker 1: of the episode, just in case any of you guys

12
00:00:45,479 --> 00:00:48,880
Speaker 1: are listening to a podcast on like a family road

13
00:00:48,920 --> 00:00:51,880
Speaker 1: trip or something. I think this is an important topic

14
00:00:52,320 --> 00:00:55,120
Speaker 1: and I think everyone should know about it and think

15
00:00:55,160 --> 00:00:57,360
Speaker 1: about it. But I also respect that for some people

16
00:00:57,360 --> 00:01:00,680
Speaker 1: this subject might get a bit taboo. So let's go

17
00:01:00,880 --> 00:01:06,360
Speaker 1: on with the episode. Back in nine, a movie called

18
00:01:06,720 --> 00:01:11,319
Speaker 1: Rising Sun, directed by Philip Kaufman, based on a Michael

19
00:01:11,360 --> 00:01:15,119
Speaker 1: Crichton novel and starring Wesley Snipes and Sean Connery came

20
00:01:15,120 --> 00:01:18,360
Speaker 1: out in theaters. Now, I didn't see it in theaters,

21
00:01:19,040 --> 00:01:21,360
Speaker 1: but I did catch it when it came on you know,

22
00:01:21,920 --> 00:01:25,760
Speaker 1: HBO or Cinemax or something. Later on, the movie included

23
00:01:25,760 --> 00:01:28,959
Speaker 1: a sequence that I found to be totally unbelievable. And

24
00:01:29,000 --> 00:01:32,720
Speaker 1: I'm not talking about buying into Sean Connery being an

25
00:01:32,720 --> 00:01:37,319
Speaker 1: expert on Japanese culture and business practices. Actually, side note,

26
00:01:37,480 --> 00:01:41,720
Speaker 1: Sean Connery has an interesting history of playing unlikely characters,

27
00:01:41,760 --> 00:01:44,759
Speaker 1: such as in Highlander, where he played an immortal who

28
00:01:44,840 --> 00:01:49,080
Speaker 1: was supposedly Egyptian, then who lived in feudal Japan and

29
00:01:49,200 --> 00:01:51,840
Speaker 1: ended up in Spain where he became known as Ramirez.

30
00:01:52,200 --> 00:01:54,760
Speaker 1: And all the while he's talking to a Scottish Highlander

31
00:01:54,960 --> 00:01:58,200
Speaker 1: who's played by a Belgian actor. But I'm getting way

32
00:01:58,240 --> 00:02:02,040
Speaker 1: off track here. Besides, I've heard Crichton actually wrote the

33
00:02:02,160 --> 00:02:05,000
Speaker 1: character while thinking of Connery, So you know, what the

34
00:02:05,000 --> 00:02:08,320
Speaker 1: heck do I know? In the film, Snives and Connery

35
00:02:08,440 --> 00:02:12,240
Speaker 1: are investigators, and they're looking into a homicide that happened

36
00:02:12,280 --> 00:02:16,760
Speaker 1: at a Japanese business but on American soil. The security

37
00:02:16,800 --> 00:02:21,080
Speaker 1: system in the building captured video of the homicide and

38
00:02:21,120 --> 00:02:23,480
Speaker 1: the identity of the killer appears to be a pretty

39
00:02:23,560 --> 00:02:26,880
Speaker 1: open and shut case. But that's not how it all

40
00:02:26,960 --> 00:02:30,520
Speaker 1: turns out. The investigators talked to a security expert played

41
00:02:30,520 --> 00:02:34,440
Speaker 1: by Tia Carrera, and she demonstrates in real time how

42
00:02:34,560 --> 00:02:39,160
Speaker 1: video footage can be altered. She records a short video

43
00:02:39,520 --> 00:02:43,800
Speaker 1: of Connery and snipes loads that onto a computer, freezes

44
00:02:43,960 --> 00:02:47,600
Speaker 1: a frame of the video, and essentially performs a cut

45
00:02:47,639 --> 00:02:51,440
Speaker 1: and paste job swapping the heads of our two lead characters.

46
00:02:51,880 --> 00:02:55,000
Speaker 1: Then she resumes the video and the head swap remains

47
00:02:55,000 --> 00:02:59,960
Speaker 1: in place, and that head swap stuff is possible. I mean,

48
00:03:00,040 --> 00:03:02,440
Speaker 1: clearly it has to be possible, because you actually do

49
00:03:02,600 --> 00:03:05,680
Speaker 1: see that effect in the film itself. But it takes

50
00:03:05,800 --> 00:03:08,680
Speaker 1: a bit more than a quick cut and paste job.

51
00:03:08,760 --> 00:03:11,320
Speaker 1: But we'll leave off of that for now. The whole

52
00:03:11,360 --> 00:03:15,720
Speaker 1: point of that sequence, apart from showing off some cinema magic,

53
00:03:16,240 --> 00:03:20,560
Speaker 1: is to demonstrate to the investigators that video, like photographs,

54
00:03:20,880 --> 00:03:24,560
Speaker 1: can be altered. The expert has detected a blue halo

55
00:03:24,720 --> 00:03:27,919
Speaker 1: around the face of the supposed murderer in the footage,

56
00:03:28,240 --> 00:03:31,680
Speaker 1: indicating that some sort of trickery has happened. She also

57
00:03:31,760 --> 00:03:34,800
Speaker 1: reveals that she cannot magically restore the video to its

58
00:03:34,800 --> 00:03:37,920
Speaker 1: previous unaltered state, which I think was actually a nice

59
00:03:38,000 --> 00:03:41,040
Speaker 1: change of pace for a movie. By the way, I

60
00:03:41,080 --> 00:03:44,760
Speaker 1: think this movie is really, you know, not good, like

61
00:03:45,480 --> 00:03:50,360
Speaker 1: not worth your time, but that's my opinion anyway. For years,

62
00:03:50,680 --> 00:03:54,040
Speaker 1: this kind of video sorcery was pretty much limited to

63
00:03:54,160 --> 00:03:57,760
Speaker 1: the film and TV industries. It usually required a lot

64
00:03:57,800 --> 00:04:01,720
Speaker 1: of pre planning beforehand, so it wasn't as simple as

65
00:04:01,760 --> 00:04:04,920
Speaker 1: just taking footage that was already shot and changing it

66
00:04:04,960 --> 00:04:07,720
Speaker 1: in post on a whim with a couple of clicks

67
00:04:07,760 --> 00:04:09,800
Speaker 1: of a button. If it were, we would see a

68
00:04:09,800 --> 00:04:13,920
Speaker 1: lot fewer mistakes left in movies and television because you

69
00:04:13,960 --> 00:04:16,599
Speaker 1: could catch it later and just fix it. But the

70
00:04:16,640 --> 00:04:20,359
Speaker 1: tricks were possible, they were just difficult to pull off.

71
00:04:20,880 --> 00:04:23,840
Speaker 1: It just wasn't something you or I would ever encounter

72
00:04:23,960 --> 00:04:27,560
Speaker 1: in our day to day lives. But today we live

73
00:04:27,680 --> 00:04:30,880
Speaker 1: in a different world, a world that has examples of

74
00:04:31,000 --> 00:04:35,960
Speaker 1: synthetic media. Commonly referred to as deep fakes. These are

75
00:04:36,080 --> 00:04:39,839
Speaker 1: videos that have been altered or generated so that the

76
00:04:39,920 --> 00:04:42,839
Speaker 1: subject of the video is doing something that they probably

77
00:04:42,960 --> 00:04:47,200
Speaker 1: would or could never do. They've brought into question whether

78
00:04:47,320 --> 00:04:50,800
Speaker 1: or not video evidence is even reliable, much as the

79
00:04:50,839 --> 00:04:54,440
Speaker 1: film Rising Sun was talking about. We already know that

80
00:04:54,520 --> 00:05:00,559
Speaker 1: eyewitness testimony is terribly unreliable. Our perception and memory play

81
00:05:00,600 --> 00:05:04,560
Speaker 1: tricks on us, and we can quote unquote remember stuff

82
00:05:04,600 --> 00:05:09,560
Speaker 1: that just didn't happen the way things actually unfolded in reality.

83
00:05:09,600 --> 00:05:13,359
Speaker 1: But now we're looking at video evidence and potentially the

84
00:05:13,440 --> 00:05:17,600
Speaker 1: same light. I mean, it's scary. So today we're going

85
00:05:17,680 --> 00:05:21,760
Speaker 1: to learn about synthetic media, how it can be generated,

86
00:05:22,080 --> 00:05:26,000
Speaker 1: the implications that follow with that sort of reality, and

87
00:05:26,120 --> 00:05:29,559
Speaker 1: ways that people are trying to counteract a potentially dangerous threat,

88
00:05:30,040 --> 00:05:34,800
Speaker 1: you know, fun stuff. Now, first, the term synthetic media

89
00:05:35,120 --> 00:05:39,400
Speaker 1: has a particular meaning. It refers to art created through

90
00:05:39,520 --> 00:05:43,760
Speaker 1: some sort of automated process, so it's a largely hands

91
00:05:43,839 --> 00:05:49,000
Speaker 1: off approach to creating the final art piece. Now, under

92
00:05:49,040 --> 00:05:52,880
Speaker 1: that definition, the example of rising sun would not apply

93
00:05:53,080 --> 00:05:56,400
Speaker 1: here because we see in the film and presumably this

94
00:05:56,480 --> 00:05:58,599
Speaker 1: happens in the book as well, but I haven't read

95
00:05:58,680 --> 00:06:03,200
Speaker 1: the book that a human being actually changes that. People

96
00:06:03,279 --> 00:06:06,880
Speaker 1: have used tools to alter the video footage. This would

97
00:06:06,880 --> 00:06:10,280
Speaker 1: be more like using photoshop to touch up a still image,

98
00:06:10,279 --> 00:06:14,039
Speaker 1: with the computer system presumably doing some of the work

99
00:06:14,040 --> 00:06:16,800
Speaker 1: in the background to keep things matched up. Either that

100
00:06:16,960 --> 00:06:19,640
Speaker 1: or you would need to alter each image in the

101
00:06:19,640 --> 00:06:23,760
Speaker 1: footage frame by frame, or use some sort of matt approach.

102
00:06:24,360 --> 00:06:26,880
Speaker 1: To learn more about matts, you can listen to my

103
00:06:26,920 --> 00:06:30,760
Speaker 1: episode about how blue and green screens work. Synthetic media

104
00:06:31,040 --> 00:06:35,200
Speaker 1: as a general practice has been around for centuries. Artists

105
00:06:35,200 --> 00:06:38,640
Speaker 1: have set up various contraptions to create works with little

106
00:06:38,880 --> 00:06:43,039
Speaker 1: or no human guidance. In the twentieth century we started

107
00:06:43,120 --> 00:06:46,960
Speaker 1: to see a movement called generative art take form. This

108
00:06:47,000 --> 00:06:49,560
Speaker 1: type of art is all about creating a system that

109
00:06:49,680 --> 00:06:53,880
Speaker 1: then creates or generates the finished art piece. That would

110
00:06:53,920 --> 00:06:57,080
Speaker 1: mean that the finished work, such as a painting, wouldn't

111
00:06:57,400 --> 00:07:00,400
Speaker 1: reflect the feelings or thoughts of the art is who

112
00:07:00,440 --> 00:07:03,919
Speaker 1: created the system. In fact, it starts to raise the

113
00:07:04,000 --> 00:07:07,120
Speaker 1: question what is the art? Is it the painting that

114
00:07:07,200 --> 00:07:11,000
Speaker 1: came about due to a machine following a program of

115
00:07:11,080 --> 00:07:15,400
Speaker 1: some sort, or is the art the program itself? Is

116
00:07:15,440 --> 00:07:19,000
Speaker 1: the art the process by which the painting was made?

117
00:07:19,320 --> 00:07:22,000
Speaker 1: Now I'm not here to answer that question. I just

118
00:07:22,320 --> 00:07:26,640
Speaker 1: think it is an interesting question to ask. Sometimes people

119
00:07:26,680 --> 00:07:30,600
Speaker 1: ask much less polite questions, such as is it art

120
00:07:30,640 --> 00:07:34,280
Speaker 1: at all? Some art critics went out of their way

121
00:07:34,320 --> 00:07:37,520
Speaker 1: to dismiss generative art in the early days. They found

122
00:07:37,520 --> 00:07:42,000
Speaker 1: it insulting, but hey, that's kind of the history of

123
00:07:42,200 --> 00:07:46,560
Speaker 1: art in general. Each new movement and art inevitably finds

124
00:07:46,600 --> 00:07:51,080
Speaker 1: both supporters and critics as it emerges. If anything, you

125
00:07:51,200 --> 00:07:55,360
Speaker 1: might argue that such a response legitimizes the movement in

126
00:07:55,560 --> 00:07:58,640
Speaker 1: you know, a weird way. If people hate it, it

127
00:07:58,720 --> 00:08:02,720
Speaker 1: must be something. In two thousand eighteen, an artist collective

128
00:08:03,040 --> 00:08:07,920
Speaker 1: called Obvious located out of Paris, France. They submitted portrait

129
00:08:08,000 --> 00:08:11,920
Speaker 1: style paintings that were created not by an actual human painter,

130
00:08:12,440 --> 00:08:16,440
Speaker 1: but by an artificially intelligent system. Now they looked a

131
00:08:16,480 --> 00:08:21,720
Speaker 1: lot like typical eighteenth century style portraits. There was no

132
00:08:21,800 --> 00:08:24,640
Speaker 1: attempt to pass off the portrait as if it were

133
00:08:24,720 --> 00:08:28,120
Speaker 1: actually made by a human artist. In fact, the appeal

134
00:08:28,320 --> 00:08:32,760
Speaker 1: of the piece was largely due to it being synthetically generated.

135
00:08:33,200 --> 00:08:36,720
Speaker 1: It went to auction at Christie's and the AI created

136
00:08:36,800 --> 00:08:42,000
Speaker 1: painting fetched more than four hundred thousand dollars. And the

137
00:08:42,040 --> 00:08:45,280
Speaker 1: way the group trained their AI is relevant to our

138
00:08:45,320 --> 00:08:49,960
Speaker 1: discussion about deep fakes. The collective relied on a type

139
00:08:49,960 --> 00:08:55,560
Speaker 1: of machine learning called generative adversarial networks or g a N,

140
00:08:56,080 --> 00:08:59,319
Speaker 1: which in turn is depending on deep learning. So it

141
00:08:59,360 --> 00:09:00,959
Speaker 1: looks like we've got a few things we're going to

142
00:09:01,080 --> 00:09:03,840
Speaker 1: have to define here. Now, I'm going to keep things

143
00:09:04,160 --> 00:09:07,719
Speaker 1: fairly high level, because as it turns out there are

144
00:09:07,760 --> 00:09:11,439
Speaker 1: a few different ways to create machine learning models, and

145
00:09:11,520 --> 00:09:14,280
Speaker 1: to go through all of them in exhaustive detail would

146
00:09:14,280 --> 00:09:17,600
Speaker 1: represent a university level course in machine learning. I have

147
00:09:17,760 --> 00:09:21,240
Speaker 1: neither the time for that nor the expertise. I would

148
00:09:21,320 --> 00:09:24,960
Speaker 1: do a terrible job, So we'll go with a high

149
00:09:25,040 --> 00:09:31,480
Speaker 1: level perspective here first. A generative adversarial network uses two systems.

150
00:09:31,520 --> 00:09:35,280
Speaker 1: You have a generator and you have a discriminator. Both

151
00:09:35,360 --> 00:09:38,760
Speaker 1: of these systems are a type of neural network. A

152
00:09:38,840 --> 00:09:42,480
Speaker 1: neural network is a computing model that is inspired by

153
00:09:42,480 --> 00:09:47,520
Speaker 1: the way our brains work. Our brains contain billions of neurons,

154
00:09:47,760 --> 00:09:52,200
Speaker 1: and these neurons work together, communicating through electrical and chemical signals,

155
00:09:52,440 --> 00:09:57,680
Speaker 1: controlling and coordinating pretty much everything in our bodies. With computers,

156
00:09:58,040 --> 00:10:02,720
Speaker 1: the neurons are Note the job of a node is,

157
00:10:03,120 --> 00:10:05,400
Speaker 1: you know, supposed to be kind of like a neuron

158
00:10:05,640 --> 00:10:08,960
Speaker 1: cell in the brain. It's to take in multiple weighted

159
00:10:09,080 --> 00:10:14,360
Speaker 1: input values and then generate a single output value. Now,

160
00:10:14,400 --> 00:10:18,000
Speaker 1: the word weighted w E I G H T E

161
00:10:18,080 --> 00:10:21,840
Speaker 1: D weighted is really important here because the larger and

162
00:10:21,960 --> 00:10:26,120
Speaker 1: inputs weight, the more that input will have an effect

163
00:10:26,360 --> 00:10:29,000
Speaker 1: on whatever the output is. So it kind of comes

164
00:10:29,040 --> 00:10:32,679
Speaker 1: down to which inputs are the most important for that

165
00:10:32,800 --> 00:10:36,760
Speaker 1: nodes particular function. Now, if I were to make an analogy,

166
00:10:36,840 --> 00:10:40,560
Speaker 1: I would say, your boss hands you three tasks to do.

167
00:10:41,240 --> 00:10:45,360
Speaker 1: One of those tasks has the label extremely important, and

168
00:10:45,440 --> 00:10:49,320
Speaker 1: the second task has the label critically important, and the

169
00:10:49,400 --> 00:10:52,240
Speaker 1: third task has a label saying you should have finished

170
00:10:52,280 --> 00:10:55,040
Speaker 1: that one before it was handed to you. Okay, so

171
00:10:55,080 --> 00:10:57,800
Speaker 1: that's just some sort of snarky office humor that I

172
00:10:57,840 --> 00:11:00,520
Speaker 1: need to get off my chest. But more seriously, imagine

173
00:11:00,559 --> 00:11:05,000
Speaker 1: a node accepting three inputs. In this example, input one

174
00:11:05,280 --> 00:11:09,680
Speaker 1: has a fifty weight, Input two has a weight, and

175
00:11:09,720 --> 00:11:12,360
Speaker 1: input three has a ten percent weight. That adds up

176
00:11:12,400 --> 00:11:16,200
Speaker 1: to and that would tell you that the output that

177
00:11:16,280 --> 00:11:21,160
Speaker 1: node generates will be most affected by input one, followed

178
00:11:21,200 --> 00:11:24,199
Speaker 1: by input two, and then input three would have a

179
00:11:24,280 --> 00:11:29,120
Speaker 1: smaller effect on whatever the output is. Each node applies

180
00:11:29,200 --> 00:11:34,080
Speaker 1: a nonlinear transformation on the input values, again affected by

181
00:11:34,240 --> 00:11:39,000
Speaker 1: each inputs weight value, and that generates the output value.

182
00:11:39,480 --> 00:11:43,520
Speaker 1: The details of that really are not important for our episode,

183
00:11:43,520 --> 00:11:46,920
Speaker 1: and involves performing changes on variables that in turn change

184
00:11:46,960 --> 00:11:50,360
Speaker 1: the correlation between variables, and it gets a bit Matthew,

185
00:11:50,559 --> 00:11:53,360
Speaker 1: and we would get lost in the weeds pretty quickly.

186
00:11:53,679 --> 00:11:56,480
Speaker 1: The important thing to remember is that a node within

187
00:11:56,520 --> 00:12:01,280
Speaker 1: a neural network takes in a weighted sum inputs, then

188
00:12:01,320 --> 00:12:06,680
Speaker 1: performs a process on those inputs before passing the result

189
00:12:06,800 --> 00:12:10,520
Speaker 1: on as an output. Then some other node a layer

190
00:12:10,640 --> 00:12:14,400
Speaker 1: down will accept that output, along with outputs from a

191
00:12:14,440 --> 00:12:17,600
Speaker 1: couple of other nodes one layer up, and then we'll

192
00:12:17,640 --> 00:12:21,400
Speaker 1: perform an operation based on those weighted inputs and pass

193
00:12:21,480 --> 00:12:23,840
Speaker 1: that on to the next layer, and so on. So

194
00:12:23,920 --> 00:12:27,000
Speaker 1: these nodes are in layers, like you know a cake.

195
00:12:27,600 --> 00:12:30,520
Speaker 1: One layer of notes processes some inputs, they send it

196
00:12:30,559 --> 00:12:33,440
Speaker 1: on to the next layer of nodes, and then that

197
00:12:33,480 --> 00:12:35,320
Speaker 1: one does onto the next one, and the next one

198
00:12:35,360 --> 00:12:40,880
Speaker 1: and so on. This isn't a new idea. Computer scientists

199
00:12:41,040 --> 00:12:45,679
Speaker 1: began theorizing and experimenting with neural network approaches as far

200
00:12:45,760 --> 00:12:49,360
Speaker 1: back as the nineteen fifties with the perceptron, which was

201
00:12:49,400 --> 00:12:53,280
Speaker 1: a hypothetical system that was described by Frank Rosenblatt of

202
00:12:53,320 --> 00:12:57,160
Speaker 1: Cornell University. But it wasn't until the last decade that

203
00:12:57,280 --> 00:13:00,400
Speaker 1: computing power and our ability to handle a lot of

204
00:13:00,520 --> 00:13:04,040
Speaker 1: data reached a point where these sort of learning models

205
00:13:04,040 --> 00:13:08,280
Speaker 1: could really take off. The goal of this system is

206
00:13:08,320 --> 00:13:12,080
Speaker 1: to train it to perform a particular task within a

207
00:13:12,120 --> 00:13:16,880
Speaker 1: certain level of precision. The weights I mentioned are adjustable,

208
00:13:17,040 --> 00:13:19,360
Speaker 1: so you can think of it as teaching a system

209
00:13:19,480 --> 00:13:22,840
Speaker 1: which bits are the most important in order to do

210
00:13:23,040 --> 00:13:25,760
Speaker 1: whatever it is the system is supposed to do in

211
00:13:25,840 --> 00:13:28,880
Speaker 1: order to achieve your task, These are the bits that

212
00:13:28,920 --> 00:13:32,320
Speaker 1: are the most important and therefore should matter the most

213
00:13:32,320 --> 00:13:35,240
Speaker 1: when you weigh a decision. This is a bit easier

214
00:13:35,280 --> 00:13:38,319
Speaker 1: if we talk about a similar system with the version

215
00:13:38,360 --> 00:13:42,679
Speaker 1: of IBM S Watson that played on Jeopardy. That system

216
00:13:42,800 --> 00:13:46,280
Speaker 1: famously was not connected to the Internet. It had to

217
00:13:46,320 --> 00:13:50,319
Speaker 1: rely on all the information that was stored within itself.

218
00:13:50,960 --> 00:13:55,000
Speaker 1: When the system encountered a clue in Jeopardy, it would

219
00:13:55,000 --> 00:13:57,959
Speaker 1: analyze the clue, and then it would reference its data

220
00:13:57,960 --> 00:14:01,320
Speaker 1: base to look for possible answers to whatever that clue was.

221
00:14:01,800 --> 00:14:05,160
Speaker 1: The system would weigh those possible answers and attempt to

222
00:14:05,160 --> 00:14:08,760
Speaker 1: determine which, if any, were the most likely to be correct.

223
00:14:09,200 --> 00:14:13,920
Speaker 1: If the certainty was over a certain threshold, such as sure,

224
00:14:14,200 --> 00:14:16,720
Speaker 1: the system would buzz in with its answer. If no

225
00:14:16,880 --> 00:14:20,920
Speaker 1: response rose above that threshold, the system would not buzz in,

226
00:14:21,280 --> 00:14:23,480
Speaker 1: So you could say that Watson was playing the game

227
00:14:23,520 --> 00:14:27,680
Speaker 1: with a best guess sort of approach. Neural networks do

228
00:14:28,240 --> 00:14:33,000
Speaker 1: essentially that sort of processing. With this particular type of approach,

229
00:14:33,400 --> 00:14:36,640
Speaker 1: we know what we want the outcome to be, so

230
00:14:36,840 --> 00:14:39,880
Speaker 1: we can judge whether or not the system was successful.

231
00:14:40,200 --> 00:14:43,760
Speaker 1: After each attempt, we can adjust the weight on the

232
00:14:43,800 --> 00:14:47,760
Speaker 1: input between nodes to refine the decision making process to

233
00:14:47,840 --> 00:14:51,880
Speaker 1: get more accurate results. If the system succeeds in its task,

234
00:14:52,360 --> 00:14:55,720
Speaker 1: we can increase the weights that contributed to the system

235
00:14:55,760 --> 00:15:00,240
Speaker 1: picking the correct answer and thus decrease the input it's

236
00:15:00,320 --> 00:15:05,280
Speaker 1: that did not contribute to the successful response. If the

237
00:15:05,280 --> 00:15:09,320
Speaker 1: system done messed up and gave the wrong answer, then

238
00:15:09,360 --> 00:15:11,720
Speaker 1: we do the opposite. We look at the inputs that

239
00:15:11,760 --> 00:15:16,000
Speaker 1: contributed to the wrong answer, we diminish their weights, and

240
00:15:16,080 --> 00:15:18,440
Speaker 1: we increase the weights of the other input and then

241
00:15:18,440 --> 00:15:23,120
Speaker 1: we run the test again a lot. I'll explain a

242
00:15:23,160 --> 00:15:25,600
Speaker 1: bit more about this process when we come back, but

243
00:15:25,680 --> 00:15:36,400
Speaker 1: first let's take a quick break. Early in the history

244
00:15:36,520 --> 00:15:40,760
Speaker 1: of neural networks, computer scientists were hitting some pretty hard

245
00:15:40,880 --> 00:15:44,400
Speaker 1: stops due to the limitations of computing power at the time.

246
00:15:44,720 --> 00:15:48,080
Speaker 1: Early networks were only a couple of layers deep, which

247
00:15:48,080 --> 00:15:50,720
Speaker 1: really meant they weren't terribly powerful, and they could only

248
00:15:50,760 --> 00:15:54,400
Speaker 1: tackle rudimentary tasks like figuring out whether or not a

249
00:15:54,520 --> 00:15:59,160
Speaker 1: square is drawn on a piece of paper that isn't

250
00:15:59,240 --> 00:16:05,560
Speaker 1: terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and Ronald

251
00:16:05,600 --> 00:16:12,120
Speaker 1: Williams published a lecture titled learning representations by back propagating errors.

252
00:16:12,160 --> 00:16:16,840
Speaker 1: This was a big breakthrough with deep learning. This all

253
00:16:16,880 --> 00:16:19,360
Speaker 1: has to do with a deep learning system improving its

254
00:16:19,360 --> 00:16:22,760
Speaker 1: ability to complete a specific task. And basically the algorithm's

255
00:16:22,840 --> 00:16:25,840
Speaker 1: job is to go from the output layer, you know,

256
00:16:25,960 --> 00:16:29,000
Speaker 1: where the system has made a decision, and then work

257
00:16:29,160 --> 00:16:32,680
Speaker 1: backward through the neural network, adjusting the weights that led

258
00:16:32,720 --> 00:16:38,480
Speaker 1: to an incorrect decision. So let's say it's a system

259
00:16:38,520 --> 00:16:41,680
Speaker 1: that is looking to figure out whether or not a

260
00:16:41,720 --> 00:16:45,000
Speaker 1: cat is in a photograph and it says, there's a

261
00:16:45,040 --> 00:16:47,400
Speaker 1: cat in this picture, and you look at the picture

262
00:16:47,400 --> 00:16:50,440
Speaker 1: and there is no cat there. Then you would look

263
00:16:50,560 --> 00:16:54,720
Speaker 1: at the inputs one level back just before the system

264
00:16:54,800 --> 00:16:57,160
Speaker 1: said here's a picture of a cat, and you'd say,

265
00:16:57,200 --> 00:16:59,720
Speaker 1: all right, which of these inputs lad the system to

266
00:17:00,120 --> 00:17:03,200
Speaker 1: leave this was a picture of a cat, And then

267
00:17:03,280 --> 00:17:06,200
Speaker 1: you would adjust those. Then you would go back one

268
00:17:06,320 --> 00:17:10,159
Speaker 1: layer up, so you're working your way up the model

269
00:17:10,520 --> 00:17:14,240
Speaker 1: and say which inputs here led to it giving the

270
00:17:14,280 --> 00:17:18,400
Speaker 1: outputs that led to the mistake, and you do this

271
00:17:18,640 --> 00:17:21,760
Speaker 1: all the way up until you get up to the

272
00:17:21,800 --> 00:17:24,639
Speaker 1: input level at the top of the computer model. You

273
00:17:24,680 --> 00:17:28,040
Speaker 1: are back propagating, and then you run the test again

274
00:17:28,160 --> 00:17:32,720
Speaker 1: to see if you've got improvement. It's exhaustive, but it's

275
00:17:32,800 --> 00:17:38,000
Speaker 1: also drastically improved neural network performance, much faster than just

276
00:17:38,520 --> 00:17:42,080
Speaker 1: throwing more brute force to it. The algorithm essentially is

277
00:17:42,160 --> 00:17:44,920
Speaker 1: checking to see if a small change in each input

278
00:17:45,040 --> 00:17:48,640
Speaker 1: value received by a layer of nodes would have led

279
00:17:48,680 --> 00:17:51,200
Speaker 1: to a more accurate results. So it's all about going

280
00:17:51,240 --> 00:17:54,679
Speaker 1: from that output working your way backward. In two thousand twelve,

281
00:17:54,720 --> 00:17:57,920
Speaker 1: Alex Krajewski published a paper that gave us the next

282
00:17:58,000 --> 00:18:02,480
Speaker 1: big breakthrough. He argued that a really deep neural network

283
00:18:02,760 --> 00:18:06,040
Speaker 1: with a lot of layers could give really great results

284
00:18:06,200 --> 00:18:09,960
Speaker 1: if you paired it with enough data to train the system.

285
00:18:10,000 --> 00:18:13,600
Speaker 1: So you needed to throw lots of data at these models,

286
00:18:14,320 --> 00:18:17,720
Speaker 1: and it needed to be an enormous amount of data. However,

287
00:18:17,880 --> 00:18:22,120
Speaker 1: once trained, the system would produce lower error rates. So yeah,

288
00:18:22,160 --> 00:18:24,040
Speaker 1: I would take a long time but you would get

289
00:18:24,080 --> 00:18:27,560
Speaker 1: better results. Now, at the time, a good error rate

290
00:18:27,720 --> 00:18:31,840
Speaker 1: for such a system was that means one out of

291
00:18:31,920 --> 00:18:35,159
Speaker 1: four conclusions the system would come to would be wrong.

292
00:18:35,600 --> 00:18:39,800
Speaker 1: If you ran it across a long enough number of decisions,

293
00:18:39,800 --> 00:18:43,120
Speaker 1: you would find that one out of every four wasn't right.

294
00:18:43,880 --> 00:18:47,520
Speaker 1: The system that Alex's team worked on produced results that

295
00:18:47,560 --> 00:18:50,880
Speaker 1: had an error rate of six percent, so much lower.

296
00:18:51,040 --> 00:18:54,720
Speaker 1: And then in just five years, with more improvements to

297
00:18:54,800 --> 00:18:58,919
Speaker 1: this process, the classification error rate had dropped down to

298
00:18:59,080 --> 00:19:02,760
Speaker 1: two point three percent for deep learning systems. So from

299
00:19:04,160 --> 00:19:09,080
Speaker 1: to two point three it was really powerful stuff. Okay,

300
00:19:09,119 --> 00:19:12,960
Speaker 1: so you've got your artificial neural network. You've got your

301
00:19:13,080 --> 00:19:17,359
Speaker 1: layers and layers of nodes. You've adjusted the weights of

302
00:19:17,400 --> 00:19:20,439
Speaker 1: the inputs into each node to see if your system

303
00:19:20,520 --> 00:19:25,240
Speaker 1: can identify, you know, pictures of cats, and you start

304
00:19:25,320 --> 00:19:29,479
Speaker 1: feeding images to this system, lots of them. This is

305
00:19:29,520 --> 00:19:32,439
Speaker 1: the domain that you are feeding to your system. The

306
00:19:32,480 --> 00:19:34,919
Speaker 1: more images you can feed to it, the better. And

307
00:19:34,960 --> 00:19:37,120
Speaker 1: you want a wide variety of images of all sorts

308
00:19:37,160 --> 00:19:39,879
Speaker 1: of stuff, not just of different types of cats, but

309
00:19:40,000 --> 00:19:43,760
Speaker 1: stuff that most certainly is not a cat, like dogs,

310
00:19:43,880 --> 00:19:48,000
Speaker 1: or cars or chartered public accountants, you name it, and

311
00:19:48,080 --> 00:19:51,400
Speaker 1: you look to see which images the system identifies correctly

312
00:19:51,800 --> 00:19:55,080
Speaker 1: and which ones it screws up, both which images have

313
00:19:55,320 --> 00:19:58,400
Speaker 1: cats in it that actually don't have cats in it,

314
00:19:58,840 --> 00:20:01,639
Speaker 1: or images the system has identified as saying there is

315
00:20:01,680 --> 00:20:04,719
Speaker 1: no cat here, but there is a cat there. This

316
00:20:04,800 --> 00:20:08,480
Speaker 1: guides you into adjusting the weights again and again, and

317
00:20:08,560 --> 00:20:10,360
Speaker 1: you start over and you do it again, and that's

318
00:20:10,400 --> 00:20:13,920
Speaker 1: your basic deep learning system, and it gets better over

319
00:20:14,000 --> 00:20:17,560
Speaker 1: time as you train it. It learns. Now, let's transition

320
00:20:17,600 --> 00:20:21,480
Speaker 1: over to the adversarial systems I mentioned earlier, because they

321
00:20:21,520 --> 00:20:25,160
Speaker 1: take this and twist it a little bit. So you've

322
00:20:25,200 --> 00:20:30,040
Speaker 1: got two artificial neural networks and they are using this

323
00:20:30,160 --> 00:20:33,840
Speaker 1: general approach to deep learning, and you're setting them up

324
00:20:34,160 --> 00:20:38,800
Speaker 1: so that they feed into each other. One network, the generator,

325
00:20:39,320 --> 00:20:42,880
Speaker 1: has the task to learn how to do something such

326
00:20:42,920 --> 00:20:47,000
Speaker 1: as create an eighteenth century style portrait based off lots

327
00:20:47,080 --> 00:20:50,400
Speaker 1: and lots of examples of the real thing. The domain

328
00:20:50,560 --> 00:20:55,800
Speaker 1: the problem domain. The second network, the discriminator, has a

329
00:20:55,840 --> 00:20:59,639
Speaker 1: different job. It has to tell the difference between authentic

330
00:20:59,720 --> 00:21:03,840
Speaker 1: port traits that came from the problem domain and computer

331
00:21:04,080 --> 00:21:08,120
Speaker 1: generated portraits that came from the generator itself. So essentially

332
00:21:08,200 --> 00:21:11,600
Speaker 1: the discriminator is like the model I mentioned earlier that

333
00:21:11,680 --> 00:21:14,480
Speaker 1: was identifying pictures of cats. It's doing the same sort

334
00:21:14,480 --> 00:21:17,000
Speaker 1: of thing, except instead of saying cat or no cat,

335
00:21:17,080 --> 00:21:21,359
Speaker 1: it's saying real portrait or a computer generated portrait. So

336
00:21:21,400 --> 00:21:25,120
Speaker 1: there are essentially two outcomes the discriminator could reach, and

337
00:21:25,240 --> 00:21:28,679
Speaker 1: that's whether or not an images computer generated or it wasn't.

338
00:21:29,520 --> 00:21:31,720
Speaker 1: So do you see where this is going? You train

339
00:21:31,880 --> 00:21:35,680
Speaker 1: up both models. You have the generator attempt to make

340
00:21:35,720 --> 00:21:39,080
Speaker 1: its own version of something such as that eighteenth century portrait.

341
00:21:39,680 --> 00:21:42,680
Speaker 1: It does so it designs the portrait based on what

342
00:21:42,760 --> 00:21:46,440
Speaker 1: the model believes are the key elements of a portrait,

343
00:21:47,160 --> 00:21:51,960
Speaker 1: So things like colors, shapes, the ratio of size, like

344
00:21:52,200 --> 00:21:54,480
Speaker 1: you know, how large should the head be in relation

345
00:21:54,520 --> 00:21:58,080
Speaker 1: to the body. All of these factors and many more

346
00:21:58,440 --> 00:22:03,200
Speaker 1: come into play. The generator creates its own idea of

347
00:22:03,240 --> 00:22:06,080
Speaker 1: what a portrait is supposed to look like, and chances

348
00:22:06,080 --> 00:22:09,800
Speaker 1: are the early rounds of this will not be terribly convincing.

349
00:22:10,560 --> 00:22:14,560
Speaker 1: The results are then fed to the discriminator, which tries

350
00:22:14,600 --> 00:22:17,520
Speaker 1: to suss out which of the images fed to it

351
00:22:17,560 --> 00:22:20,879
Speaker 1: our computer generated and which ones aren't. After that round,

352
00:22:21,280 --> 00:22:26,320
Speaker 1: both models are tweaked the generator adjusts input weights to

353
00:22:26,359 --> 00:22:29,880
Speaker 1: get closer to the genuine article, and the discriminator adjust

354
00:22:29,960 --> 00:22:34,720
Speaker 1: weights to reduce false positives or to catch computer generated images.

355
00:22:34,960 --> 00:22:39,280
Speaker 1: And then you go again and again and again and again,

356
00:22:39,560 --> 00:22:43,199
Speaker 1: and they both get better over time. So, assuming everything

357
00:22:43,280 --> 00:22:46,560
Speaker 1: is working properly, over time, the adjustment of input weights

358
00:22:46,600 --> 00:22:50,040
Speaker 1: will lead to more convincing results, and given enough time

359
00:22:50,240 --> 00:22:53,200
Speaker 1: and enough repetition, you'll end up with a computer generated

360
00:22:53,240 --> 00:22:55,639
Speaker 1: painting that you can auction off for nearly half a

361
00:22:55,680 --> 00:22:59,920
Speaker 1: million dollars. Though keep in mind that huge price or

362
00:23:00,040 --> 00:23:02,760
Speaker 1: dates back to the novelty of it being an early

363
00:23:02,960 --> 00:23:06,960
Speaker 1: AI generated painting. It would be shocking to me if

364
00:23:07,000 --> 00:23:10,320
Speaker 1: we saw that actually become a trend. Also, the painting,

365
00:23:10,359 --> 00:23:13,800
Speaker 1: while interesting, isn't exactly so astounding as to make you

366
00:23:13,840 --> 00:23:16,840
Speaker 1: think there's no way a machine did that. You'd look

367
00:23:16,880 --> 00:23:19,240
Speaker 1: at them and go, yeah, I can imagine a machine

368
00:23:19,240 --> 00:23:23,400
Speaker 1: did that. One. A group of computer scientists first described

369
00:23:23,520 --> 00:23:26,879
Speaker 1: the general adversarial network architecture in a paper in two

370
00:23:26,920 --> 00:23:30,560
Speaker 1: thousand and fourteen, and like other neural networks, these models

371
00:23:30,600 --> 00:23:33,399
Speaker 1: require a lot of data. The more the better. In fact,

372
00:23:33,480 --> 00:23:36,040
Speaker 1: smaller data sets means the models have to make some

373
00:23:36,119 --> 00:23:40,960
Speaker 1: pretty big assumptions, and you tend to get pretty lousy results.

374
00:23:41,440 --> 00:23:45,160
Speaker 1: More data, as in more examples, teaches the models more

375
00:23:45,200 --> 00:23:48,320
Speaker 1: about the parameters of the domain, whatever it is they

376
00:23:48,320 --> 00:23:52,080
Speaker 1: are trying to generate. It refines the approach. So if

377
00:23:52,119 --> 00:23:54,600
Speaker 1: you have a sophisticated enough pair of models and you

378
00:23:54,640 --> 00:23:57,280
Speaker 1: have enough data to fill up a domain, you can

379
00:23:57,359 --> 00:24:01,439
Speaker 1: generate some convincing material. And that in ludes video and

380
00:24:01,560 --> 00:24:05,080
Speaker 1: this brings us around to deep fakes. And in addition

381
00:24:05,200 --> 00:24:09,679
Speaker 1: to generative adversarial networks, a couple of other things really

382
00:24:10,200 --> 00:24:15,520
Speaker 1: converged to create the techniques and trends and technology that

383
00:24:15,560 --> 00:24:22,160
Speaker 1: would allow for deep fakes proper. In Malcolm Slaney, Michelle Covell,

384
00:24:22,440 --> 00:24:26,360
Speaker 1: and Christoph Bregler wrote some software that they called the

385
00:24:26,440 --> 00:24:30,960
Speaker 1: Video Rewrite Program. The software would analyze faces and then

386
00:24:31,040 --> 00:24:35,920
Speaker 1: create or synthesize lip animation which could be matched to

387
00:24:36,320 --> 00:24:39,800
Speaker 1: pre recorded audio. So you could take some film footage

388
00:24:40,280 --> 00:24:44,040
Speaker 1: of a person and then reanimate their lips so that

389
00:24:44,080 --> 00:24:47,000
Speaker 1: they could appear to say all sorts of things, which

390
00:24:47,000 --> 00:24:50,840
Speaker 1: in some ways set the stage for deep fakes. This case,

391
00:24:50,920 --> 00:24:53,840
Speaker 1: it was really just focusing on the lips and the

392
00:24:53,880 --> 00:24:57,879
Speaker 1: general area around the lips, so you weren't changing the

393
00:24:57,960 --> 00:25:00,920
Speaker 1: rest of the expression of the face, and you would

394
00:25:00,960 --> 00:25:04,800
Speaker 1: have to, you know, keep your recording to be about

395
00:25:04,880 --> 00:25:07,359
Speaker 1: the same length as whatever the film clip was, or

396
00:25:07,400 --> 00:25:09,719
Speaker 1: you would have to loop the film clip over and over,

397
00:25:09,800 --> 00:25:12,040
Speaker 1: which would make it, you know, far more obvious that

398
00:25:12,160 --> 00:25:16,720
Speaker 1: this was a fake. In addition, motion tracking technology was

399
00:25:16,760 --> 00:25:19,440
Speaker 1: advancing over time too, and this also became an important

400
00:25:19,480 --> 00:25:22,800
Speaker 1: tool in computer animation. This tool would also be used

401
00:25:23,119 --> 00:25:27,479
Speaker 1: by deep fake algorithms to create facial expressions, manipulating the

402
00:25:27,560 --> 00:25:30,479
Speaker 1: digital image just as it would if it were a

403
00:25:30,560 --> 00:25:34,959
Speaker 1: video game character or a Pixar animated character. Typically, you

404
00:25:35,000 --> 00:25:38,159
Speaker 1: need to start with some existing video in order to

405
00:25:38,200 --> 00:25:42,439
Speaker 1: manipulate it. You're not actually computer generating the animation, like,

406
00:25:42,480 --> 00:25:47,439
Speaker 1: you're not creating a computer generated version of whomever it

407
00:25:47,560 --> 00:25:51,320
Speaker 1: is you're you're doing the fake of You're using existing

408
00:25:51,760 --> 00:25:54,639
Speaker 1: imagery in order to do that and then manipulating that

409
00:25:54,720 --> 00:25:58,760
Speaker 1: existing imagery, so it's a little different from computer animation.

410
00:25:59,200 --> 00:26:02,000
Speaker 1: In two thousands six teen, students and faculty at the

411
00:26:02,040 --> 00:26:06,720
Speaker 1: Technical University of Munich created the face to Face project

412
00:26:07,000 --> 00:26:10,600
Speaker 1: that would be face the numeral two and then face

413
00:26:11,359 --> 00:26:14,640
Speaker 1: and this was particularly jaw dropping to me at the time.

414
00:26:14,640 --> 00:26:18,320
Speaker 1: When I first saw these videos back in ten, I

415
00:26:18,400 --> 00:26:22,600
Speaker 1: was floored. They created a system that had a target actor.

416
00:26:23,040 --> 00:26:25,600
Speaker 1: This would be the video of the person that you

417
00:26:25,640 --> 00:26:28,520
Speaker 1: want to manipulate. In the example they used, it was

418
00:26:28,760 --> 00:26:33,600
Speaker 1: former US President George W. Bush. Their process also had

419
00:26:33,640 --> 00:26:38,399
Speaker 1: a source actor. This was the source of the expressions

420
00:26:38,440 --> 00:26:41,440
Speaker 1: and facial movements you would see in the targets, so

421
00:26:41,960 --> 00:26:45,240
Speaker 1: kind of like a digital puppeteer in a way. But

422
00:26:45,320 --> 00:26:47,280
Speaker 1: the way they did it was really cool. They had

423
00:26:47,280 --> 00:26:51,160
Speaker 1: a camera trained on the source actor and it would

424
00:26:51,280 --> 00:26:54,919
Speaker 1: track specific points of movement on the source actor's face,

425
00:26:55,480 --> 00:26:58,600
Speaker 1: and then the system would manipulate the same points of

426
00:26:58,720 --> 00:27:02,520
Speaker 1: movement on the target actor's face in the video. So

427
00:27:02,720 --> 00:27:07,040
Speaker 1: if the source actor smiled, then the target smiled, so

428
00:27:07,160 --> 00:27:08,920
Speaker 1: the source actor would smile, and then you would see

429
00:27:08,920 --> 00:27:12,240
Speaker 1: George W. Bush in the video smile in real time.

430
00:27:12,560 --> 00:27:17,919
Speaker 1: It was really strange. They used this looping video of

431
00:27:18,240 --> 00:27:21,560
Speaker 1: George W. Bush wearing a neutral expression. They had to

432
00:27:21,640 --> 00:27:26,359
Speaker 1: start with that as there they're sort of zero point,

433
00:27:26,880 --> 00:27:28,880
Speaker 1: and I gotta tell you, it really does look like

434
00:27:29,520 --> 00:27:31,920
Speaker 1: the former president George W. Bush is having a bit

435
00:27:31,920 --> 00:27:35,600
Speaker 1: of a freak out on a looping video because he

436
00:27:35,720 --> 00:27:40,159
Speaker 1: keeps on opening his mouth, closing his mouth, grimacing, raising

437
00:27:40,160 --> 00:27:43,040
Speaker 1: his eyebrows. You need to watch this video. It is

438
00:27:43,080 --> 00:27:48,080
Speaker 1: still available online to check out. In Students and faculty

439
00:27:48,119 --> 00:27:52,560
Speaker 1: over at the University of Washington created the Synthesizing Obama project,

440
00:27:52,800 --> 00:27:55,639
Speaker 1: in which they trained a computer model to generate a

441
00:27:55,760 --> 00:27:59,920
Speaker 1: synthetic video of former US President Barack Obama, and they

442
00:28:00,119 --> 00:28:03,520
Speaker 1: made it lip sync to a pre recorded audio clip

443
00:28:03,720 --> 00:28:08,320
Speaker 1: from one of Obama's addresses to the nation. They actually

444
00:28:08,400 --> 00:28:12,160
Speaker 1: had the original video of that address for comparison, so

445
00:28:12,359 --> 00:28:15,639
Speaker 1: they could look back at that and see how they're

446
00:28:15,680 --> 00:28:19,400
Speaker 1: generated one compared to the real thing. And their approach

447
00:28:19,640 --> 00:28:23,560
Speaker 1: used a model that analyzed hundreds of hours of video

448
00:28:23,600 --> 00:28:28,600
Speaker 1: footage of Obama speaking, and it mapped specific mouth shapes

449
00:28:28,720 --> 00:28:33,400
Speaker 1: to specific sounds. It would also include some of Obama's mannerisms,

450
00:28:33,440 --> 00:28:35,439
Speaker 1: such as how he moves his head when he talks

451
00:28:35,560 --> 00:28:39,240
Speaker 1: or uses facial expressions to emphasize words. And watching the

452
00:28:39,360 --> 00:28:43,200
Speaker 1: video and that, you know the real one next to

453
00:28:43,240 --> 00:28:46,960
Speaker 1: the generated one is pretty strange. You can tell the

454
00:28:47,000 --> 00:28:51,680
Speaker 1: generated one isn't quite right. It's not matching the audio exactly,

455
00:28:51,960 --> 00:28:56,840
Speaker 1: at least not on the early versions, but it's fairly close,

456
00:28:56,920 --> 00:28:59,600
Speaker 1: and it might even pass casual inspection for a lot

457
00:28:59,640 --> 00:29:02,000
Speaker 1: of people who weren't, like, you know, actually paying attention.

458
00:29:02,600 --> 00:29:07,840
Speaker 1: Authors Morass and Alexandro defined deep fakes as quote the

459
00:29:07,880 --> 00:29:13,200
Speaker 1: product of artificial intelligence applications that merge, combine, replace, and

460
00:29:13,320 --> 00:29:17,440
Speaker 1: superimpose images and video clips to create fake videos that

461
00:29:17,480 --> 00:29:22,600
Speaker 1: appear authentic end quote. They first emerged in two seventeen,

462
00:29:22,880 --> 00:29:26,760
Speaker 1: and so this is a pretty darn young application of technology.

463
00:29:27,400 --> 00:29:30,600
Speaker 1: One thing that is worrisome is that once someone has

464
00:29:30,640 --> 00:29:34,360
Speaker 1: access to the tools, it's not that difficult to create

465
00:29:34,440 --> 00:29:37,479
Speaker 1: a deep fake video. You pretty much just need a

466
00:29:37,520 --> 00:29:41,320
Speaker 1: decent computer, the tools, a bit of know how on

467
00:29:41,400 --> 00:29:44,560
Speaker 1: how to do it, and some time you also need

468
00:29:44,720 --> 00:29:48,440
Speaker 1: some reference material, as in like videos and images of

469
00:29:48,480 --> 00:29:52,280
Speaker 1: the person that you are replicating, and like the machine

470
00:29:52,360 --> 00:29:55,640
Speaker 1: learning systems I've mentioned, the more reference material you have,

471
00:29:55,920 --> 00:29:59,200
Speaker 1: the better. That's why the deep fakes you encounter these

472
00:29:59,280 --> 00:30:03,280
Speaker 1: days tend to be of notable famous people like celebrities

473
00:30:03,280 --> 00:30:07,240
Speaker 1: and politicians. Mainly there's no shortage of reference material for

474
00:30:07,320 --> 00:30:10,680
Speaker 1: those types of individuals, and so they are easier to

475
00:30:10,720 --> 00:30:14,080
Speaker 1: replicate with deep fakes than someone who maintains a much

476
00:30:14,280 --> 00:30:17,240
Speaker 1: lower profile. Not to say that that will always be

477
00:30:17,320 --> 00:30:19,880
Speaker 1: the case, or that there aren't systems out there that

478
00:30:19,960 --> 00:30:25,400
Speaker 1: can accept smaller amounts of reference material. It's just harder

479
00:30:25,440 --> 00:30:31,920
Speaker 1: to make a convincing version with fewer samples. But in

480
00:30:32,040 --> 00:30:35,480
Speaker 1: order to make a convincing fake, the system really has

481
00:30:35,520 --> 00:30:39,640
Speaker 1: to learn how a person moves. All those facial expressions matter.

482
00:30:39,880 --> 00:30:42,920
Speaker 1: It also has to learn how a person sounds. Will

483
00:30:42,960 --> 00:30:48,960
Speaker 1: get into sound a little bit later. But mannerisms, inflection, accent, emphasis, cadence,

484
00:30:49,080 --> 00:30:51,640
Speaker 1: quirks and ticks, all of these things have to be

485
00:30:51,680 --> 00:30:55,480
Speaker 1: analyzed and replicated to make a convincing fake, and it

486
00:30:55,520 --> 00:30:57,840
Speaker 1: has to be done just right, or else it comes

487
00:30:57,840 --> 00:31:02,680
Speaker 1: off as creepy or unrealistic. Think about how impressionists will

488
00:31:02,720 --> 00:31:06,320
Speaker 1: take a celebrity's manner of speech and then heighten some

489
00:31:06,440 --> 00:31:09,920
Speaker 1: of it in comedic effect. You'll hear all the time

490
00:31:09,960 --> 00:31:12,959
Speaker 1: with folks who do impressions of people like Jack Nicholson

491
00:31:13,160 --> 00:31:17,080
Speaker 1: or Christopher Walkin or Barbara streisand people who have a

492
00:31:17,240 --> 00:31:21,760
Speaker 1: very particular way of speaking. Impressionists will take those as

493
00:31:21,880 --> 00:31:25,360
Speaker 1: markers and they really punch in on them. Well, a

494
00:31:25,480 --> 00:31:28,240
Speaker 1: deep fake can't really do that too much, or else

495
00:31:28,280 --> 00:31:30,560
Speaker 1: it won't come across as genuine. It'll feel like you're

496
00:31:30,560 --> 00:31:35,760
Speaker 1: watching a famous person impersonating themselves, which is weird. Now.

497
00:31:35,760 --> 00:31:38,240
Speaker 1: The earliest mention of deep fakes I can find dates

498
00:31:38,280 --> 00:31:41,480
Speaker 1: to a two thousand seventeen Reddit forum in which a

499
00:31:41,600 --> 00:31:45,680
Speaker 1: user shared deep faked videos that appeared to show female

500
00:31:45,680 --> 00:31:50,720
Speaker 1: celebrities in sexual situations. Heads and faces had been replaced,

501
00:31:50,960 --> 00:31:54,720
Speaker 1: and the actors in pornographic movies had their heads or

502
00:31:54,760 --> 00:31:58,840
Speaker 1: faces swapped out for these various celebrities. Now the fakes

503
00:31:59,080 --> 00:32:04,400
Speaker 1: can look fairly convincing, extremely convincing in some cases, which

504
00:32:04,600 --> 00:32:07,479
Speaker 1: can lead to some people assuming that the videos are

505
00:32:07,480 --> 00:32:10,880
Speaker 1: genuine and that the folks that they saw in the

506
00:32:10,960 --> 00:32:13,880
Speaker 1: videos are really the ones who are in it. And

507
00:32:14,040 --> 00:32:17,400
Speaker 1: obviously that's a real problem, right. I mean that this

508
00:32:17,480 --> 00:32:21,800
Speaker 1: technology we've given enough reference data DEFEATA system, someone could

509
00:32:21,840 --> 00:32:24,760
Speaker 1: fabricate a video that appears to put a person in

510
00:32:24,800 --> 00:32:28,760
Speaker 1: a compromising position, whether it's a sexual act or making

511
00:32:28,840 --> 00:32:32,400
Speaker 1: damaging statements or committing a crime or whatever. And there

512
00:32:32,440 --> 00:32:34,520
Speaker 1: are tools right now that allow you to do pretty

513
00:32:34,600 --> 00:32:37,440
Speaker 1: much what the face to face tool was doing back

514
00:32:37,440 --> 00:32:41,040
Speaker 1: in two thousand sixteen. A program called avatar if I,

515
00:32:41,640 --> 00:32:45,720
Speaker 1: which is not that easy to say anyway. It can

516
00:32:45,800 --> 00:32:49,520
Speaker 1: run on top of live streaming conference services like Zoom

517
00:32:49,520 --> 00:32:52,400
Speaker 1: and Skype, and you can swap out your face for

518
00:32:52,480 --> 00:32:57,200
Speaker 1: a celebrities face. Your facial expressions map to the computer

519
00:32:57,840 --> 00:33:02,200
Speaker 1: manipulated celebrity face uh that just looks at you through

520
00:33:02,240 --> 00:33:06,680
Speaker 1: your webcam, and then if you smile, the celebrity image smiles, etcetera.

521
00:33:06,720 --> 00:33:09,240
Speaker 1: It's like that old face to face program. It does

522
00:33:09,360 --> 00:33:13,720
Speaker 1: need a pretty beefy PC to manage doing all this

523
00:33:13,760 --> 00:33:17,240
Speaker 1: because you're also running that live streaming service underneath it.

524
00:33:17,240 --> 00:33:20,320
Speaker 1: It's also not exactly user friendly. You need some programming

525
00:33:20,760 --> 00:33:24,000
Speaker 1: experience to really get it to work. But it is

526
00:33:24,280 --> 00:33:29,080
Speaker 1: widely accessible as the source code is is open source

527
00:33:29,480 --> 00:33:31,880
Speaker 1: and it's on get hubs, so anyone can get it.

528
00:33:32,640 --> 00:33:36,440
Speaker 1: Samantha Cole, who writes for Vice, has covered the topic

529
00:33:36,480 --> 00:33:39,880
Speaker 1: of deep fakes pretty extensively and the potential harm they

530
00:33:39,880 --> 00:33:42,920
Speaker 1: can cause, and I recommend you check out her work

531
00:33:43,080 --> 00:33:46,440
Speaker 1: if you're interested in learning more about that. Do be

532
00:33:46,520 --> 00:33:51,000
Speaker 1: warned that Coal covers some pretty adult themed topics and

533
00:33:51,040 --> 00:33:53,880
Speaker 1: I think she does great work and very important work.

534
00:33:54,320 --> 00:33:57,000
Speaker 1: But as a guy who grew up in the Deep South,

535
00:33:57,040 --> 00:33:59,200
Speaker 1: it's also the kind of stuff that occasionally makes me

536
00:33:59,240 --> 00:34:01,600
Speaker 1: clutch my purse roles. But that's more of a statement

537
00:34:01,600 --> 00:34:06,040
Speaker 1: about me than her work. She does great work. I

538
00:34:06,080 --> 00:34:09,400
Speaker 1: think most of us can imagine plenty of scenarios in

539
00:34:09,400 --> 00:34:12,239
Speaker 1: which this sort of technology could cause mischief on a

540
00:34:12,280 --> 00:34:16,080
Speaker 1: good day and catastrophe on a bad day, whether it's

541
00:34:16,160 --> 00:34:21,640
Speaker 1: spreading misinformation, creating fear and certainty and doubt fud or

542
00:34:21,760 --> 00:34:25,200
Speaker 1: by making people seem to say things they never actually said,

543
00:34:25,360 --> 00:34:28,759
Speaker 1: or contributing to an ugly subculture in which people try

544
00:34:28,800 --> 00:34:32,560
Speaker 1: to make their more base fantasies a reality by putting

545
00:34:32,600 --> 00:34:35,239
Speaker 1: one person's head on another person's body. You know, it's

546
00:34:35,280 --> 00:34:39,040
Speaker 1: not great. There are legitimate uses of the technology too,

547
00:34:39,120 --> 00:34:42,600
Speaker 1: of course, you know, tech itself is rarely good or bad.

548
00:34:42,719 --> 00:34:45,759
Speaker 1: It's all in how we use it. But this particular

549
00:34:45,800 --> 00:34:49,200
Speaker 1: technology has a lot of potentially harmful uses, and Samantha

550
00:34:49,239 --> 00:34:52,040
Speaker 1: Coll has done a great job explaining them. When we

551
00:34:52,120 --> 00:34:54,479
Speaker 1: come back, I'll talk a bit more about the war

552
00:34:54,600 --> 00:34:57,600
Speaker 1: against deep fakes and how people are trying to prepare

553
00:34:57,680 --> 00:35:00,880
Speaker 1: for a world that is increasingly filled with media we

554
00:35:01,080 --> 00:35:05,600
Speaker 1: can't really trust. But first, let's take a quick break.

555
00:35:13,120 --> 00:35:16,880
Speaker 1: Before the break, I mentioned Samantha Cole, who has written

556
00:35:16,920 --> 00:35:19,880
Speaker 1: extensively about deep fags, and one point she makes that

557
00:35:19,920 --> 00:35:23,359
Speaker 1: I think is important for us to note is that

558
00:35:23,440 --> 00:35:28,320
Speaker 1: the vast majority of instances of deep fake videos haven't

559
00:35:28,480 --> 00:35:33,120
Speaker 1: been some manufactured video of a political leader saying inflammatory things.

560
00:35:33,840 --> 00:35:37,200
Speaker 1: That continues to be a big concern. There's a genuine

561
00:35:37,320 --> 00:35:40,400
Speaker 1: fear that someone is going to manufacture a video in

562
00:35:40,440 --> 00:35:43,920
Speaker 1: which a politician appears to say or do something truly

563
00:35:44,000 --> 00:35:47,480
Speaker 1: terrible in an effort to either discredit the politician or

564
00:35:47,520 --> 00:35:52,319
Speaker 1: perhaps instigate a conflict with some other group. There are

565
00:35:52,400 --> 00:35:56,960
Speaker 1: literal doomsday scenarios in which such a video would prompt

566
00:35:56,960 --> 00:36:01,160
Speaker 1: a massive military response, though it does seem like it

567
00:36:01,239 --> 00:36:04,120
Speaker 1: might be a little far fetched. Though heck, I don't know,

568
00:36:04,239 --> 00:36:06,279
Speaker 1: considering the world we live in, maybe it's not that

569
00:36:06,400 --> 00:36:10,200
Speaker 1: big of a stretch anyway. Cole's point is that so far,

570
00:36:10,640 --> 00:36:14,080
Speaker 1: debt has not happened. She points out that the most

571
00:36:14,280 --> 00:36:17,000
Speaker 1: frequent use for the tech either tends to be people

572
00:36:17,120 --> 00:36:20,920
Speaker 1: goofing around or disturbingly using it too. In her words,

573
00:36:21,000 --> 00:36:25,160
Speaker 1: quote take ownership of women's bodies in non consensual porn

574
00:36:25,440 --> 00:36:28,920
Speaker 1: end quote. Cole argues that the reason we haven't really

575
00:36:28,920 --> 00:36:32,240
Speaker 1: seen deep fix used much outside of these realms, apart

576
00:36:32,280 --> 00:36:36,400
Speaker 1: from a few advertising campaigns. Is that people are pretty

577
00:36:36,440 --> 00:36:39,719
Speaker 1: good at spotting Deep Fix. They aren't quite at a

578
00:36:39,840 --> 00:36:42,759
Speaker 1: level where they can easily pass for the real thing.

579
00:36:43,320 --> 00:36:46,400
Speaker 1: There's still something slightly off about them. They tend to

580
00:36:46,560 --> 00:36:49,880
Speaker 1: butt up against the uncanny valley. Now, for those of

581
00:36:49,880 --> 00:36:53,560
Speaker 1: you not familiar with that term, the uncanny valley describes

582
00:36:53,600 --> 00:36:57,320
Speaker 1: the feeling we humans get when we encounter a robot

583
00:36:57,520 --> 00:37:02,520
Speaker 1: or a computer generated figure that closely resembles a human

584
00:37:02,880 --> 00:37:06,600
Speaker 1: or human behavior, but you can still tell it's not

585
00:37:07,040 --> 00:37:10,400
Speaker 1: actually a person, and it's not a good feeling. It

586
00:37:10,440 --> 00:37:13,960
Speaker 1: tends to be described as repulsive and disturbing, or at

587
00:37:14,160 --> 00:37:18,720
Speaker 1: the very best, off putting. See also the animated film

588
00:37:18,760 --> 00:37:22,960
Speaker 1: Polar Express. There's a reason that when that film came out,

589
00:37:23,120 --> 00:37:27,839
Speaker 1: people kind of reacted negatively to the animation, and it's

590
00:37:27,840 --> 00:37:30,640
Speaker 1: also a reason why picks are tends to prefer to

591
00:37:30,680 --> 00:37:34,479
Speaker 1: go with stylized human characters who are different enough from

592
00:37:34,600 --> 00:37:38,320
Speaker 1: the way real humans look to kind of bypass uncanny valley.

593
00:37:38,520 --> 00:37:40,880
Speaker 1: We just think of that as a cartoon, not something

594
00:37:40,920 --> 00:37:44,120
Speaker 1: that's trying to pass itself off as being human. But

595
00:37:44,200 --> 00:37:46,800
Speaker 1: while there hasn't really been a flood of fake videos

596
00:37:46,840 --> 00:37:50,319
Speaker 1: hitting the Internet with the intent to discredit politicians or

597
00:37:50,400 --> 00:37:54,280
Speaker 1: infuriate specific people or whatever, there remains a general sense

598
00:37:54,320 --> 00:37:58,040
Speaker 1: that this is coming. It's just not here now. The

599
00:37:58,120 --> 00:38:01,600
Speaker 1: sense I get is that people feel it's an inevitability,

600
00:38:01,680 --> 00:38:04,080
Speaker 1: and there are already folks working on tools that will

601
00:38:04,080 --> 00:38:07,160
Speaker 1: help us sort out the real stuff from the fakes.

602
00:38:07,719 --> 00:38:12,440
Speaker 1: Take Microsoft, for example. There R and D division fittingly

603
00:38:12,640 --> 00:38:17,680
Speaker 1: called Microsoft Research, developed a tool they call the Video Authenticator.

604
00:38:18,120 --> 00:38:21,960
Speaker 1: This tool analyzes video samples and looks for signs of

605
00:38:22,320 --> 00:38:25,440
Speaker 1: deep fakery. In a blog post written by Tom Bert

606
00:38:25,520 --> 00:38:30,040
Speaker 1: and Eric Horvitts to Microsoft executives, they say, quote it

607
00:38:30,080 --> 00:38:33,600
Speaker 1: works by detecting the blending boundary of the deep fake

608
00:38:33,760 --> 00:38:36,840
Speaker 1: and subtle fading or gray scale elements that might not

609
00:38:36,960 --> 00:38:40,759
Speaker 1: be detectable by the human eye. End quote. Now I'm

610
00:38:40,800 --> 00:38:44,360
Speaker 1: no expert, but to me, it sounds like the video

611
00:38:44,440 --> 00:38:48,600
Speaker 1: Authenticator is working in a way that's not too dissimilar

612
00:38:48,880 --> 00:38:53,719
Speaker 1: to a discriminator in a generative adversarial network. I mean,

613
00:38:54,040 --> 00:38:58,080
Speaker 1: the whole purpose of the discriminator is to discriminate or

614
00:38:58,160 --> 00:39:01,960
Speaker 1: to tell the difference between genuine when unaltered videos and

615
00:39:02,080 --> 00:39:06,440
Speaker 1: computer generated ones. So the video authenticator is looking for

616
00:39:06,520 --> 00:39:10,400
Speaker 1: tailtale signs that a video was not produced through traditional

617
00:39:10,480 --> 00:39:14,560
Speaker 1: means but was computer generated. However, that's the very thing

618
00:39:14,840 --> 00:39:18,200
Speaker 1: that the generators in G A N systems are looking

619
00:39:18,239 --> 00:39:21,960
Speaker 1: out for. So when a generator receives feedback that a

620
00:39:22,080 --> 00:39:26,360
Speaker 1: video it generated did not slip past the discriminator, it

621
00:39:26,440 --> 00:39:30,000
Speaker 1: then tweaks those input weights and starts to shift its

622
00:39:30,040 --> 00:39:33,680
Speaker 1: approach in order to bypass whatever it was that gave

623
00:39:33,719 --> 00:39:37,600
Speaker 1: away its last attempt, and it does this again and again.

624
00:39:38,120 --> 00:39:41,880
Speaker 1: So the video authenticator might work well for a given

625
00:39:41,920 --> 00:39:44,759
Speaker 1: amount of time, but I would suspect that in the

626
00:39:44,880 --> 00:39:48,120
Speaker 1: long run, the deep fake systems will become sophisticated enough

627
00:39:48,440 --> 00:39:53,319
Speaker 1: to fool the authenticator. Of course, Microsoft will continue to

628
00:39:53,400 --> 00:39:56,720
Speaker 1: tweak the authenticator as well, and it will become something

629
00:39:56,760 --> 00:40:00,920
Speaker 1: of a seesaw battle as one side outperforms the other temporarily,

630
00:40:01,280 --> 00:40:04,000
Speaker 1: and then the balance will shift. Though there may come

631
00:40:04,000 --> 00:40:06,760
Speaker 1: a time where either the deep fakes are too good

632
00:40:07,120 --> 00:40:10,240
Speaker 1: and they don't set off any alarms from the discriminator,

633
00:40:11,080 --> 00:40:16,040
Speaker 1: or the discriminator gets so sensitive that it starts to

634
00:40:16,080 --> 00:40:19,200
Speaker 1: flag real videos and it hits a lot of false

635
00:40:19,280 --> 00:40:23,680
Speaker 1: positives and calls them generated videos instead. Either way, you

636
00:40:23,760 --> 00:40:26,720
Speaker 1: reach a point where a tool like this no longer

637
00:40:26,760 --> 00:40:29,839
Speaker 1: really serves a useful purpose, and the video authenticator will

638
00:40:29,840 --> 00:40:32,920
Speaker 1: be obsolete. Now, this is something we see in artificial

639
00:40:32,960 --> 00:40:36,080
Speaker 1: intelligence all the time. If you remember the good old

640
00:40:36,120 --> 00:40:39,000
Speaker 1: days of capture, you know, the approving you're not a

641
00:40:39,120 --> 00:40:42,480
Speaker 1: robot stuff. The stuff we were told to do was

642
00:40:42,840 --> 00:40:45,960
Speaker 1: typically type in a series of letters and numbers, and

643
00:40:46,000 --> 00:40:48,960
Speaker 1: it wasn't that hard when it first started, at least

644
00:40:49,000 --> 00:40:53,000
Speaker 1: not at first. That's because the text recognition algorithms of

645
00:40:53,040 --> 00:40:58,160
Speaker 1: the time weren't very good. They couldn't decipher mildly deformed

646
00:40:58,280 --> 00:41:01,439
Speaker 1: text because the shape to the text felt too far

647
00:41:01,560 --> 00:41:05,399
Speaker 1: outside the parameters of what the system could recognize as

648
00:41:05,440 --> 00:41:08,480
Speaker 1: a legitimate letter or number. You make the number a little,

649
00:41:09,040 --> 00:41:12,239
Speaker 1: you know, deformed, and then suddenly the systems like, well,

650
00:41:12,239 --> 00:41:14,839
Speaker 1: that doesn't look like a three to me because it's

651
00:41:14,880 --> 00:41:17,400
Speaker 1: not in the shape of a three. But over time

652
00:41:17,560 --> 00:41:22,239
Speaker 1: people developed better text recognition programs that could recognize these

653
00:41:22,239 --> 00:41:25,360
Speaker 1: shapes even if they weren't in a standard three orientation,

654
00:41:25,960 --> 00:41:30,040
Speaker 1: and those systems began to defeat those simple early captures

655
00:41:30,600 --> 00:41:34,800
Speaker 1: that required captured designers to make tougher versions, and eventually

656
00:41:34,840 --> 00:41:37,239
Speaker 1: the machines got good enough that they can match or

657
00:41:37,320 --> 00:41:41,280
Speaker 1: even outperform humans. And at that point, those text based

658
00:41:41,360 --> 00:41:45,240
Speaker 1: captures proved to be more challenging for people than for machines,

659
00:41:45,280 --> 00:41:47,839
Speaker 1: which meant if you use them, you defeated the whole

660
00:41:47,880 --> 00:41:50,959
Speaker 1: purpose in the first place. So while this escalation proved

661
00:41:51,000 --> 00:41:53,800
Speaker 1: to be a challenge for security, it was a boon

662
00:41:54,120 --> 00:41:58,360
Speaker 1: for artificial intelligence. And while I focused almost exclusively on

663
00:41:58,440 --> 00:42:01,320
Speaker 1: the imagery of video here, the same sort of stuff

664
00:42:01,400 --> 00:42:04,880
Speaker 1: is going on with generated speech, including generated speech that

665
00:42:04,960 --> 00:42:09,920
Speaker 1: imitates specific voices like deep big videos. This approach works

666
00:42:09,960 --> 00:42:12,680
Speaker 1: best if you have a really big data set of

667
00:42:12,760 --> 00:42:19,680
Speaker 1: recorded audio, so people like movie and TV stars, news reporters, politicians,

668
00:42:19,760 --> 00:42:24,880
Speaker 1: and um, you know, podcasters, we're great targets for this stuff.

669
00:42:25,120 --> 00:42:27,280
Speaker 1: There might be hundreds or you know, in my case,

670
00:42:27,680 --> 00:42:32,440
Speaker 1: thousands of hours of recording material to work from. Training

671
00:42:32,440 --> 00:42:38,439
Speaker 1: a model to use the frequencies timbre, intonation, pronunciation, pauses,

672
00:42:38,520 --> 00:42:41,560
Speaker 1: and other mannerisms of speech can result in a system

673
00:42:41,640 --> 00:42:45,160
Speaker 1: that can generate vocals that sound like the target, sometimes

674
00:42:45,160 --> 00:42:49,640
Speaker 1: to a fairly convincing degree, and for a while to

675
00:42:49,640 --> 00:42:52,560
Speaker 1: peek behind the curtain here we at tech stuff. We're

676
00:42:52,600 --> 00:42:54,520
Speaker 1: working with a company that I'm not going to name,

677
00:42:54,800 --> 00:42:57,399
Speaker 1: but they were going to do something like this as

678
00:42:57,480 --> 00:42:59,960
Speaker 1: an experiment. I was gonna do a whole episode on it,

679
00:43:00,520 --> 00:43:03,680
Speaker 1: and I had planned on crafting a segment of that

680
00:43:03,800 --> 00:43:07,800
Speaker 1: episode only through text. I was not going to actually

681
00:43:07,800 --> 00:43:10,880
Speaker 1: record it myself and then use a system that was

682
00:43:10,960 --> 00:43:16,120
Speaker 1: trained on my voice to replicate my voice and deliver

683
00:43:16,280 --> 00:43:19,520
Speaker 1: that segment on its own. I was curious if it

684
00:43:19,520 --> 00:43:22,479
Speaker 1: can nail not just the audio quality of my voice, which,

685
00:43:22,840 --> 00:43:27,200
Speaker 1: let's be honest, is amazing that sarcasm I can't stand

686
00:43:27,200 --> 00:43:30,600
Speaker 1: listening to myself, but it would also have to replicate

687
00:43:30,640 --> 00:43:34,480
Speaker 1: how I actually make certain sounds, Like would it get

688
00:43:34,480 --> 00:43:37,160
Speaker 1: the bit of the southern accent that's in my voice,

689
00:43:37,800 --> 00:43:40,960
Speaker 1: or the way I emphasize certain words. Would it pause

690
00:43:41,040 --> 00:43:44,399
Speaker 1: for effect at all or would it just robotically say

691
00:43:44,560 --> 00:43:47,279
Speaker 1: one word after the next and only pause when there

692
00:43:47,360 --> 00:43:50,759
Speaker 1: was some helpful punctuation that told it to do so.

693
00:43:51,280 --> 00:43:54,080
Speaker 1: Would it indicate a question by raising the pitch at

694
00:43:54,080 --> 00:43:58,239
Speaker 1: the end of its sentence. Sadly, we never got far

695
00:43:58,760 --> 00:44:01,640
Speaker 1: with that particular problem check, so I don't have any

696
00:44:01,680 --> 00:44:03,520
Speaker 1: answers for you. I don't know how it would have

697
00:44:03,600 --> 00:44:06,240
Speaker 1: turned out, but clearly one of the things I thought

698
00:44:06,280 --> 00:44:09,200
Speaker 1: of was that it's a bit of a red flag.

699
00:44:09,239 --> 00:44:11,839
Speaker 1: If you can train a computer to sound exactly like

700
00:44:11,960 --> 00:44:15,240
Speaker 1: a specific person, that means you can make that person

701
00:44:15,760 --> 00:44:19,840
Speaker 1: say anything you like, and obviously, like deep fake videos,

702
00:44:19,880 --> 00:44:22,919
Speaker 1: that could have some pretty devastating consequences if it were

703
00:44:23,000 --> 00:44:27,960
Speaker 1: at all, you know, believable or seemed realistic. Now, the

704
00:44:27,960 --> 00:44:31,000
Speaker 1: company we were working with was working hard to make

705
00:44:31,000 --> 00:44:33,440
Speaker 1: sure that the only person to have access to a

706
00:44:33,480 --> 00:44:36,600
Speaker 1: specific voice would be the owner of that voice, or

707
00:44:37,160 --> 00:44:40,600
Speaker 1: presumably the company employing that person. Though that does bring

708
00:44:40,680 --> 00:44:43,160
Speaker 1: up a whole bunch of other potential problems, like can

709
00:44:43,200 --> 00:44:47,480
Speaker 1: you imagine eliminating voice actors from a job because you've

710
00:44:47,480 --> 00:44:50,000
Speaker 1: got enough of their voice and you can just replicate it.

711
00:44:50,080 --> 00:44:53,160
Speaker 1: That wouldn't be great, But even so, it was something

712
00:44:53,200 --> 00:44:56,480
Speaker 1: I felt was both fascinating from a technology standpoint and

713
00:44:56,520 --> 00:45:01,319
Speaker 1: potentially problematic when it comes to an application of that technology.

714
00:45:01,719 --> 00:45:05,080
Speaker 1: One other thing I should mention is that the Internet

715
00:45:05,200 --> 00:45:08,240
Speaker 1: at large has been pretty active in fighting deep fakes,

716
00:45:08,280 --> 00:45:11,640
Speaker 1: not necessarily in detecting them, but removing the platforms from

717
00:45:12,040 --> 00:45:14,839
Speaker 1: which they were being shared, Reddit being a big one.

718
00:45:14,960 --> 00:45:17,560
Speaker 1: The subreddit that was dedicated to deep fakes what had

719
00:45:17,560 --> 00:45:20,960
Speaker 1: been shut down. So there have been some of those

720
00:45:21,000 --> 00:45:24,160
Speaker 1: moves as well. Now this is not directly against the technology,

721
00:45:24,160 --> 00:45:28,840
Speaker 1: it's more against the proliferation of the uh the output

722
00:45:29,280 --> 00:45:33,040
Speaker 1: of that technology. As for detecting deep fakes, it's interesting

723
00:45:33,080 --> 00:45:36,800
Speaker 1: to me that people are even developing tools to detect them,

724
00:45:36,840 --> 00:45:39,719
Speaker 1: because to me, the best tools so far seems to

725
00:45:39,760 --> 00:45:45,839
Speaker 1: be human perception. It's not that the images aren't really convincing,

726
00:45:46,000 --> 00:45:49,120
Speaker 1: or that we can suddenly detect these, you know, blending

727
00:45:49,239 --> 00:45:53,440
Speaker 1: lines like the video Authenticator tool. It's rather that it's

728
00:45:53,480 --> 00:45:56,160
Speaker 1: just not hard for us to spot a deep fake. Now,

729
00:45:56,200 --> 00:46:00,040
Speaker 1: stuff just doesn't quite look right in the way that

730
00:46:00,200 --> 00:46:04,360
Speaker 1: people behave in these videos. The vocals and animation often

731
00:46:04,440 --> 00:46:09,280
Speaker 1: don't quite match. The expressions aren't really natural, the progression

732
00:46:09,320 --> 00:46:14,319
Speaker 1: of mannerisms feels synthetic and not genuine. It just it

733
00:46:14,360 --> 00:46:18,360
Speaker 1: looks off. It's that uncanny Valley thing, and so just

734
00:46:18,440 --> 00:46:21,640
Speaker 1: paying attention and thinking critically can really help use suss

735
00:46:21,640 --> 00:46:24,319
Speaker 1: out the fakes from the real thing. Even if we

736
00:46:24,400 --> 00:46:27,759
Speaker 1: reach a point where machines can create a convincing enough

737
00:46:27,800 --> 00:46:32,000
Speaker 1: fake to pass for reality. We can still apply critical thinking,

738
00:46:32,360 --> 00:46:35,440
Speaker 1: and we always should. Heck, we should be applying critical

739
00:46:35,480 --> 00:46:38,480
Speaker 1: thinking even when there's no doubt as to the validity

740
00:46:38,520 --> 00:46:42,200
Speaker 1: of the video, because there may be enough to doubt

741
00:46:42,280 --> 00:46:45,920
Speaker 1: the content of the video itself. If I listen to

742
00:46:46,000 --> 00:46:50,360
Speaker 1: a genuine scam artist in a genuine video, that doesn't

743
00:46:50,400 --> 00:46:53,799
Speaker 1: make the scam more legitimate. We always need to use

744
00:46:53,840 --> 00:46:57,200
Speaker 1: critical thinking. What I think is most important is that

745
00:46:57,239 --> 00:47:03,560
Speaker 1: we acknowledge the very real fact that there are numerous organizations, agencies, governments,

746
00:47:03,840 --> 00:47:08,160
Speaker 1: and other groups that are actively attempting to spread misinformation

747
00:47:08,400 --> 00:47:14,719
Speaker 1: and disinformation. There are entire intelligence agencies dedicated to this endeavor,

748
00:47:15,160 --> 00:47:18,640
Speaker 1: and then there are more independent groups that are doing

749
00:47:18,680 --> 00:47:22,000
Speaker 1: it for one reason or another, typically either to advance

750
00:47:22,040 --> 00:47:25,839
Speaker 1: a particular political agenda or just to make as much

751
00:47:25,920 --> 00:47:30,560
Speaker 1: money as quickly as possible. This is beyond doubt or question.

752
00:47:30,640 --> 00:47:34,600
Speaker 1: There are numerous misinformation campaigns that are actively going on

753
00:47:34,760 --> 00:47:38,080
Speaker 1: out there in the real world right now. Most of

754
00:47:38,120 --> 00:47:42,279
Speaker 1: them are not depending on deep fakes, because one, deep

755
00:47:42,320 --> 00:47:45,920
Speaker 1: fakes aren't really good enough to fool most people right now,

756
00:47:46,400 --> 00:47:49,600
Speaker 1: and too, they don't need the deep fakes in the

757
00:47:49,640 --> 00:47:52,400
Speaker 1: first place. There are other methods that are simpler, that

758
00:47:52,520 --> 00:47:56,280
Speaker 1: don't need nearly the processing power that work just fine.

759
00:47:56,600 --> 00:47:59,160
Speaker 1: Why would you go through the trouble of synthesizing a

760
00:47:59,280 --> 00:48:01,839
Speaker 1: video if you can get a better response with a

761
00:48:01,840 --> 00:48:05,920
Speaker 1: blog post filled with lies or half truths. It's just

762
00:48:06,000 --> 00:48:09,520
Speaker 1: not a great return on investment. So bottom line, be

763
00:48:09,680 --> 00:48:14,520
Speaker 1: vigilant out there, particularly on social media. Be aware that

764
00:48:14,560 --> 00:48:17,239
Speaker 1: there are plenty of people who will not hesitate to

765
00:48:17,360 --> 00:48:20,719
Speaker 1: mislead others in order to get what they want. Use

766
00:48:20,760 --> 00:48:26,000
Speaker 1: a critical eye to evaluate the information you encounter. Ask questions,

767
00:48:26,440 --> 00:48:31,160
Speaker 1: check sources, look for corroborating reports. It's a lot of work,

768
00:48:31,200 --> 00:48:34,080
Speaker 1: but trust me, it's way better that we do our

769
00:48:34,120 --> 00:48:37,120
Speaker 1: best to make sure the stuff we're depending on is

770
00:48:37,200 --> 00:48:40,759
Speaker 1: actually dependable. It'll turn out better for us in the

771
00:48:40,800 --> 00:48:43,919
Speaker 1: long run. Well, that wraps up this episode of text stuff,

772
00:48:43,960 --> 00:48:47,399
Speaker 1: which yeah, I used as a backdoor to argue about

773
00:48:47,440 --> 00:48:51,000
Speaker 1: critical thinking. Again, sue me, don't, don't really sue me.

774
00:48:51,520 --> 00:48:55,560
Speaker 1: But I think that that's another instance where it's a

775
00:48:55,640 --> 00:48:58,680
Speaker 1: really clear example where we have to use that kind

776
00:48:58,680 --> 00:49:01,000
Speaker 1: of stuff. So I'm gonna keep keep on stressing it.

777
00:49:01,480 --> 00:49:05,080
Speaker 1: And you guys are awesome. I believe in you. I

778
00:49:05,120 --> 00:49:08,080
Speaker 1: think that when we start using these tools at our

779
00:49:08,080 --> 00:49:12,560
Speaker 1: disposal that everybody can develop just with some practice, that

780
00:49:13,040 --> 00:49:16,120
Speaker 1: things will be better. We'll be able to suss out

781
00:49:16,200 --> 00:49:20,719
Speaker 1: the nonsense from the real stuff, and we're all better

782
00:49:20,760 --> 00:49:22,439
Speaker 1: off in the long run if we can do that.

783
00:49:23,000 --> 00:49:25,680
Speaker 1: If you guys have suggestions for future topics I should

784
00:49:25,719 --> 00:49:28,960
Speaker 1: cover in episodes of tech Stuff, let me know via Twitter.

785
00:49:29,280 --> 00:49:33,279
Speaker 1: The handle is text stuff H s W and I'll

786
00:49:33,280 --> 00:49:41,480
Speaker 1: talk to you again really soon. Text Stuff is an

787
00:49:41,480 --> 00:49:45,200
Speaker 1: I Heart Radio production. For more podcasts from my Heart Radio,

788
00:49:45,520 --> 00:49:48,680
Speaker 1: visit the i Heart Radio app, Apple Podcasts, or wherever

789
00:49:48,760 --> 00:49:50,280
Speaker 1: you listen to your favorite shows.