1
00:00:04,120 --> 00:00:07,160
Speaker 1: Get in touch with technology with tech Stuff from how

2
00:00:07,200 --> 00:00:13,720
Speaker 1: stuff works dot com. Hey there, and welcome to tech Stuff.

3
00:00:13,760 --> 00:00:16,799
Speaker 1: I'm your host, Jonathan Strickland. I'm an executive producer and

4
00:00:16,840 --> 00:00:20,400
Speaker 1: I love all things tech. And here's a fun fact

5
00:00:20,440 --> 00:00:25,799
Speaker 1: about getting older. As you age, stuff that you once

6
00:00:25,840 --> 00:00:30,160
Speaker 1: thought was impossible will become not only possible, but will

7
00:00:30,200 --> 00:00:34,040
Speaker 1: become the norm, and future generations won't even think about

8
00:00:34,040 --> 00:00:37,400
Speaker 1: what it must have been like before the impossible was commonplace.

9
00:00:37,880 --> 00:00:40,720
Speaker 1: Now this is true for every generation. It's not like

10
00:00:40,760 --> 00:00:45,880
Speaker 1: this is, you know, a brand new, groundbreaking observation. Plenty

11
00:00:45,920 --> 00:00:48,400
Speaker 1: of people have made it before me, but I want

12
00:00:48,400 --> 00:00:53,720
Speaker 1: to talk about a specific implementation. For example, with photography,

13
00:00:54,160 --> 00:00:58,480
Speaker 1: it used to be pretty difficult to manipulate pictures convincingly.

14
00:00:58,840 --> 00:01:03,280
Speaker 1: There have been photo and tools for decades, but generally

15
00:01:03,320 --> 00:01:07,080
Speaker 1: it took a great deal of skill and training, plus

16
00:01:07,160 --> 00:01:10,440
Speaker 1: access to specialized equipment to pull it off, especially in

17
00:01:10,480 --> 00:01:15,240
Speaker 1: the old film days. Now, gradually, tools like Photoshop made

18
00:01:15,240 --> 00:01:19,160
Speaker 1: it easier to manipulate digital images. Now it still requires

19
00:01:19,200 --> 00:01:21,600
Speaker 1: a certain level of skill to pull off a really

20
00:01:21,680 --> 00:01:26,480
Speaker 1: convincing job, and it's easy to make really badly manipulated images.

21
00:01:26,880 --> 00:01:30,640
Speaker 1: But as these tools became widely available, people began to

22
00:01:30,720 --> 00:01:33,000
Speaker 1: learn how to use them. We had to come to

23
00:01:33,040 --> 00:01:36,840
Speaker 1: the realization that we cannot necessarily believe our own eyes

24
00:01:36,920 --> 00:01:40,479
Speaker 1: when we're looking at a digital image. Now, the same

25
00:01:40,520 --> 00:01:44,920
Speaker 1: thing is happening with video footage. It's quite possible to

26
00:01:45,080 --> 00:01:48,440
Speaker 1: fake video footage, though again, if you want to do

27
00:01:48,480 --> 00:01:52,000
Speaker 1: it really well, it requires some skill, some specialized tools,

28
00:01:52,680 --> 00:01:55,800
Speaker 1: and to really get some expertise in it in order

29
00:01:55,840 --> 00:01:58,120
Speaker 1: to do it in a way that's really convincing. But

30
00:01:58,760 --> 00:02:02,600
Speaker 1: it's a pretty recent techn logical capability in fiction, however,

31
00:02:02,720 --> 00:02:05,919
Speaker 1: it's been around for a long time. I remember seeing

32
00:02:06,040 --> 00:02:09,880
Speaker 1: the movie Rising Sun back in the early nineteen nineties. Now,

33
00:02:09,880 --> 00:02:13,799
Speaker 1: in that film, Wesley Snipes plays a detective and Sean

34
00:02:13,880 --> 00:02:17,560
Speaker 1: Connery in his most convincing roles, since he played an

35
00:02:17,560 --> 00:02:21,880
Speaker 1: Egyptian immortal posing as a Spaniard with a Scottish accent

36
00:02:21,919 --> 00:02:27,400
Speaker 1: in Highlander, would play a Japanese customs and culture expert

37
00:02:28,800 --> 00:02:33,880
Speaker 1: Sean Connery. Anyway, the film is a mystery thriller, and

38
00:02:34,160 --> 00:02:38,000
Speaker 1: while the two are investigating a murder, they come into

39
00:02:38,040 --> 00:02:41,200
Speaker 1: possession of some video footage, and they find out that

40
00:02:41,320 --> 00:02:44,600
Speaker 1: video footage has actually been manipulated it was planted for

41
00:02:44,639 --> 00:02:46,400
Speaker 1: them to find so that would put them on the

42
00:02:46,440 --> 00:02:51,560
Speaker 1: wrong trail. One person's face was replaced with someone else's,

43
00:02:51,600 --> 00:02:54,920
Speaker 1: and in a somewhat comedic scene, the video editor who

44
00:02:55,000 --> 00:02:59,560
Speaker 1: is explaining this casually swaps the heads of Snipes and

45
00:02:59,600 --> 00:03:02,280
Speaker 1: Connor in real time in a video feed to show

46
00:03:02,280 --> 00:03:06,280
Speaker 1: off this capability, which might have been a tad unrealistic,

47
00:03:06,320 --> 00:03:08,440
Speaker 1: but today we are in a world in which video

48
00:03:08,480 --> 00:03:13,799
Speaker 1: manipulation of increasingly convincing quality is achievable in real time.

49
00:03:14,400 --> 00:03:17,360
Speaker 1: In fact, these days, it's possible to use sophisticated computer

50
00:03:17,400 --> 00:03:21,960
Speaker 1: algorithms to allow for manipulation of captured video, almost as

51
00:03:22,040 --> 00:03:26,040
Speaker 1: if the video was a computer generated cartoon reacting to

52
00:03:26,120 --> 00:03:29,600
Speaker 1: real time inputs like a video game controller, only instead

53
00:03:29,600 --> 00:03:32,360
Speaker 1: of it being a video game, it's a real person

54
00:03:32,480 --> 00:03:36,280
Speaker 1: on video. There are, of course, lots of ways this

55
00:03:36,360 --> 00:03:40,000
Speaker 1: technology could be used unethically, and one of the best

56
00:03:40,040 --> 00:03:44,400
Speaker 1: known has been the focus of this whole conversation around

57
00:03:44,440 --> 00:03:48,240
Speaker 1: video manipulation, and it comes from a former Reddit user

58
00:03:48,560 --> 00:03:52,400
Speaker 1: who went by the handle deep fakes. Now that handle

59
00:03:52,400 --> 00:03:56,920
Speaker 1: has become the shorthand for the general practice, which frequently,

60
00:03:57,240 --> 00:04:01,280
Speaker 1: but not exclusively, would involve replace the face of an

61
00:04:01,280 --> 00:04:05,440
Speaker 1: actor in a pornographic scene with someone else's face like

62
00:04:05,560 --> 00:04:10,160
Speaker 1: that of a celebrity, which is pretty darn unethical and creepy.

63
00:04:10,240 --> 00:04:14,240
Speaker 1: The name itself was a reference to the technology used

64
00:04:14,280 --> 00:04:17,960
Speaker 1: in the approach, so it relies on a process called

65
00:04:18,279 --> 00:04:22,520
Speaker 1: deep learning. Deep learning is a type of machine learning.

66
00:04:22,520 --> 00:04:26,680
Speaker 1: It's a sub type of machine learning that utilizes artificial

67
00:04:26,720 --> 00:04:29,719
Speaker 1: neural networks. And I've talked an awful lot about those

68
00:04:29,800 --> 00:04:31,960
Speaker 1: kind of networks recently, so I'm not going to go

69
00:04:32,000 --> 00:04:34,480
Speaker 1: over the whole thing again. Will give just a really

70
00:04:34,560 --> 00:04:39,640
Speaker 1: quick rundown to say, you have nodes, artificial neurons in

71
00:04:39,680 --> 00:04:44,400
Speaker 1: these networks that receive input from potentially multiple other nodes,

72
00:04:45,160 --> 00:04:49,360
Speaker 1: and then on that input, your artificial neuron that you're

73
00:04:49,360 --> 00:04:52,880
Speaker 1: looking at will perform some sort of weighted operation and

74
00:04:52,960 --> 00:04:56,200
Speaker 1: produce a single output. That single output can move on

75
00:04:56,279 --> 00:05:00,000
Speaker 1: to become one of many inputs for a different ARTIFICI

76
00:05:00,000 --> 00:05:02,559
Speaker 1: shoal neuron in that network, and so on and so forth.

77
00:05:03,520 --> 00:05:08,400
Speaker 1: Deep learning networks are very very large artificial neural networks,

78
00:05:08,680 --> 00:05:11,640
Speaker 1: and they can accept a large amount of training data.

79
00:05:12,040 --> 00:05:15,520
Speaker 1: This is a scalable approach. This means the larger the

80
00:05:15,600 --> 00:05:18,479
Speaker 1: network and the more data you can feed it, the

81
00:05:18,600 --> 00:05:22,320
Speaker 1: better it performs. This is different from any other machine

82
00:05:22,400 --> 00:05:25,839
Speaker 1: learning models. Those tend to hit a performance plateau once

83
00:05:25,839 --> 00:05:28,640
Speaker 1: you hit a certain size, which means that if you

84
00:05:28,680 --> 00:05:32,279
Speaker 1: were to add more nodes to the network, you wouldn't

85
00:05:32,279 --> 00:05:35,560
Speaker 1: necessarily see a comparative increase in performance. You would you

86
00:05:35,560 --> 00:05:39,719
Speaker 1: would kind of flatten out over time. In fourteen, a

87
00:05:39,800 --> 00:05:43,680
Speaker 1: deep learning expert named Andrew ng gave a talk at

88
00:05:43,720 --> 00:05:48,159
Speaker 1: Stanford about the best use cases for deep learning, and

89
00:05:48,200 --> 00:05:54,000
Speaker 1: he mentioned that it was particularly good at supervised learning tasks. Now,

90
00:05:54,040 --> 00:05:56,800
Speaker 1: these are the types of computer problems in which we

91
00:05:56,960 --> 00:06:00,680
Speaker 1: humans already know the answer, such as is there a

92
00:06:00,800 --> 00:06:04,479
Speaker 1: cat in this photograph? Humans can pick up on that

93
00:06:04,800 --> 00:06:08,440
Speaker 1: right away, assuming someone has not carefully hidden a cat

94
00:06:08,600 --> 00:06:12,160
Speaker 1: in a very busy image. But for a computer, this

95
00:06:12,240 --> 00:06:14,760
Speaker 1: is a much more difficult problem. Even if the picture

96
00:06:14,800 --> 00:06:18,120
Speaker 1: has a cat center stage, it can be tough for

97
00:06:18,200 --> 00:06:21,880
Speaker 1: a computer to figure that out using a supervised learning approach.

98
00:06:21,920 --> 00:06:24,960
Speaker 1: With a deep learning network, you can train a system

99
00:06:25,080 --> 00:06:28,440
Speaker 1: to recognize cats and images with a high degree of

100
00:06:28,520 --> 00:06:32,600
Speaker 1: success if you have a large amount of training data

101
00:06:32,880 --> 00:06:36,800
Speaker 1: to train the network to recognize cats. Now, in this

102
00:06:36,920 --> 00:06:39,239
Speaker 1: other case, that I was talking about. The Reddit user

103
00:06:39,320 --> 00:06:44,280
Speaker 1: called deep fakes started posting on Reddit in late twenties seventeen.

104
00:06:44,720 --> 00:06:48,160
Speaker 1: The user made an open source code version of a

105
00:06:48,200 --> 00:06:51,720
Speaker 1: deep learning algorithm and made it available for the purposes

106
00:06:51,760 --> 00:06:56,200
Speaker 1: of video manipulation and anyone could take advantage of it. Now. Specifically,

107
00:06:56,680 --> 00:07:00,840
Speaker 1: this algorithm was designed for face swapping. The algorithm would

108
00:07:00,839 --> 00:07:03,400
Speaker 1: allow you to put the face of one person onto

109
00:07:03,480 --> 00:07:07,159
Speaker 1: the body of another in video form, and it wasn't

110
00:07:07,200 --> 00:07:11,000
Speaker 1: always convincing. In fact, it could often be easily detectable

111
00:07:11,120 --> 00:07:14,960
Speaker 1: as fake if someone had not trained the model properly

112
00:07:15,080 --> 00:07:18,800
Speaker 1: before creating the video. But it did open up a

113
00:07:18,840 --> 00:07:22,400
Speaker 1: can of worms once the practice started getting media coverage. However,

114
00:07:22,560 --> 00:07:25,760
Speaker 1: the actual technology to pull this off was already a

115
00:07:25,800 --> 00:07:29,360
Speaker 1: couple of years old when deep fakes shared it. Back

116
00:07:29,360 --> 00:07:33,560
Speaker 1: in there was a group of researchers from Stanford then

117
00:07:33,840 --> 00:07:37,280
Speaker 1: and also the University of Erlanger Nuremberg and the Max

118
00:07:37,320 --> 00:07:41,360
Speaker 1: Planck Institute for Informatics who collectively published a paper that

119
00:07:41,480 --> 00:07:45,640
Speaker 1: was titled Face to Face Real Time Face Capture and

120
00:07:45,680 --> 00:07:49,520
Speaker 1: Reenactment of r GB Videos and that's a face, the

121
00:07:49,680 --> 00:07:54,560
Speaker 1: number two and face. The paper details the methodology the

122
00:07:54,600 --> 00:07:58,400
Speaker 1: group used to create a pretty incredible effect. The algorithm

123
00:07:58,600 --> 00:08:02,600
Speaker 1: could take the facial expressis from one person and transfer

124
00:08:02,720 --> 00:08:06,120
Speaker 1: them in real time to a video target. It was

125
00:08:06,160 --> 00:08:10,880
Speaker 1: like turning the video into a digital puppet. So you

126
00:08:10,960 --> 00:08:14,040
Speaker 1: might have a video loop of a celebrity running, and

127
00:08:14,080 --> 00:08:17,200
Speaker 1: preferably it's a loop that's easily repeatable without the repeat

128
00:08:17,200 --> 00:08:22,240
Speaker 1: being terribly noticeable, and that's your target video. So if

129
00:08:22,280 --> 00:08:25,240
Speaker 1: you just let it run, you just would see video

130
00:08:25,280 --> 00:08:27,760
Speaker 1: of someone sitting down, maybe looking around a little bit,

131
00:08:27,760 --> 00:08:31,200
Speaker 1: but that's it, nothing special. Then you would have a

132
00:08:31,240 --> 00:08:36,000
Speaker 1: source subject sitting in view of a consumer quality webcam,

133
00:08:36,120 --> 00:08:41,040
Speaker 1: no special equipment here, and that person could make different expressions,

134
00:08:41,080 --> 00:08:44,640
Speaker 1: including opening and closing their mouths, and the video target

135
00:08:44,679 --> 00:08:49,080
Speaker 1: would match them move for move, like a digital puppet. Moreover,

136
00:08:49,480 --> 00:08:52,040
Speaker 1: the source subject didn't have to wear any special gear.

137
00:08:52,160 --> 00:08:54,160
Speaker 1: They didn't have to have any special markers, none of

138
00:08:54,200 --> 00:08:56,640
Speaker 1: those dots that you would see with motion capture. None

139
00:08:56,679 --> 00:09:00,040
Speaker 1: of that was necessary. All the algorithm needed was a

140
00:09:00,120 --> 00:09:03,719
Speaker 1: video feed from a monocular camera, so you didn't even

141
00:09:03,800 --> 00:09:07,080
Speaker 1: need depth perception for this. There's a video of their

142
00:09:07,080 --> 00:09:10,400
Speaker 1: work on YouTube that shows off this process and includes

143
00:09:10,440 --> 00:09:13,320
Speaker 1: a loop of George W. Bush sitting for an interview.

144
00:09:13,800 --> 00:09:18,280
Speaker 1: The source subject can manipulate the face that Bush makes

145
00:09:18,360 --> 00:09:21,040
Speaker 1: just by making faces of his own, and the algorithm

146
00:09:21,080 --> 00:09:24,240
Speaker 1: would map those movements to the target video. And it's

147
00:09:24,280 --> 00:09:29,160
Speaker 1: pretty wild to see an image of moving image of

148
00:09:29,440 --> 00:09:32,800
Speaker 1: George W. Bush responding in real time to all of

149
00:09:32,840 --> 00:09:35,760
Speaker 1: these different facial expressions this guy is making. So how

150
00:09:35,760 --> 00:09:37,920
Speaker 1: did the team do this? It's one thing to say

151
00:09:37,960 --> 00:09:40,760
Speaker 1: a deep learning algorithm gave them this capability, but that's

152
00:09:40,800 --> 00:09:46,640
Speaker 1: not really an explanation. The paper definitively spells this out

153
00:09:47,120 --> 00:09:52,360
Speaker 1: in real technical detail. It starts off the explanation by saying, quote,

154
00:09:52,720 --> 00:09:56,120
Speaker 1: in our method, we first reconstruct the shape identity of

155
00:09:56,120 --> 00:09:59,760
Speaker 1: the target actor using a new global non rigid model

156
00:09:59,800 --> 00:10:03,600
Speaker 1: BA fast bundling approach based on a prerecorded training sequence.

157
00:10:04,000 --> 00:10:06,800
Speaker 1: As this pre process is performed globally on a set

158
00:10:06,840 --> 00:10:10,439
Speaker 1: of training frames, we can resolve geometric ambiguities common to

159
00:10:10,520 --> 00:10:14,560
Speaker 1: binocular reconstruction. At runtime. We tracked both the expressions of

160
00:10:14,600 --> 00:10:17,760
Speaker 1: the source and target actors video by a dense analysis

161
00:10:17,760 --> 00:10:22,320
Speaker 1: by synthesis approach based on a statistical facial prior end quote,

162
00:10:22,760 --> 00:10:25,559
Speaker 1: and it goes on in that vein throughout the paper

163
00:10:26,000 --> 00:10:28,880
Speaker 1: which means it gets pretty dense. But I think we

164
00:10:28,920 --> 00:10:31,680
Speaker 1: can suss out what's going on from a high level

165
00:10:32,280 --> 00:10:36,160
Speaker 1: if we just take a moment. But first, I'm going

166
00:10:36,240 --> 00:10:39,080
Speaker 1: to take a moment of my own to thank my sponsor.

167
00:10:46,679 --> 00:10:50,559
Speaker 1: So how did the Face to Face team build this tool? Well,

168
00:10:50,640 --> 00:10:54,760
Speaker 1: for each target video, they would collect a large sample

169
00:10:55,200 --> 00:10:58,040
Speaker 1: of footage and images and feed it to this deep

170
00:10:58,160 --> 00:11:01,560
Speaker 1: learning algorithm. This would be necess sarry to identify all

171
00:11:01,600 --> 00:11:04,599
Speaker 1: the points on the face that would move with various expressions,

172
00:11:04,840 --> 00:11:08,400
Speaker 1: as well as to capture images of the inside of

173
00:11:08,440 --> 00:11:11,720
Speaker 1: the target's mouth when he or she spoke. This is

174
00:11:11,760 --> 00:11:15,080
Speaker 1: because the video loop they used to create the manipulated

175
00:11:15,160 --> 00:11:18,440
Speaker 1: video would feature the target subject, typically with his or

176
00:11:18,440 --> 00:11:21,600
Speaker 1: her mouth closed, so it might be a section in

177
00:11:21,640 --> 00:11:24,000
Speaker 1: which the subject was sitting down for an interview and

178
00:11:24,040 --> 00:11:27,600
Speaker 1: listening to an interviewer's questions but not responding yet they

179
00:11:27,600 --> 00:11:31,600
Speaker 1: were just listening. The additional video would provide information about

180
00:11:31,600 --> 00:11:33,840
Speaker 1: the inside of the target subject's mouth, which could be

181
00:11:33,880 --> 00:11:37,240
Speaker 1: rendered in time on the target video performance when it

182
00:11:37,360 --> 00:11:40,760
Speaker 1: came time to do that. Their approach improved the scanning

183
00:11:40,800 --> 00:11:44,720
Speaker 1: technique to build face templates for both the source subject

184
00:11:45,280 --> 00:11:48,720
Speaker 1: who provides all the expressions and the target subject, who

185
00:11:48,840 --> 00:11:52,679
Speaker 1: mimics all the expressions. As the source subject makes different

186
00:11:52,720 --> 00:11:57,120
Speaker 1: facial expressions, the computer face template detects how the subjects

187
00:11:57,240 --> 00:12:02,880
Speaker 1: face changes or deforms over time. The computer model then

188
00:12:02,960 --> 00:12:06,840
Speaker 1: takes that information, saying, all right, well the lips moved

189
00:12:06,840 --> 00:12:09,520
Speaker 1: in this way, there was a grimace here or a

190
00:12:09,559 --> 00:12:15,200
Speaker 1: smile there, and transfer those motions to the targets face template,

191
00:12:15,600 --> 00:12:19,280
Speaker 1: which is matched to the target's actual face. This process

192
00:12:19,360 --> 00:12:24,760
Speaker 1: transfers the expressions over to the target, and so when

193
00:12:24,800 --> 00:12:28,600
Speaker 1: the source subject grimaces, the target grimaces, if the source

194
00:12:28,640 --> 00:12:31,319
Speaker 1: subject just it's still the target. Video will continue to

195
00:12:31,400 --> 00:12:34,960
Speaker 1: loop and the targets face won't change. The more video

196
00:12:34,960 --> 00:12:37,720
Speaker 1: footage you can get of your target and your source,

197
00:12:38,120 --> 00:12:41,880
Speaker 1: the better the computer algorithms are that create those face templates,

198
00:12:41,920 --> 00:12:44,720
Speaker 1: and the more natural the manipulation will appear on the

199
00:12:44,720 --> 00:12:48,280
Speaker 1: finished video. You also want a really good amount of

200
00:12:48,800 --> 00:12:51,680
Speaker 1: footage just to get all that extra information you need,

201
00:12:51,679 --> 00:12:53,120
Speaker 1: like the inside of the mouth, so that that can

202
00:12:53,160 --> 00:12:57,319
Speaker 1: all be extrapolated properly. You have to design a tool

203
00:12:57,400 --> 00:13:00,880
Speaker 1: that can encode an image called the training image, and

204
00:13:00,800 --> 00:13:05,720
Speaker 1: then decodes this data to reconstruct the image. So imagine

205
00:13:05,760 --> 00:13:11,360
Speaker 1: you've got a picture. The encoder essentially creates data based

206
00:13:11,400 --> 00:13:14,400
Speaker 1: on that image. It's like a description of that image.

207
00:13:15,080 --> 00:13:19,280
Speaker 1: The decoder takes the description and tries to rebuild the

208
00:13:19,320 --> 00:13:22,520
Speaker 1: image based on the description. I think of this like

209
00:13:22,559 --> 00:13:25,720
Speaker 1: that scene in Willy Wonka where Mike TV gets broken

210
00:13:25,800 --> 00:13:28,479
Speaker 1: up in a million little pieces and then gets reconstructed

211
00:13:28,760 --> 00:13:31,960
Speaker 1: on the television screen. So the second image is not

212
00:13:32,040 --> 00:13:34,199
Speaker 1: a copy. It's not like you made a copy of

213
00:13:34,240 --> 00:13:36,880
Speaker 1: the first one. It's like you built a new image

214
00:13:36,920 --> 00:13:39,400
Speaker 1: based on the first one. By the way, when you

215
00:13:39,440 --> 00:13:45,200
Speaker 1: start off with these uh these algorithms, those reconstructions tend

216
00:13:45,240 --> 00:13:49,120
Speaker 1: to look pretty bad. You have to continually train and

217
00:13:49,160 --> 00:13:51,680
Speaker 1: train and train and train the model so that it

218
00:13:51,720 --> 00:13:57,160
Speaker 1: gets better and better at producing a close representation of

219
00:13:57,200 --> 00:14:01,480
Speaker 1: the original image. When it does, it's reconstru struction. And

220
00:14:01,960 --> 00:14:06,319
Speaker 1: you would have both essentially decoders for both your source

221
00:14:06,440 --> 00:14:10,720
Speaker 1: subject and your target subject. Use the same encoder for both,

222
00:14:11,120 --> 00:14:15,600
Speaker 1: but two different decoders, one dedicated to your source, one

223
00:14:15,679 --> 00:14:20,120
Speaker 1: dedicated to your target. Then you would feed the reconstructed

224
00:14:20,200 --> 00:14:22,720
Speaker 1: images through the system again and again. This is called

225
00:14:23,080 --> 00:14:27,240
Speaker 1: back propagation. You do this over millions of times, typically

226
00:14:27,520 --> 00:14:31,520
Speaker 1: to improve this process, and then you're ready to really

227
00:14:32,080 --> 00:14:34,880
Speaker 1: switch switch things up. So let's say we've got two people.

228
00:14:34,960 --> 00:14:37,800
Speaker 1: We've got person one and we've got person two, and

229
00:14:37,840 --> 00:14:41,160
Speaker 1: you've been feeding images of both of these people through

230
00:14:41,280 --> 00:14:44,680
Speaker 1: the same encoder, but of course you have dedicated decoders

231
00:14:44,720 --> 00:14:47,760
Speaker 1: to produce the reconstruction. So person one has decode er

232
00:14:47,840 --> 00:14:52,320
Speaker 1: one and person two has decoder two. Now let's say

233
00:14:52,320 --> 00:14:58,160
Speaker 1: you're ready to put person two's face on person one's body. Well,

234
00:14:58,200 --> 00:15:01,840
Speaker 1: you would feed an image of person one into the encoder,

235
00:15:02,200 --> 00:15:05,320
Speaker 1: but you use the decoder for a person to to

236
00:15:05,480 --> 00:15:08,680
Speaker 1: reconstruct the image, and what you get is persons who's

237
00:15:08,800 --> 00:15:13,480
Speaker 1: face but mimicking the expression from person one. You, or

238
00:15:13,600 --> 00:15:18,840
Speaker 1: rather the computer algorithm, does this frame by frame on video,

239
00:15:18,920 --> 00:15:20,920
Speaker 1: and you end up with a video appearing to feature

240
00:15:20,960 --> 00:15:24,680
Speaker 1: one person when in fact it's just their face on

241
00:15:24,760 --> 00:15:27,680
Speaker 1: top of someone else, and it's their face making the

242
00:15:27,680 --> 00:15:31,720
Speaker 1: exact same expressions as whoever was originally in that video.

243
00:15:32,680 --> 00:15:36,280
Speaker 1: Now back over to deep fakes. Before long after the

244
00:15:36,320 --> 00:15:40,840
Speaker 1: Reddit user initially posted this code, folks over at Reddit,

245
00:15:40,840 --> 00:15:43,600
Speaker 1: we're taking this open source code and making more advanced

246
00:15:43,680 --> 00:15:46,960
Speaker 1: software based off of it. Soon there were desktop apps

247
00:15:46,960 --> 00:15:50,200
Speaker 1: that would take over all the hard parts of this process,

248
00:15:50,440 --> 00:15:54,080
Speaker 1: all the codey bits, if you will, of training a model.

249
00:15:54,480 --> 00:15:57,160
Speaker 1: Some of them would guide users into creating the data

250
00:15:57,280 --> 00:15:59,800
Speaker 1: that would be used to train the model and go

251
00:16:00,040 --> 00:16:02,360
Speaker 1: all the way through the process of creating the final

252
00:16:02,440 --> 00:16:05,960
Speaker 1: fake videos. Even with some of the more sophisticated versions,

253
00:16:06,000 --> 00:16:09,720
Speaker 1: there were tell tales signs of tampering. Typically some blurring

254
00:16:09,760 --> 00:16:14,400
Speaker 1: around images, particularly near chins and mouths. Those would be signs.

255
00:16:14,960 --> 00:16:17,360
Speaker 1: If there was any flicker, that was a sign if

256
00:16:17,360 --> 00:16:21,720
Speaker 1: you didn't take enough time to train the model. Typically

257
00:16:21,720 --> 00:16:24,640
Speaker 1: you would want to do several days of training at least.

258
00:16:24,960 --> 00:16:26,840
Speaker 1: If you didn't take that time, you might see some

259
00:16:26,960 --> 00:16:29,840
Speaker 1: really nasty blurring and flickering, and it would be a

260
00:16:29,840 --> 00:16:35,480
Speaker 1: dead giveaway that this was tampered. Video in writer, director,

261
00:16:35,520 --> 00:16:39,520
Speaker 1: and comedian Jordan's Peel demonstrated the power of this technology.

262
00:16:39,600 --> 00:16:43,040
Speaker 1: He showed how, with his impersonation of Barack Obama and

263
00:16:43,200 --> 00:16:47,840
Speaker 1: some manipulation software, he could create a fake public service

264
00:16:47,840 --> 00:16:51,680
Speaker 1: address when which the president would appear to say things

265
00:16:52,080 --> 00:16:55,640
Speaker 1: that he normally would never say. The technology behind this

266
00:16:55,800 --> 00:16:58,640
Speaker 1: made use of what is called a long short term

267
00:16:58,680 --> 00:17:01,480
Speaker 1: memory network or l s TM, to go into the

268
00:17:01,520 --> 00:17:05,199
Speaker 1: mechanics of that would require another podcast, but using an

269
00:17:05,200 --> 00:17:08,520
Speaker 1: approach similar to what I've already described, a team was

270
00:17:08,560 --> 00:17:12,240
Speaker 1: able to make a video of Obama apparently lip syncing

271
00:17:12,480 --> 00:17:16,159
Speaker 1: Peel's satirical message. The goal of this p s A

272
00:17:16,480 --> 00:17:20,080
Speaker 1: was beyond alert because fakes are getting harder to spot.

273
00:17:20,600 --> 00:17:25,200
Speaker 1: The University of Washington showed off this and They're Synthesizing

274
00:17:25,240 --> 00:17:28,880
Speaker 1: Obama project in which they took the audio from one

275
00:17:29,000 --> 00:17:32,600
Speaker 1: of President Obama's speeches and then used it to animate

276
00:17:32,760 --> 00:17:36,600
Speaker 1: his face in video from a different address that he

277
00:17:36,680 --> 00:17:40,240
Speaker 1: gave during his presidency. So in this example, the person

278
00:17:40,280 --> 00:17:43,680
Speaker 1: in the target video is the same person as the

279
00:17:43,800 --> 00:17:47,399
Speaker 1: source for the audio. But the point was pretty clear

280
00:17:47,600 --> 00:17:51,399
Speaker 1: that tech would soon make it possible to fake someone

281
00:17:51,440 --> 00:17:54,960
Speaker 1: saying or doing something. It just takes the right algorithms,

282
00:17:55,280 --> 00:17:58,320
Speaker 1: the right amount of training data, and the right amount

283
00:17:58,359 --> 00:18:00,720
Speaker 1: of time to get the model trained up enough to

284
00:18:00,760 --> 00:18:04,960
Speaker 1: do it smoothly. Now, this technology could be used to

285
00:18:05,080 --> 00:18:09,119
Speaker 1: do stuff that isn't related to malicious deception or for

286
00:18:09,359 --> 00:18:12,399
Speaker 1: pornography or anything along those lines. It could be used

287
00:18:12,760 --> 00:18:16,119
Speaker 1: in television and film for lots of stuff, including potentially

288
00:18:16,160 --> 00:18:20,159
Speaker 1: adding in actors who have passed away into a film.

289
00:18:20,200 --> 00:18:23,320
Speaker 1: Paired with similar work that's going on in voice synthesis,

290
00:18:23,320 --> 00:18:26,560
Speaker 1: you could end up with a convincing replacement, which means

291
00:18:26,960 --> 00:18:30,800
Speaker 1: we could make movies with dead actors taking on new

292
00:18:30,920 --> 00:18:35,000
Speaker 1: parts because we can synthesize their speech, we can synthesize

293
00:18:35,040 --> 00:18:38,000
Speaker 1: their appearance. You would still have someone else acting out

294
00:18:38,080 --> 00:18:41,680
Speaker 1: the part physically, but you would replace their image with

295
00:18:42,040 --> 00:18:46,920
Speaker 1: this actor's image. Or maybe you would want to use

296
00:18:46,960 --> 00:18:49,080
Speaker 1: this kind of technology just to make everyone think you

297
00:18:49,080 --> 00:18:51,880
Speaker 1: can cut a rug. This brings me to the University

298
00:18:51,880 --> 00:18:55,080
Speaker 1: of California, Berkeley and is the subject of a paper

299
00:18:55,119 --> 00:18:59,480
Speaker 1: titled Everybody Dance Now. The goal is a simple concept

300
00:18:59,560 --> 00:19:02,280
Speaker 1: that's actually really hard to pull off. What if you

301
00:19:02,320 --> 00:19:05,719
Speaker 1: were to take the movements of a professional dancer and

302
00:19:05,760 --> 00:19:09,160
Speaker 1: then map those movements onto the body of someone who

303
00:19:09,320 --> 00:19:12,399
Speaker 1: wasn't a dancer. What if you could create a video

304
00:19:12,720 --> 00:19:17,040
Speaker 1: in which literally anyone would appear to move like a skilled,

305
00:19:17,440 --> 00:19:21,280
Speaker 1: trained dancer. And how the heck would that be possible. Well,

306
00:19:21,320 --> 00:19:23,919
Speaker 1: at the heart of the team's efforts was something I

307
00:19:23,960 --> 00:19:27,120
Speaker 1: talked about in a recent episode of tech Stuff about

308
00:19:27,200 --> 00:19:31,919
Speaker 1: an AI generated portrait, and that would be generative adversarial

309
00:19:32,040 --> 00:19:35,479
Speaker 1: networks or g A n s. These use a pair

310
00:19:35,600 --> 00:19:40,199
Speaker 1: of artificial neural networks in competition against each other. So

311
00:19:40,240 --> 00:19:42,600
Speaker 1: since I covered this recently, i'll just give again a

312
00:19:42,640 --> 00:19:46,240
Speaker 1: super quick high level summary. You've got one network that

313
00:19:46,320 --> 00:19:49,399
Speaker 1: has a specific job, such as trying to create an

314
00:19:49,400 --> 00:19:51,760
Speaker 1: original image of a cat. We'll go back to the

315
00:19:51,800 --> 00:19:54,560
Speaker 1: cat pictures. That's one of my favorite ones because it

316
00:19:54,600 --> 00:19:58,600
Speaker 1: was one of the early use cases of neural networks

317
00:19:58,600 --> 00:20:01,399
Speaker 1: that I remember encountering when I was doing research. Now,

318
00:20:01,480 --> 00:20:04,479
Speaker 1: let's say you've got your second network. Your second network

319
00:20:04,480 --> 00:20:08,159
Speaker 1: has the specific job of evaluating pictures of cats to

320
00:20:08,280 --> 00:20:12,280
Speaker 1: determine if they are valid, meaning is this a real

321
00:20:12,320 --> 00:20:16,000
Speaker 1: picture of a cat that's part of the training material

322
00:20:16,320 --> 00:20:20,000
Speaker 1: that I'm accepting, or is this, in fact a fake

323
00:20:20,400 --> 00:20:25,040
Speaker 1: that was created by a computer program the other neural network.

324
00:20:25,400 --> 00:20:27,920
Speaker 1: So you've got one network trying to fool the other network.

325
00:20:28,200 --> 00:20:31,560
Speaker 1: And these networks get better at what they do over time,

326
00:20:31,880 --> 00:20:38,159
Speaker 1: they improve, So your counterfeit network is getting better and

327
00:20:38,200 --> 00:20:41,959
Speaker 1: better at making fake pictures of cats, and your detector

328
00:20:42,040 --> 00:20:45,960
Speaker 1: network is getting better and better at detecting fake images

329
00:20:46,000 --> 00:20:49,440
Speaker 1: of cats. Now, typically this requires humans to give feedback

330
00:20:49,560 --> 00:20:52,879
Speaker 1: or tweaking weight values along the networks, but they do

331
00:20:52,920 --> 00:20:56,679
Speaker 1: get better over time. So if the network trying to

332
00:20:56,680 --> 00:20:59,359
Speaker 1: create a picture of a cat gets the feedback of sorry, buddy,

333
00:20:59,400 --> 00:21:01,960
Speaker 1: but they're onto you, then it can try again and

334
00:21:02,040 --> 00:21:04,480
Speaker 1: adjust it's approach slightly in an effort to fool the

335
00:21:04,480 --> 00:21:07,880
Speaker 1: second network. If the second network gets the feedback you'll

336
00:21:07,960 --> 00:21:10,320
Speaker 1: let this one slip by and it's fake, then it

337
00:21:10,359 --> 00:21:13,040
Speaker 1: will adjust or it will be adjusted to look out

338
00:21:13,040 --> 00:21:16,080
Speaker 1: for any tailtale signs that it had missed in that

339
00:21:16,200 --> 00:21:20,360
Speaker 1: earlier evaluation. Over time, the two networks working against each

340
00:21:20,359 --> 00:21:24,480
Speaker 1: other will create the ultimate result of better and better

341
00:21:24,600 --> 00:21:28,679
Speaker 1: computer generated content, whether it's an image of a cat

342
00:21:29,440 --> 00:21:34,760
Speaker 1: or a sonnet, or a song or a video. Now

343
00:21:34,880 --> 00:21:39,680
Speaker 1: that doesn't mean that these computer generated things are at

344
00:21:39,680 --> 00:21:43,399
Speaker 1: the same level as human generated stuff, especially when it

345
00:21:43,440 --> 00:21:46,479
Speaker 1: comes to text. I've seen a lot of song lyrics

346
00:21:46,520 --> 00:21:51,000
Speaker 1: that were inscrutable even by my old man standards. So

347
00:21:51,280 --> 00:21:54,520
Speaker 1: I think that we're a long way away from getting

348
00:21:54,600 --> 00:21:57,520
Speaker 1: to a point where they can fool us in every case.

349
00:21:57,600 --> 00:22:01,040
Speaker 1: But with video they're getting pretty darn good. Now, this

350
00:22:01,119 --> 00:22:04,719
Speaker 1: team had two groups of subjects, and so you had

351
00:22:04,760 --> 00:22:08,520
Speaker 1: your source subjects and your target subjects. The source in

352
00:22:08,560 --> 00:22:11,399
Speaker 1: this case, were the people who could dance, so like

353
00:22:11,520 --> 00:22:14,720
Speaker 1: ballet dancers, hip hop dancers and that sort of stuff.

354
00:22:14,760 --> 00:22:19,000
Speaker 1: They legit know how to move. They would demonstrate various

355
00:22:19,080 --> 00:22:22,400
Speaker 1: dances on video. The second group of subjects were your

356
00:22:22,440 --> 00:22:27,439
Speaker 1: target subjects. These were not trained dancers. They were to

357
00:22:27,600 --> 00:22:31,560
Speaker 1: go through a series of moves and poses, essentially aping

358
00:22:31,720 --> 00:22:35,600
Speaker 1: as best they could the movements of trained dancers, and

359
00:22:35,680 --> 00:22:40,080
Speaker 1: the goal of this pair of networks was to smooth

360
00:22:40,119 --> 00:22:43,160
Speaker 1: the movements out and adjust the timing so that these

361
00:22:43,280 --> 00:22:46,600
Speaker 1: untrained dancers would appear to move more like their groovy

362
00:22:46,720 --> 00:22:50,800
Speaker 1: source subject counterparts. I'll explain more in just a moment,

363
00:22:50,800 --> 00:22:53,840
Speaker 1: but first let's take another quick break to thank our sponsor.

364
00:23:01,359 --> 00:23:05,280
Speaker 1: According to the Everybody Dance Now paper, the team would

365
00:23:05,280 --> 00:23:09,040
Speaker 1: transfer motion between the sources to the target through an

366
00:23:09,240 --> 00:23:14,399
Speaker 1: end to end pixel based pipeline. So here's how that's done.

367
00:23:14,480 --> 00:23:18,919
Speaker 1: Because if you're like me, that phrase meant next to

368
00:23:19,000 --> 00:23:22,760
Speaker 1: nothing to you. So specifically, the group used three stages

369
00:23:22,800 --> 00:23:26,280
Speaker 1: to take the movements of one person and transpose them

370
00:23:26,320 --> 00:23:31,200
Speaker 1: to a target person. Those three were pose detection, global

371
00:23:31,280 --> 00:23:35,800
Speaker 1: pose normalization, and mapping from normalized pose stick figures to

372
00:23:35,840 --> 00:23:41,040
Speaker 1: the target subject. Post detection involves teaching machines, in other words,

373
00:23:41,040 --> 00:23:45,159
Speaker 1: computers how to interpret images to determine where key body

374
00:23:45,200 --> 00:23:50,480
Speaker 1: points are, like elbows, knees, hips, shoulders, the head, that

375
00:23:50,560 --> 00:23:53,679
Speaker 1: kind of stuff. That first requires that you teach the

376
00:23:53,720 --> 00:23:57,760
Speaker 1: machine to recognize those points in the first place. So

377
00:23:58,080 --> 00:24:00,160
Speaker 1: first you have to train a machine to recogniz eyes

378
00:24:00,240 --> 00:24:04,360
Speaker 1: those points and identify them with a target level of accuracy.

379
00:24:04,520 --> 00:24:08,000
Speaker 1: It's pretty typical to represent these joints as as points

380
00:24:08,040 --> 00:24:11,320
Speaker 1: in a stick figure, so each point represents another joint

381
00:24:11,400 --> 00:24:15,159
Speaker 1: or point of articulation. The lines represent the trunk of

382
00:24:15,200 --> 00:24:18,199
Speaker 1: the body, the limbs, the head. You end up with

383
00:24:18,240 --> 00:24:21,159
Speaker 1: a stick figure. If your machine learning mechanism was a

384
00:24:21,160 --> 00:24:24,199
Speaker 1: good one, the machine should be able to overlay a

385
00:24:24,280 --> 00:24:27,600
Speaker 1: stick figure on top of any image of a person posing,

386
00:24:28,040 --> 00:24:30,439
Speaker 1: and the stick figure should more or less conform to

387
00:24:30,560 --> 00:24:34,280
Speaker 1: that image, including where the actual joints are. So if

388
00:24:34,280 --> 00:24:36,800
Speaker 1: you have someone standing there in the classic Peter Pan

389
00:24:36,880 --> 00:24:40,239
Speaker 1: pose of their their fists on their hips uh and

390
00:24:40,280 --> 00:24:43,480
Speaker 1: their their arms out of kimbo, then it should draw

391
00:24:43,520 --> 00:24:46,320
Speaker 1: a stick figure that's essentially aping the same thing and

392
00:24:46,359 --> 00:24:48,760
Speaker 1: be able to overlay it on top of the original image.

393
00:24:49,000 --> 00:24:52,160
Speaker 1: Now these days this can be done in real time. So,

394
00:24:52,240 --> 00:24:54,520
Speaker 1: for example, there's a team at Google Creative Lab that

395
00:24:54,600 --> 00:24:57,639
Speaker 1: used a machine learning model of pose net and created

396
00:24:57,680 --> 00:25:01,240
Speaker 1: a JavaScript version with TensorFlow, which is an open source

397
00:25:01,320 --> 00:25:04,760
Speaker 1: software library often used for machine learning. And with this

398
00:25:04,840 --> 00:25:08,080
Speaker 1: tool you can do real time pose estimation through a

399
00:25:08,119 --> 00:25:12,399
Speaker 1: browser and a webcam. The application doesn't have any technology

400
00:25:12,480 --> 00:25:15,040
Speaker 1: related to identifying the person in the image. It's just

401
00:25:15,119 --> 00:25:17,600
Speaker 1: quote unquote interested in what the person is doing, not

402
00:25:17,720 --> 00:25:19,879
Speaker 1: who the person is. So you can actually run this

403
00:25:20,000 --> 00:25:22,880
Speaker 1: on your own machine in a browser, and you can

404
00:25:22,880 --> 00:25:24,800
Speaker 1: pose in front of a webcam and you'll see the

405
00:25:24,840 --> 00:25:29,679
Speaker 1: little stick figure uh painted on top of your image

406
00:25:29,680 --> 00:25:32,400
Speaker 1: on the computer. Essentially, so every time you move, every

407
00:25:32,400 --> 00:25:34,480
Speaker 1: time you bend a joint, you will see the stick

408
00:25:34,520 --> 00:25:37,760
Speaker 1: figure doing the same thing, um mapped on top of you.

409
00:25:38,240 --> 00:25:42,200
Speaker 1: The Berkeley team made use of a pre trained pose detector,

410
00:25:42,359 --> 00:25:45,040
Speaker 1: meaning they didn't build a new one, which helps save

411
00:25:45,080 --> 00:25:47,960
Speaker 1: a lot of time and expense on their project. Now

412
00:25:47,960 --> 00:25:51,639
Speaker 1: people come in all shapes and sizes. In the video

413
00:25:51,720 --> 00:25:54,600
Speaker 1: the team released, they showed off subjects who included a

414
00:25:54,640 --> 00:25:56,960
Speaker 1: woman who appeared to be of around average height and

415
00:25:57,000 --> 00:26:00,720
Speaker 1: a man who appeared to be pretty darn Tallman transfer

416
00:26:00,800 --> 00:26:03,639
Speaker 1: method that would only work between a subject and a

417
00:26:03,680 --> 00:26:07,280
Speaker 1: target who are of similar shape and size would be

418
00:26:07,280 --> 00:26:11,359
Speaker 1: pretty limited. So the purpose of the global pose normalization

419
00:26:11,440 --> 00:26:14,639
Speaker 1: stage is to account for all the differences between the

420
00:26:14,800 --> 00:26:18,480
Speaker 1: source and the target subjects and the locations within the

421
00:26:18,480 --> 00:26:22,520
Speaker 1: frame of the camera. Without this step, the motion transfer

422
00:26:22,640 --> 00:26:27,800
Speaker 1: might appear ghoulish. We don't have all the same proportions, right,

423
00:26:27,840 --> 00:26:31,240
Speaker 1: so a mismatch might mean a target's limbs would appear

424
00:26:31,280 --> 00:26:34,560
Speaker 1: to bend in places that were clearly not natural joints.

425
00:26:35,080 --> 00:26:36,919
Speaker 1: All you need to do is see an arm bend

426
00:26:36,960 --> 00:26:38,919
Speaker 1: where an arm isn't supposed to bend, and that's going

427
00:26:38,960 --> 00:26:41,560
Speaker 1: to ski the out quite a bit. Makes an effective

428
00:26:41,600 --> 00:26:44,920
Speaker 1: horror movie experience, but not one that would produce convincing

429
00:26:45,000 --> 00:26:47,760
Speaker 1: motion transfer. Now, there are a lot of ways that

430
00:26:47,760 --> 00:26:50,679
Speaker 1: the team could have gone about normalizing the poses, but

431
00:26:50,760 --> 00:26:54,320
Speaker 1: their choice seems particularly clever to me. They measured the

432
00:26:54,400 --> 00:26:58,600
Speaker 1: heights and ankle positions of the various subjects and used

433
00:26:58,720 --> 00:27:03,040
Speaker 1: linear mapping between the closest and farthest ankle positions in

434
00:27:03,080 --> 00:27:06,800
Speaker 1: both videos to normalize the stick figure for the target subjects.

435
00:27:07,440 --> 00:27:10,760
Speaker 1: The program would calculate the scale of the figure as

436
00:27:10,800 --> 00:27:13,720
Speaker 1: well as the scale of motion from frame to frame.

437
00:27:14,080 --> 00:27:16,240
Speaker 1: And I think that's pretty darn cool because it wasn't

438
00:27:16,280 --> 00:27:19,640
Speaker 1: just accounting for the size of the subjects to get

439
00:27:19,640 --> 00:27:21,920
Speaker 1: all the joints right, but also to make sure the

440
00:27:21,960 --> 00:27:25,240
Speaker 1: scale of the movements with respect to the body size

441
00:27:25,280 --> 00:27:29,320
Speaker 1: and proportions would remain the same. So a tall person

442
00:27:29,400 --> 00:27:34,520
Speaker 1: with really long limbs moving their arms in really big, big,

443
00:27:34,560 --> 00:27:39,399
Speaker 1: bold gestures, if you tried to transfer that motion to

444
00:27:39,480 --> 00:27:42,960
Speaker 1: someone who was of smaller stature, it could really look disturbing.

445
00:27:43,440 --> 00:27:47,359
Speaker 1: But by using this scaling approach, the movements on the

446
00:27:47,480 --> 00:27:52,600
Speaker 1: smaller person would be proportionate in size to the movements

447
00:27:52,680 --> 00:27:56,240
Speaker 1: of the larger person. The team would use two of

448
00:27:56,359 --> 00:28:00,439
Speaker 1: the Generative Adversarial Network setups to work on making a

449
00:28:00,440 --> 00:28:04,040
Speaker 1: convincing final video. The first was dedicated to image to

450
00:28:04,200 --> 00:28:07,840
Speaker 1: image translation, attempting to manipulate the image of the target

451
00:28:07,880 --> 00:28:10,760
Speaker 1: subjects that would follow the motions made from the pose

452
00:28:10,880 --> 00:28:14,919
Speaker 1: detection process, and like all g a N setups, this

453
00:28:15,000 --> 00:28:18,240
Speaker 1: included the generator, which would attempt to create a convincing

454
00:28:18,320 --> 00:28:21,879
Speaker 1: sequence of images, and the discriminators, which tried to weed

455
00:28:21,880 --> 00:28:25,440
Speaker 1: out the quote unquote fake sequences from the generator from

456
00:28:25,480 --> 00:28:28,080
Speaker 1: the ground truth data that was being fed to it.

457
00:28:28,840 --> 00:28:32,639
Speaker 1: The second g N set set up was specifically dedicated

458
00:28:33,000 --> 00:28:36,359
Speaker 1: to add detail and realism to the faces of the

459
00:28:36,400 --> 00:28:39,920
Speaker 1: target subjects. In some frames this appears to have worked

460
00:28:39,920 --> 00:28:42,600
Speaker 1: pretty well, and others there's a bit of an uncanny

461
00:28:42,720 --> 00:28:45,840
Speaker 1: valley thing or maybe even horror movie type element going on,

462
00:28:46,480 --> 00:28:49,720
Speaker 1: similar to how some of the AI generated portraits that

463
00:28:49,800 --> 00:28:52,000
Speaker 1: I talked about in the previous episode introduced a bit

464
00:28:52,040 --> 00:28:56,200
Speaker 1: of unrealistic qualities to the various images. When shooting video

465
00:28:56,280 --> 00:29:00,080
Speaker 1: of the target subjects, the team captured images at one

466
00:29:00,160 --> 00:29:03,360
Speaker 1: hundred twenty frames per second to get enough data for

467
00:29:03,440 --> 00:29:07,360
Speaker 1: each subject. The sessions lasted for about twenty minutes. They

468
00:29:07,440 --> 00:29:10,560
Speaker 1: used smartphone cameras to do it, since many smartphones allow

469
00:29:10,600 --> 00:29:12,840
Speaker 1: you to shoot video at this kind of frame rate

470
00:29:12,920 --> 00:29:15,840
Speaker 1: these days. They had their target subjects where close fitting

471
00:29:15,840 --> 00:29:19,920
Speaker 1: clothing that wasn't prone to wrinkling because the post recognition

472
00:29:19,920 --> 00:29:23,240
Speaker 1: tool they were using wasn't designed to encode information about clothing.

473
00:29:24,040 --> 00:29:27,040
Speaker 1: As for the source videos, the ones that would actually

474
00:29:27,080 --> 00:29:29,960
Speaker 1: create the motions that would be transferred to the targets,

475
00:29:30,240 --> 00:29:32,880
Speaker 1: the team didn't have to worry about capturing images at

476
00:29:32,960 --> 00:29:35,239
Speaker 1: such a high frame rate. They could use videos of

477
00:29:35,280 --> 00:29:39,479
Speaker 1: just reasonable quality, meaning decent resolution and frame rate, and

478
00:29:39,520 --> 00:29:42,400
Speaker 1: their post detection tool would do its work and create

479
00:29:42,400 --> 00:29:45,080
Speaker 1: the stick figure that would serve as the guide for

480
00:29:45,160 --> 00:29:48,920
Speaker 1: the target motions later on. Because of that, the team

481
00:29:49,040 --> 00:29:52,920
Speaker 1: can really use any online video of sufficient quality to

482
00:29:52,960 --> 00:29:55,960
Speaker 1: act as the source information for motion transfer. It doesn't

483
00:29:55,960 --> 00:29:58,880
Speaker 1: have to be a video shot specifically for that purpose.

484
00:29:59,400 --> 00:30:02,760
Speaker 1: In fact, one of the example videos the team used

485
00:30:02,760 --> 00:30:06,120
Speaker 1: in their demonstration was from a Bruno Mars music video

486
00:30:06,240 --> 00:30:09,440
Speaker 1: for That's what I Like. Before applying the motion transfer,

487
00:30:09,720 --> 00:30:12,920
Speaker 1: the team smoothed pose key points to reduce jitter in

488
00:30:12,920 --> 00:30:16,960
Speaker 1: the final output, and then the team applied the motion transfer.

489
00:30:17,360 --> 00:30:21,640
Speaker 1: The stick figure motions were then transferred to the target

490
00:30:21,760 --> 00:30:25,800
Speaker 1: subjects and the result is pretty interesting. It is not seamless.

491
00:30:26,240 --> 00:30:29,080
Speaker 1: You can definitely tell something odd is going on, but

492
00:30:29,160 --> 00:30:31,800
Speaker 1: it is an indication of where things are going and

493
00:30:31,880 --> 00:30:36,080
Speaker 1: using adversarial networks could lead to more convincing motion transfers

494
00:30:36,120 --> 00:30:40,440
Speaker 1: in the future. Now, this could lead to all sorts

495
00:30:40,440 --> 00:30:43,960
Speaker 1: of stuff nefarious and otherwise. You could imagine using it

496
00:30:44,000 --> 00:30:47,520
Speaker 1: to transform an average actor into a martial arts master,

497
00:30:48,480 --> 00:30:51,920
Speaker 1: or it might allow directors more freedom of casting, knowing

498
00:30:51,960 --> 00:30:55,920
Speaker 1: that if the actors they choose don't possess certain physical skills,

499
00:30:56,280 --> 00:30:59,480
Speaker 1: they can use this kind of technology to fake it,

500
00:30:59,520 --> 00:31:02,240
Speaker 1: but would also be used to fake footage to make

501
00:31:02,520 --> 00:31:06,200
Speaker 1: it looks like people like specific people are doing stuff

502
00:31:06,240 --> 00:31:09,160
Speaker 1: that they are not doing. It could be used to

503
00:31:09,200 --> 00:31:13,440
Speaker 1: spread misinformation and it likely will be, which means we'll

504
00:31:13,480 --> 00:31:15,760
Speaker 1: need to be on the lookout for signs of fakes,

505
00:31:15,800 --> 00:31:18,720
Speaker 1: which are going to get harder and harder to detect

506
00:31:18,880 --> 00:31:22,720
Speaker 1: as time goes on. And hey, you guys remember DARPA,

507
00:31:22,840 --> 00:31:25,240
Speaker 1: right because I just did a whole series of episodes

508
00:31:25,240 --> 00:31:29,280
Speaker 1: about them. Well, that agency has funded programs dedicated to

509
00:31:29,480 --> 00:31:34,240
Speaker 1: automating various forensic tools, including tools that could be used

510
00:31:34,240 --> 00:31:39,480
Speaker 1: to detect AI created forgeries in video and audio. Often

511
00:31:39,560 --> 00:31:43,040
Speaker 1: the secret is in the eyes. Most of these neural

512
00:31:43,080 --> 00:31:47,600
Speaker 1: networks are trained on still images, so you send thousands

513
00:31:47,680 --> 00:31:49,880
Speaker 1: or tens of thousands of images if you have them,

514
00:31:50,080 --> 00:31:54,000
Speaker 1: of your various subjects, your target, and your source. But

515
00:31:54,120 --> 00:31:58,320
Speaker 1: most published still images don't show people with their eyes closed.

516
00:31:59,440 --> 00:32:02,760
Speaker 1: So I've moved my movements and blinking tends to be

517
00:32:02,800 --> 00:32:05,200
Speaker 1: a little wonky in these fake videos. You might watch

518
00:32:05,200 --> 00:32:08,120
Speaker 1: one for a while and think, huh, that's weird. This

519
00:32:08,160 --> 00:32:11,800
Speaker 1: guy hasn't blinked for like ten minutes, or when they

520
00:32:11,840 --> 00:32:15,080
Speaker 1: blink it looks really strange. Well, that's an indication that

521
00:32:15,160 --> 00:32:17,680
Speaker 1: it's a fake video. There are other ones as well,

522
00:32:17,720 --> 00:32:23,200
Speaker 1: but DARPA is understandably keeping those quiet because not you know,

523
00:32:23,240 --> 00:32:27,280
Speaker 1: if if they publish how they figure out AI created

524
00:32:27,400 --> 00:32:31,600
Speaker 1: videos are in fact faked, then that gives the fakers

525
00:32:31,720 --> 00:32:35,440
Speaker 1: enough information to go back and improve their models. So

526
00:32:35,520 --> 00:32:38,760
Speaker 1: we're likely to see something akin to what happened with capture.

527
00:32:39,240 --> 00:32:44,360
Speaker 1: Specialists will develop new tools to detect a I generated media.

528
00:32:45,200 --> 00:32:49,080
Speaker 1: AI developers will then create more sophisticated models, and so

529
00:32:49,160 --> 00:32:51,760
Speaker 1: it becomes kind of an arms race a seesaw, and

530
00:32:51,840 --> 00:32:54,680
Speaker 1: one benefit is that AI as a whole will improve,

531
00:32:55,840 --> 00:32:57,920
Speaker 1: but we may not be able to believe it when

532
00:32:58,000 --> 00:33:02,200
Speaker 1: we see it. Well, that wraps up this episode of fascinating,

533
00:33:02,800 --> 00:33:07,640
Speaker 1: somewhat disturbing topic, and uh, I'm sure we're gonna hear

534
00:33:07,760 --> 00:33:10,040
Speaker 1: a lot more about this in the years to come.

535
00:33:10,120 --> 00:33:14,680
Speaker 1: We've seen a lot of of sites banning deep fakes

536
00:33:14,680 --> 00:33:20,080
Speaker 1: outright because of the misinformation that they can spread. So

537
00:33:20,160 --> 00:33:24,760
Speaker 1: we're already seeing a reaction to this in various online communities,

538
00:33:25,200 --> 00:33:28,840
Speaker 1: so that's very interesting to me. But we're definitely gonna

539
00:33:28,920 --> 00:33:32,520
Speaker 1: keep seeing this continue. It's a it's a valid area

540
00:33:32,560 --> 00:33:36,360
Speaker 1: of AI research, so we will have to wait and

541
00:33:36,400 --> 00:33:38,680
Speaker 1: see how it all plays out. If you guys have

542
00:33:38,680 --> 00:33:41,440
Speaker 1: any suggestions for future episodes of tech Stuff, why not

543
00:33:41,520 --> 00:33:43,760
Speaker 1: send me a message. You can go over to our

544
00:33:43,800 --> 00:33:47,800
Speaker 1: website that is Text Stuff podcast dot com. You'll find

545
00:33:47,800 --> 00:33:50,560
Speaker 1: all the different ways to contact me. I look forward

546
00:33:50,560 --> 00:33:52,640
Speaker 1: to hearing from you. Make sure you check out our

547
00:33:52,680 --> 00:33:56,080
Speaker 1: store over at t public dot com slash tech Stuff

548
00:33:56,360 --> 00:33:59,760
Speaker 1: by some merchandise. You can make sure that you get

549
00:34:00,000 --> 00:34:03,160
Speaker 1: all the really cool T shirts like prove to Me

550
00:34:03,200 --> 00:34:05,720
Speaker 1: You're not a Robot. That one's pretty appropriate for this

551
00:34:05,760 --> 00:34:08,960
Speaker 1: particular episode. And remember every single purchase you make goes

552
00:34:09,000 --> 00:34:11,520
Speaker 1: to help the show, so we greatly appreciate it. Also,

553
00:34:11,920 --> 00:34:14,399
Speaker 1: if you haven't heard, we have been nominated for an

554
00:34:14,400 --> 00:34:18,120
Speaker 1: I Heart Radio Podcast Award. It's the first year I

555
00:34:18,200 --> 00:34:20,760
Speaker 1: Heart Radio is giving out podcast awards. We are nominated

556
00:34:20,760 --> 00:34:24,839
Speaker 1: in the Science and Technology category. You can go online

557
00:34:24,880 --> 00:34:27,880
Speaker 1: and visit the I Heart Radio Podcast Awards page and

558
00:34:27,960 --> 00:34:32,040
Speaker 1: vote up to five times a day for your favorite podcasts.

559
00:34:32,360 --> 00:34:34,560
Speaker 1: If you wanted to. You could dedicate all five of

560
00:34:34,600 --> 00:34:38,279
Speaker 1: those votes every single day to us. I would not

561
00:34:38,360 --> 00:34:40,560
Speaker 1: complain if you did that. It would be really cool

562
00:34:40,600 --> 00:34:43,359
Speaker 1: to win that award, but make sure you check it out.

563
00:34:43,440 --> 00:34:45,440
Speaker 1: There may be lots of shows there that you truly

564
00:34:45,440 --> 00:34:47,080
Speaker 1: love and you want to throw your support behind them.

565
00:34:47,080 --> 00:34:49,120
Speaker 1: That would be really cool with you, And I'll talk

566
00:34:49,160 --> 00:34:57,719
Speaker 1: to you again really soon for more on this and

567
00:34:57,800 --> 00:35:00,000
Speaker 1: bousands of other topics because it has to have four.

568
00:35:00,080 --> 00:35:00,520
Speaker 1: Stock com