1
00:00:04,440 --> 00:00:12,240
Speaker 1: Welcome to tech Stuff, a production from iHeartRadio. He there

2
00:00:12,320 --> 00:00:16,040
Speaker 1: and welcome to tech Stuff. I'm your host, Jonathan Strickland.

3
00:00:16,120 --> 00:00:19,000
Speaker 1: I'm an executive producer with iHeartRadio. And how the tech

4
00:00:19,040 --> 00:00:23,639
Speaker 1: are you. In April twenty twenty three, the lawyers for

5
00:00:23,760 --> 00:00:28,560
Speaker 1: Tesla CEO Elon Musk argued that submitted recordings of their

6
00:00:28,640 --> 00:00:33,279
Speaker 1: client from twenty sixteen might have been deep bakes, so

7
00:00:33,360 --> 00:00:37,680
Speaker 1: the ongoing case is an emotionally charged one. In twenty eighteen,

8
00:00:37,800 --> 00:00:41,800
Speaker 1: a man named Walter Huang died in a car accident.

9
00:00:41,960 --> 00:00:45,000
Speaker 1: The Tesla he was in was a Tesla Model X

10
00:00:45,440 --> 00:00:48,360
Speaker 1: and it was engaged in autopilot mode at the time

11
00:00:48,360 --> 00:00:52,360
Speaker 1: of the crash. His family contends that the Tesla's safety

12
00:00:52,400 --> 00:00:56,760
Speaker 1: systems failed and the vehicle steered itself into a concrete median,

13
00:00:57,440 --> 00:01:01,440
Speaker 1: and the family's lawyer submitted a recording of Elon Musk

14
00:01:01,520 --> 00:01:04,800
Speaker 1: as evidence that Wang was led to believe his vehicle

15
00:01:04,920 --> 00:01:09,520
Speaker 1: had greater capabilities than it actually possessed. So, in this recording,

16
00:01:09,680 --> 00:01:14,240
Speaker 1: Elon Musk said of Tesla vehicles, quote a Model S

17
00:01:14,480 --> 00:01:18,120
Speaker 1: and Model X at this point can drive autonomously with

18
00:01:18,240 --> 00:01:23,039
Speaker 1: greater safety than a person right now. End quote In response,

19
00:01:23,360 --> 00:01:27,280
Speaker 1: Musk's lawyers said, the recording could be faked. Now, not

20
00:01:27,400 --> 00:01:30,840
Speaker 1: to waffle about this, but if we're speaking solely on

21
00:01:31,160 --> 00:01:35,759
Speaker 1: a technical level, the recording could be faked. And by

22
00:01:35,760 --> 00:01:39,479
Speaker 1: that I mean there are technologies that are sophisticated enough

23
00:01:39,520 --> 00:01:43,560
Speaker 1: to create a fake recording. But just because something could

24
00:01:43,959 --> 00:01:47,640
Speaker 1: be faked doesn't mean it actually was faked. And the

25
00:01:47,720 --> 00:01:52,760
Speaker 1: judge in the Tesla Casevet Pennypacker, who has an amazing name,

26
00:01:53,200 --> 00:01:57,080
Speaker 1: said that this argument is a truly dangerous one. The

27
00:01:57,160 --> 00:02:00,920
Speaker 1: judge said that it implies quote that because mister Musk

28
00:02:01,080 --> 00:02:03,840
Speaker 1: is famous and might be more of a target for

29
00:02:03,960 --> 00:02:08,640
Speaker 1: deep fakes, his public statements are immune quote. In other words,

30
00:02:09,000 --> 00:02:13,320
Speaker 1: if you're notable enough or notorious enough, you have a

31
00:02:13,639 --> 00:02:17,800
Speaker 1: carte blanche excuse for anything that you are recorded as saying,

32
00:02:17,960 --> 00:02:21,640
Speaker 1: because maybe someone just targeted you and created a fake

33
00:02:21,760 --> 00:02:26,640
Speaker 1: version to discredit you. In twenty eighteen, Danielle Citron and

34
00:02:26,800 --> 00:02:30,440
Speaker 1: Robert Chesney wrote a paper in which they predicted this

35
00:02:30,560 --> 00:02:35,320
Speaker 1: sort of situation. They dubbed it the liar's dividend. That

36
00:02:35,840 --> 00:02:39,560
Speaker 1: when there is a proliferation of technology that can create

37
00:02:39,639 --> 00:02:45,480
Speaker 1: misinformation or outright disinformation, the liars out there reap the benefits,

38
00:02:45,600 --> 00:02:48,600
Speaker 1: because what is the truth anyway? When you can't trust

39
00:02:48,639 --> 00:02:53,359
Speaker 1: the evidence, everything falls apart. This is just one of

40
00:02:53,400 --> 00:02:58,000
Speaker 1: the many challenges deep fake technology presents. There are potentially

41
00:02:58,520 --> 00:03:03,360
Speaker 1: harmless or perhaps even beneficial uses of this technology, but

42
00:03:03,560 --> 00:03:07,000
Speaker 1: it doesn't take much imagination to come up with ways

43
00:03:07,000 --> 00:03:11,359
Speaker 1: to cause harm. Let's talk a second about the entertainment industry.

44
00:03:12,200 --> 00:03:16,000
Speaker 1: With deep fake technology, it becomes possible to create videos

45
00:03:16,040 --> 00:03:21,000
Speaker 1: and audio recordings that simulate celebrities, which potentially allows a

46
00:03:21,040 --> 00:03:24,359
Speaker 1: director to cast a film with people who otherwise would

47
00:03:24,400 --> 00:03:29,320
Speaker 1: be very much unavailable. Using sufficiently sophisticated deep fakes, you

48
00:03:29,400 --> 00:03:32,200
Speaker 1: could create a movie that combines a cast of modern

49
00:03:32,280 --> 00:03:36,040
Speaker 1: and classic film stars. Maybe you want the Marx Brothers

50
00:03:36,120 --> 00:03:39,560
Speaker 1: running around with Will Ferrell, Maybe you want Lon Cheney

51
00:03:39,640 --> 00:03:43,040
Speaker 1: Junior to show up in your modern werewolf movie. Or

52
00:03:43,360 --> 00:03:48,560
Speaker 1: maybe you're doing something slightly less extreme, maybe you're using

53
00:03:48,600 --> 00:03:51,840
Speaker 1: the technology to generate a younger version of your current

54
00:03:51,920 --> 00:03:55,800
Speaker 1: star ala Harrison Ford in the upcoming Indiana Jones and

55
00:03:55,840 --> 00:03:58,920
Speaker 1: the Dial of Destiny film. So it doesn't have to

56
00:03:58,960 --> 00:04:03,240
Speaker 1: be more or sinister, but it does bring into question

57
00:04:03,400 --> 00:04:06,480
Speaker 1: concepts like the right to personality or the right to

58
00:04:06,560 --> 00:04:11,600
Speaker 1: identity or the right to publicity. Presumably filmmakers wouldn't want

59
00:04:11,640 --> 00:04:14,840
Speaker 1: to move forward on any project with a computer generated

60
00:04:14,880 --> 00:04:18,680
Speaker 1: simulation of a real film star without the permission from

61
00:04:18,839 --> 00:04:22,600
Speaker 1: that person or their family. But it's possible to do it,

62
00:04:23,080 --> 00:04:26,080
Speaker 1: and depending on the movie, maybe they do go ahead

63
00:04:26,240 --> 00:04:31,719
Speaker 1: without securing permission first. Maybe it's a edgy parody film,

64
00:04:31,920 --> 00:04:34,640
Speaker 1: and the buzz around their decision to do this could

65
00:04:34,760 --> 00:04:37,960
Speaker 1: end up being a boost to marketing. People would say

66
00:04:37,960 --> 00:04:40,359
Speaker 1: how dare they do this, and then go buy tickets

67
00:04:40,400 --> 00:04:43,880
Speaker 1: to see the fallout of it. For actors, there's a

68
00:04:43,920 --> 00:04:46,719
Speaker 1: real concern that this technology could rob them of work,

69
00:04:47,040 --> 00:04:49,400
Speaker 1: that if they turned down a role, the filmmaker could

70
00:04:49,440 --> 00:04:52,000
Speaker 1: just get a computer generated version of them in there,

71
00:04:52,520 --> 00:04:55,479
Speaker 1: or that they could appear to you know, appear in

72
00:04:55,520 --> 00:04:59,120
Speaker 1: projects that they don't actually agree with, and perhaps most

73
00:04:59,200 --> 00:05:02,920
Speaker 1: importantly for many actors out there, that this could all

74
00:05:02,960 --> 00:05:07,400
Speaker 1: happen without compensation for the original actor. I know that

75
00:05:07,480 --> 00:05:10,760
Speaker 1: it could be tough to feel sympathetic toward big Hollywood stars,

76
00:05:10,760 --> 00:05:14,360
Speaker 1: but keep in mind the vast majority of working actors

77
00:05:14,400 --> 00:05:17,760
Speaker 1: out there are not raking in the huge movie deals.

78
00:05:18,200 --> 00:05:21,359
Speaker 1: They're just as worried about AI biting into their work

79
00:05:21,640 --> 00:05:24,599
Speaker 1: as the rest of us are. Then there's the world

80
00:05:24,839 --> 00:05:28,839
Speaker 1: of audio performance. Earlier this year, a TikTok user with

81
00:05:28,920 --> 00:05:33,520
Speaker 1: the handle ghost Writer nine seven seven wrote and produced

82
00:05:33,520 --> 00:05:37,359
Speaker 1: a song called Heart on My Sleeve. But ghost Writer

83
00:05:37,560 --> 00:05:41,240
Speaker 1: nine seven seven didn't provide the vocals for this track. Instead,

84
00:05:41,520 --> 00:05:46,720
Speaker 1: they used AI generated deep fake vocal simulations of artists

85
00:05:46,880 --> 00:05:51,400
Speaker 1: Drake and The Weekend. The songwriter then posted the release

86
00:05:51,440 --> 00:05:56,080
Speaker 1: on multiple platforms and it quickly went viral. Universal music

87
00:05:56,120 --> 00:05:59,919
Speaker 1: groups sprung into action right away and claimed copyright infringement.

88
00:06:00,600 --> 00:06:04,320
Speaker 1: And I am no legal expert, but in my mind

89
00:06:04,720 --> 00:06:08,680
Speaker 1: that's a weak argument. After all, the song itself was

90
00:06:08,800 --> 00:06:11,599
Speaker 1: an original, it was not a cover. It had not

91
00:06:11,680 --> 00:06:17,480
Speaker 1: been stolen from someone's discography. You cannot copyright the sound

92
00:06:17,560 --> 00:06:21,680
Speaker 1: of a voice. Universal Music Group doesn't own the vocal

93
00:06:21,760 --> 00:06:25,400
Speaker 1: quality of Drake or the Weekend, and I'm sure those

94
00:06:25,480 --> 00:06:29,240
Speaker 1: artists would be concerned to learn otherwise. And even if

95
00:06:29,279 --> 00:06:32,640
Speaker 1: the agreement between the label and the artists did go

96
00:06:32,760 --> 00:06:36,719
Speaker 1: all Ursula from the Little Mermaid and claim ownership of

97
00:06:36,760 --> 00:06:40,520
Speaker 1: the voices themselves, there's not really a legal foundation to

98
00:06:40,640 --> 00:06:45,320
Speaker 1: use that as a deterrent against Deep fakes. Universal Music

99
00:06:45,320 --> 00:06:48,800
Speaker 1: Group did argue that the deep fake voices used tons

100
00:06:48,880 --> 00:06:53,560
Speaker 1: of recorded material to train itself to sound like those artists.

101
00:06:54,240 --> 00:06:57,640
Speaker 1: That is most certainly the case. We'll dive into deep

102
00:06:57,720 --> 00:07:01,200
Speaker 1: fake techniques a bit later in this episode, but it

103
00:07:01,279 --> 00:07:04,520
Speaker 1: often boils down to machine learning and using a lot

104
00:07:04,560 --> 00:07:08,480
Speaker 1: of training material to educate a model about what it

105
00:07:08,560 --> 00:07:11,080
Speaker 1: is you want it to do. The more material you

106
00:07:11,120 --> 00:07:15,720
Speaker 1: can submit in training, the better, and Universal Music Group

107
00:07:15,760 --> 00:07:20,280
Speaker 1: said quote the training of generative AI using our artist's music,

108
00:07:20,480 --> 00:07:22,880
Speaker 1: which represents both a breach of our agreements and a

109
00:07:22,960 --> 00:07:26,920
Speaker 1: violation of copyright law end quote, before going on to

110
00:07:26,960 --> 00:07:29,200
Speaker 1: suggest that allowing Heart on my sleeve to exist as

111
00:07:29,200 --> 00:07:31,960
Speaker 1: akin to powering up skynet so that the terminators will

112
00:07:31,960 --> 00:07:36,520
Speaker 1: become real. I'm exaggerating only a little bit, and again

113
00:07:37,320 --> 00:07:39,880
Speaker 1: I am not a copyright expert, but it's hard for

114
00:07:39,920 --> 00:07:43,760
Speaker 1: me to imagine how training an AI model on music

115
00:07:44,280 --> 00:07:47,880
Speaker 1: is in itself a violation of copyright law. After all,

116
00:07:48,520 --> 00:07:54,080
Speaker 1: every musician, every artist, heck, every person who has been

117
00:07:54,120 --> 00:07:58,000
Speaker 1: around other people has been influenced by the work of

118
00:07:58,040 --> 00:08:03,560
Speaker 1: other people. Sometimes you can actually hear the influences in music.

119
00:08:04,480 --> 00:08:06,720
Speaker 1: You might hear an artist play and say, oh, that

120
00:08:06,800 --> 00:08:10,040
Speaker 1: reminds me of Johnny Cash or something like that. The

121
00:08:10,080 --> 00:08:14,640
Speaker 1: history of art is one in which succeeding generations iterate

122
00:08:15,000 --> 00:08:18,080
Speaker 1: on the works of those who came before them. Sometimes

123
00:08:18,360 --> 00:08:22,520
Speaker 1: they make drastic departures from the generations that came before them,

124
00:08:22,520 --> 00:08:26,560
Speaker 1: but even that is in response to the influence of

125
00:08:26,640 --> 00:08:31,880
Speaker 1: the earlier art. So, if you make the argument that

126
00:08:32,040 --> 00:08:36,080
Speaker 1: training AI on specific works is wrong, how do you

127
00:08:36,200 --> 00:08:40,680
Speaker 1: differentiate that from someone who gets their start playing song

128
00:08:40,760 --> 00:08:44,360
Speaker 1: covers or maybe writing their own stuff, but with musical

129
00:08:44,400 --> 00:08:49,520
Speaker 1: influences from identifiable artists. Because art is not created in

130
00:08:49,559 --> 00:08:53,800
Speaker 1: a vacuum, obviously, using AI is different. It can lead

131
00:08:53,800 --> 00:08:56,760
Speaker 1: to the creation of a near perfect simulation of the

132
00:08:56,800 --> 00:09:02,000
Speaker 1: original artist, But the method of training the AI isn't

133
00:09:02,040 --> 00:09:06,000
Speaker 1: really that different from a budding musician voraciously devouring the

134
00:09:06,160 --> 00:09:10,400
Speaker 1: entire discography of their favorite artists before emulating those artists

135
00:09:10,440 --> 00:09:14,440
Speaker 1: in their own work. It is a sticky wicket, no

136
00:09:14,600 --> 00:09:17,840
Speaker 1: question about it, and we're in the early stages of

137
00:09:17,880 --> 00:09:21,560
Speaker 1: figuring out how to handle it, which is particularly unfortunate

138
00:09:21,880 --> 00:09:26,640
Speaker 1: since the technology is already here. But how did we

139
00:09:26,760 --> 00:09:32,280
Speaker 1: get here? Well. An exhaustive history of deep fake technology

140
00:09:32,280 --> 00:09:36,120
Speaker 1: would require a full series of episodes about the history

141
00:09:36,120 --> 00:09:40,240
Speaker 1: of artificial intelligence and machine learning in general and computer

142
00:09:40,440 --> 00:09:44,600
Speaker 1: vision in particular, as well as text to speech and

143
00:09:44,679 --> 00:09:48,240
Speaker 1: lots of other related technologies. But for our purposes, we'll

144
00:09:48,280 --> 00:09:53,000
Speaker 1: simply acknowledge that countless computer scientists and programmers had spent

145
00:09:53,280 --> 00:09:57,040
Speaker 1: endless hours advancing computer technology with the goal of finding

146
00:09:57,040 --> 00:10:02,040
Speaker 1: ways to make machines quote unquote under and data. This

147
00:10:02,120 --> 00:10:05,559
Speaker 1: is easier said than done, so let's take images as

148
00:10:05,600 --> 00:10:08,920
Speaker 1: an example, as that will factor heavily in our discussion today.

149
00:10:09,280 --> 00:10:12,280
Speaker 1: We humans can glance at a photo and we can

150
00:10:12,320 --> 00:10:16,840
Speaker 1: immediately identify what is an object versus just a background.

151
00:10:16,960 --> 00:10:19,760
Speaker 1: So if you have a red mug placed in front

152
00:10:19,800 --> 00:10:23,320
Speaker 1: of a white cinder block wall, we can see what

153
00:10:23,520 --> 00:10:25,400
Speaker 1: is a mug and what is a wall. But we

154
00:10:25,480 --> 00:10:28,720
Speaker 1: have to teach computers how to do that, and when

155
00:10:28,720 --> 00:10:33,400
Speaker 1: you're talking about technologies that generate moving images, it becomes

156
00:10:33,640 --> 00:10:39,120
Speaker 1: even more complicated. So, for lack of a clear beginning,

157
00:10:39,640 --> 00:10:44,880
Speaker 1: I am somewhat arbitrarily going to start in nineteen ninety seven. Now,

158
00:10:44,920 --> 00:10:47,760
Speaker 1: a couple of things happened that year that would be

159
00:10:47,800 --> 00:10:51,520
Speaker 1: important for us to talk about, and one was not

160
00:10:51,800 --> 00:10:56,199
Speaker 1: quite deep baked technology, but it did illustrate some potential

161
00:10:57,240 --> 00:11:00,120
Speaker 1: ethical issues we had to think about. And that was

162
00:11:00,000 --> 00:11:04,679
Speaker 1: a commercial that aired during a big old American football game.

163
00:11:05,120 --> 00:11:08,720
Speaker 1: You know one that happens every year, You know, the

164
00:11:08,760 --> 00:11:11,600
Speaker 1: one I can't I can't call it by name for

165
00:11:11,679 --> 00:11:15,959
Speaker 1: you know, legal reasons. Anyway, one famous feature of this

166
00:11:16,120 --> 00:11:19,480
Speaker 1: big old American football game is that brands will shell

167
00:11:19,520 --> 00:11:22,839
Speaker 1: out huge amounts of money to air commercials during it.

168
00:11:23,200 --> 00:11:26,439
Speaker 1: And one brand to do that in nineteen ninety seven

169
00:11:27,120 --> 00:11:31,120
Speaker 1: was the Dirt Devil vacuum cleaner company. Now, those of

170
00:11:31,120 --> 00:11:33,120
Speaker 1: you across the pond would call it a hoover, not

171
00:11:33,120 --> 00:11:35,720
Speaker 1: a vacuum cleaner, but a hoover is a different brand altogether,

172
00:11:35,760 --> 00:11:39,720
Speaker 1: so stop confusing me. In the commercial, famous actor and

173
00:11:39,840 --> 00:11:44,400
Speaker 1: dancer Fred Astaire is shown dancing with Dirt Devil vacuum cleaners.

174
00:11:44,880 --> 00:11:48,400
Speaker 1: But here's the thing. Fred Astaire had died a decade earlier.

175
00:11:49,000 --> 00:11:52,760
Speaker 1: The footage was taken from his films, with Dirt Devil

176
00:11:52,880 --> 00:11:56,280
Speaker 1: inserting the imagery of its products into the footage to

177
00:11:56,320 --> 00:11:59,160
Speaker 1: make it seem as if Astaire had actually shot commercials

178
00:11:59,200 --> 00:12:02,400
Speaker 1: this way and really danced with vacuum cleaners. So in

179
00:12:02,440 --> 00:12:05,000
Speaker 1: this case, the footage of a stare was legitimate. It

180
00:12:05,120 --> 00:12:07,199
Speaker 1: was the appearance of the vacuum cleaners that had been

181
00:12:07,240 --> 00:12:10,680
Speaker 1: inserted into it. But the use of footage of performers

182
00:12:10,679 --> 00:12:14,360
Speaker 1: who have passed away prompted a debate about the ethics

183
00:12:14,400 --> 00:12:18,320
Speaker 1: of that practice, and people began to speculate about what

184
00:12:18,559 --> 00:12:21,600
Speaker 1: might happen once technology reached a point where a computer

185
00:12:21,679 --> 00:12:27,239
Speaker 1: simulation of a person would be indistinguishable from the real thing. Meanwhile,

186
00:12:27,400 --> 00:12:30,920
Speaker 1: also in nineteen ninety seven, a group of computer scientists

187
00:12:31,000 --> 00:12:36,880
Speaker 1: published and important work. The scientists were Christoph Or, Chris Bregler,

188
00:12:37,720 --> 00:12:43,040
Speaker 1: Michel Covell, and Malcolm Stanley. The paper's title is Video

189
00:12:43,160 --> 00:12:47,960
Speaker 1: Rewrite Driving Visual Speech with Audio. This work built on

190
00:12:48,040 --> 00:12:51,600
Speaker 1: top of a lot of other previous work. For example,

191
00:12:52,040 --> 00:12:56,480
Speaker 1: base interpretation was already a discipline in computer science. It

192
00:12:56,559 --> 00:12:59,400
Speaker 1: traces its history all the way back to the nineteen sixties.

193
00:13:00,040 --> 00:13:02,800
Speaker 1: Ditto for technology that could generate speech from texts. That

194
00:13:02,920 --> 00:13:06,760
Speaker 1: two dates back to the nineteen sixties. Computer animation had

195
00:13:06,800 --> 00:13:09,720
Speaker 1: been around for a while by nineteen ninety seven, so

196
00:13:09,920 --> 00:13:13,040
Speaker 1: creating a three D model of lips, one that you

197
00:13:13,040 --> 00:13:17,520
Speaker 1: could subsequently animate that was also a thing already. But

198
00:13:17,600 --> 00:13:20,800
Speaker 1: what these researchers did was they brought all these elements together.

199
00:13:21,360 --> 00:13:25,320
Speaker 1: It was a convergence of technologies that resulted in a

200
00:13:25,400 --> 00:13:30,000
Speaker 1: new application, one which would allow for computer generated synthetic

201
00:13:30,120 --> 00:13:35,559
Speaker 1: video of real people. The team created the video rewrite software,

202
00:13:35,960 --> 00:13:38,360
Speaker 1: and they also showed what it was capable of doing

203
00:13:38,400 --> 00:13:42,000
Speaker 1: in some very very short video clips. The results are

204
00:13:42,040 --> 00:13:46,199
Speaker 1: primitive by today's standards, but nonetheless impressive. In one two

205
00:13:46,320 --> 00:13:51,120
Speaker 1: second clip, President JFK appears to say I never met

206
00:13:51,160 --> 00:13:54,319
Speaker 1: Forrest Gump. It's a cheeky reference to the nineteen ninety

207
00:13:54,320 --> 00:13:57,640
Speaker 1: four film, which included a segment in which the titular

208
00:13:57,760 --> 00:14:02,360
Speaker 1: character Forst Gump appears to meet JFK and then informs

209
00:14:02,440 --> 00:14:05,080
Speaker 1: him that he needs to rush off to the restroom.

210
00:14:05,720 --> 00:14:10,800
Speaker 1: Video Rewrite served as a foundation for technologies that we

211
00:14:11,040 --> 00:14:14,559
Speaker 1: could refer to as deep fake tech. So just a

212
00:14:14,600 --> 00:14:18,040
Speaker 1: few years later, in two thousand and one, Christopher J. Taylor,

213
00:14:18,320 --> 00:14:22,520
Speaker 1: Gareth J. Edwards, and Timothy. My middle initial is F

214
00:14:22,640 --> 00:14:25,080
Speaker 1: and not Jay, which actually upsets Jonathan. Because of a

215
00:14:25,120 --> 00:14:29,240
Speaker 1: lack of consistency, Coots published a paper that was titled

216
00:14:29,560 --> 00:14:33,480
Speaker 1: Active Appearance Models. The abstract for this paper reads, in

217
00:14:33,560 --> 00:14:37,880
Speaker 1: part quote, we describe a new method of matching statistical

218
00:14:37,920 --> 00:14:43,640
Speaker 1: models of appearances to images. End quote now in plain English.

219
00:14:43,840 --> 00:14:47,240
Speaker 1: This paper describes a method in which computer vision relies

220
00:14:47,320 --> 00:14:52,280
Speaker 1: on statistical models to more accurately identify elements within the image.

221
00:14:52,440 --> 00:14:56,680
Speaker 1: So let's consider facial recognition technology. As I mentioned earlier,

222
00:14:57,160 --> 00:15:01,280
Speaker 1: computers do not inherently understand image. If presented with a

223
00:15:01,280 --> 00:15:04,920
Speaker 1: picture of a face, a computer cannot naturally determine what

224
00:15:05,040 --> 00:15:08,600
Speaker 1: the various features of that face are. Only through proper

225
00:15:08,640 --> 00:15:11,360
Speaker 1: programming and machine learning can you start to do this

226
00:15:11,760 --> 00:15:15,720
Speaker 1: and train a computer to recognize features like a nose,

227
00:15:16,240 --> 00:15:20,640
Speaker 1: a mouth, eyebrows, eyes, et cetera. And by training machines

228
00:15:20,680 --> 00:15:23,760
Speaker 1: on millions of faces, you can reach a point where

229
00:15:23,800 --> 00:15:26,400
Speaker 1: the machine can examine a new face, one that has

230
00:15:26,520 --> 00:15:29,920
Speaker 1: never before been submitted to the machine, and attempt to

231
00:15:30,000 --> 00:15:34,520
Speaker 1: identify those features. This is a necessary step with a

232
00:15:34,560 --> 00:15:37,840
Speaker 1: lot of deep fake technology. See to call all deep

233
00:15:37,880 --> 00:15:42,440
Speaker 1: fakes computer generated is a little misleading. Often what is

234
00:15:42,520 --> 00:15:46,920
Speaker 1: happening is a computer is replacing an existing person or

235
00:15:47,040 --> 00:15:51,600
Speaker 1: face in a video with someone else's features. In order

236
00:15:51,640 --> 00:15:53,520
Speaker 1: to do that, you first have to be able to

237
00:15:53,680 --> 00:15:57,840
Speaker 1: map and identify the original person that was in the video,

238
00:15:58,200 --> 00:16:01,920
Speaker 1: you need to be able to match the synthesized face

239
00:16:02,360 --> 00:16:06,200
Speaker 1: with the movements of the original face. To do that,

240
00:16:06,240 --> 00:16:09,920
Speaker 1: the computer first has to encode the original face, essentially

241
00:16:10,240 --> 00:16:13,760
Speaker 1: to break it down into lots of smaller shapes. Then

242
00:16:13,800 --> 00:16:16,600
Speaker 1: it has to be able to match the synthesized face

243
00:16:17,000 --> 00:16:20,920
Speaker 1: to the original one with a similar encoded approach, and

244
00:16:20,920 --> 00:16:24,640
Speaker 1: then decode that into the synthesized face that replaces the

245
00:16:24,680 --> 00:16:28,080
Speaker 1: original one and then follows the various motions of the

246
00:16:28,080 --> 00:16:32,880
Speaker 1: original face. So you're replacing one person with another through

247
00:16:32,880 --> 00:16:35,000
Speaker 1: the use of a computer, and as part of that,

248
00:16:35,040 --> 00:16:38,000
Speaker 1: the computer has to break down the original person into

249
00:16:38,080 --> 00:16:41,280
Speaker 1: points of data that the computer can handle. So with

250
00:16:41,400 --> 00:16:45,760
Speaker 1: this technology, I could stand facing a camera and deliver

251
00:16:45,840 --> 00:16:48,800
Speaker 1: a speech and then, using software designed to follow the

252
00:16:48,800 --> 00:16:52,880
Speaker 1: steps I just laid out, replace my image with that

253
00:16:53,000 --> 00:16:56,080
Speaker 1: of someone else. If I also used a program designed

254
00:16:56,120 --> 00:17:00,480
Speaker 1: to create a vocal impersonation of that someone else, well

255
00:17:00,560 --> 00:17:03,320
Speaker 1: I could create a video where some celebrity says things

256
00:17:03,320 --> 00:17:06,920
Speaker 1: that they would never say, Like maybe I could create

257
00:17:06,960 --> 00:17:10,040
Speaker 1: a video of Keanu Reeves saying tech Stuff is my

258
00:17:10,160 --> 00:17:13,720
Speaker 1: favorite podcast. Jonathan is such a cool host. I wish

259
00:17:13,840 --> 00:17:17,240
Speaker 1: I could hang out with him. For the record, mister Reeves,

260
00:17:17,320 --> 00:17:20,080
Speaker 1: I would never actually do that. I'm just saying I

261
00:17:20,200 --> 00:17:24,639
Speaker 1: could do it. Of course, creating a video image of

262
00:17:24,720 --> 00:17:27,640
Speaker 1: Keanu Reeves would just be one part of the equation.

263
00:17:27,920 --> 00:17:31,120
Speaker 1: Another would be replicating his voice. Now, I could try

264
00:17:31,160 --> 00:17:34,600
Speaker 1: and do my own impersonation, but this would so clearly

265
00:17:34,640 --> 00:17:37,399
Speaker 1: be fake that I would never achieve my goal of

266
00:17:37,440 --> 00:17:39,600
Speaker 1: trying to make it appear as though Keanu Reeves knows

267
00:17:39,600 --> 00:17:41,240
Speaker 1: who I am and wants to hang out with me.

268
00:17:41,920 --> 00:17:44,600
Speaker 1: I can't even say WHOA the way he does. To

269
00:17:44,720 --> 00:17:48,800
Speaker 1: achieve my dreams, I would need a voice synthesis program

270
00:17:48,960 --> 00:17:52,600
Speaker 1: that I could train on Keano's voice and then produce

271
00:17:52,640 --> 00:17:57,959
Speaker 1: a computer generated impersonation. The history of voice synthesis is

272
00:17:58,200 --> 00:18:01,080
Speaker 1: crazy long. I mean, if we really, we really wanted

273
00:18:01,119 --> 00:18:02,840
Speaker 1: to dive into it, we could go all the way

274
00:18:02,840 --> 00:18:07,560
Speaker 1: back to the late seventeen hundreds. But we won't because

275
00:18:07,560 --> 00:18:11,119
Speaker 1: I can't keep you here that long. Text to speech

276
00:18:11,200 --> 00:18:14,560
Speaker 1: technologies brings us a bit closer to modern day, but

277
00:18:14,960 --> 00:18:17,840
Speaker 1: then we're still talking about the nineteen sixties or thereabouts.

278
00:18:17,840 --> 00:18:20,600
Speaker 1: As I mentioned earlier in this episode. To get to

279
00:18:20,640 --> 00:18:23,720
Speaker 1: a point where computers are capable of producing an imitation

280
00:18:23,840 --> 00:18:27,480
Speaker 1: of a specific person's voice. Then we're getting up to

281
00:18:27,560 --> 00:18:31,199
Speaker 1: like the last decade or so, researchers built tools that,

282
00:18:31,320 --> 00:18:36,440
Speaker 1: after training on how a specific person produces different sounds phonemes.

283
00:18:36,880 --> 00:18:38,520
Speaker 1: If we want to think of it in terms of

284
00:18:38,600 --> 00:18:41,680
Speaker 1: language and the sounds of language, well, then we have

285
00:18:42,080 --> 00:18:46,040
Speaker 1: applications that can take text, interpret that text as a

286
00:18:46,119 --> 00:18:49,520
Speaker 1: series of sounds, pull upon the computer knowledge of how

287
00:18:49,600 --> 00:18:55,200
Speaker 1: a particular person makes those specific sounds, and then voila,

288
00:18:55,640 --> 00:18:58,679
Speaker 1: we have ourselves a copy. Now, early versions of this

289
00:18:58,760 --> 00:19:02,960
Speaker 1: technology were understandibly a bit limited. You would end up

290
00:19:03,040 --> 00:19:06,719
Speaker 1: with speech that on a service level sounded like the

291
00:19:06,760 --> 00:19:10,720
Speaker 1: person in question, the synthesized person, but it would typically

292
00:19:10,720 --> 00:19:15,080
Speaker 1: come across as flat or using incorrect inflection to emphasize

293
00:19:15,119 --> 00:19:17,960
Speaker 1: a point. So think of that kind of robotics sound

294
00:19:18,040 --> 00:19:21,320
Speaker 1: you would get with early personal assistance, right, like if

295
00:19:21,359 --> 00:19:25,280
Speaker 1: you were using a GPS system, which I realized I

296
00:19:25,440 --> 00:19:29,120
Speaker 1: just used a repetition there, like ATM machine. But let's

297
00:19:29,119 --> 00:19:33,040
Speaker 1: say you're using a GPS and it has a voice

298
00:19:33,080 --> 00:19:37,959
Speaker 1: associated with it. Older ones were very robotic, and they

299
00:19:37,960 --> 00:19:40,920
Speaker 1: could also say things that were hilariously wrong. I'll never

300
00:19:40,960 --> 00:19:43,640
Speaker 1: forget the time I was riding in a car and

301
00:19:43,720 --> 00:19:47,440
Speaker 1: the GPS told us to turn right on Oak Doctor

302
00:19:47,960 --> 00:19:52,879
Speaker 1: instead of Oak Drive. But over time the models improved

303
00:19:53,000 --> 00:19:55,800
Speaker 1: and things started to sound a bit more natural. So

304
00:19:57,040 --> 00:20:01,199
Speaker 1: those early ones not so good. Not mistake them for

305
00:20:01,280 --> 00:20:03,679
Speaker 1: being a real person. It would sound like a robot

306
00:20:03,880 --> 00:20:06,720
Speaker 1: in making an impersonation of that person, But models would

307
00:20:06,720 --> 00:20:11,160
Speaker 1: grow in sophistication, and training sessions would include examples where

308
00:20:11,359 --> 00:20:15,400
Speaker 1: the target's expressions would be associated with specific emotions like anger,

309
00:20:15,640 --> 00:20:20,240
Speaker 1: or happiness or sadness. You can actually use a voice

310
00:20:20,280 --> 00:20:24,520
Speaker 1: synthesizer yourself and train it. And as part of that,

311
00:20:24,600 --> 00:20:28,480
Speaker 1: you're typically told to read out sentences with different emotional

312
00:20:28,520 --> 00:20:31,679
Speaker 1: weight to them, So using a bit of appropriate text,

313
00:20:32,320 --> 00:20:35,199
Speaker 1: then maybe some metadata to indicate what emotion should be

314
00:20:35,320 --> 00:20:38,680
Speaker 1: used to read out that text. It then becomes possible

315
00:20:38,720 --> 00:20:43,439
Speaker 1: to craft vocal performances that were and are difficult to

316
00:20:43,520 --> 00:20:47,119
Speaker 1: distinguish from the real thing. We're going to take a

317
00:20:47,200 --> 00:20:49,800
Speaker 1: quick break to thank our sponsor, and then I'll be

318
00:20:49,880 --> 00:20:53,439
Speaker 1: back to talk more about the history and impact of

319
00:20:53,520 --> 00:21:08,080
Speaker 1: deep fake technology. Back to our history of video deep fakes.

320
00:21:08,160 --> 00:21:11,199
Speaker 1: We left off at two thousand and one, and for

321
00:21:11,280 --> 00:21:15,520
Speaker 1: nearly two decades computer scientists continued to work on systems

322
00:21:15,920 --> 00:21:20,600
Speaker 1: that would push forward the capabilities of synthesized video content.

323
00:21:21,280 --> 00:21:23,800
Speaker 1: By the time we get up to twenty seventeen, a

324
00:21:23,920 --> 00:21:28,320
Speaker 1: pair of papers explained that the advancements in consumer computers

325
00:21:28,640 --> 00:21:31,280
Speaker 1: had reached a point where it was actually possible to

326
00:21:31,400 --> 00:21:36,159
Speaker 1: achieve synthesized video using off the shelf computer systems, and

327
00:21:36,240 --> 00:21:39,320
Speaker 1: that would be a huge game changer. No longer would

328
00:21:39,359 --> 00:21:45,280
Speaker 1: you need access to incredibly powerful systems with specialized software.

329
00:21:45,720 --> 00:21:50,320
Speaker 1: Now you could potentially create or access an application on

330
00:21:50,400 --> 00:21:52,560
Speaker 1: an off the shelf computer to do the same thing.

331
00:21:53,240 --> 00:21:56,960
Speaker 1: So the tools to generate computer synthesized video now we're

332
00:21:57,040 --> 00:22:00,720
Speaker 1: within the grasp of the average computer user. With cloud

333
00:22:00,720 --> 00:22:04,120
Speaker 1: based services that could augment these efforts, it became possible

334
00:22:04,160 --> 00:22:06,480
Speaker 1: for a creative person to make videos that appear to

335
00:22:06,520 --> 00:22:10,480
Speaker 1: show people doing and saying things that they never actually did.

336
00:22:11,119 --> 00:22:14,800
Speaker 1: And again, there are multiple uses for such technology. Not

337
00:22:14,960 --> 00:22:18,600
Speaker 1: all of them are sinister, but it doesn't take much

338
00:22:18,640 --> 00:22:22,320
Speaker 1: imagination to come up with scenarios where things get grim,

339
00:22:22,560 --> 00:22:25,879
Speaker 1: And indeed, many early uses of this tech once it

340
00:22:25,880 --> 00:22:30,320
Speaker 1: became accessible, were bad. One big one was using face

341
00:22:30,400 --> 00:22:34,520
Speaker 1: swapping technology to make it appear as though someone famous

342
00:22:34,680 --> 00:22:39,520
Speaker 1: or otherwise was appearing in an adult video. And I

343
00:22:39,520 --> 00:22:41,920
Speaker 1: think it goes without saying that this is a total

344
00:22:42,080 --> 00:22:46,159
Speaker 1: violation of the victim. It robs them of agency and

345
00:22:46,320 --> 00:22:50,080
Speaker 1: they may end up suffering consequences despite not being remotely

346
00:22:50,160 --> 00:22:55,480
Speaker 1: responsible for the content. So imagine facing judgment for something

347
00:22:55,520 --> 00:22:58,680
Speaker 1: that not only you did not do, but you had

348
00:22:58,720 --> 00:23:03,240
Speaker 1: no way of preventing. Honestly, it's impossible for me to

349
00:23:03,280 --> 00:23:07,560
Speaker 1: communicate how devastating this can be. There are several accounts

350
00:23:07,600 --> 00:23:10,600
Speaker 1: online written by people who have been the victim of

351
00:23:10,640 --> 00:23:13,520
Speaker 1: this sort of activity, and they are worth your time.

352
00:23:13,880 --> 00:23:17,240
Speaker 1: They are harrowing to read, but it is important their

353
00:23:17,280 --> 00:23:20,800
Speaker 1: words will far more effectively explain how traumatizing this experience

354
00:23:20,840 --> 00:23:24,879
Speaker 1: can be. And just as a reminder, the rise of

355
00:23:24,960 --> 00:23:28,840
Speaker 1: social networks means that we've all been sharing a lot

356
00:23:28,960 --> 00:23:32,560
Speaker 1: of images of ourselves, videos of ourselves. There's a lot

357
00:23:32,560 --> 00:23:35,879
Speaker 1: of content out there that could be used to train

358
00:23:36,440 --> 00:23:40,399
Speaker 1: various machine models. So it's something to keep in mind

359
00:23:40,800 --> 00:23:44,280
Speaker 1: that even if you aren't concerned right now, there's nothing

360
00:23:44,320 --> 00:23:48,160
Speaker 1: to say that you couldn't become a victim tomorrow. Deep

361
00:23:48,200 --> 00:23:52,800
Speaker 1: fakes also pose a risk to organizations it's not just individuals.

362
00:23:53,240 --> 00:23:56,119
Speaker 1: So imagine for a moment that you see you have

363
00:23:56,160 --> 00:23:58,919
Speaker 1: a voicemail at work, and you pull it up, and

364
00:23:58,960 --> 00:24:01,520
Speaker 1: you listen to the voicemail, and it sounds like your boss,

365
00:24:01,960 --> 00:24:03,800
Speaker 1: and your boss is telling you that you need to

366
00:24:03,840 --> 00:24:09,000
Speaker 1: transfer company funds from the company account into a different one.

367
00:24:09,440 --> 00:24:11,879
Speaker 1: And perhaps they say that it's in order for you

368
00:24:11,920 --> 00:24:14,760
Speaker 1: to pay off some third party vendor for a project

369
00:24:14,760 --> 00:24:18,320
Speaker 1: that you're not really familiar with. But then maybe it

370
00:24:18,359 --> 00:24:21,160
Speaker 1: turns out that voicemail wasn't from your boss after all.

371
00:24:21,720 --> 00:24:25,679
Speaker 1: Maybe it was the result of spearfishing. Maybe a nefarious

372
00:24:25,680 --> 00:24:29,840
Speaker 1: thief has identified you as a possible key to stealing

373
00:24:29,920 --> 00:24:34,440
Speaker 1: money from your organization and has used tech to impersonate

374
00:24:34,480 --> 00:24:39,119
Speaker 1: your boss and direct you toward facilitating a crime. You

375
00:24:39,240 --> 00:24:43,560
Speaker 1: unknowingly have become an accomplice. There's actually been a case

376
00:24:43,640 --> 00:24:46,400
Speaker 1: where this sort of thing was alleged to have happened.

377
00:24:46,440 --> 00:24:49,960
Speaker 1: Now I have to say alleged, because there were questions

378
00:24:50,000 --> 00:24:53,159
Speaker 1: about whether or not it really was a case of

379
00:24:53,280 --> 00:24:57,560
Speaker 1: a synthesized voice, or if maybe this was more of

380
00:24:57,600 --> 00:25:02,240
Speaker 1: a straightforward embezzlement issue, and the deep fake defense aka

381
00:25:02,400 --> 00:25:06,760
Speaker 1: the liar's dividend came into play. Deep fakes have come

382
00:25:07,520 --> 00:25:11,840
Speaker 1: a long way in a few short years. However, they

383
00:25:11,840 --> 00:25:15,720
Speaker 1: are not perfect. There can be telltale signs that a

384
00:25:15,880 --> 00:25:20,120
Speaker 1: video is fake, though they can sometimes be too subtle

385
00:25:20,240 --> 00:25:23,800
Speaker 1: for the human eye to detect. Sometimes there's a dead giveaway.

386
00:25:24,400 --> 00:25:26,560
Speaker 1: You're watching a video and you think this person is

387
00:25:26,560 --> 00:25:31,560
Speaker 1: blinking too frequently or not frequently enough, or maybe their

388
00:25:31,560 --> 00:25:36,719
Speaker 1: eyes don't look quite right, or they movements you're seeing

389
00:25:36,960 --> 00:25:39,280
Speaker 1: don't line up. That a person is turning their head

390
00:25:39,320 --> 00:25:41,760
Speaker 1: one way, their eyes are shifting another in a way

391
00:25:41,760 --> 00:25:45,000
Speaker 1: that just doesn't seem natural. There are those sorts of

392
00:25:45,040 --> 00:25:47,000
Speaker 1: things that people can pick up on, there's some that

393
00:25:47,040 --> 00:25:50,280
Speaker 1: are far more subtle, and deep fake detection tools are

394
00:25:50,359 --> 00:25:52,920
Speaker 1: growing in importance as a result of this. There are

395
00:25:52,960 --> 00:25:57,639
Speaker 1: tools that are trained to spot signs of fakery, sometimes

396
00:25:57,760 --> 00:26:00,240
Speaker 1: ones that are far too subtle for us to notice.

397
00:26:00,400 --> 00:26:04,560
Speaker 1: So it may be things like inconsistencies in lighting and

398
00:26:04,640 --> 00:26:08,960
Speaker 1: the quality of reflections within the frame. Things like that

399
00:26:09,720 --> 00:26:12,720
Speaker 1: may end up being an indication that a video was

400
00:26:12,800 --> 00:26:19,399
Speaker 1: manufactured artificially rather than an actual recording, and they're becoming

401
00:26:19,440 --> 00:26:23,879
Speaker 1: more and more important for people and for organizations. So

402
00:26:23,920 --> 00:26:27,639
Speaker 1: in addition to those tools, Organization leaders should really prepare

403
00:26:27,680 --> 00:26:32,560
Speaker 1: employees for the possibility of encountering deep fakes. Critical thinking

404
00:26:32,720 --> 00:26:38,120
Speaker 1: is a big part of uncovering deception, as is preparation. Heck,

405
00:26:38,200 --> 00:26:40,760
Speaker 1: depending on the organization, you might go so far as

406
00:26:40,840 --> 00:26:45,040
Speaker 1: to set up a phrase or question as an authentication

407
00:26:45,200 --> 00:26:48,359
Speaker 1: process at the top of an official phone call or

408
00:26:48,800 --> 00:26:51,320
Speaker 1: video meeting, so that the person on the other end

409
00:26:51,320 --> 00:26:55,080
Speaker 1: of the line can verify that things are legit. I

410
00:26:55,119 --> 00:26:57,320
Speaker 1: know it sounds like you're going a far away, but

411
00:26:57,440 --> 00:27:01,119
Speaker 1: as this technology gets more sophisticated, as people deploy it

412
00:27:01,800 --> 00:27:05,960
Speaker 1: in ways that are potentially harmful, you have to start

413
00:27:06,000 --> 00:27:09,080
Speaker 1: to think about these things. What we do not want

414
00:27:09,119 --> 00:27:11,879
Speaker 1: to do is to enter into an era where we

415
00:27:11,960 --> 00:27:15,600
Speaker 1: can no longer reliably determine the real from the fate.

416
00:27:16,240 --> 00:27:18,600
Speaker 1: But there is no putting the cat back in the bag,

417
00:27:19,080 --> 00:27:21,919
Speaker 1: or the genie in the bottle or baby in the corner.

418
00:27:22,400 --> 00:27:26,679
Speaker 1: The technology isn't going away. It will not disappear. It

419
00:27:26,680 --> 00:27:30,480
Speaker 1: will continue to evolve and to improve, and so it

420
00:27:30,520 --> 00:27:34,359
Speaker 1: falls upon us to educate ourselves as best we can

421
00:27:34,840 --> 00:27:40,440
Speaker 1: in preparation for encountering it, and to think about how

422
00:27:40,480 --> 00:27:45,919
Speaker 1: we can address the flagrant misuses of the technology to

423
00:27:46,640 --> 00:27:49,439
Speaker 1: attempt to dissuade people from using it in that way,

424
00:27:49,680 --> 00:27:53,600
Speaker 1: because again, the victimization element of this can be really

425
00:27:53,640 --> 00:27:58,080
Speaker 1: severe and really traumatizing and incredibly disruptive to a person's life.

426
00:27:58,680 --> 00:28:03,560
Speaker 1: We should not forget that either. So in conclusion, I

427
00:28:03,600 --> 00:28:07,320
Speaker 1: will say that this technology is truly impressive and again

428
00:28:07,400 --> 00:28:10,720
Speaker 1: it can have some really incredible uses. I don't want

429
00:28:10,760 --> 00:28:13,560
Speaker 1: to paint it as just being a bad thing. It

430
00:28:13,640 --> 00:28:16,200
Speaker 1: is not good or bad. It is how we use

431
00:28:16,240 --> 00:28:19,480
Speaker 1: it that determines whether or not the end result is

432
00:28:19,520 --> 00:28:24,359
Speaker 1: a positive one or a negative one. But only by

433
00:28:24,440 --> 00:28:27,960
Speaker 1: learning about it can we prepare for what is to come.

434
00:28:28,520 --> 00:28:32,040
Speaker 1: So I hope that you found this episode informative, that

435
00:28:32,080 --> 00:28:35,480
Speaker 1: you have a deeper appreciation for what this technology does

436
00:28:35,680 --> 00:28:38,360
Speaker 1: and what it is capable of, and I will speak

437
00:28:38,360 --> 00:28:49,000
Speaker 1: to you again really soon. Tech Stuff is an iHeartRadio production.

438
00:28:49,280 --> 00:28:54,320
Speaker 1: For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts,

439
00:28:54,440 --> 00:28:56,440
Speaker 1: or wherever you listen to your favorite shows.