1
00:00:04,360 --> 00:00:06,080
Speaker 1: There Are No Girls on the Internet. As a production

2
00:00:06,160 --> 00:00:13,800
Speaker 1: of iHeartRadio and Unbossed Creative. I'm Bridgett and this is

3
00:00:13,840 --> 00:00:19,080
Speaker 1: There Are No Girls on the Internet. I'm hosting a

4
00:00:19,079 --> 00:00:23,840
Speaker 1: new season of Mozilla's podcast IRL. Online Life is Real Life.

5
00:00:24,120 --> 00:00:27,760
Speaker 1: You might actually know Mozilla. They make the web browser Firefox.

6
00:00:28,400 --> 00:00:32,440
Speaker 1: This season of IRL is all about AI, specifically the

7
00:00:32,520 --> 00:00:35,600
Speaker 1: people who make AI, and how important it is to

8
00:00:35,600 --> 00:00:38,960
Speaker 1: put people above profit when it comes to AI. Now,

9
00:00:39,000 --> 00:00:41,680
Speaker 1: it's really easy to think of AI as just computer

10
00:00:41,800 --> 00:00:45,360
Speaker 1: brains and robots, but it's built and trained by people.

11
00:00:46,120 --> 00:00:48,440
Speaker 1: And as much as we talk about making sure AI

12
00:00:48,600 --> 00:00:51,440
Speaker 1: is ethical and equitable after it's been built and it's

13
00:00:51,440 --> 00:00:54,160
Speaker 1: out in the world, we should also remember the people

14
00:00:54,240 --> 00:00:57,000
Speaker 1: who build it from the very beginning too. It's something

15
00:00:57,240 --> 00:01:00,560
Speaker 1: really important that I think it's overlooked in conversations about AI,

16
00:01:01,040 --> 00:01:04,040
Speaker 1: turning the people responsible for making it into a kind

17
00:01:04,040 --> 00:01:08,640
Speaker 1: of invisible human workforce. But they shouldn't be invisible. We

18
00:01:08,640 --> 00:01:10,680
Speaker 1: should listen to them when they speak up about this

19
00:01:10,720 --> 00:01:13,479
Speaker 1: technology and how it's going to shape all of our lives.

20
00:01:14,120 --> 00:01:16,960
Speaker 1: So I wanted to share the very first episode of

21
00:01:16,959 --> 00:01:20,280
Speaker 1: this new season of IRL with you all here. This

22
00:01:20,319 --> 00:01:22,920
Speaker 1: one is all about the risks and reward of AI

23
00:01:23,040 --> 00:01:27,440
Speaker 1: technology like chatgept being open source that is built in

24
00:01:27,480 --> 00:01:30,600
Speaker 1: a way that allows anyone to inspect, modify, and enhance

25
00:01:30,640 --> 00:01:32,920
Speaker 1: its code on their own. So let me know what

26
00:01:32,959 --> 00:01:35,520
Speaker 1: you think and if you enjoy it, please subscribe to

27
00:01:35,600 --> 00:01:43,120
Speaker 1: IRL online. Life is real life. So the first thing

28
00:01:43,160 --> 00:01:47,160
Speaker 1: I ever asked chatjeepbt wasn't work related at all. It

29
00:01:47,240 --> 00:01:50,600
Speaker 1: was actually for help drafting kind of a tough personal

30
00:01:50,640 --> 00:01:53,160
Speaker 1: email I had to send. I was having trouble finding

31
00:01:53,160 --> 00:01:56,080
Speaker 1: the right words the right tone, so I asked chat

32
00:01:56,120 --> 00:02:00,080
Speaker 1: geept and I was amazed it actually produced something that

33
00:02:00,160 --> 00:02:03,240
Speaker 1: I might say. That was about a year ago. Fast

34
00:02:03,240 --> 00:02:05,720
Speaker 1: forward to today and open ai is said to be

35
00:02:05,760 --> 00:02:08,639
Speaker 1: on track to earn one billion dollars of revenue in

36
00:02:08,680 --> 00:02:12,040
Speaker 1: the next year. Even though large language models aren't new.

37
00:02:12,600 --> 00:02:16,400
Speaker 1: Suddenly more people can see the potential through that simple

38
00:02:16,440 --> 00:02:24,840
Speaker 1: interface for good, for bad, and for making money. This

39
00:02:25,000 --> 00:02:28,440
Speaker 1: is IRL, an original podcast for Mozilla than on Profit

40
00:02:28,520 --> 00:02:32,400
Speaker 1: behind Firefox. This season we meet people who are building

41
00:02:32,480 --> 00:02:37,680
Speaker 1: artificial intelligence that puts people over profit. I'm Bridget Todd.

42
00:02:38,360 --> 00:02:41,560
Speaker 1: In this episode, we get into the risks and rewards

43
00:02:41,600 --> 00:02:45,359
Speaker 1: of the tech that makes chat GPT talk. We're talking

44
00:02:45,360 --> 00:02:49,600
Speaker 1: about large language models LMS for short, and the controversy

45
00:02:49,680 --> 00:02:53,600
Speaker 1: over suddenly giving the whole world access to build with them.

46
00:02:54,240 --> 00:02:57,840
Speaker 1: But chatbots are only one example of what powerful lms

47
00:02:57,919 --> 00:03:01,960
Speaker 1: can do. Imagine games where characters can chat with you more,

48
00:03:02,360 --> 00:03:06,240
Speaker 1: or virtual assistants that can draft emails for you at work, banks,

49
00:03:06,280 --> 00:03:09,920
Speaker 1: insurance companies, travel agencies. Everyone is thinking about how to

50
00:03:10,000 --> 00:03:14,200
Speaker 1: use this technology to increase productivity and more. But there's

51
00:03:14,240 --> 00:03:17,880
Speaker 1: also a lot of talk about the risks.

52
00:03:19,120 --> 00:03:22,120
Speaker 2: I think a lot of people don't understand the detailed

53
00:03:22,120 --> 00:03:25,120
Speaker 2: capabilities of large language models, so you could use them

54
00:03:25,160 --> 00:03:28,000
Speaker 2: to really tear apart the civic fabric of a country.

55
00:03:29,040 --> 00:03:33,160
Speaker 1: That's David Evan Harris. Over five years he managed teams

56
00:03:33,160 --> 00:03:37,440
Speaker 1: that kept harmful content off Facebook and later also researched

57
00:03:37,480 --> 00:03:42,440
Speaker 1: responsible AI for Meta. Today, he's worried that llms can

58
00:03:42,480 --> 00:03:45,200
Speaker 1: be used to generate disinformation and hate speech on a

59
00:03:45,200 --> 00:03:48,680
Speaker 1: greater scale than ever. Like other big tech companies, Meta

60
00:03:48,720 --> 00:03:52,400
Speaker 1: develops its own lms, and now they're urging people to

61
00:03:52,480 --> 00:03:58,080
Speaker 1: use them and tweak them. With few strings attached. Metasms

62
00:03:58,080 --> 00:04:03,080
Speaker 1: are called LAMA. They might have a cute name, but

63
00:04:03,200 --> 00:04:06,880
Speaker 1: David says there's a potentially ugly side to Meta's OPENLLM.

64
00:04:07,800 --> 00:04:09,800
Speaker 2: I have a long history with open source and a

65
00:04:09,840 --> 00:04:13,920
Speaker 2: big passion for it, but thinking about large language models

66
00:04:13,920 --> 00:04:17,440
Speaker 2: and LAMA and whether or not these things are safe

67
00:04:17,480 --> 00:04:20,279
Speaker 2: to be open source has been a real turning point

68
00:04:20,279 --> 00:04:24,440
Speaker 2: for me. I remember more than a decade ago having

69
00:04:24,480 --> 00:04:29,240
Speaker 2: some conversations with a friend at MIT about the possibility

70
00:04:29,360 --> 00:04:33,760
Speaker 2: of open source licenses that don't allow for military use.

71
00:04:34,279 --> 00:04:37,160
Speaker 2: We love making open source software, but what if our

72
00:04:37,240 --> 00:04:40,000
Speaker 2: open source software is being used to make bombs and

73
00:04:40,080 --> 00:04:42,600
Speaker 2: kill people. We don't want to do that. Now. That

74
00:04:42,680 --> 00:04:47,640
Speaker 2: connects to this question of what's the threshold for something

75
00:04:47,720 --> 00:04:50,440
Speaker 2: that we're not comfortable having open source. I just think

76
00:04:50,480 --> 00:04:53,120
Speaker 2: the bigger danger that I keep coming back to, and

77
00:04:53,200 --> 00:04:58,279
Speaker 2: I maybe not bigger, but the very important danger is misinformation,

78
00:04:58,440 --> 00:05:01,320
Speaker 2: and is the idea that a system like LAMA too

79
00:05:01,760 --> 00:05:06,919
Speaker 2: could be really effectively abused in a large influence operation

80
00:05:07,080 --> 00:05:11,080
Speaker 2: campaign by what we call in the industry a sophisticated

81
00:05:11,160 --> 00:05:14,760
Speaker 2: threat actor, and that basically means like an intelligence agency

82
00:05:14,839 --> 00:05:19,719
Speaker 2: that probably has great hardware and big budgets and well

83
00:05:19,720 --> 00:05:20,640
Speaker 2: trained engineers.

84
00:05:21,279 --> 00:05:24,440
Speaker 1: David's argument echoed by many in the industry is that

85
00:05:24,480 --> 00:05:27,719
Speaker 1: we don't really know how llms of today or tomorrow

86
00:05:27,880 --> 00:05:30,919
Speaker 1: could be harmful in the long term. But he's also

87
00:05:31,000 --> 00:05:33,400
Speaker 1: focused on the harms of the here and now and

88
00:05:33,440 --> 00:05:37,120
Speaker 1: how these disproportionately affect people who are already at risk

89
00:05:37,200 --> 00:05:42,160
Speaker 1: of exclusion and discrimination. So here's how I think about lms.

90
00:05:43,560 --> 00:05:45,839
Speaker 1: Put on your chef's hat for a moment and imagine

91
00:05:45,839 --> 00:05:49,680
Speaker 1: you're baking a delicious cake, a layer cake. The foundation

92
00:05:50,120 --> 00:05:53,200
Speaker 1: or bottom layer of that cake is a large language model.

93
00:05:53,320 --> 00:05:56,120
Speaker 1: It's made out of lots of Internet data. Now, some

94
00:05:56,160 --> 00:06:00,440
Speaker 1: of these ingredients aren't the best quality, but with additional layers, coloring,

95
00:06:00,760 --> 00:06:03,960
Speaker 1: I think, and sprinkles, you can fine tune your system.

96
00:06:04,560 --> 00:06:07,200
Speaker 1: To make a chatbot. You find tune an LM with

97
00:06:07,320 --> 00:06:10,719
Speaker 1: data of people chatting to make a safer chatbot. You

98
00:06:10,800 --> 00:06:12,800
Speaker 1: train it with data that shows what prompts should you

99
00:06:13,000 --> 00:06:17,400
Speaker 1: er safety replies. Whenever you're building software with lms like Lama,

100
00:06:17,680 --> 00:06:20,719
Speaker 1: GPT four or Falcon, that's just part of what goes

101
00:06:20,760 --> 00:06:23,080
Speaker 1: into the cake. So there are a lot of options

102
00:06:23,080 --> 00:06:25,680
Speaker 1: that go into creating an AI system, even when the

103
00:06:25,760 --> 00:06:28,120
Speaker 1: so called foundational models are the same.

104
00:06:29,360 --> 00:06:32,160
Speaker 2: When you're using AI in a hiring system or in

105
00:06:32,160 --> 00:06:35,800
Speaker 2: an applicant tracking system that's sorting through thousands and thousands

106
00:06:35,839 --> 00:06:38,560
Speaker 2: of resumes. You don't need an LLM for that, but

107
00:06:39,160 --> 00:06:41,839
Speaker 2: you could use llms for that kind of thing. You

108
00:06:41,880 --> 00:06:45,040
Speaker 2: could use llms to give you analysis of different candidates.

109
00:06:45,480 --> 00:06:51,159
Speaker 2: And there may be situations where lms demonstrate bias. I

110
00:06:51,240 --> 00:06:55,520
Speaker 2: say this because banks are using lllms too. If a

111
00:06:55,560 --> 00:06:58,080
Speaker 2: bank is using an LLM as part of their process

112
00:06:58,240 --> 00:07:02,760
Speaker 2: is to evaluate loans and nobody has noticed yet because

113
00:07:02,760 --> 00:07:07,360
Speaker 2: that LM has never been systematically tested for bias, maybe

114
00:07:07,400 --> 00:07:11,160
Speaker 2: that's introducing bias into that bank system. So I think

115
00:07:11,160 --> 00:07:13,000
Speaker 2: there's some danger there. And a lot of people think,

116
00:07:13,240 --> 00:07:17,560
Speaker 2: oh danger, that's not danger. And you know, if you're

117
00:07:17,920 --> 00:07:23,120
Speaker 2: getting denied a mortgage because of your race, that's danger

118
00:07:23,400 --> 00:07:23,680
Speaker 2: to me.

119
00:07:25,120 --> 00:07:27,960
Speaker 1: David feels the industry as a whole is rushing development.

120
00:07:28,360 --> 00:07:31,920
Speaker 1: At the same time, responsible AI teams have been downsized

121
00:07:31,960 --> 00:07:34,920
Speaker 1: at several companies. David himself was laid off from metas

122
00:07:34,960 --> 00:07:37,080
Speaker 1: responsible AI team in twenty twenty two.

123
00:07:37,880 --> 00:07:40,400
Speaker 2: As a company that's using AI, or even as a

124
00:07:40,400 --> 00:07:44,280
Speaker 2: government that's using AI, or a nonprofit organization that's using AI,

125
00:07:44,760 --> 00:07:48,240
Speaker 2: you need to create robust processes to figure out how

126
00:07:48,280 --> 00:07:52,240
Speaker 2: and when it's appropriate to use AI systems, and you

127
00:07:52,320 --> 00:07:55,920
Speaker 2: need to have people who are not interested parties. And

128
00:07:55,960 --> 00:07:58,560
Speaker 2: in the case of a company, an interested party might

129
00:07:58,640 --> 00:08:01,400
Speaker 2: be just the engineer who wants to ship the damn

130
00:08:01,440 --> 00:08:05,160
Speaker 2: thing and get the feature running with the AI. And

131
00:08:05,720 --> 00:08:08,080
Speaker 2: you need to have someone who does not have an

132
00:08:08,080 --> 00:08:11,520
Speaker 2: incentive to ship products in the loop there who can say,

133
00:08:11,760 --> 00:08:15,560
Speaker 2: hold on, we might need another month of testing of this.

134
00:08:15,880 --> 00:08:18,760
Speaker 2: Hold on, we might need to find a way to

135
00:08:18,760 --> 00:08:21,920
Speaker 2: get someone out from outside the company to really give

136
00:08:22,000 --> 00:08:24,280
Speaker 2: us an opinion about if this is a fair AI

137
00:08:24,400 --> 00:08:27,000
Speaker 2: system or if this is safe.

138
00:08:27,200 --> 00:08:29,720
Speaker 1: The reason so many lms are at our fingertips now

139
00:08:30,000 --> 00:08:34,600
Speaker 1: is that investors with deep pockets Google, Microsoft, Meta, Elon

140
00:08:34,679 --> 00:08:38,040
Speaker 1: Musk and others have been pouring money into AI research

141
00:08:38,160 --> 00:08:42,720
Speaker 1: and powerful supercomputers. Some companies will bake lms into their

142
00:08:42,720 --> 00:08:46,680
Speaker 1: own products, others will make money by licensing access to them.

143
00:08:47,080 --> 00:08:50,520
Speaker 1: Everyone is competing for influence and for engineering talent that

144
00:08:50,559 --> 00:08:54,160
Speaker 1: can help them go faster. Openness can be a strategic

145
00:08:54,200 --> 00:08:57,600
Speaker 1: move to get ahead by attracting more developers, but often

146
00:08:57,720 --> 00:09:00,800
Speaker 1: companies also exaggerate how open they are, since it's not

147
00:09:00,800 --> 00:09:03,240
Speaker 1: always possible to see their data or methods.

148
00:09:07,480 --> 00:09:10,760
Speaker 3: So I've followed these models very closely, and I know

149
00:09:10,960 --> 00:09:15,000
Speaker 3: every time they're released, I know there is some element

150
00:09:15,120 --> 00:09:16,439
Speaker 3: of deception.

151
00:09:17,960 --> 00:09:21,600
Speaker 1: That's a Bebba Brahani. Time magazine just named her one

152
00:09:21,640 --> 00:09:24,880
Speaker 1: of the one hundred most influential people in AI. She's

153
00:09:24,920 --> 00:09:28,960
Speaker 1: a Mozilla advisor and a cognitive scientist from Ethiopia working

154
00:09:28,960 --> 00:09:30,880
Speaker 1: at Trinity College in Dublin, Ireland.

155
00:09:32,040 --> 00:09:35,640
Speaker 3: I mean LAMA, for example, was introduced as OH, an

156
00:09:35,720 --> 00:09:39,320
Speaker 3: open sourced large language model, and I went into the

157
00:09:39,400 --> 00:09:43,640
Speaker 3: paper hoping to find information, detailed information, because I work

158
00:09:43,679 --> 00:09:47,400
Speaker 3: with data sets. I went immediately into the data sets

159
00:09:47,440 --> 00:09:51,000
Speaker 3: section and it was just one tiny, small paragraph in

160
00:09:51,040 --> 00:09:52,160
Speaker 3: that giant paper.

161
00:09:52,880 --> 00:09:55,360
Speaker 1: A beeba wants to know what's inside the data sets

162
00:09:55,360 --> 00:09:59,000
Speaker 1: for AI because systems trained on them mimic their biases.

163
00:09:59,640 --> 00:10:02,160
Speaker 1: Just a a handful of data sets get used repeatedly

164
00:10:02,200 --> 00:10:06,680
Speaker 1: across most llms, and these usually include massive amounts of

165
00:10:06,679 --> 00:10:10,000
Speaker 1: Internet content from an open data set called common crawl.

166
00:10:10,760 --> 00:10:14,360
Speaker 3: The Internet can be a really toxic place. It holds,

167
00:10:14,480 --> 00:10:18,480
Speaker 3: you know, everything from the world's beauty to its ugliness

168
00:10:18,600 --> 00:10:23,680
Speaker 3: and everything. In Betuit, for example, during our audits we've

169
00:10:23,720 --> 00:10:30,040
Speaker 3: found content such as child abuse or genocide, or a

170
00:10:30,080 --> 00:10:34,559
Speaker 3: lot of explicit pornographic images. You also have to make

171
00:10:34,600 --> 00:10:38,440
Speaker 3: sure that personal sensitive information that could be used to

172
00:10:38,640 --> 00:10:42,320
Speaker 3: identify individuals. You have to make sure things like this

173
00:10:42,640 --> 00:10:46,240
Speaker 3: are not included in data sets. That's one of the

174
00:10:46,280 --> 00:10:49,800
Speaker 3: reasons why we need to audit the data sets we

175
00:10:49,840 --> 00:10:51,680
Speaker 3: are using to train models.

176
00:10:53,360 --> 00:10:56,400
Speaker 1: Decades of research. So the Internet has never been representative

177
00:10:56,400 --> 00:10:59,680
Speaker 1: of all the world's people or languages, but in generative

178
00:10:59,679 --> 00:11:03,840
Speaker 1: AI it becomes the ground truth. Abeba and her colleagues

179
00:11:03,880 --> 00:11:07,720
Speaker 1: have coined a term to highlight the problem they see. Abeba,

180
00:11:07,800 --> 00:11:09,840
Speaker 1: I noticed in one of your papers that y'all actually

181
00:11:09,920 --> 00:11:12,840
Speaker 1: use the term data swamps, not data sets. Where did

182
00:11:12,840 --> 00:11:15,080
Speaker 1: that term come from? Like why data swamps?

183
00:11:15,720 --> 00:11:21,160
Speaker 3: Data swamp is an attempt to kind of express how

184
00:11:21,440 --> 00:11:24,480
Speaker 3: such a huge dump like the common crawl or even

185
00:11:24,559 --> 00:11:28,880
Speaker 3: large scale data sets, now how they represent not only

186
00:11:29,320 --> 00:11:33,240
Speaker 3: the good and the healthy of humanity, but also the

187
00:11:33,440 --> 00:11:37,319
Speaker 3: nasty and ugly of humanity, because you find all kinds

188
00:11:37,360 --> 00:11:43,560
Speaker 3: of horrible, hateful, degrading texts, especially towards minoritized communities, and

189
00:11:43,640 --> 00:11:47,520
Speaker 3: you find all kinds of images that is really disturbing

190
00:11:47,559 --> 00:11:48,520
Speaker 3: to the human eye.

191
00:11:49,679 --> 00:11:52,719
Speaker 1: Even when these enormous data sets are open, it can

192
00:11:52,760 --> 00:11:55,920
Speaker 1: be too difficult and costly for independent researchers to audit

193
00:11:56,040 --> 00:12:00,400
Speaker 1: because they're too big. But even using smaller samples of data,

194
00:12:00,080 --> 00:12:03,280
Speaker 1: Abeba and our colleagues have uncovered a ton of problems

195
00:12:03,960 --> 00:12:06,319
Speaker 1: in the past. There are audits of a leading image

196
00:12:06,400 --> 00:12:10,079
Speaker 1: data set for AI documented so much racism and sexism

197
00:12:10,320 --> 00:12:14,920
Speaker 1: that it was decommissioned after decades of use. So, Abeba,

198
00:12:15,120 --> 00:12:17,480
Speaker 1: is it personal for you the motivation to keep going?

199
00:12:18,360 --> 00:12:21,120
Speaker 3: Yeah, it is a bit personal. When I go into

200
00:12:21,240 --> 00:12:23,480
Speaker 3: data sets, for example, you know the first thing I

201
00:12:23,559 --> 00:12:27,439
Speaker 3: query is around you know, how black women are represented,

202
00:12:27,720 --> 00:12:31,320
Speaker 3: how Africa as a continent is represented in so on.

203
00:12:31,440 --> 00:12:37,439
Speaker 3: So when I see all the negative images or extreme negative,

204
00:12:37,679 --> 00:12:44,760
Speaker 3: stereotypical caricatures, or you know, completely inaccurate, false, misleading informations,

205
00:12:45,000 --> 00:12:47,640
Speaker 3: you feel like if you don't say anything, if you

206
00:12:47,640 --> 00:12:51,520
Speaker 3: don't do anything about it, nobody else is gonna.

207
00:12:54,400 --> 00:12:58,160
Speaker 1: Abeba says we need regulation to make companies more transparent

208
00:12:58,200 --> 00:12:59,920
Speaker 1: about the data they use and where it came to.

209
00:13:01,000 --> 00:13:04,080
Speaker 1: She says, if companies can hide this information, they can

210
00:13:04,120 --> 00:13:06,920
Speaker 1: include data they don't actually have permission to use.

211
00:13:07,800 --> 00:13:11,680
Speaker 3: These artifacts are not something that just remain in the

212
00:13:11,800 --> 00:13:17,080
Speaker 3: labs of big corporations. These are tools that infiltrate into

213
00:13:17,160 --> 00:13:20,880
Speaker 3: every social spheres. What information goes into thems, what kind

214
00:13:20,920 --> 00:13:23,840
Speaker 3: of data set that is used to train them, where

215
00:13:23,880 --> 00:13:26,719
Speaker 3: the data set is sourced, and the quality of the

216
00:13:26,800 --> 00:13:30,319
Speaker 3: data set itself, and how the models were built, and

217
00:13:30,640 --> 00:13:35,040
Speaker 3: any other important information should be open for auditing and

218
00:13:35,080 --> 00:13:38,840
Speaker 3: for scrutiny. Given that they are almost treated as social

219
00:13:38,880 --> 00:13:41,960
Speaker 3: good that are supposed to serve everybody, so some level

220
00:13:42,000 --> 00:13:46,880
Speaker 3: of openness is really important in terms of making them

221
00:13:47,080 --> 00:13:51,160
Speaker 3: entirely open. Some people have raised the issue of if

222
00:13:51,240 --> 00:13:54,760
Speaker 3: they can be accessed by everybody, bad actors can download

223
00:13:54,800 --> 00:13:59,440
Speaker 3: them and use them for problematic applications. There is always

224
00:13:59,480 --> 00:14:03,280
Speaker 3: a balance that we have to keep working around. We

225
00:14:03,400 --> 00:14:06,440
Speaker 3: have to always try and find that is between open

226
00:14:06,480 --> 00:14:07,000
Speaker 3: and closed.

227
00:14:08,480 --> 00:14:12,000
Speaker 1: It's because llms and their data sets can be problematic

228
00:14:12,200 --> 00:14:16,440
Speaker 1: that we need independent scrutiny of them. Could regulation empower

229
00:14:16,480 --> 00:14:18,840
Speaker 1: people to work together to improve these systems.

230
00:14:23,880 --> 00:14:25,880
Speaker 4: Currently, there's been a lot of kind of like polarizing

231
00:14:25,960 --> 00:14:29,480
Speaker 4: discourse about open versus closed source, as if those were

232
00:14:29,480 --> 00:14:32,560
Speaker 4: the only two choices, but they aren't the only two choices.

233
00:14:32,720 --> 00:14:36,960
Speaker 4: It's kind of like more productive, more forward thinking to

234
00:14:37,000 --> 00:14:39,920
Speaker 4: acknowledge the fact that it's a gradient, it's a spectrum.

235
00:14:40,160 --> 00:14:43,520
Speaker 1: That's Sasha lucci Oni a leading researcher at a startup

236
00:14:43,560 --> 00:14:47,000
Speaker 1: called hugging Face. They run an online platform for testing

237
00:14:47,040 --> 00:14:50,680
Speaker 1: and developing AI. It's so popular that they've been valued

238
00:14:50,680 --> 00:14:54,320
Speaker 1: at four point five billion dollars. Sasha and our colleagues

239
00:14:54,360 --> 00:14:56,480
Speaker 1: have a fresh take on the open source debate.

240
00:14:57,240 --> 00:14:59,320
Speaker 4: What point in the spectrum can I pick for this

241
00:14:59,360 --> 00:15:02,200
Speaker 4: in this model? And I think it's important, especially for

242
00:15:02,280 --> 00:15:05,840
Speaker 4: policymakers to understand that that it's not an US versus them.

243
00:15:05,920 --> 00:15:08,560
Speaker 4: It's not like a two camp situation. It's really like,

244
00:15:09,080 --> 00:15:11,720
Speaker 4: let's pick what works for each model. And also there's

245
00:15:11,760 --> 00:15:14,280
Speaker 4: no one size fits all solution. Depending on the model,

246
00:15:14,520 --> 00:15:18,320
Speaker 4: depending on the data, depending on the usage, some point

247
00:15:18,480 --> 00:15:20,840
Speaker 4: in that gradient is more or less fitting.

248
00:15:21,640 --> 00:15:24,560
Speaker 1: The spectrum of openness Sasha talks about. It's not just

249
00:15:24,600 --> 00:15:27,000
Speaker 1: for a model's code or the data sets. It can

250
00:15:27,040 --> 00:15:29,760
Speaker 1: be for a lot more like the documentation and the

251
00:15:29,800 --> 00:15:33,120
Speaker 1: so called weights that determine how it works. These are

252
00:15:33,240 --> 00:15:36,320
Speaker 1: all decision points on openness, along with the usage terms

253
00:15:37,200 --> 00:15:41,200
Speaker 1: Sasha's research at hugging Face depends on openness. That's because

254
00:15:41,240 --> 00:15:44,040
Speaker 1: it's all about how to measure and lower the environmental

255
00:15:44,040 --> 00:15:48,640
Speaker 1: impact of language models. She says training the lm GBT

256
00:15:48,720 --> 00:15:52,960
Speaker 1: three emitted as much carbon as five hundred transatlantic flights,

257
00:15:53,280 --> 00:15:57,200
Speaker 1: and she says open source technology helps with sustainability in

258
00:15:57,280 --> 00:15:58,000
Speaker 1: other ways too.

259
00:15:59,280 --> 00:16:03,280
Speaker 4: Definitely, the reasons I joined hunking Face was because I

260
00:16:03,360 --> 00:16:08,240
Speaker 4: truly believe that by helping open source AI research, we

261
00:16:08,280 --> 00:16:10,920
Speaker 4: can help the sustainability the energy side of things, but

262
00:16:11,080 --> 00:16:14,920
Speaker 4: also in terms of democratization, like giving more people access

263
00:16:14,960 --> 00:16:17,160
Speaker 4: to models that they can both use out of the

264
00:16:17,200 --> 00:16:20,080
Speaker 4: box or they can fine tune them in order to

265
00:16:20,440 --> 00:16:23,640
Speaker 4: fit their context better. I think that's like a net

266
00:16:23,680 --> 00:16:25,920
Speaker 4: positive for everyone. And for me, it's kind of like

267
00:16:26,000 --> 00:16:30,600
Speaker 4: recycling or thrifting or or you know, buying something used

268
00:16:30,640 --> 00:16:33,080
Speaker 4: and then you know patching it up or changing it

269
00:16:33,120 --> 00:16:35,480
Speaker 4: a little bit to work with what you need it for.

270
00:16:35,560 --> 00:16:37,600
Speaker 4: And I mean I thrift like ninety five percent of

271
00:16:37,600 --> 00:16:40,360
Speaker 4: my clothes, So that's definitely a philosophy I'm really on

272
00:16:40,400 --> 00:16:43,200
Speaker 4: board with. And for me, a open source is definitely

273
00:16:44,040 --> 00:16:46,840
Speaker 4: much more sustainable in the long run because you're not

274
00:16:46,920 --> 00:16:50,320
Speaker 4: constantly starting from scratch, and also people can work together

275
00:16:50,360 --> 00:16:52,200
Speaker 4: and so you have less wasted effort.

276
00:16:53,440 --> 00:16:56,720
Speaker 1: Sasha says. A community initiative called Big Science is an

277
00:16:56,760 --> 00:17:00,760
Speaker 1: example of this. About two years ago, Base backed one

278
00:17:00,760 --> 00:17:04,200
Speaker 1: thousand people from sixty countries in a collaboration to develop

279
00:17:04,240 --> 00:17:06,240
Speaker 1: an open m called Bloom.

280
00:17:07,119 --> 00:17:10,400
Speaker 4: Was literally a thousand researchers and volunteers from all over

281
00:17:10,400 --> 00:17:12,240
Speaker 4: the world who were like, hey, let's train a large

282
00:17:12,280 --> 00:17:15,159
Speaker 4: language model together because we don't have the resources to

283
00:17:15,160 --> 00:17:17,919
Speaker 4: do it like each one of us separately. And it

284
00:17:17,960 --> 00:17:20,040
Speaker 4: was great because we had people who are lawyers, We

285
00:17:20,080 --> 00:17:23,879
Speaker 4: had people who were like specialists in archival studies to

286
00:17:23,920 --> 00:17:25,679
Speaker 4: help get data from different places, Like I mean, we

287
00:17:25,680 --> 00:17:27,200
Speaker 4: had all sorts of people from all over the world,

288
00:17:27,200 --> 00:17:30,920
Speaker 4: and people who don't necessarily have like a supercomputer on premise,

289
00:17:31,119 --> 00:17:32,919
Speaker 4: who don't work in a big tech company that can

290
00:17:32,960 --> 00:17:35,320
Speaker 4: give them access to some kind of computes to train

291
00:17:35,400 --> 00:17:36,000
Speaker 4: these models.

292
00:17:38,760 --> 00:17:42,160
Speaker 1: Open communities like this one could be directly affected by

293
00:17:42,160 --> 00:17:46,160
Speaker 1: policies that either limit or encourage important research for alternatives.

294
00:17:46,760 --> 00:17:48,760
Speaker 4: During the Big Science project, I joined hunging Face because

295
00:17:48,760 --> 00:17:49,840
Speaker 4: I was like, Yeah, this is the kind of work

296
00:17:49,880 --> 00:17:52,040
Speaker 4: I want to do. I don't want to have to

297
00:17:52,080 --> 00:17:53,760
Speaker 4: be secretive about what I'm doing. I want to do

298
00:17:53,760 --> 00:17:55,240
Speaker 4: it in an open source wain and I want to

299
00:17:55,440 --> 00:17:58,760
Speaker 4: help other people who don't necessarily have the means to

300
00:17:58,960 --> 00:18:01,000
Speaker 4: train these kinds of models. I want to help them

301
00:18:01,119 --> 00:18:04,520
Speaker 4: also benefit from this technology. The fact that we had

302
00:18:04,520 --> 00:18:06,760
Speaker 4: all these people involved in big science made the whole

303
00:18:06,800 --> 00:18:12,359
Speaker 4: project and the ensuing model much more representative of society,

304
00:18:12,400 --> 00:18:14,840
Speaker 4: I feel. And that's important because when these models get

305
00:18:14,920 --> 00:18:19,719
Speaker 4: used in downstream models or downstream tools or systems, than

306
00:18:19,800 --> 00:18:22,879
Speaker 4: any kind of information that's implicitly encoded in the model

307
00:18:22,920 --> 00:18:24,640
Speaker 4: will bubble up to the service.

308
00:18:25,520 --> 00:18:28,359
Speaker 1: So with all these gradients of openness, it's not only

309
00:18:28,440 --> 00:18:32,000
Speaker 1: the biggest AI companies developing lms, and that can be

310
00:18:32,000 --> 00:18:37,360
Speaker 1: a good thing. There's an open source alternative to chat

311
00:18:37,359 --> 00:18:42,280
Speaker 1: GPT called GPT for All. Amazingly, it works without an

312
00:18:42,320 --> 00:18:45,600
Speaker 1: Internet connection, and the lms are compressed so much that

313
00:18:45,640 --> 00:18:49,520
Speaker 1: you can download them to any regular personal computer. GPT

314
00:18:49,680 --> 00:18:51,720
Speaker 1: for All was launched by a New York startup called

315
00:18:51,760 --> 00:18:55,119
Speaker 1: Nomec earlier this year as a privacy preserving alternative to

316
00:18:55,200 --> 00:18:58,560
Speaker 1: chat GPT. Tens of thousands of people flocked to it.

317
00:19:00,200 --> 00:19:02,040
Speaker 1: Mixed co founder Andre Moliar.

318
00:19:02,720 --> 00:19:05,000
Speaker 5: One of the biggest focuses that we have around GPD

319
00:19:05,080 --> 00:19:08,320
Speaker 5: Ferol is making sure that privacy is the first thing

320
00:19:08,400 --> 00:19:10,760
Speaker 5: we think about in some sense. One of the core

321
00:19:10,800 --> 00:19:13,720
Speaker 5: reasons behind why we even built GPT Ferol and the

322
00:19:13,760 --> 00:19:16,359
Speaker 5: ecosystem of models that came in with it, was because

323
00:19:16,359 --> 00:19:19,360
Speaker 5: of all these large sort of like issues and concerns

324
00:19:19,359 --> 00:19:22,439
Speaker 5: about privacy with people using open AI's models.

325
00:19:23,840 --> 00:19:26,199
Speaker 1: You may not know this, but when you type prompts

326
00:19:26,200 --> 00:19:29,560
Speaker 1: into chat GPT, open ai can use whatever you type

327
00:19:29,600 --> 00:19:32,560
Speaker 1: to further train their models. There have even been numerous

328
00:19:32,560 --> 00:19:35,520
Speaker 1: privacy lakes because of it, both corporate and personal.

329
00:19:36,240 --> 00:19:38,639
Speaker 5: The privacy angle that we focus on specifically is making

330
00:19:38,680 --> 00:19:41,520
Speaker 5: sure that the application in its open source form, you

331
00:19:41,520 --> 00:19:42,920
Speaker 5: can see all of the code, so we start out

332
00:19:42,920 --> 00:19:44,720
Speaker 5: from that. That makes it safe. We make sure that

333
00:19:44,760 --> 00:19:47,280
Speaker 5: everything's audited by the community. And the next thing is

334
00:19:47,320 --> 00:19:49,639
Speaker 5: that we make sure we align by all lawsome regulations

335
00:19:49,640 --> 00:19:52,800
Speaker 5: across Europe and across the US. We don't gather user

336
00:19:52,840 --> 00:19:55,920
Speaker 5: specific data whatever they use, for instance, the models, and

337
00:19:56,000 --> 00:19:58,560
Speaker 5: we make sure that the models can run without access

338
00:19:58,600 --> 00:20:00,720
Speaker 5: to any internet, so you can go once you dinalad

339
00:20:00,720 --> 00:20:03,040
Speaker 5: the models to your computer, you can turn off your Internet.

340
00:20:03,119 --> 00:20:05,199
Speaker 5: If you're stuck in the jungle and you don't have

341
00:20:05,240 --> 00:20:07,160
Speaker 5: access to Internet, you can ask it for help.

342
00:20:07,840 --> 00:20:11,160
Speaker 1: No mixed mission is to improve the explainability and accessibility

343
00:20:11,200 --> 00:20:14,320
Speaker 1: of AI. Their main software product is a data exploration

344
00:20:14,480 --> 00:20:18,040
Speaker 1: tool for massive data sets called Atlas, but Andre believes

345
00:20:18,200 --> 00:20:21,119
Speaker 1: GPT for All is important for them to devote resources

346
00:20:21,160 --> 00:20:22,040
Speaker 1: to as a company.

347
00:20:22,760 --> 00:20:25,480
Speaker 5: When you run a business, there are certain things you

348
00:20:25,480 --> 00:20:27,360
Speaker 5: get the opportunity to do that you wouldn't be able

349
00:20:27,440 --> 00:20:29,080
Speaker 5: to do if you were running a business. One of

350
00:20:29,119 --> 00:20:31,200
Speaker 5: those is you have access to capital to be able

351
00:20:31,200 --> 00:20:34,199
Speaker 5: to work on risky projects like GPT for All purely

352
00:20:34,240 --> 00:20:36,960
Speaker 5: because you want to, not because you know there's some

353
00:20:37,040 --> 00:20:39,440
Speaker 5: direct revenue driving source of it.

354
00:20:40,480 --> 00:20:43,400
Speaker 1: Mainly, Andre says he's motivated by a wish to see

355
00:20:43,440 --> 00:20:46,080
Speaker 1: AI developed by more than just a handful of companies.

356
00:20:46,359 --> 00:20:49,280
Speaker 1: But he also raises a question of values and who

357
00:20:49,320 --> 00:20:51,080
Speaker 1: decides how lms behave.

358
00:20:51,720 --> 00:20:54,240
Speaker 5: So biases aren't always bad. So an example of a

359
00:20:54,240 --> 00:20:57,800
Speaker 5: bias could be the model always you know prefers to

360
00:20:58,640 --> 00:21:01,439
Speaker 5: greet you with a salutation for giving you a response. Right,

361
00:21:01,480 --> 00:21:03,240
Speaker 5: that's a bias that might not be bad. But obviously

362
00:21:03,280 --> 00:21:05,600
Speaker 5: there's biases that could be bad. Right, And one of

363
00:21:05,600 --> 00:21:07,959
Speaker 5: these sort of important things with large language models is

364
00:21:08,280 --> 00:21:10,480
Speaker 5: the fact that you can actually go in and customize this.

365
00:21:10,560 --> 00:21:12,479
Speaker 5: So if you have your own examples of data that

366
00:21:12,480 --> 00:21:14,400
Speaker 5: you would like your model to be able to output,

367
00:21:14,480 --> 00:21:16,240
Speaker 5: you can actually change that by training the model.

368
00:21:17,359 --> 00:21:20,840
Speaker 1: Andre offers the example of open AI training chat GPT

369
00:21:21,200 --> 00:21:25,399
Speaker 1: not to output hateful statements. Today, JPT for all gives

370
00:21:25,440 --> 00:21:28,440
Speaker 1: access to models fine tune not to offend, as well

371
00:21:28,480 --> 00:21:31,760
Speaker 1: as some that aren't. Andre says they've had some backlash

372
00:21:31,840 --> 00:21:34,840
Speaker 1: from people criticizing them for giving more people access to

373
00:21:35,119 --> 00:21:37,160
Speaker 1: lms that could be used for harm.

374
00:21:37,840 --> 00:21:40,119
Speaker 5: The reality is, like this technology isn't going away. The

375
00:21:40,119 --> 00:21:41,440
Speaker 5: biggest thing is we need to learn how to live

376
00:21:41,440 --> 00:21:43,480
Speaker 5: with it and how to be able to cope with

377
00:21:43,560 --> 00:21:45,439
Speaker 5: the side effects that emerge from it. A lot of

378
00:21:45,440 --> 00:21:47,119
Speaker 5: them will be positive, some of them are going to

379
00:21:47,119 --> 00:21:49,159
Speaker 5: be negative. Like one of the things that I guess

380
00:21:49,480 --> 00:21:51,320
Speaker 5: I think about quite a bit is like what happens

381
00:21:51,920 --> 00:21:54,119
Speaker 5: in the twenty twenty four election in the United States.

382
00:21:54,480 --> 00:21:56,800
Speaker 5: You can go in and pick ten thousand people, get

383
00:21:56,800 --> 00:22:00,240
Speaker 5: their Facebook profile, and customize a chatbot that apt to

384
00:22:00,240 --> 00:22:01,920
Speaker 5: be a human to convince them to think one way

385
00:22:02,000 --> 00:22:03,760
Speaker 5: or the other, and you can do that for no

386
00:22:03,880 --> 00:22:06,240
Speaker 5: cost at all. I guess the thing that keeps me

387
00:22:06,280 --> 00:22:08,879
Speaker 5: awake at night is if we're going to live in

388
00:22:08,880 --> 00:22:12,360
Speaker 5: this inevitable world where we're surrounded by machines that can

389
00:22:12,680 --> 00:22:17,119
Speaker 5: generate synthesized versions of information, and all that information is

390
00:22:17,160 --> 00:22:20,840
Speaker 5: being piped from one or two company servers. If there's

391
00:22:20,880 --> 00:22:24,040
Speaker 5: a world where someone like open AI owns all the

392
00:22:24,160 --> 00:22:26,520
Speaker 5: pipes for the information flow, and then they get the

393
00:22:26,600 --> 00:22:27,960
Speaker 5: chance to manipulate.

394
00:22:27,480 --> 00:22:28,280
Speaker 2: That however they want.

395
00:22:29,040 --> 00:22:31,439
Speaker 5: This is like why we do what we do. We

396
00:22:31,480 --> 00:22:33,840
Speaker 5: want to make sure that these generative AI models that

397
00:22:34,440 --> 00:22:39,040
Speaker 5: persist through the world are built with everyone's view into

398
00:22:39,080 --> 00:22:41,520
Speaker 5: how the models are being created, not just a couple

399
00:22:41,520 --> 00:22:48,160
Speaker 5: of organizations behind closed doors with unlimited resources.

400
00:22:50,400 --> 00:22:54,720
Speaker 1: Llms are here. Open source communities that do put people

401
00:22:54,760 --> 00:22:58,240
Speaker 1: ahead of profits are crucial to unlocking the positive potential

402
00:22:58,320 --> 00:23:01,760
Speaker 1: of generative AI. The challenge for builders and regulators is

403
00:23:01,760 --> 00:23:05,000
Speaker 1: to find that balance on the one hand, so generative

404
00:23:05,040 --> 00:23:07,919
Speaker 1: AI isn't developed or deployed in harmful ways, and on

405
00:23:08,000 --> 00:23:11,520
Speaker 1: the other to empower independent researchers to contribute to how

406
00:23:11,560 --> 00:23:18,560
Speaker 1: systems work. I'm bridgetad Thanks for listening to IRL Online.

407
00:23:18,600 --> 00:23:21,840
Speaker 1: Life is Real Life, an original podcast from Mozilla than

408
00:23:21,920 --> 00:23:25,640
Speaker 1: on Profit behind Firefox. For more about our guests, check

409
00:23:25,680 --> 00:23:29,479
Speaker 1: out our show notes or visit irlpodcast dot org. This season,

410
00:23:29,760 --> 00:23:37,280
Speaker 1: we're talking about people over profit in AI Mozilla Reclaim

411
00:23:37,400 --> 00:23:39,159
Speaker 1: the Internet