1 00:00:04,360 --> 00:00:06,080 Speaker 1: There Are No Girls on the Internet. As a production 2 00:00:06,160 --> 00:00:13,800 Speaker 1: of iHeartRadio and Unbossed Creative. I'm Bridgett and this is 3 00:00:13,840 --> 00:00:19,080 Speaker 1: There Are No Girls on the Internet. I'm hosting a 4 00:00:19,079 --> 00:00:23,840 Speaker 1: new season of Mozilla's podcast IRL. Online Life is Real Life. 5 00:00:24,120 --> 00:00:27,760 Speaker 1: You might actually know Mozilla. They make the web browser Firefox. 6 00:00:28,400 --> 00:00:32,440 Speaker 1: This season of IRL is all about AI, specifically the 7 00:00:32,520 --> 00:00:35,600 Speaker 1: people who make AI, and how important it is to 8 00:00:35,600 --> 00:00:38,960 Speaker 1: put people above profit when it comes to AI. Now, 9 00:00:39,000 --> 00:00:41,680 Speaker 1: it's really easy to think of AI as just computer 10 00:00:41,800 --> 00:00:45,360 Speaker 1: brains and robots, but it's built and trained by people. 11 00:00:46,120 --> 00:00:48,440 Speaker 1: And as much as we talk about making sure AI 12 00:00:48,600 --> 00:00:51,440 Speaker 1: is ethical and equitable after it's been built and it's 13 00:00:51,440 --> 00:00:54,160 Speaker 1: out in the world, we should also remember the people 14 00:00:54,240 --> 00:00:57,000 Speaker 1: who build it from the very beginning too. It's something 15 00:00:57,240 --> 00:01:00,560 Speaker 1: really important that I think it's overlooked in conversations about AI, 16 00:01:01,040 --> 00:01:04,040 Speaker 1: turning the people responsible for making it into a kind 17 00:01:04,040 --> 00:01:08,640 Speaker 1: of invisible human workforce. But they shouldn't be invisible. We 18 00:01:08,640 --> 00:01:10,680 Speaker 1: should listen to them when they speak up about this 19 00:01:10,720 --> 00:01:13,479 Speaker 1: technology and how it's going to shape all of our lives. 20 00:01:14,120 --> 00:01:16,960 Speaker 1: So I wanted to share the very first episode of 21 00:01:16,959 --> 00:01:20,280 Speaker 1: this new season of IRL with you all here. This 22 00:01:20,319 --> 00:01:22,920 Speaker 1: one is all about the risks and reward of AI 23 00:01:23,040 --> 00:01:27,440 Speaker 1: technology like chatgept being open source that is built in 24 00:01:27,480 --> 00:01:30,600 Speaker 1: a way that allows anyone to inspect, modify, and enhance 25 00:01:30,640 --> 00:01:32,920 Speaker 1: its code on their own. So let me know what 26 00:01:32,959 --> 00:01:35,520 Speaker 1: you think and if you enjoy it, please subscribe to 27 00:01:35,600 --> 00:01:43,120 Speaker 1: IRL online. Life is real life. So the first thing 28 00:01:43,160 --> 00:01:47,160 Speaker 1: I ever asked chatjeepbt wasn't work related at all. It 29 00:01:47,240 --> 00:01:50,600 Speaker 1: was actually for help drafting kind of a tough personal 30 00:01:50,640 --> 00:01:53,160 Speaker 1: email I had to send. I was having trouble finding 31 00:01:53,160 --> 00:01:56,080 Speaker 1: the right words the right tone, so I asked chat 32 00:01:56,120 --> 00:02:00,080 Speaker 1: geept and I was amazed it actually produced something that 33 00:02:00,160 --> 00:02:03,240 Speaker 1: I might say. That was about a year ago. Fast 34 00:02:03,240 --> 00:02:05,720 Speaker 1: forward to today and open ai is said to be 35 00:02:05,760 --> 00:02:08,639 Speaker 1: on track to earn one billion dollars of revenue in 36 00:02:08,680 --> 00:02:12,040 Speaker 1: the next year. Even though large language models aren't new. 37 00:02:12,600 --> 00:02:16,400 Speaker 1: Suddenly more people can see the potential through that simple 38 00:02:16,440 --> 00:02:24,840 Speaker 1: interface for good, for bad, and for making money. This 39 00:02:25,000 --> 00:02:28,440 Speaker 1: is IRL, an original podcast for Mozilla than on Profit 40 00:02:28,520 --> 00:02:32,400 Speaker 1: behind Firefox. This season we meet people who are building 41 00:02:32,480 --> 00:02:37,680 Speaker 1: artificial intelligence that puts people over profit. I'm Bridget Todd. 42 00:02:38,360 --> 00:02:41,560 Speaker 1: In this episode, we get into the risks and rewards 43 00:02:41,600 --> 00:02:45,359 Speaker 1: of the tech that makes chat GPT talk. We're talking 44 00:02:45,360 --> 00:02:49,600 Speaker 1: about large language models LMS for short, and the controversy 45 00:02:49,680 --> 00:02:53,600 Speaker 1: over suddenly giving the whole world access to build with them. 46 00:02:54,240 --> 00:02:57,840 Speaker 1: But chatbots are only one example of what powerful lms 47 00:02:57,919 --> 00:03:01,960 Speaker 1: can do. Imagine games where characters can chat with you more, 48 00:03:02,360 --> 00:03:06,240 Speaker 1: or virtual assistants that can draft emails for you at work, banks, 49 00:03:06,280 --> 00:03:09,920 Speaker 1: insurance companies, travel agencies. Everyone is thinking about how to 50 00:03:10,000 --> 00:03:14,200 Speaker 1: use this technology to increase productivity and more. But there's 51 00:03:14,240 --> 00:03:17,880 Speaker 1: also a lot of talk about the risks. 52 00:03:19,120 --> 00:03:22,120 Speaker 2: I think a lot of people don't understand the detailed 53 00:03:22,120 --> 00:03:25,120 Speaker 2: capabilities of large language models, so you could use them 54 00:03:25,160 --> 00:03:28,000 Speaker 2: to really tear apart the civic fabric of a country. 55 00:03:29,040 --> 00:03:33,160 Speaker 1: That's David Evan Harris. Over five years he managed teams 56 00:03:33,160 --> 00:03:37,440 Speaker 1: that kept harmful content off Facebook and later also researched 57 00:03:37,480 --> 00:03:42,440 Speaker 1: responsible AI for Meta. Today, he's worried that llms can 58 00:03:42,480 --> 00:03:45,200 Speaker 1: be used to generate disinformation and hate speech on a 59 00:03:45,200 --> 00:03:48,680 Speaker 1: greater scale than ever. Like other big tech companies, Meta 60 00:03:48,720 --> 00:03:52,400 Speaker 1: develops its own lms, and now they're urging people to 61 00:03:52,480 --> 00:03:58,080 Speaker 1: use them and tweak them. With few strings attached. Metasms 62 00:03:58,080 --> 00:04:03,080 Speaker 1: are called LAMA. They might have a cute name, but 63 00:04:03,200 --> 00:04:06,880 Speaker 1: David says there's a potentially ugly side to Meta's OPENLLM. 64 00:04:07,800 --> 00:04:09,800 Speaker 2: I have a long history with open source and a 65 00:04:09,840 --> 00:04:13,920 Speaker 2: big passion for it, but thinking about large language models 66 00:04:13,920 --> 00:04:17,440 Speaker 2: and LAMA and whether or not these things are safe 67 00:04:17,480 --> 00:04:20,279 Speaker 2: to be open source has been a real turning point 68 00:04:20,279 --> 00:04:24,440 Speaker 2: for me. I remember more than a decade ago having 69 00:04:24,480 --> 00:04:29,240 Speaker 2: some conversations with a friend at MIT about the possibility 70 00:04:29,360 --> 00:04:33,760 Speaker 2: of open source licenses that don't allow for military use. 71 00:04:34,279 --> 00:04:37,160 Speaker 2: We love making open source software, but what if our 72 00:04:37,240 --> 00:04:40,000 Speaker 2: open source software is being used to make bombs and 73 00:04:40,080 --> 00:04:42,600 Speaker 2: kill people. We don't want to do that. Now. That 74 00:04:42,680 --> 00:04:47,640 Speaker 2: connects to this question of what's the threshold for something 75 00:04:47,720 --> 00:04:50,440 Speaker 2: that we're not comfortable having open source. I just think 76 00:04:50,480 --> 00:04:53,120 Speaker 2: the bigger danger that I keep coming back to, and 77 00:04:53,200 --> 00:04:58,279 Speaker 2: I maybe not bigger, but the very important danger is misinformation, 78 00:04:58,440 --> 00:05:01,320 Speaker 2: and is the idea that a system like LAMA too 79 00:05:01,760 --> 00:05:06,919 Speaker 2: could be really effectively abused in a large influence operation 80 00:05:07,080 --> 00:05:11,080 Speaker 2: campaign by what we call in the industry a sophisticated 81 00:05:11,160 --> 00:05:14,760 Speaker 2: threat actor, and that basically means like an intelligence agency 82 00:05:14,839 --> 00:05:19,719 Speaker 2: that probably has great hardware and big budgets and well 83 00:05:19,720 --> 00:05:20,640 Speaker 2: trained engineers. 84 00:05:21,279 --> 00:05:24,440 Speaker 1: David's argument echoed by many in the industry is that 85 00:05:24,480 --> 00:05:27,719 Speaker 1: we don't really know how llms of today or tomorrow 86 00:05:27,880 --> 00:05:30,919 Speaker 1: could be harmful in the long term. But he's also 87 00:05:31,000 --> 00:05:33,400 Speaker 1: focused on the harms of the here and now and 88 00:05:33,440 --> 00:05:37,120 Speaker 1: how these disproportionately affect people who are already at risk 89 00:05:37,200 --> 00:05:42,160 Speaker 1: of exclusion and discrimination. So here's how I think about lms. 90 00:05:43,560 --> 00:05:45,839 Speaker 1: Put on your chef's hat for a moment and imagine 91 00:05:45,839 --> 00:05:49,680 Speaker 1: you're baking a delicious cake, a layer cake. The foundation 92 00:05:50,120 --> 00:05:53,200 Speaker 1: or bottom layer of that cake is a large language model. 93 00:05:53,320 --> 00:05:56,120 Speaker 1: It's made out of lots of Internet data. Now, some 94 00:05:56,160 --> 00:06:00,440 Speaker 1: of these ingredients aren't the best quality, but with additional layers, coloring, 95 00:06:00,760 --> 00:06:03,960 Speaker 1: I think, and sprinkles, you can fine tune your system. 96 00:06:04,560 --> 00:06:07,200 Speaker 1: To make a chatbot. You find tune an LM with 97 00:06:07,320 --> 00:06:10,719 Speaker 1: data of people chatting to make a safer chatbot. You 98 00:06:10,800 --> 00:06:12,800 Speaker 1: train it with data that shows what prompts should you 99 00:06:13,000 --> 00:06:17,400 Speaker 1: er safety replies. Whenever you're building software with lms like Lama, 100 00:06:17,680 --> 00:06:20,719 Speaker 1: GPT four or Falcon, that's just part of what goes 101 00:06:20,760 --> 00:06:23,080 Speaker 1: into the cake. So there are a lot of options 102 00:06:23,080 --> 00:06:25,680 Speaker 1: that go into creating an AI system, even when the 103 00:06:25,760 --> 00:06:28,120 Speaker 1: so called foundational models are the same. 104 00:06:29,360 --> 00:06:32,160 Speaker 2: When you're using AI in a hiring system or in 105 00:06:32,160 --> 00:06:35,800 Speaker 2: an applicant tracking system that's sorting through thousands and thousands 106 00:06:35,839 --> 00:06:38,560 Speaker 2: of resumes. You don't need an LLM for that, but 107 00:06:39,160 --> 00:06:41,839 Speaker 2: you could use llms for that kind of thing. You 108 00:06:41,880 --> 00:06:45,040 Speaker 2: could use llms to give you analysis of different candidates. 109 00:06:45,480 --> 00:06:51,159 Speaker 2: And there may be situations where lms demonstrate bias. I 110 00:06:51,240 --> 00:06:55,520 Speaker 2: say this because banks are using lllms too. If a 111 00:06:55,560 --> 00:06:58,080 Speaker 2: bank is using an LLM as part of their process 112 00:06:58,240 --> 00:07:02,760 Speaker 2: is to evaluate loans and nobody has noticed yet because 113 00:07:02,760 --> 00:07:07,360 Speaker 2: that LM has never been systematically tested for bias, maybe 114 00:07:07,400 --> 00:07:11,160 Speaker 2: that's introducing bias into that bank system. So I think 115 00:07:11,160 --> 00:07:13,000 Speaker 2: there's some danger there. And a lot of people think, 116 00:07:13,240 --> 00:07:17,560 Speaker 2: oh danger, that's not danger. And you know, if you're 117 00:07:17,920 --> 00:07:23,120 Speaker 2: getting denied a mortgage because of your race, that's danger 118 00:07:23,400 --> 00:07:23,680 Speaker 2: to me. 119 00:07:25,120 --> 00:07:27,960 Speaker 1: David feels the industry as a whole is rushing development. 120 00:07:28,360 --> 00:07:31,920 Speaker 1: At the same time, responsible AI teams have been downsized 121 00:07:31,960 --> 00:07:34,920 Speaker 1: at several companies. David himself was laid off from metas 122 00:07:34,960 --> 00:07:37,080 Speaker 1: responsible AI team in twenty twenty two. 123 00:07:37,880 --> 00:07:40,400 Speaker 2: As a company that's using AI, or even as a 124 00:07:40,400 --> 00:07:44,280 Speaker 2: government that's using AI, or a nonprofit organization that's using AI, 125 00:07:44,760 --> 00:07:48,240 Speaker 2: you need to create robust processes to figure out how 126 00:07:48,280 --> 00:07:52,240 Speaker 2: and when it's appropriate to use AI systems, and you 127 00:07:52,320 --> 00:07:55,920 Speaker 2: need to have people who are not interested parties. And 128 00:07:55,960 --> 00:07:58,560 Speaker 2: in the case of a company, an interested party might 129 00:07:58,640 --> 00:08:01,400 Speaker 2: be just the engineer who wants to ship the damn 130 00:08:01,440 --> 00:08:05,160 Speaker 2: thing and get the feature running with the AI. And 131 00:08:05,720 --> 00:08:08,080 Speaker 2: you need to have someone who does not have an 132 00:08:08,080 --> 00:08:11,520 Speaker 2: incentive to ship products in the loop there who can say, 133 00:08:11,760 --> 00:08:15,560 Speaker 2: hold on, we might need another month of testing of this. 134 00:08:15,880 --> 00:08:18,760 Speaker 2: Hold on, we might need to find a way to 135 00:08:18,760 --> 00:08:21,920 Speaker 2: get someone out from outside the company to really give 136 00:08:22,000 --> 00:08:24,280 Speaker 2: us an opinion about if this is a fair AI 137 00:08:24,400 --> 00:08:27,000 Speaker 2: system or if this is safe. 138 00:08:27,200 --> 00:08:29,720 Speaker 1: The reason so many lms are at our fingertips now 139 00:08:30,000 --> 00:08:34,600 Speaker 1: is that investors with deep pockets Google, Microsoft, Meta, Elon 140 00:08:34,679 --> 00:08:38,040 Speaker 1: Musk and others have been pouring money into AI research 141 00:08:38,160 --> 00:08:42,720 Speaker 1: and powerful supercomputers. Some companies will bake lms into their 142 00:08:42,720 --> 00:08:46,680 Speaker 1: own products, others will make money by licensing access to them. 143 00:08:47,080 --> 00:08:50,520 Speaker 1: Everyone is competing for influence and for engineering talent that 144 00:08:50,559 --> 00:08:54,160 Speaker 1: can help them go faster. Openness can be a strategic 145 00:08:54,200 --> 00:08:57,600 Speaker 1: move to get ahead by attracting more developers, but often 146 00:08:57,720 --> 00:09:00,800 Speaker 1: companies also exaggerate how open they are, since it's not 147 00:09:00,800 --> 00:09:03,240 Speaker 1: always possible to see their data or methods. 148 00:09:07,480 --> 00:09:10,760 Speaker 3: So I've followed these models very closely, and I know 149 00:09:10,960 --> 00:09:15,000 Speaker 3: every time they're released, I know there is some element 150 00:09:15,120 --> 00:09:16,439 Speaker 3: of deception. 151 00:09:17,960 --> 00:09:21,600 Speaker 1: That's a Bebba Brahani. Time magazine just named her one 152 00:09:21,640 --> 00:09:24,880 Speaker 1: of the one hundred most influential people in AI. She's 153 00:09:24,920 --> 00:09:28,960 Speaker 1: a Mozilla advisor and a cognitive scientist from Ethiopia working 154 00:09:28,960 --> 00:09:30,880 Speaker 1: at Trinity College in Dublin, Ireland. 155 00:09:32,040 --> 00:09:35,640 Speaker 3: I mean LAMA, for example, was introduced as OH, an 156 00:09:35,720 --> 00:09:39,320 Speaker 3: open sourced large language model, and I went into the 157 00:09:39,400 --> 00:09:43,640 Speaker 3: paper hoping to find information, detailed information, because I work 158 00:09:43,679 --> 00:09:47,400 Speaker 3: with data sets. I went immediately into the data sets 159 00:09:47,440 --> 00:09:51,000 Speaker 3: section and it was just one tiny, small paragraph in 160 00:09:51,040 --> 00:09:52,160 Speaker 3: that giant paper. 161 00:09:52,880 --> 00:09:55,360 Speaker 1: A beeba wants to know what's inside the data sets 162 00:09:55,360 --> 00:09:59,000 Speaker 1: for AI because systems trained on them mimic their biases. 163 00:09:59,640 --> 00:10:02,160 Speaker 1: Just a a handful of data sets get used repeatedly 164 00:10:02,200 --> 00:10:06,680 Speaker 1: across most llms, and these usually include massive amounts of 165 00:10:06,679 --> 00:10:10,000 Speaker 1: Internet content from an open data set called common crawl. 166 00:10:10,760 --> 00:10:14,360 Speaker 3: The Internet can be a really toxic place. It holds, 167 00:10:14,480 --> 00:10:18,480 Speaker 3: you know, everything from the world's beauty to its ugliness 168 00:10:18,600 --> 00:10:23,680 Speaker 3: and everything. In Betuit, for example, during our audits we've 169 00:10:23,720 --> 00:10:30,040 Speaker 3: found content such as child abuse or genocide, or a 170 00:10:30,080 --> 00:10:34,559 Speaker 3: lot of explicit pornographic images. You also have to make 171 00:10:34,600 --> 00:10:38,440 Speaker 3: sure that personal sensitive information that could be used to 172 00:10:38,640 --> 00:10:42,320 Speaker 3: identify individuals. You have to make sure things like this 173 00:10:42,640 --> 00:10:46,240 Speaker 3: are not included in data sets. That's one of the 174 00:10:46,280 --> 00:10:49,800 Speaker 3: reasons why we need to audit the data sets we 175 00:10:49,840 --> 00:10:51,680 Speaker 3: are using to train models. 176 00:10:53,360 --> 00:10:56,400 Speaker 1: Decades of research. So the Internet has never been representative 177 00:10:56,400 --> 00:10:59,680 Speaker 1: of all the world's people or languages, but in generative 178 00:10:59,679 --> 00:11:03,840 Speaker 1: AI it becomes the ground truth. Abeba and her colleagues 179 00:11:03,880 --> 00:11:07,720 Speaker 1: have coined a term to highlight the problem they see. Abeba, 180 00:11:07,800 --> 00:11:09,840 Speaker 1: I noticed in one of your papers that y'all actually 181 00:11:09,920 --> 00:11:12,840 Speaker 1: use the term data swamps, not data sets. Where did 182 00:11:12,840 --> 00:11:15,080 Speaker 1: that term come from? Like why data swamps? 183 00:11:15,720 --> 00:11:21,160 Speaker 3: Data swamp is an attempt to kind of express how 184 00:11:21,440 --> 00:11:24,480 Speaker 3: such a huge dump like the common crawl or even 185 00:11:24,559 --> 00:11:28,880 Speaker 3: large scale data sets, now how they represent not only 186 00:11:29,320 --> 00:11:33,240 Speaker 3: the good and the healthy of humanity, but also the 187 00:11:33,440 --> 00:11:37,319 Speaker 3: nasty and ugly of humanity, because you find all kinds 188 00:11:37,360 --> 00:11:43,560 Speaker 3: of horrible, hateful, degrading texts, especially towards minoritized communities, and 189 00:11:43,640 --> 00:11:47,520 Speaker 3: you find all kinds of images that is really disturbing 190 00:11:47,559 --> 00:11:48,520 Speaker 3: to the human eye. 191 00:11:49,679 --> 00:11:52,719 Speaker 1: Even when these enormous data sets are open, it can 192 00:11:52,760 --> 00:11:55,920 Speaker 1: be too difficult and costly for independent researchers to audit 193 00:11:56,040 --> 00:12:00,400 Speaker 1: because they're too big. But even using smaller samples of data, 194 00:12:00,080 --> 00:12:03,280 Speaker 1: Abeba and our colleagues have uncovered a ton of problems 195 00:12:03,960 --> 00:12:06,319 Speaker 1: in the past. There are audits of a leading image 196 00:12:06,400 --> 00:12:10,079 Speaker 1: data set for AI documented so much racism and sexism 197 00:12:10,320 --> 00:12:14,920 Speaker 1: that it was decommissioned after decades of use. So, Abeba, 198 00:12:15,120 --> 00:12:17,480 Speaker 1: is it personal for you the motivation to keep going? 199 00:12:18,360 --> 00:12:21,120 Speaker 3: Yeah, it is a bit personal. When I go into 200 00:12:21,240 --> 00:12:23,480 Speaker 3: data sets, for example, you know the first thing I 201 00:12:23,559 --> 00:12:27,439 Speaker 3: query is around you know, how black women are represented, 202 00:12:27,720 --> 00:12:31,320 Speaker 3: how Africa as a continent is represented in so on. 203 00:12:31,440 --> 00:12:37,439 Speaker 3: So when I see all the negative images or extreme negative, 204 00:12:37,679 --> 00:12:44,760 Speaker 3: stereotypical caricatures, or you know, completely inaccurate, false, misleading informations, 205 00:12:45,000 --> 00:12:47,640 Speaker 3: you feel like if you don't say anything, if you 206 00:12:47,640 --> 00:12:51,520 Speaker 3: don't do anything about it, nobody else is gonna. 207 00:12:54,400 --> 00:12:58,160 Speaker 1: Abeba says we need regulation to make companies more transparent 208 00:12:58,200 --> 00:12:59,920 Speaker 1: about the data they use and where it came to. 209 00:13:01,000 --> 00:13:04,080 Speaker 1: She says, if companies can hide this information, they can 210 00:13:04,120 --> 00:13:06,920 Speaker 1: include data they don't actually have permission to use. 211 00:13:07,800 --> 00:13:11,680 Speaker 3: These artifacts are not something that just remain in the 212 00:13:11,800 --> 00:13:17,080 Speaker 3: labs of big corporations. These are tools that infiltrate into 213 00:13:17,160 --> 00:13:20,880 Speaker 3: every social spheres. What information goes into thems, what kind 214 00:13:20,920 --> 00:13:23,840 Speaker 3: of data set that is used to train them, where 215 00:13:23,880 --> 00:13:26,719 Speaker 3: the data set is sourced, and the quality of the 216 00:13:26,800 --> 00:13:30,319 Speaker 3: data set itself, and how the models were built, and 217 00:13:30,640 --> 00:13:35,040 Speaker 3: any other important information should be open for auditing and 218 00:13:35,080 --> 00:13:38,840 Speaker 3: for scrutiny. Given that they are almost treated as social 219 00:13:38,880 --> 00:13:41,960 Speaker 3: good that are supposed to serve everybody, so some level 220 00:13:42,000 --> 00:13:46,880 Speaker 3: of openness is really important in terms of making them 221 00:13:47,080 --> 00:13:51,160 Speaker 3: entirely open. Some people have raised the issue of if 222 00:13:51,240 --> 00:13:54,760 Speaker 3: they can be accessed by everybody, bad actors can download 223 00:13:54,800 --> 00:13:59,440 Speaker 3: them and use them for problematic applications. There is always 224 00:13:59,480 --> 00:14:03,280 Speaker 3: a balance that we have to keep working around. We 225 00:14:03,400 --> 00:14:06,440 Speaker 3: have to always try and find that is between open 226 00:14:06,480 --> 00:14:07,000 Speaker 3: and closed. 227 00:14:08,480 --> 00:14:12,000 Speaker 1: It's because llms and their data sets can be problematic 228 00:14:12,200 --> 00:14:16,440 Speaker 1: that we need independent scrutiny of them. Could regulation empower 229 00:14:16,480 --> 00:14:18,840 Speaker 1: people to work together to improve these systems. 230 00:14:23,880 --> 00:14:25,880 Speaker 4: Currently, there's been a lot of kind of like polarizing 231 00:14:25,960 --> 00:14:29,480 Speaker 4: discourse about open versus closed source, as if those were 232 00:14:29,480 --> 00:14:32,560 Speaker 4: the only two choices, but they aren't the only two choices. 233 00:14:32,720 --> 00:14:36,960 Speaker 4: It's kind of like more productive, more forward thinking to 234 00:14:37,000 --> 00:14:39,920 Speaker 4: acknowledge the fact that it's a gradient, it's a spectrum. 235 00:14:40,160 --> 00:14:43,520 Speaker 1: That's Sasha lucci Oni a leading researcher at a startup 236 00:14:43,560 --> 00:14:47,000 Speaker 1: called hugging Face. They run an online platform for testing 237 00:14:47,040 --> 00:14:50,680 Speaker 1: and developing AI. It's so popular that they've been valued 238 00:14:50,680 --> 00:14:54,320 Speaker 1: at four point five billion dollars. Sasha and our colleagues 239 00:14:54,360 --> 00:14:56,480 Speaker 1: have a fresh take on the open source debate. 240 00:14:57,240 --> 00:14:59,320 Speaker 4: What point in the spectrum can I pick for this 241 00:14:59,360 --> 00:15:02,200 Speaker 4: in this model? And I think it's important, especially for 242 00:15:02,280 --> 00:15:05,840 Speaker 4: policymakers to understand that that it's not an US versus them. 243 00:15:05,920 --> 00:15:08,560 Speaker 4: It's not like a two camp situation. It's really like, 244 00:15:09,080 --> 00:15:11,720 Speaker 4: let's pick what works for each model. And also there's 245 00:15:11,760 --> 00:15:14,280 Speaker 4: no one size fits all solution. Depending on the model, 246 00:15:14,520 --> 00:15:18,320 Speaker 4: depending on the data, depending on the usage, some point 247 00:15:18,480 --> 00:15:20,840 Speaker 4: in that gradient is more or less fitting. 248 00:15:21,640 --> 00:15:24,560 Speaker 1: The spectrum of openness Sasha talks about. It's not just 249 00:15:24,600 --> 00:15:27,000 Speaker 1: for a model's code or the data sets. It can 250 00:15:27,040 --> 00:15:29,760 Speaker 1: be for a lot more like the documentation and the 251 00:15:29,800 --> 00:15:33,120 Speaker 1: so called weights that determine how it works. These are 252 00:15:33,240 --> 00:15:36,320 Speaker 1: all decision points on openness, along with the usage terms 253 00:15:37,200 --> 00:15:41,200 Speaker 1: Sasha's research at hugging Face depends on openness. That's because 254 00:15:41,240 --> 00:15:44,040 Speaker 1: it's all about how to measure and lower the environmental 255 00:15:44,040 --> 00:15:48,640 Speaker 1: impact of language models. She says training the lm GBT 256 00:15:48,720 --> 00:15:52,960 Speaker 1: three emitted as much carbon as five hundred transatlantic flights, 257 00:15:53,280 --> 00:15:57,200 Speaker 1: and she says open source technology helps with sustainability in 258 00:15:57,280 --> 00:15:58,000 Speaker 1: other ways too. 259 00:15:59,280 --> 00:16:03,280 Speaker 4: Definitely, the reasons I joined hunking Face was because I 260 00:16:03,360 --> 00:16:08,240 Speaker 4: truly believe that by helping open source AI research, we 261 00:16:08,280 --> 00:16:10,920 Speaker 4: can help the sustainability the energy side of things, but 262 00:16:11,080 --> 00:16:14,920 Speaker 4: also in terms of democratization, like giving more people access 263 00:16:14,960 --> 00:16:17,160 Speaker 4: to models that they can both use out of the 264 00:16:17,200 --> 00:16:20,080 Speaker 4: box or they can fine tune them in order to 265 00:16:20,440 --> 00:16:23,640 Speaker 4: fit their context better. I think that's like a net 266 00:16:23,680 --> 00:16:25,920 Speaker 4: positive for everyone. And for me, it's kind of like 267 00:16:26,000 --> 00:16:30,600 Speaker 4: recycling or thrifting or or you know, buying something used 268 00:16:30,640 --> 00:16:33,080 Speaker 4: and then you know patching it up or changing it 269 00:16:33,120 --> 00:16:35,480 Speaker 4: a little bit to work with what you need it for. 270 00:16:35,560 --> 00:16:37,600 Speaker 4: And I mean I thrift like ninety five percent of 271 00:16:37,600 --> 00:16:40,360 Speaker 4: my clothes, So that's definitely a philosophy I'm really on 272 00:16:40,400 --> 00:16:43,200 Speaker 4: board with. And for me, a open source is definitely 273 00:16:44,040 --> 00:16:46,840 Speaker 4: much more sustainable in the long run because you're not 274 00:16:46,920 --> 00:16:50,320 Speaker 4: constantly starting from scratch, and also people can work together 275 00:16:50,360 --> 00:16:52,200 Speaker 4: and so you have less wasted effort. 276 00:16:53,440 --> 00:16:56,720 Speaker 1: Sasha says. A community initiative called Big Science is an 277 00:16:56,760 --> 00:17:00,760 Speaker 1: example of this. About two years ago, Base backed one 278 00:17:00,760 --> 00:17:04,200 Speaker 1: thousand people from sixty countries in a collaboration to develop 279 00:17:04,240 --> 00:17:06,240 Speaker 1: an open m called Bloom. 280 00:17:07,119 --> 00:17:10,400 Speaker 4: Was literally a thousand researchers and volunteers from all over 281 00:17:10,400 --> 00:17:12,240 Speaker 4: the world who were like, hey, let's train a large 282 00:17:12,280 --> 00:17:15,159 Speaker 4: language model together because we don't have the resources to 283 00:17:15,160 --> 00:17:17,919 Speaker 4: do it like each one of us separately. And it 284 00:17:17,960 --> 00:17:20,040 Speaker 4: was great because we had people who are lawyers, We 285 00:17:20,080 --> 00:17:23,879 Speaker 4: had people who were like specialists in archival studies to 286 00:17:23,920 --> 00:17:25,679 Speaker 4: help get data from different places, Like I mean, we 287 00:17:25,680 --> 00:17:27,200 Speaker 4: had all sorts of people from all over the world, 288 00:17:27,200 --> 00:17:30,920 Speaker 4: and people who don't necessarily have like a supercomputer on premise, 289 00:17:31,119 --> 00:17:32,919 Speaker 4: who don't work in a big tech company that can 290 00:17:32,960 --> 00:17:35,320 Speaker 4: give them access to some kind of computes to train 291 00:17:35,400 --> 00:17:36,000 Speaker 4: these models. 292 00:17:38,760 --> 00:17:42,160 Speaker 1: Open communities like this one could be directly affected by 293 00:17:42,160 --> 00:17:46,160 Speaker 1: policies that either limit or encourage important research for alternatives. 294 00:17:46,760 --> 00:17:48,760 Speaker 4: During the Big Science project, I joined hunging Face because 295 00:17:48,760 --> 00:17:49,840 Speaker 4: I was like, Yeah, this is the kind of work 296 00:17:49,880 --> 00:17:52,040 Speaker 4: I want to do. I don't want to have to 297 00:17:52,080 --> 00:17:53,760 Speaker 4: be secretive about what I'm doing. I want to do 298 00:17:53,760 --> 00:17:55,240 Speaker 4: it in an open source wain and I want to 299 00:17:55,440 --> 00:17:58,760 Speaker 4: help other people who don't necessarily have the means to 300 00:17:58,960 --> 00:18:01,000 Speaker 4: train these kinds of models. I want to help them 301 00:18:01,119 --> 00:18:04,520 Speaker 4: also benefit from this technology. The fact that we had 302 00:18:04,520 --> 00:18:06,760 Speaker 4: all these people involved in big science made the whole 303 00:18:06,800 --> 00:18:12,359 Speaker 4: project and the ensuing model much more representative of society, 304 00:18:12,400 --> 00:18:14,840 Speaker 4: I feel. And that's important because when these models get 305 00:18:14,920 --> 00:18:19,719 Speaker 4: used in downstream models or downstream tools or systems, than 306 00:18:19,800 --> 00:18:22,879 Speaker 4: any kind of information that's implicitly encoded in the model 307 00:18:22,920 --> 00:18:24,640 Speaker 4: will bubble up to the service. 308 00:18:25,520 --> 00:18:28,359 Speaker 1: So with all these gradients of openness, it's not only 309 00:18:28,440 --> 00:18:32,000 Speaker 1: the biggest AI companies developing lms, and that can be 310 00:18:32,000 --> 00:18:37,360 Speaker 1: a good thing. There's an open source alternative to chat 311 00:18:37,359 --> 00:18:42,280 Speaker 1: GPT called GPT for All. Amazingly, it works without an 312 00:18:42,320 --> 00:18:45,600 Speaker 1: Internet connection, and the lms are compressed so much that 313 00:18:45,640 --> 00:18:49,520 Speaker 1: you can download them to any regular personal computer. GPT 314 00:18:49,680 --> 00:18:51,720 Speaker 1: for All was launched by a New York startup called 315 00:18:51,760 --> 00:18:55,119 Speaker 1: Nomec earlier this year as a privacy preserving alternative to 316 00:18:55,200 --> 00:18:58,560 Speaker 1: chat GPT. Tens of thousands of people flocked to it. 317 00:19:00,200 --> 00:19:02,040 Speaker 1: Mixed co founder Andre Moliar. 318 00:19:02,720 --> 00:19:05,000 Speaker 5: One of the biggest focuses that we have around GPD 319 00:19:05,080 --> 00:19:08,320 Speaker 5: Ferol is making sure that privacy is the first thing 320 00:19:08,400 --> 00:19:10,760 Speaker 5: we think about in some sense. One of the core 321 00:19:10,800 --> 00:19:13,720 Speaker 5: reasons behind why we even built GPT Ferol and the 322 00:19:13,760 --> 00:19:16,359 Speaker 5: ecosystem of models that came in with it, was because 323 00:19:16,359 --> 00:19:19,360 Speaker 5: of all these large sort of like issues and concerns 324 00:19:19,359 --> 00:19:22,439 Speaker 5: about privacy with people using open AI's models. 325 00:19:23,840 --> 00:19:26,199 Speaker 1: You may not know this, but when you type prompts 326 00:19:26,200 --> 00:19:29,560 Speaker 1: into chat GPT, open ai can use whatever you type 327 00:19:29,600 --> 00:19:32,560 Speaker 1: to further train their models. There have even been numerous 328 00:19:32,560 --> 00:19:35,520 Speaker 1: privacy lakes because of it, both corporate and personal. 329 00:19:36,240 --> 00:19:38,639 Speaker 5: The privacy angle that we focus on specifically is making 330 00:19:38,680 --> 00:19:41,520 Speaker 5: sure that the application in its open source form, you 331 00:19:41,520 --> 00:19:42,920 Speaker 5: can see all of the code, so we start out 332 00:19:42,920 --> 00:19:44,720 Speaker 5: from that. That makes it safe. We make sure that 333 00:19:44,760 --> 00:19:47,280 Speaker 5: everything's audited by the community. And the next thing is 334 00:19:47,320 --> 00:19:49,639 Speaker 5: that we make sure we align by all lawsome regulations 335 00:19:49,640 --> 00:19:52,800 Speaker 5: across Europe and across the US. We don't gather user 336 00:19:52,840 --> 00:19:55,920 Speaker 5: specific data whatever they use, for instance, the models, and 337 00:19:56,000 --> 00:19:58,560 Speaker 5: we make sure that the models can run without access 338 00:19:58,600 --> 00:20:00,720 Speaker 5: to any internet, so you can go once you dinalad 339 00:20:00,720 --> 00:20:03,040 Speaker 5: the models to your computer, you can turn off your Internet. 340 00:20:03,119 --> 00:20:05,199 Speaker 5: If you're stuck in the jungle and you don't have 341 00:20:05,240 --> 00:20:07,160 Speaker 5: access to Internet, you can ask it for help. 342 00:20:07,840 --> 00:20:11,160 Speaker 1: No mixed mission is to improve the explainability and accessibility 343 00:20:11,200 --> 00:20:14,320 Speaker 1: of AI. Their main software product is a data exploration 344 00:20:14,480 --> 00:20:18,040 Speaker 1: tool for massive data sets called Atlas, but Andre believes 345 00:20:18,200 --> 00:20:21,119 Speaker 1: GPT for All is important for them to devote resources 346 00:20:21,160 --> 00:20:22,040 Speaker 1: to as a company. 347 00:20:22,760 --> 00:20:25,480 Speaker 5: When you run a business, there are certain things you 348 00:20:25,480 --> 00:20:27,360 Speaker 5: get the opportunity to do that you wouldn't be able 349 00:20:27,440 --> 00:20:29,080 Speaker 5: to do if you were running a business. One of 350 00:20:29,119 --> 00:20:31,200 Speaker 5: those is you have access to capital to be able 351 00:20:31,200 --> 00:20:34,199 Speaker 5: to work on risky projects like GPT for All purely 352 00:20:34,240 --> 00:20:36,960 Speaker 5: because you want to, not because you know there's some 353 00:20:37,040 --> 00:20:39,440 Speaker 5: direct revenue driving source of it. 354 00:20:40,480 --> 00:20:43,400 Speaker 1: Mainly, Andre says he's motivated by a wish to see 355 00:20:43,440 --> 00:20:46,080 Speaker 1: AI developed by more than just a handful of companies. 356 00:20:46,359 --> 00:20:49,280 Speaker 1: But he also raises a question of values and who 357 00:20:49,320 --> 00:20:51,080 Speaker 1: decides how lms behave. 358 00:20:51,720 --> 00:20:54,240 Speaker 5: So biases aren't always bad. So an example of a 359 00:20:54,240 --> 00:20:57,800 Speaker 5: bias could be the model always you know prefers to 360 00:20:58,640 --> 00:21:01,439 Speaker 5: greet you with a salutation for giving you a response. Right, 361 00:21:01,480 --> 00:21:03,240 Speaker 5: that's a bias that might not be bad. But obviously 362 00:21:03,280 --> 00:21:05,600 Speaker 5: there's biases that could be bad. Right, And one of 363 00:21:05,600 --> 00:21:07,959 Speaker 5: these sort of important things with large language models is 364 00:21:08,280 --> 00:21:10,480 Speaker 5: the fact that you can actually go in and customize this. 365 00:21:10,560 --> 00:21:12,479 Speaker 5: So if you have your own examples of data that 366 00:21:12,480 --> 00:21:14,400 Speaker 5: you would like your model to be able to output, 367 00:21:14,480 --> 00:21:16,240 Speaker 5: you can actually change that by training the model. 368 00:21:17,359 --> 00:21:20,840 Speaker 1: Andre offers the example of open AI training chat GPT 369 00:21:21,200 --> 00:21:25,399 Speaker 1: not to output hateful statements. Today, JPT for all gives 370 00:21:25,440 --> 00:21:28,440 Speaker 1: access to models fine tune not to offend, as well 371 00:21:28,480 --> 00:21:31,760 Speaker 1: as some that aren't. Andre says they've had some backlash 372 00:21:31,840 --> 00:21:34,840 Speaker 1: from people criticizing them for giving more people access to 373 00:21:35,119 --> 00:21:37,160 Speaker 1: lms that could be used for harm. 374 00:21:37,840 --> 00:21:40,119 Speaker 5: The reality is, like this technology isn't going away. The 375 00:21:40,119 --> 00:21:41,440 Speaker 5: biggest thing is we need to learn how to live 376 00:21:41,440 --> 00:21:43,480 Speaker 5: with it and how to be able to cope with 377 00:21:43,560 --> 00:21:45,439 Speaker 5: the side effects that emerge from it. A lot of 378 00:21:45,440 --> 00:21:47,119 Speaker 5: them will be positive, some of them are going to 379 00:21:47,119 --> 00:21:49,159 Speaker 5: be negative. Like one of the things that I guess 380 00:21:49,480 --> 00:21:51,320 Speaker 5: I think about quite a bit is like what happens 381 00:21:51,920 --> 00:21:54,119 Speaker 5: in the twenty twenty four election in the United States. 382 00:21:54,480 --> 00:21:56,800 Speaker 5: You can go in and pick ten thousand people, get 383 00:21:56,800 --> 00:22:00,240 Speaker 5: their Facebook profile, and customize a chatbot that apt to 384 00:22:00,240 --> 00:22:01,920 Speaker 5: be a human to convince them to think one way 385 00:22:02,000 --> 00:22:03,760 Speaker 5: or the other, and you can do that for no 386 00:22:03,880 --> 00:22:06,240 Speaker 5: cost at all. I guess the thing that keeps me 387 00:22:06,280 --> 00:22:08,879 Speaker 5: awake at night is if we're going to live in 388 00:22:08,880 --> 00:22:12,360 Speaker 5: this inevitable world where we're surrounded by machines that can 389 00:22:12,680 --> 00:22:17,119 Speaker 5: generate synthesized versions of information, and all that information is 390 00:22:17,160 --> 00:22:20,840 Speaker 5: being piped from one or two company servers. If there's 391 00:22:20,880 --> 00:22:24,040 Speaker 5: a world where someone like open AI owns all the 392 00:22:24,160 --> 00:22:26,520 Speaker 5: pipes for the information flow, and then they get the 393 00:22:26,600 --> 00:22:27,960 Speaker 5: chance to manipulate. 394 00:22:27,480 --> 00:22:28,280 Speaker 2: That however they want. 395 00:22:29,040 --> 00:22:31,439 Speaker 5: This is like why we do what we do. We 396 00:22:31,480 --> 00:22:33,840 Speaker 5: want to make sure that these generative AI models that 397 00:22:34,440 --> 00:22:39,040 Speaker 5: persist through the world are built with everyone's view into 398 00:22:39,080 --> 00:22:41,520 Speaker 5: how the models are being created, not just a couple 399 00:22:41,520 --> 00:22:48,160 Speaker 5: of organizations behind closed doors with unlimited resources. 400 00:22:50,400 --> 00:22:54,720 Speaker 1: Llms are here. Open source communities that do put people 401 00:22:54,760 --> 00:22:58,240 Speaker 1: ahead of profits are crucial to unlocking the positive potential 402 00:22:58,320 --> 00:23:01,760 Speaker 1: of generative AI. The challenge for builders and regulators is 403 00:23:01,760 --> 00:23:05,000 Speaker 1: to find that balance on the one hand, so generative 404 00:23:05,040 --> 00:23:07,919 Speaker 1: AI isn't developed or deployed in harmful ways, and on 405 00:23:08,000 --> 00:23:11,520 Speaker 1: the other to empower independent researchers to contribute to how 406 00:23:11,560 --> 00:23:18,560 Speaker 1: systems work. I'm bridgetad Thanks for listening to IRL Online. 407 00:23:18,600 --> 00:23:21,840 Speaker 1: Life is Real Life, an original podcast from Mozilla than 408 00:23:21,920 --> 00:23:25,640 Speaker 1: on Profit behind Firefox. For more about our guests, check 409 00:23:25,680 --> 00:23:29,479 Speaker 1: out our show notes or visit irlpodcast dot org. This season, 410 00:23:29,760 --> 00:23:37,280 Speaker 1: we're talking about people over profit in AI Mozilla Reclaim 411 00:23:37,400 --> 00:23:39,159 Speaker 1: the Internet