1 00:00:04,480 --> 00:00:12,639 Speaker 1: Welcome to tech Stuff, a production from iHeartRadio. Hey there, 2 00:00:12,640 --> 00:00:16,000 Speaker 1: and welcome to tech Stuff. I'm your host, Jonathan Strickland. 3 00:00:16,040 --> 00:00:19,040 Speaker 1: I'm an executive producer with iHeart Podcasts. And how the 4 00:00:19,079 --> 00:00:23,280 Speaker 1: tech are yet. So let's take a little literary trip. 5 00:00:23,600 --> 00:00:29,200 Speaker 1: In Anthony Burgess's a clockwork Orange, the extremely wicked protagonist 6 00:00:29,680 --> 00:00:32,920 Speaker 1: it's putting it lightly. At one point early early in 7 00:00:32,920 --> 00:00:36,760 Speaker 1: the novel, reflects on the nature of permanence. He thinks 8 00:00:36,800 --> 00:00:40,680 Speaker 1: the reader might not remember what milk bars were like 9 00:00:41,159 --> 00:00:45,360 Speaker 1: due to quote things changing so scory these days and 10 00:00:45,479 --> 00:00:49,600 Speaker 1: everybody very quick to forget, newspapers not being read much 11 00:00:49,760 --> 00:00:54,120 Speaker 1: neither end quote. Alex in this case is saying that 12 00:00:54,200 --> 00:00:58,040 Speaker 1: the combination of the world changing very quickly scory is 13 00:00:58,080 --> 00:01:01,880 Speaker 1: derived from a Slavic word meaning swiftly or quickly, and 14 00:01:02,000 --> 00:01:05,720 Speaker 1: people having short memories means that referencing something that happened 15 00:01:05,760 --> 00:01:08,680 Speaker 1: even just a few years ago might mean you're met 16 00:01:08,680 --> 00:01:12,360 Speaker 1: with blank stares because the world has moved on. Now 17 00:01:12,520 --> 00:01:15,759 Speaker 1: take that same sentiment and crank it up to eleven 18 00:01:16,040 --> 00:01:18,840 Speaker 1: when you talk about the Internet in general and the 19 00:01:18,840 --> 00:01:21,600 Speaker 1: Web in particular. So, on the one hand, we know 20 00:01:22,000 --> 00:01:24,240 Speaker 1: that the rule of thumb is that once something gets 21 00:01:24,280 --> 00:01:27,920 Speaker 1: posted online, that's kind of it, right, it's sort of 22 00:01:27,959 --> 00:01:31,240 Speaker 1: perpetually online. Like that's kind of the joke. Like once 23 00:01:31,280 --> 00:01:33,520 Speaker 1: it's up, it's up, and you can take it down, 24 00:01:33,520 --> 00:01:35,280 Speaker 1: but there's going to be a copy of it somewhere. 25 00:01:35,720 --> 00:01:39,319 Speaker 1: So even if the originator tries to take down whatever 26 00:01:39,400 --> 00:01:43,440 Speaker 1: the stuff was, somebody's got it. But on the other hand, 27 00:01:43,440 --> 00:01:46,200 Speaker 1: we also know that so much stuff gets added every 28 00:01:46,240 --> 00:01:49,400 Speaker 1: single day to the Internet. There's actually a colossal mountain 29 00:01:49,400 --> 00:01:53,120 Speaker 1: of content out there that just keeps getting bigger moment 30 00:01:53,160 --> 00:01:55,960 Speaker 1: by moment, and everything that came before it can end 31 00:01:56,040 --> 00:01:59,480 Speaker 1: up getting buried in the process. And sometimes stuff can 32 00:01:59,560 --> 00:02:03,760 Speaker 1: be added and taken down without anyone being the wiser. Now, 33 00:02:03,800 --> 00:02:06,640 Speaker 1: on top of that, web pages obviously can change. A 34 00:02:06,720 --> 00:02:10,760 Speaker 1: website might adopt a new format or style, might incorporate 35 00:02:10,840 --> 00:02:15,000 Speaker 1: new technologies and interfaces that are added to web browsers, 36 00:02:15,360 --> 00:02:18,680 Speaker 1: or it might choose to remove sections that once might 37 00:02:18,720 --> 00:02:21,960 Speaker 1: have been relevant but maybe now not so much. Or 38 00:02:22,080 --> 00:02:27,079 Speaker 1: entire websites could disappear as servers go offline or companies 39 00:02:27,320 --> 00:02:32,040 Speaker 1: go bankrupt, or you know, web administrators just lose interest. 40 00:02:32,520 --> 00:02:36,520 Speaker 1: The entire spectrum of human output can be found on 41 00:02:36,560 --> 00:02:39,400 Speaker 1: the web. Not every instance of human output, but an 42 00:02:39,440 --> 00:02:44,440 Speaker 1: example of everything is out there. Everything from deep philosophical 43 00:02:44,520 --> 00:02:48,040 Speaker 1: musings to the most banal posts you know, which often 44 00:02:48,520 --> 00:02:51,320 Speaker 1: revolve around what someone is having for lunch. All of 45 00:02:51,320 --> 00:02:53,760 Speaker 1: that finds its way to the Internet. And while you 46 00:02:53,840 --> 00:02:56,600 Speaker 1: might argue that a lot of it, or perhaps even 47 00:02:56,680 --> 00:02:59,040 Speaker 1: most of it, is it really worth the time it 48 00:02:59,080 --> 00:03:02,920 Speaker 1: takes to consume, let alone keep it around. There is 49 00:03:03,080 --> 00:03:06,160 Speaker 1: undeniably a huge amount of valuable data out there too, 50 00:03:06,639 --> 00:03:09,800 Speaker 1: but there's no guarantee that it will stay there or 51 00:03:09,880 --> 00:03:13,880 Speaker 1: remain easily findable. And that's where today's topic comes in. 52 00:03:13,960 --> 00:03:16,480 Speaker 1: I wanted to talk about a project that began back 53 00:03:16,520 --> 00:03:19,320 Speaker 1: in nineteen ninety six. It's a project that aims to 54 00:03:19,360 --> 00:03:22,520 Speaker 1: preserve as much of the Internet as possible and little 55 00:03:22,720 --> 00:03:26,600 Speaker 1: slices of time, little snapshots. Not only does that mean 56 00:03:26,639 --> 00:03:29,200 Speaker 1: you can potentially dig up something that hasn't been online 57 00:03:29,240 --> 00:03:31,919 Speaker 1: for years, but also you can get a look at 58 00:03:32,000 --> 00:03:35,080 Speaker 1: what different sites were like in various eras of the Web. 59 00:03:35,320 --> 00:03:37,600 Speaker 1: It could be a really eye opening experience to see 60 00:03:37,640 --> 00:03:40,480 Speaker 1: something like Amazon and what it looked like, you know, 61 00:03:40,520 --> 00:03:43,960 Speaker 1: shortly after it launched, compared to what it looks like today. 62 00:03:44,400 --> 00:03:48,960 Speaker 1: So we are going to talk about the Internet Archive. Now. 63 00:03:48,960 --> 00:03:51,240 Speaker 1: To do that, we need to talk a little bit 64 00:03:51,240 --> 00:03:54,040 Speaker 1: about the people who founded the ding dang darn thing, 65 00:03:54,320 --> 00:03:58,520 Speaker 1: and that would be Brewster Kale and Bruce Gilliat. So 66 00:03:58,680 --> 00:04:02,040 Speaker 1: Klee graduated from m with a degree in computer science 67 00:04:02,040 --> 00:04:06,280 Speaker 1: and engineering. After he graduated, he joined fellow MIT graduate 68 00:04:06,400 --> 00:04:10,080 Speaker 1: Danny Hillis, who had created a company called Thinking Machines. 69 00:04:10,320 --> 00:04:13,960 Speaker 1: So this was a super computer company. His team specialized 70 00:04:13,960 --> 00:04:17,920 Speaker 1: in building massively parallel computer systems, mostly with the aim 71 00:04:17,960 --> 00:04:21,120 Speaker 1: of building machines for AI research and development. So yeah, 72 00:04:21,240 --> 00:04:24,480 Speaker 1: Calee was working on the challenges of providing AI researchers 73 00:04:24,520 --> 00:04:28,040 Speaker 1: with the compute power they need, decades before our current 74 00:04:28,120 --> 00:04:33,040 Speaker 1: AI explosion. Bruce Gilliot is also a computer scientist, and 75 00:04:33,080 --> 00:04:35,160 Speaker 1: that's just about all I know about him. I mean, 76 00:04:35,320 --> 00:04:38,040 Speaker 1: I know he is, or at least was married, and 77 00:04:38,120 --> 00:04:40,600 Speaker 1: I also know he owned a series of very impressive 78 00:04:40,600 --> 00:04:43,960 Speaker 1: houses in the San Francisco and San Jose areas because 79 00:04:44,000 --> 00:04:46,600 Speaker 1: it made the news whenever he sold one or bought 80 00:04:46,600 --> 00:04:49,679 Speaker 1: a new one. But other than that, there's precious little 81 00:04:49,680 --> 00:04:53,000 Speaker 1: information about him that I could find, which is somewhat ironic. 82 00:04:53,040 --> 00:04:55,440 Speaker 1: When you consider that he has dedicated a lot of 83 00:04:55,440 --> 00:04:58,520 Speaker 1: time and effort to preserving information on the Internet. He 84 00:04:58,520 --> 00:05:00,839 Speaker 1: would go on to co found the company called Alexa 85 00:05:00,920 --> 00:05:03,960 Speaker 1: Internet with Brewster Kale, but that's getting ahead of ourselves. 86 00:05:04,080 --> 00:05:07,839 Speaker 1: So most of my story will center around Kale simply 87 00:05:07,880 --> 00:05:10,520 Speaker 1: because out of the two co founders, he's the one 88 00:05:10,520 --> 00:05:13,839 Speaker 1: who acted more as the face of the efforts, and Gileat, 89 00:05:13,839 --> 00:05:15,880 Speaker 1: from what I can tell, has just been really good 90 00:05:15,880 --> 00:05:20,120 Speaker 1: about kind of maintaining a very personal private life. So 91 00:05:20,880 --> 00:05:24,960 Speaker 1: I don't mean to diminish Gileat's contributions, but at the 92 00:05:24,960 --> 00:05:27,640 Speaker 1: same time, you know, I can only cover what I 93 00:05:27,640 --> 00:05:31,240 Speaker 1: can find. So in nineteen eighty nine, Kale, along with 94 00:05:31,320 --> 00:05:35,080 Speaker 1: a colleague named Harry Morris, created an innovative tool for 95 00:05:35,200 --> 00:05:38,760 Speaker 1: the blossoming Internet. Now remember this is the Internet. It's 96 00:05:38,839 --> 00:05:42,119 Speaker 1: not the Worldwide Web. It didn't exist yet the Web 97 00:05:42,240 --> 00:05:45,159 Speaker 1: the Internet did, and the tool they created was called 98 00:05:45,160 --> 00:05:51,960 Speaker 1: the Wide Area Information Server or ways WAIS. So people 99 00:05:52,000 --> 00:05:55,040 Speaker 1: could create a web server. They could host documents on 100 00:05:55,080 --> 00:05:59,960 Speaker 1: their web servers. But finding these documents was really hard 101 00:06:00,720 --> 00:06:04,680 Speaker 1: because you didn't necessarily have hyperlinks connecting one document to 102 00:06:04,760 --> 00:06:07,920 Speaker 1: others and vice versa. You didn't have an easy way 103 00:06:07,960 --> 00:06:12,680 Speaker 1: of even navigating through different documents from one to the next. 104 00:06:13,160 --> 00:06:15,320 Speaker 1: So it was almost a case that you needed to 105 00:06:15,360 --> 00:06:19,080 Speaker 1: know where something was and what it was called first, 106 00:06:19,240 --> 00:06:22,440 Speaker 1: and then you could go to the relevant server and 107 00:06:22,480 --> 00:06:26,599 Speaker 1: retrieve that document. Otherwise the document would just remain quietly 108 00:06:26,680 --> 00:06:30,359 Speaker 1: sitting on some server somewhere and no one would know 109 00:06:30,400 --> 00:06:34,080 Speaker 1: about it. Now, that is antithetical to the entire purpose 110 00:06:34,160 --> 00:06:37,840 Speaker 1: of a wide area information sharing system, because, I mean, 111 00:06:37,880 --> 00:06:40,800 Speaker 1: the name tells us the whole purpose of this technology 112 00:06:40,839 --> 00:06:45,360 Speaker 1: is to allow information to be widely shared. Jeremy Norman's 113 00:06:45,400 --> 00:06:50,000 Speaker 1: History of Information lists ways as quote the first Internet 114 00:06:50,080 --> 00:06:54,120 Speaker 1: publishing system, just predating Gopher and the World Wide Web 115 00:06:54,320 --> 00:06:58,839 Speaker 1: end quote. In a recorded presentation to some Xerox employees, 116 00:06:59,000 --> 00:07:03,120 Speaker 1: Kale laid out personal perspective on what he wants from 117 00:07:03,279 --> 00:07:06,159 Speaker 1: his experience on the Internet. So first up, he said 118 00:07:06,360 --> 00:07:09,520 Speaker 1: he wanted his own personal information to be easily accessible 119 00:07:09,960 --> 00:07:13,240 Speaker 1: by him. Specifically, not that it should be easily accessible 120 00:07:13,280 --> 00:07:16,880 Speaker 1: to everybody, but specifically to him. He wanted the ability 121 00:07:16,920 --> 00:07:19,760 Speaker 1: to get access to all the different stuff he generates, 122 00:07:19,800 --> 00:07:22,280 Speaker 1: like articles and such, and to make it really easy 123 00:07:22,320 --> 00:07:25,080 Speaker 1: to do that. He also wanted the ability for publishers 124 00:07:25,120 --> 00:07:27,960 Speaker 1: to get their work to him. So in Kal's mind, 125 00:07:28,280 --> 00:07:30,720 Speaker 1: the best approach would be for published works that are 126 00:07:30,760 --> 00:07:33,360 Speaker 1: relevant to his interests to find their way to him, 127 00:07:33,560 --> 00:07:36,120 Speaker 1: as opposed to Kale having to go out and hunt 128 00:07:36,200 --> 00:07:39,480 Speaker 1: down these published works himself. And he pointed out this 129 00:07:39,600 --> 00:07:42,480 Speaker 1: is what publishers want too, because you wouldn't publish something 130 00:07:42,560 --> 00:07:45,239 Speaker 1: unless he wanted folks to actually read it. He also 131 00:07:45,320 --> 00:07:48,160 Speaker 1: said that he wanted this technology to be usable anywhere. 132 00:07:48,600 --> 00:07:51,200 Speaker 1: He wanted people to be able to access it no 133 00:07:51,240 --> 00:07:53,080 Speaker 1: matter what kind of device they were relying on. Now 134 00:07:53,160 --> 00:07:56,160 Speaker 1: he was specifically referencing laptops at the time, but he 135 00:07:56,280 --> 00:08:00,120 Speaker 1: was also saying that portable computer systems, essentially things that 136 00:08:00,120 --> 00:08:03,400 Speaker 1: would become smartphones and tablets, were on the horizon and 137 00:08:03,440 --> 00:08:05,880 Speaker 1: that these needed to be able to access that stuff too. 138 00:08:06,280 --> 00:08:09,080 Speaker 1: And he said that he wanted people to be able 139 00:08:09,080 --> 00:08:11,880 Speaker 1: to use what he had learned should he choose to 140 00:08:11,880 --> 00:08:15,440 Speaker 1: share the information, that if he had come up with 141 00:08:15,480 --> 00:08:17,600 Speaker 1: something that was useful and he wanted to share that, 142 00:08:17,640 --> 00:08:19,760 Speaker 1: he wanted other people to be able to access that. 143 00:08:20,160 --> 00:08:23,120 Speaker 1: Cale didn't say that people should be compelled to share, 144 00:08:23,560 --> 00:08:26,000 Speaker 1: but if they wanted to it should be possible to 145 00:08:26,040 --> 00:08:30,560 Speaker 1: do so. Ways was Cale's attempt to bring these ideas 146 00:08:30,640 --> 00:08:34,199 Speaker 1: to life. In that presentation to the Xerox employees, he 147 00:08:34,320 --> 00:08:38,320 Speaker 1: defined ways as electronic publishing. He further defined that term 148 00:08:38,400 --> 00:08:41,880 Speaker 1: to mean the distribution of information. So whether the end 149 00:08:41,960 --> 00:08:45,080 Speaker 1: user was to look at this information on a computer 150 00:08:45,120 --> 00:08:48,280 Speaker 1: screen or they just chose to print out the information 151 00:08:48,640 --> 00:08:50,880 Speaker 1: and then read it that way, that was beside the point. 152 00:08:51,120 --> 00:08:55,559 Speaker 1: Electronic publishing was all about how information got from the 153 00:08:55,600 --> 00:08:58,760 Speaker 1: originator to the end user. That's what made it e 154 00:08:58,920 --> 00:09:02,880 Speaker 1: publishing that it was publishing over wires. Now, one thing 155 00:09:03,000 --> 00:09:06,800 Speaker 1: Cale introduced in this presentation to Xerox was this concept 156 00:09:06,800 --> 00:09:10,760 Speaker 1: of conducting searches using natural language. This concept is one 157 00:09:10,800 --> 00:09:13,640 Speaker 1: that we're really familiar with today. You enter a query 158 00:09:13,800 --> 00:09:16,200 Speaker 1: into a search bar. You describe what it is that 159 00:09:16,240 --> 00:09:19,760 Speaker 1: you want to know or learn about, or have access to, 160 00:09:20,080 --> 00:09:23,400 Speaker 1: or retrieve or whatever. This search engine brings back search 161 00:09:23,440 --> 00:09:26,600 Speaker 1: results that are ordered by some kind of relevance depending 162 00:09:26,679 --> 00:09:29,960 Speaker 1: upon the search engines, you know, various algorithms. How the 163 00:09:30,000 --> 00:09:33,760 Speaker 1: search engine determines relevance really depends upon the system itself, 164 00:09:33,880 --> 00:09:36,160 Speaker 1: of course, Like you could run the same search across 165 00:09:36,400 --> 00:09:39,760 Speaker 1: different search engines and get very different results based upon 166 00:09:40,080 --> 00:09:45,280 Speaker 1: that methodology of determining relevance. If the system believes it's relevant, 167 00:09:45,480 --> 00:09:47,240 Speaker 1: it may or may not be relevant to what you 168 00:09:47,320 --> 00:09:50,520 Speaker 1: actually want. Like hopefully the two are aligned. If it's 169 00:09:50,520 --> 00:09:53,400 Speaker 1: a really good search engine, then you're going to get 170 00:09:53,480 --> 00:09:57,600 Speaker 1: something that is actually meaningful to you. Anyway, Ways was 171 00:09:57,720 --> 00:10:01,720 Speaker 1: kind of following in that approach back before there was 172 00:10:01,760 --> 00:10:04,280 Speaker 1: a World Wide Web, you know, when you just needed 173 00:10:04,280 --> 00:10:08,200 Speaker 1: a way to find stuff that was being stored on 174 00:10:08,280 --> 00:10:11,880 Speaker 1: these Internet servers and to be able to retrieve these 175 00:10:11,920 --> 00:10:14,600 Speaker 1: documents to make use of them. Otherwise you had this 176 00:10:14,679 --> 00:10:19,360 Speaker 1: incredibly powerful communications tool, but it was so challenging to 177 00:10:19,480 --> 00:10:22,600 Speaker 1: use in a meaningful way that the information stored there 178 00:10:23,000 --> 00:10:26,560 Speaker 1: would be not that useful. I think of it akin 179 00:10:26,679 --> 00:10:31,720 Speaker 1: to imagine that there's this one remote library and it's tiny, 180 00:10:32,080 --> 00:10:36,440 Speaker 1: but it has the world's only copy of some text. 181 00:10:36,840 --> 00:10:39,280 Speaker 1: But this libraries in the middle of nowhere. It's really 182 00:10:39,360 --> 00:10:42,160 Speaker 1: hard to get to the fact that that library has 183 00:10:42,280 --> 00:10:45,800 Speaker 1: that document would not be terribly useful to most people, 184 00:10:45,920 --> 00:10:47,840 Speaker 1: and so it might as well not have the document 185 00:10:47,880 --> 00:10:50,120 Speaker 1: at all. That's kind of what Ways was trying to 186 00:10:50,160 --> 00:10:52,920 Speaker 1: do is solve this problem of making it easier to 187 00:10:52,960 --> 00:10:57,400 Speaker 1: get access to this wealth of information that Kale saw 188 00:10:57,720 --> 00:11:01,880 Speaker 1: was only going to get more complex and more full 189 00:11:01,960 --> 00:11:05,600 Speaker 1: of data. Well, we'll move away from Ways, because we 190 00:11:05,600 --> 00:11:08,280 Speaker 1: could do a full episode about that. I will say 191 00:11:08,280 --> 00:11:11,960 Speaker 1: that Cale and Morris, the founders of Ways, the guys 192 00:11:11,960 --> 00:11:17,120 Speaker 1: who created the Ways technologies, would actually leave Thinking Machines 193 00:11:17,320 --> 00:11:20,680 Speaker 1: and they would found a spinoff company just called Ways Incorporated. 194 00:11:20,920 --> 00:11:23,439 Speaker 1: And it was around this point when the mysterious Bruce 195 00:11:23,480 --> 00:11:26,840 Speaker 1: Gilliot joined the team. And while the Worldwide Web would 196 00:11:26,880 --> 00:11:29,840 Speaker 1: debut in the early nineties, which really opened up accessibility 197 00:11:29,840 --> 00:11:32,040 Speaker 1: to information on the Internet for a lot of people, 198 00:11:32,480 --> 00:11:35,840 Speaker 1: most of them for the first time, Ways would continue 199 00:11:35,880 --> 00:11:38,920 Speaker 1: to remain relevant. In fact, it was relevant enough that 200 00:11:39,040 --> 00:11:42,480 Speaker 1: in nineteen ninety five AOL would come calling with an 201 00:11:42,480 --> 00:11:45,959 Speaker 1: offer to purchase the company for a cool fifteen million dollars. 202 00:11:46,000 --> 00:11:48,840 Speaker 1: If we adjust that for inflation today's money, that would 203 00:11:48,880 --> 00:11:53,640 Speaker 1: be around thirty million bucks around that ballpark. Now, a 204 00:11:53,640 --> 00:11:56,680 Speaker 1: lot of the folks that Ways Incorporated would split off 205 00:11:56,760 --> 00:12:00,679 Speaker 1: to create new companies after this acquisition, and within a 206 00:12:00,800 --> 00:12:04,400 Speaker 1: year that included Cale and Gileat, who went on to 207 00:12:04,559 --> 00:12:10,000 Speaker 1: found a new company called Alexa Internet and you might think, huh, Alexa, 208 00:12:10,120 --> 00:12:13,280 Speaker 1: you mean like the same name as the Amazon Digital Assistant, 209 00:12:13,679 --> 00:12:16,559 Speaker 1: And yes, exactly that, because, as it would turn out, 210 00:12:16,600 --> 00:12:21,840 Speaker 1: Amazon would ultimately acquire Alexa Internet just a few years 211 00:12:21,880 --> 00:12:25,080 Speaker 1: after it was founded. But the name derived from the 212 00:12:25,120 --> 00:12:29,800 Speaker 1: Library at Alexandria, the ancient library of Egypt that at 213 00:12:29,880 --> 00:12:33,240 Speaker 1: one point housed one of the world's largest collections of 214 00:12:33,320 --> 00:12:39,400 Speaker 1: accumulated knowledge. Now around forty eight BCE, Julius Caesar Julie 215 00:12:39,400 --> 00:12:42,960 Speaker 1: Baby and his boys they barged into Alexandria, and as 216 00:12:43,000 --> 00:12:46,840 Speaker 1: a consequence of their rowdy invasion, the library caught fire 217 00:12:47,200 --> 00:12:49,920 Speaker 1: and much of the collection burned. Sadly, that was not 218 00:12:49,960 --> 00:12:52,880 Speaker 1: the only indignity. In fact, it wasn't the first indignity 219 00:12:53,200 --> 00:12:57,120 Speaker 1: that the library suffered that would impact its relevance. Further 220 00:12:57,240 --> 00:13:00,000 Speaker 1: conflicts a couple of centuries later pretty much wiped out 221 00:13:00,160 --> 00:13:03,560 Speaker 1: whatever had been left from the previous calamities, and the 222 00:13:03,600 --> 00:13:07,079 Speaker 1: Library of Alexandria became kind of a touchstone for folks 223 00:13:07,080 --> 00:13:10,160 Speaker 1: who have stressed the importance of access to knowledge and 224 00:13:10,240 --> 00:13:13,240 Speaker 1: the protection of that knowledge, and that the consequences that 225 00:13:13,360 --> 00:13:15,920 Speaker 1: could follow from the loss of such knowledge can be 226 00:13:15,960 --> 00:13:20,200 Speaker 1: really dire. See also like the Middle Ages the Dark Ages, 227 00:13:20,200 --> 00:13:24,120 Speaker 1: for example, that loss of knowledge is a really terrible thing. 228 00:13:24,520 --> 00:13:28,000 Speaker 1: So the impetus for Alexa Internet was that Cale and 229 00:13:28,080 --> 00:13:31,760 Speaker 1: Gillat wanted, in the words of the Web Design Museum quote, 230 00:13:31,840 --> 00:13:35,960 Speaker 1: to develop advanced web navigation that would continually improve itself 231 00:13:36,080 --> 00:13:39,520 Speaker 1: on the basis of user generated data end quote, which 232 00:13:39,559 --> 00:13:42,679 Speaker 1: is a pretty advanced idea for nineteen ninety six when 233 00:13:42,720 --> 00:13:45,600 Speaker 1: the Web was still very young and the general public 234 00:13:45,679 --> 00:13:47,439 Speaker 1: was still just trying to get a grip on exactly 235 00:13:47,480 --> 00:13:51,320 Speaker 1: what the Web and by extension, the Internet were. One 236 00:13:51,360 --> 00:13:54,679 Speaker 1: of the first tools that Alexa Internet developed was a 237 00:13:54,720 --> 00:13:58,000 Speaker 1: browser toolbar. So installing this toolbar into a browser would 238 00:13:58,000 --> 00:14:01,120 Speaker 1: give the user's access to a sort of crowd powered 239 00:14:01,200 --> 00:14:04,640 Speaker 1: recommendation engine. In some ways, it's not that different from 240 00:14:04,840 --> 00:14:08,360 Speaker 1: sites like dig and Reddit that would later rely on 241 00:14:08,440 --> 00:14:11,880 Speaker 1: the user community to actually work and to recommend links 242 00:14:11,920 --> 00:14:17,120 Speaker 1: to really interesting sites. This toolbar would recommend the sites 243 00:14:17,120 --> 00:14:20,760 Speaker 1: to users based upon how the overall community was browsing. 244 00:14:20,920 --> 00:14:24,160 Speaker 1: So the more people who were using this toolbar, the 245 00:14:24,200 --> 00:14:27,480 Speaker 1: more information was going into where they were going, and 246 00:14:27,520 --> 00:14:29,720 Speaker 1: thus you would get different recommendations. So if a lot 247 00:14:29,720 --> 00:14:32,440 Speaker 1: of people were navigating to a specific site for whatever reason, 248 00:14:32,680 --> 00:14:35,320 Speaker 1: you might get a recommendation to do the same. It 249 00:14:35,360 --> 00:14:38,160 Speaker 1: was an attempt at an organic way for folks to 250 00:14:38,240 --> 00:14:41,560 Speaker 1: suggest websites, kind of like a word of mouth campaign, 251 00:14:41,920 --> 00:14:45,920 Speaker 1: and Alexa Internet would also provide meta information about websites 252 00:14:45,960 --> 00:14:48,840 Speaker 1: to users if they wanted it. Meta information is information 253 00:14:48,920 --> 00:14:52,240 Speaker 1: about information, so this would include stuff like how many 254 00:14:52,440 --> 00:14:55,400 Speaker 1: web pages were part of an overall website, or how 255 00:14:55,440 --> 00:14:58,600 Speaker 1: many other websites were pointing back to the one you 256 00:14:58,640 --> 00:15:01,200 Speaker 1: were on, and so forth. A lot of the stuff 257 00:15:01,360 --> 00:15:04,840 Speaker 1: that Alexa Internet could tell you would reflect a specific 258 00:15:04,880 --> 00:15:07,640 Speaker 1: web page's relevance. It's the same sort of information that 259 00:15:07,640 --> 00:15:10,600 Speaker 1: search engines like Google would take into account when deciding 260 00:15:10,640 --> 00:15:14,480 Speaker 1: relevance for search results. And that meant that it didn't 261 00:15:14,480 --> 00:15:16,520 Speaker 1: take very long for Amazon to come around with an 262 00:15:16,560 --> 00:15:20,000 Speaker 1: offer to purchase Alexa Internet. I'll talk about that more, 263 00:15:20,120 --> 00:15:22,920 Speaker 1: as well as the development of the Internet Archive after 264 00:15:22,960 --> 00:15:26,360 Speaker 1: we come back from this quick break to thank our sponsors. 265 00:15:35,600 --> 00:15:40,000 Speaker 1: So Amazon in nineteen ninety nine takes a look at 266 00:15:40,080 --> 00:15:44,200 Speaker 1: Alexa Internet and says, Wow, this is pretty incredible. This 267 00:15:44,600 --> 00:15:49,480 Speaker 1: little company has created some means of checking for stuff 268 00:15:49,480 --> 00:15:53,840 Speaker 1: like relevance and metadata that could be really really useful 269 00:15:53,880 --> 00:15:57,280 Speaker 1: for us, And so Amazon made an offer that Alexa 270 00:15:57,320 --> 00:16:00,160 Speaker 1: Internet couldn't refuse to acquire the company for the and 271 00:16:00,240 --> 00:16:03,160 Speaker 1: slee some of two hundred and fifty million dollars in 272 00:16:03,280 --> 00:16:07,680 Speaker 1: Amazon stock in May of ninety nine. So this is 273 00:16:07,880 --> 00:16:10,880 Speaker 1: a little different than the earlier deal we talked about 274 00:16:10,880 --> 00:16:14,840 Speaker 1: where AOL bought you know, the Ways Incorporated, because they 275 00:16:14,840 --> 00:16:17,120 Speaker 1: bought it with two hundred and fifty million dollars with 276 00:16:17,200 --> 00:16:19,920 Speaker 1: a stock. If we just treated that like it was 277 00:16:19,960 --> 00:16:25,040 Speaker 1: a cash exchange, then if we had just for inflation, 278 00:16:25,120 --> 00:16:28,240 Speaker 1: that's like around four hundred and sixty nine million dollars 279 00:16:28,240 --> 00:16:31,480 Speaker 1: worth of stock. But that's not really how you deal 280 00:16:31,520 --> 00:16:33,920 Speaker 1: with the value here, right. You have to think about 281 00:16:33,920 --> 00:16:36,680 Speaker 1: how much was the stock worth back in nineteen ninety 282 00:16:36,800 --> 00:16:39,600 Speaker 1: nine versus how much is the stock worth today? I 283 00:16:39,800 --> 00:16:43,480 Speaker 1: checked and I saw that in May of nineteen ninety nine, 284 00:16:43,560 --> 00:16:46,520 Speaker 1: Amazon stock was trading for around two dollars eighty nine 285 00:16:46,560 --> 00:16:49,400 Speaker 1: cents per share. These days, it's closer to one hundred 286 00:16:49,400 --> 00:16:53,840 Speaker 1: and eighty dollars per share. Plus. Between that time, Amazon 287 00:16:53,920 --> 00:16:56,760 Speaker 1: had two different stock splits. There was a two to 288 00:16:56,760 --> 00:16:59,520 Speaker 1: one split in late ninety nine, and there was a 289 00:16:59,560 --> 00:17:03,240 Speaker 1: twenty to one stock split in twenty twenty two. When 290 00:17:03,240 --> 00:17:06,080 Speaker 1: you factor all that up, that two hundred and fifty 291 00:17:06,080 --> 00:17:10,840 Speaker 1: million dollars in stock ends up being a ton of wealth. 292 00:17:11,240 --> 00:17:13,760 Speaker 1: Like it's just a huge amount. It would take a 293 00:17:13,800 --> 00:17:17,040 Speaker 1: lot of calculating to get an estimate, and even then 294 00:17:17,359 --> 00:17:21,520 Speaker 1: it wouldn't really be accurate just say that deal is 295 00:17:21,560 --> 00:17:25,399 Speaker 1: worth a lot. So anyway, the important thing with the 296 00:17:25,400 --> 00:17:29,119 Speaker 1: Internet Archive is that Cale and Gileat, through their work 297 00:17:29,160 --> 00:17:32,359 Speaker 1: and creating tools for Alexa Internet, found themselves able to 298 00:17:32,400 --> 00:17:36,920 Speaker 1: create snapshots of the Web. So they were using Alexa 299 00:17:37,000 --> 00:17:40,560 Speaker 1: Internet to have a commercial business, and they established the 300 00:17:40,560 --> 00:17:45,480 Speaker 1: Internet Archive as a way of preserving information that had, 301 00:17:45,560 --> 00:17:48,680 Speaker 1: at some point or another found its home on the Internet. 302 00:17:48,960 --> 00:17:52,480 Speaker 1: So they were using Alexa Internet tech to crawl the 303 00:17:52,560 --> 00:17:55,080 Speaker 1: young Web in order to index everything, which is a 304 00:17:55,200 --> 00:17:58,040 Speaker 1: necessary step if you want to give people access to 305 00:17:58,119 --> 00:18:00,399 Speaker 1: the various documents posted on the web. We first have 306 00:18:00,440 --> 00:18:02,639 Speaker 1: to know what is there and where is it. To 307 00:18:02,720 --> 00:18:07,320 Speaker 1: do that, you've got to index everything. And then they said, well, 308 00:18:07,600 --> 00:18:09,760 Speaker 1: now that we are able to index this, we could 309 00:18:09,800 --> 00:18:14,000 Speaker 1: actually download these little snapshots and keep them. And according 310 00:18:14,000 --> 00:18:18,560 Speaker 1: to the Internet Archive, that would be important because the 311 00:18:18,640 --> 00:18:23,119 Speaker 1: average lifespan for a new web page was not very long, 312 00:18:23,400 --> 00:18:27,320 Speaker 1: So contrary to our belief that once something is posted 313 00:18:27,359 --> 00:18:30,480 Speaker 1: to the Internet, it's there forever, the archive found that 314 00:18:30,520 --> 00:18:34,560 Speaker 1: on average, new web pages stuck around for about seventy 315 00:18:34,680 --> 00:18:38,679 Speaker 1: seven days, which means it's less than three months, and 316 00:18:38,720 --> 00:18:42,639 Speaker 1: then puff they would disappear, like maybe they would change drastically, 317 00:18:42,680 --> 00:18:46,679 Speaker 1: maybe they would just go away. Now, imagine that you 318 00:18:46,720 --> 00:18:49,800 Speaker 1: were to walk into a brick and mortar library, but 319 00:18:49,880 --> 00:18:52,000 Speaker 1: then you found out that on average the books in 320 00:18:52,040 --> 00:18:54,639 Speaker 1: that library would only stick around for three months before 321 00:18:54,680 --> 00:18:57,720 Speaker 1: being lost forever. And think of all the knowledge that 322 00:18:57,760 --> 00:19:01,200 Speaker 1: would disappear on a regular basis and ongoing basis. It 323 00:19:01,200 --> 00:19:03,840 Speaker 1: would be impossible to calculate the impact of that kind 324 00:19:03,840 --> 00:19:06,200 Speaker 1: of reality. It would be like losing the Library of 325 00:19:06,240 --> 00:19:10,679 Speaker 1: Alexandria regularly every three months. So Cale had come to 326 00:19:10,720 --> 00:19:14,160 Speaker 1: the conclusion that knowledge should be preserved and made available 327 00:19:14,200 --> 00:19:17,399 Speaker 1: for posterity. This is similar to an idea that was 328 00:19:17,440 --> 00:19:20,880 Speaker 1: proposed by Stuart Brand back in the nineteen eighties. It's 329 00:19:20,920 --> 00:19:24,560 Speaker 1: a complicated idea that typically gets boiled down to the 330 00:19:24,600 --> 00:19:29,679 Speaker 1: saying information wants to be free. That's actually an oversimplification 331 00:19:29,720 --> 00:19:33,800 Speaker 1: of what Brand was really communicating. But his point was 332 00:19:33,800 --> 00:19:37,040 Speaker 1: that information's value is kind of like a paradox. The 333 00:19:37,119 --> 00:19:41,440 Speaker 1: information could be incredibly valuable, right, it could be absolutely critical, 334 00:19:41,480 --> 00:19:45,439 Speaker 1: and therefore it could be expensive, but the cost of 335 00:19:45,480 --> 00:19:50,040 Speaker 1: distributing information was consistently declining. It was getting easier and 336 00:19:50,200 --> 00:19:54,120 Speaker 1: cheaper to share information, and the benefits of making information 337 00:19:54,240 --> 00:19:59,560 Speaker 1: accessible are typically pretty tremendous. But information is only accessible 338 00:20:00,119 --> 00:20:03,560 Speaker 1: if someone is able to hold onto that info. Otherwise 339 00:20:03,560 --> 00:20:06,520 Speaker 1: it's lost. Right, The Internet was such a volatile thing 340 00:20:06,560 --> 00:20:09,119 Speaker 1: that there was no guarantee that what you saw today 341 00:20:09,520 --> 00:20:13,000 Speaker 1: would be available tomorrow. In the days before the dynamic web, 342 00:20:13,680 --> 00:20:16,639 Speaker 1: it wasn't really unusual for someone to establish a web page, 343 00:20:16,880 --> 00:20:20,159 Speaker 1: to publish that page, and then later on to wipe 344 00:20:20,160 --> 00:20:24,480 Speaker 1: the slate clean or you know, otherwise alter vast portions 345 00:20:24,480 --> 00:20:27,040 Speaker 1: of that page in order to use that same web 346 00:20:27,400 --> 00:20:31,400 Speaker 1: landscape to host a totally different document. So the old 347 00:20:31,440 --> 00:20:34,720 Speaker 1: stuff would just disappear. And so Calee and Gilliat created 348 00:20:35,000 --> 00:20:40,119 Speaker 1: the Internet Archive, a nonprofit organization dedicated to the archival 349 00:20:40,440 --> 00:20:44,399 Speaker 1: of information across the Internet. And I think most people 350 00:20:44,800 --> 00:20:49,040 Speaker 1: are familiar with it from the web wayback machine, but 351 00:20:49,080 --> 00:20:52,240 Speaker 1: that's just one part of what the Internet Archive does. 352 00:20:52,600 --> 00:20:55,199 Speaker 1: As stated in the Library of Congress, the mission of 353 00:20:55,240 --> 00:20:59,480 Speaker 1: the Internet Archive was quote offering permanent access for researchers, 354 00:20:59,520 --> 00:21:03,040 Speaker 1: his story and scholars to historical collections that exist in 355 00:21:03,119 --> 00:21:07,040 Speaker 1: digital format end quote. Cale and Gilliat founded the Internet 356 00:21:07,119 --> 00:21:09,600 Speaker 1: Archive the same year they founded Alexa Internet. So that's 357 00:21:09,720 --> 00:21:14,440 Speaker 1: nineteen ninety six. And it wasn't easy. And why is that? Well, 358 00:21:14,880 --> 00:21:17,280 Speaker 1: you got to think about the challenge you face if 359 00:21:17,320 --> 00:21:20,919 Speaker 1: you want to archive everything on the Internet, or at 360 00:21:21,000 --> 00:21:24,480 Speaker 1: least everything that you're allowed to archive on the Internet. 361 00:21:24,600 --> 00:21:26,600 Speaker 1: We'll come back to that a couple of times. So, 362 00:21:26,640 --> 00:21:28,240 Speaker 1: for one thing, you need to create a way to 363 00:21:28,320 --> 00:21:31,920 Speaker 1: capture the content of a web page and to preserve 364 00:21:31,960 --> 00:21:35,119 Speaker 1: that for posterity. And you need a way for people 365 00:21:35,280 --> 00:21:39,560 Speaker 1: to access those archived web pages and to navigate them. 366 00:21:39,800 --> 00:21:43,639 Speaker 1: So Alexa Internet would end up developing these technologies and 367 00:21:43,680 --> 00:21:47,320 Speaker 1: commercializing them in various ways, and the Internet Archive was 368 00:21:47,359 --> 00:21:51,119 Speaker 1: made possible through these tools. So you could think of 369 00:21:51,160 --> 00:21:56,000 Speaker 1: Alexa Internet as being the funding machine for Internet Archive 370 00:21:56,119 --> 00:21:58,600 Speaker 1: in the beginning, at least as far as the tools 371 00:21:58,680 --> 00:22:02,080 Speaker 1: Internet Archive would use in order to achieve its mission. Now, 372 00:22:02,119 --> 00:22:05,720 Speaker 1: on the capturing front, Alexa Internet created a web crawler. 373 00:22:06,000 --> 00:22:10,760 Speaker 1: So for applications like web search engines, primarily web search engines, 374 00:22:11,040 --> 00:22:14,919 Speaker 1: web crawlers are the soldiers that they send out. A 375 00:22:14,960 --> 00:22:19,080 Speaker 1: web crawler's job is to index content across the Internet 376 00:22:19,160 --> 00:22:22,119 Speaker 1: and to capture information about what the various web pages 377 00:22:22,160 --> 00:22:26,199 Speaker 1: on the Internet are actually about. It's complicated, right. You 378 00:22:26,240 --> 00:22:29,520 Speaker 1: could just have a directory of web pages that's based 379 00:22:29,520 --> 00:22:32,119 Speaker 1: off the title of the web pages, but title and 380 00:22:32,240 --> 00:22:36,280 Speaker 1: content are not always in alignment. So web crawlers are 381 00:22:36,320 --> 00:22:40,399 Speaker 1: all about following the various branching pathways across the web. 382 00:22:40,480 --> 00:22:43,520 Speaker 1: They crawl through the web, in other words, indexing every 383 00:22:43,640 --> 00:22:47,080 Speaker 1: page as they do. So. Not everyone, however, wants their 384 00:22:47,080 --> 00:22:50,760 Speaker 1: web page indexed. So you can actually include some HTML 385 00:22:50,880 --> 00:22:54,840 Speaker 1: language in your web page that indicates that it's off 386 00:22:54,880 --> 00:22:58,760 Speaker 1: limits for indexing, and appolite web crawlers such as the 387 00:22:58,760 --> 00:23:03,000 Speaker 1: ones that Alexi Internet was using, will honor those instructions 388 00:23:03,040 --> 00:23:06,480 Speaker 1: and it will not index that page. But other pages 389 00:23:06,760 --> 00:23:11,639 Speaker 1: that lack this specific instruction of hey, don't index this, 390 00:23:12,359 --> 00:23:15,920 Speaker 1: they're fair game. I like to think of web crellers 391 00:23:16,000 --> 00:23:18,440 Speaker 1: kind of like Doctor Strange from the Marvel Universe the 392 00:23:18,560 --> 00:23:21,399 Speaker 1: Cinematic Universe in particular, they all want. He uses his 393 00:23:21,520 --> 00:23:25,760 Speaker 1: time manipulation abilities to see where all the different possible 394 00:23:26,000 --> 00:23:29,800 Speaker 1: pathways can lead to. The web crellers do that across 395 00:23:29,880 --> 00:23:32,440 Speaker 1: the web. They explore all the nooks and crannies. They 396 00:23:32,480 --> 00:23:35,560 Speaker 1: follow each link that even the ones that no one 397 00:23:35,640 --> 00:23:38,520 Speaker 1: ever clicks on, they follow those two. And you know, 398 00:23:38,640 --> 00:23:41,359 Speaker 1: hats off to web crellers for doing that to build 399 00:23:41,359 --> 00:23:44,240 Speaker 1: out these indices, because without it, web search wouldn't work, 400 00:23:44,560 --> 00:23:49,919 Speaker 1: and Alexa Internet wouldn't have been a thing anyway. Alexa 401 00:23:49,960 --> 00:23:53,520 Speaker 1: Internet and by extension, the Internet Archive used several different 402 00:23:53,520 --> 00:23:56,240 Speaker 1: web crallers over the years, but they all basically do 403 00:23:56,359 --> 00:23:59,119 Speaker 1: the same thing, or they they you know, more accurately. 404 00:23:59,160 --> 00:24:02,800 Speaker 1: They all aimed to achieve the same results. So the 405 00:24:02,840 --> 00:24:06,280 Speaker 1: crawler starts with seed URLs. This is like the starting 406 00:24:06,320 --> 00:24:08,879 Speaker 1: point where you let them go, and then they follow 407 00:24:08,880 --> 00:24:11,920 Speaker 1: each link and they download documents to the archives servers. 408 00:24:12,119 --> 00:24:15,640 Speaker 1: The crawlers also reference the links to ensure that they're 409 00:24:15,640 --> 00:24:20,119 Speaker 1: not double dipping on a specific crawl. So if you 410 00:24:20,160 --> 00:24:22,600 Speaker 1: have a ton of different sites that are all linking 411 00:24:22,680 --> 00:24:25,240 Speaker 1: to the same document, like let's say that someone has 412 00:24:25,440 --> 00:24:30,160 Speaker 1: published something, and hundreds of other resources on the internet 413 00:24:30,840 --> 00:24:34,960 Speaker 1: reference that published document, Well, That means there's all these 414 00:24:34,960 --> 00:24:38,360 Speaker 1: different pathways that lead to the same destination, right, and 415 00:24:38,720 --> 00:24:42,680 Speaker 1: it would be somewhat wasteful to capture this exact same 416 00:24:42,760 --> 00:24:48,160 Speaker 1: document multiple times during the same crawl, so there's cross 417 00:24:48,280 --> 00:24:51,400 Speaker 1: referencing that happens in order to prevent that from occurring. 418 00:24:52,000 --> 00:24:55,159 Speaker 1: This process does work, but it also has limitations. So 419 00:24:55,240 --> 00:24:58,600 Speaker 1: for one thing, these crawls they do create snapshots of 420 00:24:58,600 --> 00:25:01,640 Speaker 1: the web in intervals, So if you use the wayback machine, 421 00:25:02,000 --> 00:25:04,359 Speaker 1: we'll talk more about that in a second. You'll see 422 00:25:04,400 --> 00:25:06,879 Speaker 1: that the history of a web page consists of a 423 00:25:07,040 --> 00:25:10,919 Speaker 1: series of dates from which the Internet archive first received 424 00:25:10,960 --> 00:25:13,720 Speaker 1: a snapshot of that page, and it leads all the 425 00:25:13,760 --> 00:25:17,000 Speaker 1: way up to the most recent reference of that page, 426 00:25:17,040 --> 00:25:20,560 Speaker 1: the most recent snapshot. The various dates and the wayback 427 00:25:20,640 --> 00:25:24,359 Speaker 1: machine are not necessarily relevant to any major changes that 428 00:25:24,480 --> 00:25:27,159 Speaker 1: happened on the web page itself. This is just when 429 00:25:27,640 --> 00:25:31,280 Speaker 1: the web crawlers went to that particular web page. So 430 00:25:31,880 --> 00:25:35,480 Speaker 1: it may be immediately after a massive change has been implemented, 431 00:25:35,520 --> 00:25:38,119 Speaker 1: it may be well after. In fact, there might be 432 00:25:38,240 --> 00:25:42,600 Speaker 1: a point where between webcraller visits a web page has 433 00:25:42,720 --> 00:25:45,520 Speaker 1: changed a couple of times. Well, that means that the 434 00:25:45,520 --> 00:25:48,320 Speaker 1: ones that are happening in between those changes aren't going 435 00:25:48,359 --> 00:25:51,200 Speaker 1: to be captured. It's just whatever was there the first 436 00:25:51,200 --> 00:25:53,760 Speaker 1: time the web crawler came through, and whatever was there 437 00:25:53,800 --> 00:25:57,200 Speaker 1: the next time the web craller came through. So interesting 438 00:25:57,240 --> 00:25:59,359 Speaker 1: thing is that if a particular page does have a 439 00:25:59,480 --> 00:26:02,960 Speaker 1: ton of other links pointing to it, that page is 440 00:26:03,000 --> 00:26:06,880 Speaker 1: more likely to have very frequent snapshots throughout its history, 441 00:26:07,280 --> 00:26:12,280 Speaker 1: because again, through subsequent crawls, there are various routes that 442 00:26:12,359 --> 00:26:15,320 Speaker 1: take web crallers through that web page, so they're more 443 00:26:15,480 --> 00:26:18,919 Speaker 1: likely to capture a snapshot of it. For pages that 444 00:26:18,960 --> 00:26:21,639 Speaker 1: have fewer links pointing to them, maybe there aren't that 445 00:26:21,720 --> 00:26:25,520 Speaker 1: many other web pages out there that cite this particular page, 446 00:26:25,720 --> 00:26:28,919 Speaker 1: they're more likely to have sporadic updates throughout their history. 447 00:26:28,960 --> 00:26:31,679 Speaker 1: You might pull up a page in the Wayback machine 448 00:26:31,680 --> 00:26:36,000 Speaker 1: and see that there's only maybe half a dozen captures 449 00:26:36,160 --> 00:26:39,840 Speaker 1: of that particular page, and that means that there could 450 00:26:39,840 --> 00:26:42,800 Speaker 1: be a lot of changes that were missed in between visits. 451 00:26:43,160 --> 00:26:47,040 Speaker 1: So not everything gets captured in the Internet archive. I 452 00:26:47,080 --> 00:26:51,080 Speaker 1: think that some people work under the mistaken presumption that 453 00:26:51,720 --> 00:26:55,200 Speaker 1: anything that was ever published to the web is captured 454 00:26:55,280 --> 00:26:58,439 Speaker 1: and archived. There that's not the case. It's whatever was 455 00:26:58,480 --> 00:27:00,920 Speaker 1: there when the web crawlers came through it. So, because 456 00:27:00,960 --> 00:27:03,359 Speaker 1: even the Internet Archive is not a perfect record of 457 00:27:03,440 --> 00:27:07,000 Speaker 1: everything that's ever happened on the web, other elements, like 458 00:27:07,040 --> 00:27:09,639 Speaker 1: I said, could also be lost to time due to 459 00:27:09,680 --> 00:27:13,200 Speaker 1: the complexity of web navigation. For example, so when web 460 00:27:13,240 --> 00:27:18,280 Speaker 1: designers started to incorporate things like flash, which really is 461 00:27:18,320 --> 00:27:20,600 Speaker 1: no longer a thing but it was for a while, 462 00:27:20,880 --> 00:27:24,240 Speaker 1: or JavaScript, then the web callers that were being used 463 00:27:24,359 --> 00:27:26,880 Speaker 1: to index the web, a lot of them just couldn't 464 00:27:27,359 --> 00:27:30,879 Speaker 1: navigate these types of tools that were made through flash 465 00:27:30,920 --> 00:27:34,840 Speaker 1: or JavaScript. So while human users could, and they could, 466 00:27:35,160 --> 00:27:39,680 Speaker 1: you know, interact with interfaces that had these tools created 467 00:27:39,720 --> 00:27:43,320 Speaker 1: through these various methods, web collers couldn't. And that meant 468 00:27:43,320 --> 00:27:46,680 Speaker 1: that if a website used like tools that were made 469 00:27:46,720 --> 00:27:50,800 Speaker 1: in JavaScript to act as the interface, the web creller 470 00:27:50,880 --> 00:27:54,000 Speaker 1: might only be able to index the homepage, but not 471 00:27:54,080 --> 00:27:57,320 Speaker 1: any of the other links branching off from the homepage 472 00:27:57,359 --> 00:28:01,280 Speaker 1: because it couldn't navigate that same interface. So there's a 473 00:28:01,280 --> 00:28:04,199 Speaker 1: lot of stuff from that era that's lost to the 474 00:28:04,240 --> 00:28:07,320 Speaker 1: Internet Archive as well, simply because the crawlers just could 475 00:28:07,359 --> 00:28:11,560 Speaker 1: not navigate those pages. They were never captured. And like 476 00:28:11,600 --> 00:28:15,080 Speaker 1: I said, if you happen to have the instruction, the 477 00:28:15,200 --> 00:28:18,840 Speaker 1: HTML instruction not to index the site, well then that's 478 00:28:18,880 --> 00:28:21,119 Speaker 1: not going to be there either. Now let's move on 479 00:28:21,240 --> 00:28:25,160 Speaker 1: to another challenge, which is the storing of these files. 480 00:28:25,520 --> 00:28:29,960 Speaker 1: Indexing everything was one thing. How do you store everything 481 00:28:30,000 --> 00:28:32,960 Speaker 1: that can be indexed on the web in an archive? 482 00:28:33,880 --> 00:28:36,800 Speaker 1: That's what we're going to come back and explore after 483 00:28:36,840 --> 00:28:49,840 Speaker 1: we take another quick break to thank our sponsors. Okay, 484 00:28:50,360 --> 00:28:54,160 Speaker 1: so the Internet archive, how do you store all the 485 00:28:54,200 --> 00:28:57,040 Speaker 1: information that you find across the web. Well, the big 486 00:28:57,080 --> 00:29:00,600 Speaker 1: one for web pages was that you had to figure 487 00:29:00,600 --> 00:29:03,840 Speaker 1: out where do you store and how do you organize 488 00:29:03,840 --> 00:29:06,640 Speaker 1: snapshots of the web so that one you have a 489 00:29:06,680 --> 00:29:09,320 Speaker 1: record of them, and two you can find what you're 490 00:29:09,360 --> 00:29:12,720 Speaker 1: looking for. You can navigate to the specific instance that 491 00:29:12,760 --> 00:29:16,000 Speaker 1: you're looking for. Keep in mind again, the archives not 492 00:29:16,040 --> 00:29:18,800 Speaker 1: capturing everything. As I said before the break, there's a 493 00:29:18,840 --> 00:29:21,440 Speaker 1: lot of stuff that web crawlers could not access for 494 00:29:21,480 --> 00:29:25,000 Speaker 1: one reason or another. Those things would be either off 495 00:29:25,040 --> 00:29:28,080 Speaker 1: limits or inaccessible and thus would not be in the archive. 496 00:29:28,400 --> 00:29:31,880 Speaker 1: But everything else was still fair game. So to store 497 00:29:31,920 --> 00:29:35,880 Speaker 1: and organize everything, Alexa Internet created a new file format 498 00:29:36,000 --> 00:29:41,680 Speaker 1: called an ARC file. ARC ARC files contain information about 499 00:29:41,720 --> 00:29:45,840 Speaker 1: all the stuff that's inside them, the metadata of the Internet. 500 00:29:46,000 --> 00:29:50,240 Speaker 1: So again, metadata is data about data. It makes the 501 00:29:50,280 --> 00:29:53,880 Speaker 1: small files inside the larger ARC files all self identifying, 502 00:29:54,000 --> 00:29:56,480 Speaker 1: so there's no need to actually build out an index. 503 00:29:56,760 --> 00:30:00,480 Speaker 1: The self identifying information includes stuff like the URL for 504 00:30:00,600 --> 00:30:03,640 Speaker 1: the file, like what the URL for that particular document is, 505 00:30:03,880 --> 00:30:06,680 Speaker 1: how big the document is when it was retrieved, and 506 00:30:06,880 --> 00:30:10,160 Speaker 1: other stuff like that. Each ARC file would have a 507 00:30:10,200 --> 00:30:13,120 Speaker 1: capacity of around one hundred megabytes, and it was possible 508 00:30:13,160 --> 00:30:15,840 Speaker 1: for a single website to span multiple ARC files. I mean, 509 00:30:15,880 --> 00:30:18,120 Speaker 1: there's some big websites out there that have been around 510 00:30:18,120 --> 00:30:22,400 Speaker 1: for a long time, so yeah, sometimes a single ARC 511 00:30:22,480 --> 00:30:26,040 Speaker 1: file would just be a portion of that website. At first, 512 00:30:26,320 --> 00:30:30,160 Speaker 1: the Internet archives stored all this information on magnetic tape, 513 00:30:30,440 --> 00:30:34,360 Speaker 1: So you would do this indexing of the web, all 514 00:30:34,400 --> 00:30:37,320 Speaker 1: these snapshots, and you would save it to magnetic tape. 515 00:30:37,400 --> 00:30:40,200 Speaker 1: I remember I used to work for a company, a 516 00:30:40,280 --> 00:30:44,120 Speaker 1: consulting firm that had magnetic tape backups. So it was 517 00:30:44,200 --> 00:30:48,040 Speaker 1: my job, one of my jobs to occasionally back up 518 00:30:48,120 --> 00:30:51,520 Speaker 1: all the data on our network to tape, and I 519 00:30:51,560 --> 00:30:54,720 Speaker 1: would have to swap tapes out and label them and 520 00:30:54,760 --> 00:30:58,240 Speaker 1: everything and archive them properly. The Internet Archive worked under 521 00:30:58,280 --> 00:31:01,560 Speaker 1: the same idea. It would capture a snapshot of all 522 00:31:01,680 --> 00:31:05,720 Speaker 1: the files across the web, save them to tape, and 523 00:31:06,280 --> 00:31:09,160 Speaker 1: that was how the Internet Archive kept track of things 524 00:31:09,200 --> 00:31:14,440 Speaker 1: for about three years. But eventually activity on the Internet 525 00:31:14,800 --> 00:31:16,760 Speaker 1: was such that that was not going to do it. 526 00:31:16,840 --> 00:31:19,640 Speaker 1: There were too many users who wanted to be able 527 00:31:19,880 --> 00:31:23,720 Speaker 1: to access things that were stored or saved within the 528 00:31:23,760 --> 00:31:27,680 Speaker 1: Internet Archive, and this method just couldn't keep up with 529 00:31:27,800 --> 00:31:30,560 Speaker 1: demand and necessity, as we all know, is the mother 530 00:31:30,640 --> 00:31:33,800 Speaker 1: of invention. So the Internet Archive needed an alternative way 531 00:31:33,840 --> 00:31:37,080 Speaker 1: to store these snapshots. And of course, the Web was 532 00:31:37,640 --> 00:31:41,080 Speaker 1: really growing dramatically, which is putting it lightly, and there 533 00:31:41,120 --> 00:31:43,320 Speaker 1: was a real need to step things up considerably. So 534 00:31:43,360 --> 00:31:46,600 Speaker 1: to that end, the staff at Internet Archive developed a 535 00:31:46,640 --> 00:31:52,080 Speaker 1: storage system they called the PetaBox PetaBox, and it was 536 00:31:52,120 --> 00:31:55,600 Speaker 1: called the PetaBox because it could house a petabyte of information. 537 00:31:55,960 --> 00:32:00,120 Speaker 1: A petabyte, in case you're curious, is a million gigabytes. Now, 538 00:32:00,160 --> 00:32:02,719 Speaker 1: the most recent data I have about the PetaBox storage 539 00:32:02,720 --> 00:32:05,920 Speaker 1: system actually comes from December twenty twenty one, so it's 540 00:32:05,960 --> 00:32:08,400 Speaker 1: a few years out of date. But at that time, 541 00:32:08,600 --> 00:32:11,760 Speaker 1: the Internet Archive was using two hundred and twelve petabytes 542 00:32:11,760 --> 00:32:15,160 Speaker 1: of storage, which is a lot that wasn't all the 543 00:32:15,200 --> 00:32:20,000 Speaker 1: Wayback Machine. However, only around fifty seven petabytes of that 544 00:32:20,440 --> 00:32:23,600 Speaker 1: was for the Wayback Machine. The rest was for other 545 00:32:23,680 --> 00:32:27,640 Speaker 1: things like archiving various forms of digital media as well 546 00:32:27,680 --> 00:32:32,920 Speaker 1: as what Internet Archive references as quote unquote unique data. Anyway, 547 00:32:33,320 --> 00:32:36,640 Speaker 1: the page on Internet Archive site says that the data 548 00:32:36,680 --> 00:32:40,240 Speaker 1: centers there are four of them that house the petabyte 549 00:32:40,280 --> 00:32:44,280 Speaker 1: storage system, don't use air conditioning, which helps keep electric 550 00:32:44,320 --> 00:32:48,440 Speaker 1: bills down. They actually let the heat from the data 551 00:32:48,480 --> 00:32:52,440 Speaker 1: storage devices provide heating for the buildings that they're stored 552 00:32:52,480 --> 00:32:56,200 Speaker 1: in and that you know, this is all part of 553 00:32:56,240 --> 00:33:00,960 Speaker 1: a strategy to keep things at low cost but high 554 00:33:01,040 --> 00:33:05,440 Speaker 1: usability and high efficiency. So that's really the big requirements 555 00:33:05,480 --> 00:33:08,480 Speaker 1: for the PetaBox system. It has to be efficient. It 556 00:33:08,520 --> 00:33:12,520 Speaker 1: cannot require too much power to operate any single PetaBox. 557 00:33:12,760 --> 00:33:17,040 Speaker 1: Another requirement is that each rack of hard drive storage 558 00:33:17,080 --> 00:33:19,320 Speaker 1: has to hold a ton of hard drives. We're talking 559 00:33:19,440 --> 00:33:23,160 Speaker 1: like one hundred plus terabytes worth of hard drive space. 560 00:33:23,600 --> 00:33:26,920 Speaker 1: Another requirement is that to serve as an administrator, it 561 00:33:27,000 --> 00:33:30,640 Speaker 1: needs to be easy like it can't be complicated to 562 00:33:30,880 --> 00:33:37,440 Speaker 1: administrate this storage system, and according to Internet Archive, the 563 00:33:37,480 --> 00:33:40,840 Speaker 1: structure of this is such that you need about one 564 00:33:41,000 --> 00:33:44,640 Speaker 1: administrator for every petabyte worth of data, so you know, 565 00:33:44,720 --> 00:33:47,840 Speaker 1: that's like two hundred administrators. Essentially, the whole goal was 566 00:33:47,880 --> 00:33:52,480 Speaker 1: to create systems that were relatively inexpensive, relatively efficient, and 567 00:33:52,640 --> 00:33:56,160 Speaker 1: relatively easy to use. At least from an administrative perspective. 568 00:33:56,480 --> 00:33:59,640 Speaker 1: That's really tall order. It's hard to meet all those 569 00:34:00,560 --> 00:34:03,360 Speaker 1: but the folks at Internet Archive made it happen, and 570 00:34:03,480 --> 00:34:07,480 Speaker 1: it was such a useful approach to storage and to 571 00:34:07,680 --> 00:34:10,719 Speaker 1: being able to organize the files within storage so that 572 00:34:10,800 --> 00:34:14,160 Speaker 1: you didn't have to build out indices that ultimately Internet 573 00:34:14,280 --> 00:34:21,160 Speaker 1: Archive would deploy this same strategy for other organizations and institutions. Okay, 574 00:34:21,239 --> 00:34:26,040 Speaker 1: but that's all about, you know, collecting and storing all 575 00:34:26,080 --> 00:34:30,640 Speaker 1: the information across the Internet. How do you access it? 576 00:34:30,920 --> 00:34:33,440 Speaker 1: How is a user? How is a researcher? Are you 577 00:34:33,520 --> 00:34:39,320 Speaker 1: able to tap into this? Because again, unless accessibility is easy, 578 00:34:39,960 --> 00:34:42,440 Speaker 1: then there's not much point to doing this. You're just 579 00:34:42,480 --> 00:34:46,279 Speaker 1: making a record that nobody can reference. Well, I would 580 00:34:46,360 --> 00:34:51,680 Speaker 1: argue the most famous of the ways to access information 581 00:34:51,800 --> 00:34:54,879 Speaker 1: contained within the Internet Archive is the wayback machine, which 582 00:34:54,920 --> 00:34:58,960 Speaker 1: is specifically for web pages. The Internet Archive first introduced 583 00:34:59,000 --> 00:35:02,279 Speaker 1: the wayback Machine in two thousand and one, and the 584 00:35:02,320 --> 00:35:05,160 Speaker 1: way it works is pretty simple. There's a little it's 585 00:35:05,239 --> 00:35:07,520 Speaker 1: kind of like a search bar, but it's a urlbar. 586 00:35:07,680 --> 00:35:10,520 Speaker 1: You put in a URL for the web page that 587 00:35:10,560 --> 00:35:13,799 Speaker 1: you're interested in, and the wayback machine pulls up the 588 00:35:13,840 --> 00:35:17,040 Speaker 1: snapshots that are contained within the archive if there are 589 00:35:17,080 --> 00:35:20,120 Speaker 1: any snapshots. As I mentioned earlier, not everything is in there, 590 00:35:20,200 --> 00:35:22,440 Speaker 1: but if it is in there, you will see options 591 00:35:22,440 --> 00:35:25,600 Speaker 1: available to you to look at the page at different 592 00:35:25,640 --> 00:35:28,239 Speaker 1: points in history. One thing I like to do is 593 00:35:28,320 --> 00:35:31,920 Speaker 1: look back at how famous web pages have changed in 594 00:35:32,000 --> 00:35:34,680 Speaker 1: their design over the years. If you put in something 595 00:35:34,719 --> 00:35:38,360 Speaker 1: like really big like CNN dot com, you can see 596 00:35:38,360 --> 00:35:41,359 Speaker 1: how the look and interface of that site has transitioned 597 00:35:41,640 --> 00:35:44,920 Speaker 1: during different eras across the web. I also used to 598 00:35:44,960 --> 00:35:47,920 Speaker 1: do this with the old website I worked for houstuffworks 599 00:35:47,960 --> 00:35:50,560 Speaker 1: dot com. I mean that's where tech stuff gets the 600 00:35:50,640 --> 00:35:53,880 Speaker 1: stuff and its name is from HowStuffWorks dot com. I 601 00:35:54,000 --> 00:35:57,160 Speaker 1: like using the wayback machine to look at what the 602 00:35:57,200 --> 00:35:59,719 Speaker 1: site looked like when I first joined, which was a 603 00:36:00,040 --> 00:36:02,400 Speaker 1: in February two thousand and seven. In case you're curious. 604 00:36:02,680 --> 00:36:07,000 Speaker 1: It looks entirely different now than how it looked back then, 605 00:36:07,200 --> 00:36:09,360 Speaker 1: and through the wayback Machine you can see what it 606 00:36:09,360 --> 00:36:12,400 Speaker 1: looked like back then. Also, these days, the wayback machine 607 00:36:12,440 --> 00:36:13,920 Speaker 1: is the only way I can see some of the 608 00:36:14,000 --> 00:36:17,840 Speaker 1: articles I wrote for that site, because the articles have 609 00:36:17,960 --> 00:36:23,040 Speaker 1: been either deleted or more likely rewritten over the time. Now. 610 00:36:23,040 --> 00:36:24,839 Speaker 1: To be fair to how stuff works, a lot of 611 00:36:24,840 --> 00:36:28,000 Speaker 1: my writing was in the computers and electronics sections, and 612 00:36:28,120 --> 00:36:32,520 Speaker 1: obviously things change in those fields very quickly, and something 613 00:36:32,560 --> 00:36:37,320 Speaker 1: that was relevant fifteen years ago is definitely not relevant today. 614 00:36:37,880 --> 00:36:41,040 Speaker 1: So you have to replace old stuff on a regular basis. 615 00:36:41,160 --> 00:36:43,319 Speaker 1: But it is kind of sad that a lot of 616 00:36:43,320 --> 00:36:45,360 Speaker 1: my work, a lot of my work for the first 617 00:36:45,719 --> 00:36:49,040 Speaker 1: you know, ten years of my career doing this kind 618 00:36:49,040 --> 00:36:52,560 Speaker 1: of stuff, is not accessible unless you use something like 619 00:36:52,600 --> 00:36:55,440 Speaker 1: the wayback Machine. Now, one super neat thing about the 620 00:36:55,440 --> 00:36:58,680 Speaker 1: wayback machine is that you can still follow links that 621 00:36:58,719 --> 00:37:02,600 Speaker 1: are on pages, like if the archive has those linked 622 00:37:02,640 --> 00:37:05,600 Speaker 1: assets also in the archive, then you're going to be 623 00:37:05,600 --> 00:37:08,120 Speaker 1: shown a record, and the record will be one that 624 00:37:08,160 --> 00:37:11,719 Speaker 1: was captured closest in time with the first page that 625 00:37:11,800 --> 00:37:15,319 Speaker 1: you were originally on. This sounds complicated, Let me give 626 00:37:15,320 --> 00:37:17,880 Speaker 1: an example, it makes it way easier. So let's say 627 00:37:18,080 --> 00:37:22,440 Speaker 1: that I visit the web capture the snapshot for HowStuffWorks 628 00:37:22,520 --> 00:37:26,680 Speaker 1: dot COM's homepage on February nineteenth, two thousand and seven. 629 00:37:27,160 --> 00:37:30,440 Speaker 1: By the way, this snapshot on feb nineteenth, two thousand 630 00:37:30,440 --> 00:37:33,400 Speaker 1: and seven is the closest date to when I started 631 00:37:33,480 --> 00:37:37,600 Speaker 1: working at that company that's in the archive. The actual 632 00:37:37,680 --> 00:37:40,960 Speaker 1: date when I started the website was not captured on 633 00:37:41,000 --> 00:37:45,840 Speaker 1: that day. Anyway, By clicking around on this homepage, I 634 00:37:45,840 --> 00:37:49,399 Speaker 1: can actually follow links and it'll pull up archived links 635 00:37:49,440 --> 00:37:52,840 Speaker 1: of archived articles, which is really neat. And when I 636 00:37:52,880 --> 00:37:56,120 Speaker 1: did that, at one point, I clicked on a link 637 00:37:56,239 --> 00:38:01,320 Speaker 1: for more information or related articles to how helicopters work. 638 00:38:01,719 --> 00:38:06,320 Speaker 1: That page, the related page was actually archived on February 639 00:38:06,360 --> 00:38:09,319 Speaker 1: twenty second, two thousand and seven. So one was on 640 00:38:09,360 --> 00:38:12,360 Speaker 1: February nineteenth, the other was February twenty second, but the 641 00:38:12,440 --> 00:38:16,800 Speaker 1: link still worked. Right. Yes, these were two different pages 642 00:38:16,840 --> 00:38:20,560 Speaker 1: that were archived on two different days, but the nature 643 00:38:20,760 --> 00:38:26,120 Speaker 1: of the archive allows those links to still work between 644 00:38:26,160 --> 00:38:28,680 Speaker 1: the two, which is neat because I'm not just popping 645 00:38:28,719 --> 00:38:31,480 Speaker 1: around through a web of links. I'm also kind of 646 00:38:31,520 --> 00:38:36,040 Speaker 1: time traveling, right, I'm looking at a timeline of snapshots 647 00:38:36,280 --> 00:38:39,279 Speaker 1: that are all still interlinked together, even if they were 648 00:38:39,280 --> 00:38:42,839 Speaker 1: captured on different days. I think that's really cool. Now 649 00:38:42,880 --> 00:38:45,040 Speaker 1: it gets even more cool when you think about the 650 00:38:45,080 --> 00:38:48,440 Speaker 1: scale of this project. So, according to the Internet Archive itself, 651 00:38:48,640 --> 00:38:52,120 Speaker 1: the archive contains eight hundred and thirty five billion with 652 00:38:52,200 --> 00:38:55,439 Speaker 1: a B web pages, And as I mentioned earlier, that 653 00:38:55,680 --> 00:38:58,360 Speaker 1: just makes up part of all the data that's stored 654 00:38:58,400 --> 00:39:02,040 Speaker 1: on Internet Archive servers, because the organization is also home 655 00:39:02,080 --> 00:39:05,640 Speaker 1: to more than forty four million books and other texts, 656 00:39:06,000 --> 00:39:11,040 Speaker 1: fifteen million audio recordings, more than ten million videos, and 657 00:39:11,120 --> 00:39:15,040 Speaker 1: more than a million different pieces of software. Again, some 658 00:39:15,120 --> 00:39:19,040 Speaker 1: of this stuff might not be recorded anywhere else. There 659 00:39:19,080 --> 00:39:22,160 Speaker 1: may not be duplicates or copies of some of this 660 00:39:22,200 --> 00:39:26,799 Speaker 1: stuff anywhere else. While you might have things like Blu 661 00:39:26,920 --> 00:39:31,160 Speaker 1: ray DVDs or whatever of some of those videos, others 662 00:39:31,239 --> 00:39:36,080 Speaker 1: might not have anything. And history is filled with instances 663 00:39:36,120 --> 00:39:40,759 Speaker 1: of media companies generating stuff or others, you know, independent 664 00:39:40,920 --> 00:39:45,359 Speaker 1: people too, generating stuff but not keeping a copy for posterity, 665 00:39:45,440 --> 00:39:48,880 Speaker 1: and then it's here and it's gone. Sometimes that's on purpose. 666 00:39:49,280 --> 00:39:52,719 Speaker 1: Sometimes it's a statement, like you make something ephemeral for 667 00:39:52,760 --> 00:39:56,400 Speaker 1: that very reason. Other times it's out of convenience, Like 668 00:39:56,719 --> 00:40:00,719 Speaker 1: there are stories about how the BBC would regularly reuse 669 00:40:00,880 --> 00:40:05,719 Speaker 1: tapes and tape over previous programming because there was no 670 00:40:05,800 --> 00:40:13,359 Speaker 1: thought about preservation or a home theater industry. So there 671 00:40:13,400 --> 00:40:17,200 Speaker 1: are entire eras of stuff like Doctor Who that are 672 00:40:17,239 --> 00:40:21,160 Speaker 1: just gone or believed to be gone because the BBC 673 00:40:21,280 --> 00:40:25,000 Speaker 1: would just tape over old tapes and so you lost 674 00:40:25,040 --> 00:40:29,440 Speaker 1: whatever was on there originally. That's why things like the 675 00:40:29,440 --> 00:40:32,880 Speaker 1: Internet Archive exist is to avoid that in the case 676 00:40:32,920 --> 00:40:35,680 Speaker 1: of stuff that's stored across the Internet, to make sure 677 00:40:35,800 --> 00:40:39,239 Speaker 1: that there is an accessible record of those things and 678 00:40:39,239 --> 00:40:41,920 Speaker 1: that they don't just disappear. In two thousand and seven, 679 00:40:42,120 --> 00:40:45,640 Speaker 1: the state of California recognize the Internet Archive as an 680 00:40:45,680 --> 00:40:49,480 Speaker 1: official library, which was important it's not just an honorarium. 681 00:40:49,760 --> 00:40:53,520 Speaker 1: It would allow the nonprofit organization to receive federal funding, 682 00:40:53,600 --> 00:40:56,360 Speaker 1: which is a pretty important development for the longevity of 683 00:40:56,400 --> 00:40:59,440 Speaker 1: the program. But while the usefulness of the organization is 684 00:40:59,440 --> 00:41:02,480 Speaker 1: beyond question, the methods that the Archive has used this 685 00:41:02,680 --> 00:41:06,680 Speaker 1: have not always been met with universal approval. For example, recently, 686 00:41:06,920 --> 00:41:10,800 Speaker 1: the Internet Archive has been embroiled in a pretty nasty lawsuit. 687 00:41:10,960 --> 00:41:14,719 Speaker 1: It's called the Hatchet versus Internet Archive suit, and it 688 00:41:14,760 --> 00:41:17,839 Speaker 1: revolves around a group of publishers that object to how 689 00:41:17,880 --> 00:41:21,880 Speaker 1: the Internet Archive scans physical books for the purposes of 690 00:41:22,000 --> 00:41:26,160 Speaker 1: lending them out as digital copies. Publishers are in the 691 00:41:26,160 --> 00:41:29,680 Speaker 1: business of publishing and selling copies of books, but for years, 692 00:41:29,680 --> 00:41:32,520 Speaker 1: libraries have existed in order to get copies of various 693 00:41:32,560 --> 00:41:35,200 Speaker 1: books and to make them available for lending. So libraries 694 00:41:35,239 --> 00:41:38,640 Speaker 1: have to purchase the books or have them donated to 695 00:41:38,760 --> 00:41:42,080 Speaker 1: the library, and then makes those books available to lend 696 00:41:42,120 --> 00:41:46,399 Speaker 1: out to members of the library. The Internet Archive has 697 00:41:46,440 --> 00:41:49,879 Speaker 1: a controlled digital lending program to handle this sort of thing, 698 00:41:50,040 --> 00:41:54,800 Speaker 1: only we're talking about digital formats, not a physical copy 699 00:41:54,840 --> 00:41:58,600 Speaker 1: of a book. This is where things get tricky because obviously, 700 00:41:58,760 --> 00:42:02,000 Speaker 1: if you, as a American citizen at least, if you 701 00:42:02,040 --> 00:42:04,319 Speaker 1: go out and buy a copy of a book, you 702 00:42:04,360 --> 00:42:07,800 Speaker 1: can do whatever you like with your copy of that book, 703 00:42:08,040 --> 00:42:10,359 Speaker 1: apart from making your own copies of it and then 704 00:42:10,480 --> 00:42:14,120 Speaker 1: selling those. You can't do that. That's copyright infringement. But 705 00:42:14,200 --> 00:42:16,760 Speaker 1: if you own a physical copy of a book, you can. 706 00:42:16,920 --> 00:42:19,560 Speaker 1: You can keep it for yourself. You could lend it 707 00:42:19,600 --> 00:42:22,040 Speaker 1: to a friend and let them read it, they return 708 00:42:22,080 --> 00:42:24,440 Speaker 1: it to you later. You could give the book away. 709 00:42:24,880 --> 00:42:28,120 Speaker 1: You could resell your copy to someone else, even if 710 00:42:28,160 --> 00:42:30,520 Speaker 1: you're selling it for a fraction of what the book 711 00:42:30,600 --> 00:42:33,160 Speaker 1: is going for in bookstores. You could do that. You 712 00:42:33,160 --> 00:42:35,560 Speaker 1: could even burn the darn thing if you're so inclined. 713 00:42:35,960 --> 00:42:38,719 Speaker 1: Just don't do that. Don't burn books. But all of 714 00:42:38,760 --> 00:42:42,920 Speaker 1: those things are permitted with your personal copy of the book. However, 715 00:42:43,160 --> 00:42:46,520 Speaker 1: a digital copy, well, now we're starting to talk about 716 00:42:46,600 --> 00:42:49,840 Speaker 1: different rules. So yes, you can lend out a physical 717 00:42:49,840 --> 00:42:52,920 Speaker 1: copy of a book. That's allowed. That's fair use. But 718 00:42:53,280 --> 00:42:57,400 Speaker 1: actually it's not even fair use. That's under laws of property. 719 00:42:57,719 --> 00:43:00,640 Speaker 1: But we won't get into all that. A digital copy 720 00:43:00,760 --> 00:43:04,200 Speaker 1: is a lot trickier because it's easy to replicate, much 721 00:43:04,239 --> 00:43:07,520 Speaker 1: easier than replicating a physical copy of a book, and 722 00:43:07,560 --> 00:43:11,160 Speaker 1: so different rules have developed to handle digital information compared 723 00:43:11,200 --> 00:43:14,880 Speaker 1: to stuff that's in our physical meat space. So this 724 00:43:15,000 --> 00:43:18,600 Speaker 1: lawsuit argues that the Internet Archive first digitized physical books 725 00:43:18,640 --> 00:43:21,800 Speaker 1: without permission from the publishers, and that that was problem 726 00:43:21,880 --> 00:43:26,759 Speaker 1: number one. There's been some different arguments about that, like 727 00:43:27,000 --> 00:43:30,200 Speaker 1: if there was no ebook equivalent of the copy of 728 00:43:30,239 --> 00:43:33,799 Speaker 1: the book, if the publishers had not digitized that, that's 729 00:43:33,840 --> 00:43:37,560 Speaker 1: slightly different than if the publishers also offer an electronic 730 00:43:37,680 --> 00:43:40,560 Speaker 1: version of the physical books they sell. But the other 731 00:43:40,600 --> 00:43:43,880 Speaker 1: problem is that the Internet Archive received donations and funding 732 00:43:43,920 --> 00:43:46,480 Speaker 1: that in part stemmed from the practice of lending out 733 00:43:46,520 --> 00:43:49,640 Speaker 1: digitized books, So the publisher said that made the Internet 734 00:43:50,040 --> 00:43:54,200 Speaker 1: Archives activities a commercial enterprise. In twenty twenty three, a 735 00:43:54,400 --> 00:43:57,400 Speaker 1: judge found in favor of the publishers, saying that the 736 00:43:57,400 --> 00:44:00,319 Speaker 1: Internet Archive failed to argue that their work fell under 737 00:44:00,360 --> 00:44:03,200 Speaker 1: the principles of fair use. Again, getting into fair use, 738 00:44:03,239 --> 00:44:06,000 Speaker 1: that's a whole thing, but generally speaking, fair use covers 739 00:44:06,040 --> 00:44:08,880 Speaker 1: a relatively narrow set of use cases in which the 740 00:44:09,000 --> 00:44:12,680 Speaker 1: copying are the use or the distribution of a copyrighted 741 00:44:12,719 --> 00:44:16,080 Speaker 1: work does not count as copyright infringement. But it has 742 00:44:16,160 --> 00:44:20,200 Speaker 1: to meet certain criteria, and it's only ever decided in 743 00:44:20,239 --> 00:44:23,279 Speaker 1: a court of law. It's not something that's just you 744 00:44:23,280 --> 00:44:27,920 Speaker 1: can apply to proactively. It's something that you use in 745 00:44:28,000 --> 00:44:31,200 Speaker 1: a defense if you're brought up on charges of copyright infringement. 746 00:44:31,400 --> 00:44:34,560 Speaker 1: So by the time you're actually talking fair use, it's 747 00:44:34,600 --> 00:44:38,480 Speaker 1: already pretty late in the game. But anyway, this particular 748 00:44:38,600 --> 00:44:43,000 Speaker 1: lawsuit is under appeal. The Internet Archive recently made final 749 00:44:43,120 --> 00:44:46,920 Speaker 1: arguments in the case I have not seen anything about 750 00:44:46,960 --> 00:44:49,239 Speaker 1: the case being decided one way or the other since then, 751 00:44:49,560 --> 00:44:53,520 Speaker 1: so I'm not really sure which way it's going. Again. 752 00:44:54,200 --> 00:44:57,600 Speaker 1: I didn't see anything about a decision made, but then 753 00:44:57,760 --> 00:45:01,280 Speaker 1: most the articles about this are about the initial trial 754 00:45:01,360 --> 00:45:04,439 Speaker 1: that happened in twenty twenty three, so hopefully I will 755 00:45:04,480 --> 00:45:08,000 Speaker 1: find some follow up on this at some point. But 756 00:45:08,480 --> 00:45:11,239 Speaker 1: there's no denying the Internet Archive has done a tremendous 757 00:45:11,280 --> 00:45:14,359 Speaker 1: amount of work in the field of knowledge preservation and 758 00:45:14,440 --> 00:45:18,160 Speaker 1: knowledge accessibility. Without the Internet Archive, there's no way of 759 00:45:18,200 --> 00:45:21,400 Speaker 1: knowing how much information would be lost to us forever. 760 00:45:21,760 --> 00:45:25,000 Speaker 1: Stuff that could have been incredibly useful or even just 761 00:45:25,160 --> 00:45:29,759 Speaker 1: diverting could be gone, and we'd never have a way 762 00:45:29,760 --> 00:45:33,200 Speaker 1: of retrieving it again. And I am very thankful that 763 00:45:33,280 --> 00:45:36,440 Speaker 1: an organization like the Internet Archive exists. If you're not 764 00:45:36,520 --> 00:45:38,800 Speaker 1: familiar with it, if you never used it, I recommend 765 00:45:38,800 --> 00:45:42,560 Speaker 1: you check it out and explore the Internet Archive. Look 766 00:45:42,600 --> 00:45:45,040 Speaker 1: at some of the things that are in that archive, 767 00:45:45,280 --> 00:45:47,480 Speaker 1: like some of the books, some of the recordings. There's 768 00:45:47,520 --> 00:45:49,440 Speaker 1: some great stuff. I think there's like a quarter of 769 00:45:49,480 --> 00:45:53,720 Speaker 1: a million live performances archived just on the Internet Archive, 770 00:45:54,080 --> 00:45:58,719 Speaker 1: like live music performances. That alone is super cool. Anyway, 771 00:45:58,920 --> 00:46:03,080 Speaker 1: I hope you found this episode informative and entertaining. I 772 00:46:03,120 --> 00:46:06,560 Speaker 1: hope you check out Internet archive. I also very much 773 00:46:06,600 --> 00:46:09,560 Speaker 1: hope that you are all well and I will talk 774 00:46:09,600 --> 00:46:20,360 Speaker 1: to you again really soon. Tech Stuff is an iHeartRadio production. 775 00:46:20,640 --> 00:46:25,680 Speaker 1: For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts, 776 00:46:25,800 --> 00:46:27,800 Speaker 1: or wherever you listen to your favorite shows.