1
00:00:04,480 --> 00:00:12,639
Speaker 1: Welcome to tech Stuff, a production from iHeartRadio. Hey there,

2
00:00:12,640 --> 00:00:16,000
Speaker 1: and welcome to tech Stuff. I'm your host, Jonathan Strickland.

3
00:00:16,040 --> 00:00:19,040
Speaker 1: I'm an executive producer with iHeart Podcasts. And how the

4
00:00:19,079 --> 00:00:23,280
Speaker 1: tech are yet. So let's take a little literary trip.

5
00:00:23,600 --> 00:00:29,200
Speaker 1: In Anthony Burgess's a clockwork Orange, the extremely wicked protagonist

6
00:00:29,680 --> 00:00:32,920
Speaker 1: it's putting it lightly. At one point early early in

7
00:00:32,920 --> 00:00:36,760
Speaker 1: the novel, reflects on the nature of permanence. He thinks

8
00:00:36,800 --> 00:00:40,680
Speaker 1: the reader might not remember what milk bars were like

9
00:00:41,159 --> 00:00:45,360
Speaker 1: due to quote things changing so scory these days and

10
00:00:45,479 --> 00:00:49,600
Speaker 1: everybody very quick to forget, newspapers not being read much

11
00:00:49,760 --> 00:00:54,120
Speaker 1: neither end quote. Alex in this case is saying that

12
00:00:54,200 --> 00:00:58,040
Speaker 1: the combination of the world changing very quickly scory is

13
00:00:58,080 --> 00:01:01,880
Speaker 1: derived from a Slavic word meaning swiftly or quickly, and

14
00:01:02,000 --> 00:01:05,720
Speaker 1: people having short memories means that referencing something that happened

15
00:01:05,760 --> 00:01:08,680
Speaker 1: even just a few years ago might mean you're met

16
00:01:08,680 --> 00:01:12,360
Speaker 1: with blank stares because the world has moved on. Now

17
00:01:12,520 --> 00:01:15,759
Speaker 1: take that same sentiment and crank it up to eleven

18
00:01:16,040 --> 00:01:18,840
Speaker 1: when you talk about the Internet in general and the

19
00:01:18,840 --> 00:01:21,600
Speaker 1: Web in particular. So, on the one hand, we know

20
00:01:22,000 --> 00:01:24,240
Speaker 1: that the rule of thumb is that once something gets

21
00:01:24,280 --> 00:01:27,920
Speaker 1: posted online, that's kind of it, right, it's sort of

22
00:01:27,959 --> 00:01:31,240
Speaker 1: perpetually online. Like that's kind of the joke. Like once

23
00:01:31,280 --> 00:01:33,520
Speaker 1: it's up, it's up, and you can take it down,

24
00:01:33,520 --> 00:01:35,280
Speaker 1: but there's going to be a copy of it somewhere.

25
00:01:35,720 --> 00:01:39,319
Speaker 1: So even if the originator tries to take down whatever

26
00:01:39,400 --> 00:01:43,440
Speaker 1: the stuff was, somebody's got it. But on the other hand,

27
00:01:43,440 --> 00:01:46,200
Speaker 1: we also know that so much stuff gets added every

28
00:01:46,240 --> 00:01:49,400
Speaker 1: single day to the Internet. There's actually a colossal mountain

29
00:01:49,400 --> 00:01:53,120
Speaker 1: of content out there that just keeps getting bigger moment

30
00:01:53,160 --> 00:01:55,960
Speaker 1: by moment, and everything that came before it can end

31
00:01:56,040 --> 00:01:59,480
Speaker 1: up getting buried in the process. And sometimes stuff can

32
00:01:59,560 --> 00:02:03,760
Speaker 1: be added and taken down without anyone being the wiser. Now,

33
00:02:03,800 --> 00:02:06,640
Speaker 1: on top of that, web pages obviously can change. A

34
00:02:06,720 --> 00:02:10,760
Speaker 1: website might adopt a new format or style, might incorporate

35
00:02:10,840 --> 00:02:15,000
Speaker 1: new technologies and interfaces that are added to web browsers,

36
00:02:15,360 --> 00:02:18,680
Speaker 1: or it might choose to remove sections that once might

37
00:02:18,720 --> 00:02:21,960
Speaker 1: have been relevant but maybe now not so much. Or

38
00:02:22,080 --> 00:02:27,079
Speaker 1: entire websites could disappear as servers go offline or companies

39
00:02:27,320 --> 00:02:32,040
Speaker 1: go bankrupt, or you know, web administrators just lose interest.

40
00:02:32,520 --> 00:02:36,520
Speaker 1: The entire spectrum of human output can be found on

41
00:02:36,560 --> 00:02:39,400
Speaker 1: the web. Not every instance of human output, but an

42
00:02:39,440 --> 00:02:44,440
Speaker 1: example of everything is out there. Everything from deep philosophical

43
00:02:44,520 --> 00:02:48,040
Speaker 1: musings to the most banal posts you know, which often

44
00:02:48,520 --> 00:02:51,320
Speaker 1: revolve around what someone is having for lunch. All of

45
00:02:51,320 --> 00:02:53,760
Speaker 1: that finds its way to the Internet. And while you

46
00:02:53,840 --> 00:02:56,600
Speaker 1: might argue that a lot of it, or perhaps even

47
00:02:56,680 --> 00:02:59,040
Speaker 1: most of it, is it really worth the time it

48
00:02:59,080 --> 00:03:02,920
Speaker 1: takes to consume, let alone keep it around. There is

49
00:03:03,080 --> 00:03:06,160
Speaker 1: undeniably a huge amount of valuable data out there too,

50
00:03:06,639 --> 00:03:09,800
Speaker 1: but there's no guarantee that it will stay there or

51
00:03:09,880 --> 00:03:13,880
Speaker 1: remain easily findable. And that's where today's topic comes in.

52
00:03:13,960 --> 00:03:16,480
Speaker 1: I wanted to talk about a project that began back

53
00:03:16,520 --> 00:03:19,320
Speaker 1: in nineteen ninety six. It's a project that aims to

54
00:03:19,360 --> 00:03:22,520
Speaker 1: preserve as much of the Internet as possible and little

55
00:03:22,720 --> 00:03:26,600
Speaker 1: slices of time, little snapshots. Not only does that mean

56
00:03:26,639 --> 00:03:29,200
Speaker 1: you can potentially dig up something that hasn't been online

57
00:03:29,240 --> 00:03:31,919
Speaker 1: for years, but also you can get a look at

58
00:03:32,000 --> 00:03:35,080
Speaker 1: what different sites were like in various eras of the Web.

59
00:03:35,320 --> 00:03:37,600
Speaker 1: It could be a really eye opening experience to see

60
00:03:37,640 --> 00:03:40,480
Speaker 1: something like Amazon and what it looked like, you know,

61
00:03:40,520 --> 00:03:43,960
Speaker 1: shortly after it launched, compared to what it looks like today.

62
00:03:44,400 --> 00:03:48,960
Speaker 1: So we are going to talk about the Internet Archive. Now.

63
00:03:48,960 --> 00:03:51,240
Speaker 1: To do that, we need to talk a little bit

64
00:03:51,240 --> 00:03:54,040
Speaker 1: about the people who founded the ding dang darn thing,

65
00:03:54,320 --> 00:03:58,520
Speaker 1: and that would be Brewster Kale and Bruce Gilliat. So

66
00:03:58,680 --> 00:04:02,040
Speaker 1: Klee graduated from m with a degree in computer science

67
00:04:02,040 --> 00:04:06,280
Speaker 1: and engineering. After he graduated, he joined fellow MIT graduate

68
00:04:06,400 --> 00:04:10,080
Speaker 1: Danny Hillis, who had created a company called Thinking Machines.

69
00:04:10,320 --> 00:04:13,960
Speaker 1: So this was a super computer company. His team specialized

70
00:04:13,960 --> 00:04:17,920
Speaker 1: in building massively parallel computer systems, mostly with the aim

71
00:04:17,960 --> 00:04:21,120
Speaker 1: of building machines for AI research and development. So yeah,

72
00:04:21,240 --> 00:04:24,480
Speaker 1: Calee was working on the challenges of providing AI researchers

73
00:04:24,520 --> 00:04:28,040
Speaker 1: with the compute power they need, decades before our current

74
00:04:28,120 --> 00:04:33,040
Speaker 1: AI explosion. Bruce Gilliot is also a computer scientist, and

75
00:04:33,080 --> 00:04:35,160
Speaker 1: that's just about all I know about him. I mean,

76
00:04:35,320 --> 00:04:38,040
Speaker 1: I know he is, or at least was married, and

77
00:04:38,120 --> 00:04:40,600
Speaker 1: I also know he owned a series of very impressive

78
00:04:40,600 --> 00:04:43,960
Speaker 1: houses in the San Francisco and San Jose areas because

79
00:04:44,000 --> 00:04:46,600
Speaker 1: it made the news whenever he sold one or bought

80
00:04:46,600 --> 00:04:49,679
Speaker 1: a new one. But other than that, there's precious little

81
00:04:49,680 --> 00:04:53,000
Speaker 1: information about him that I could find, which is somewhat ironic.

82
00:04:53,040 --> 00:04:55,440
Speaker 1: When you consider that he has dedicated a lot of

83
00:04:55,440 --> 00:04:58,520
Speaker 1: time and effort to preserving information on the Internet. He

84
00:04:58,520 --> 00:05:00,839
Speaker 1: would go on to co found the company called Alexa

85
00:05:00,920 --> 00:05:03,960
Speaker 1: Internet with Brewster Kale, but that's getting ahead of ourselves.

86
00:05:04,080 --> 00:05:07,839
Speaker 1: So most of my story will center around Kale simply

87
00:05:07,880 --> 00:05:10,520
Speaker 1: because out of the two co founders, he's the one

88
00:05:10,520 --> 00:05:13,839
Speaker 1: who acted more as the face of the efforts, and Gileat,

89
00:05:13,839 --> 00:05:15,880
Speaker 1: from what I can tell, has just been really good

90
00:05:15,880 --> 00:05:20,120
Speaker 1: about kind of maintaining a very personal private life. So

91
00:05:20,880 --> 00:05:24,960
Speaker 1: I don't mean to diminish Gileat's contributions, but at the

92
00:05:24,960 --> 00:05:27,640
Speaker 1: same time, you know, I can only cover what I

93
00:05:27,640 --> 00:05:31,240
Speaker 1: can find. So in nineteen eighty nine, Kale, along with

94
00:05:31,320 --> 00:05:35,080
Speaker 1: a colleague named Harry Morris, created an innovative tool for

95
00:05:35,200 --> 00:05:38,760
Speaker 1: the blossoming Internet. Now remember this is the Internet. It's

96
00:05:38,839 --> 00:05:42,119
Speaker 1: not the Worldwide Web. It didn't exist yet the Web

97
00:05:42,240 --> 00:05:45,159
Speaker 1: the Internet did, and the tool they created was called

98
00:05:45,160 --> 00:05:51,960
Speaker 1: the Wide Area Information Server or ways WAIS. So people

99
00:05:52,000 --> 00:05:55,040
Speaker 1: could create a web server. They could host documents on

100
00:05:55,080 --> 00:05:59,960
Speaker 1: their web servers. But finding these documents was really hard

101
00:06:00,720 --> 00:06:04,680
Speaker 1: because you didn't necessarily have hyperlinks connecting one document to

102
00:06:04,760 --> 00:06:07,920
Speaker 1: others and vice versa. You didn't have an easy way

103
00:06:07,960 --> 00:06:12,680
Speaker 1: of even navigating through different documents from one to the next.

104
00:06:13,160 --> 00:06:15,320
Speaker 1: So it was almost a case that you needed to

105
00:06:15,360 --> 00:06:19,080
Speaker 1: know where something was and what it was called first,

106
00:06:19,240 --> 00:06:22,440
Speaker 1: and then you could go to the relevant server and

107
00:06:22,480 --> 00:06:26,599
Speaker 1: retrieve that document. Otherwise the document would just remain quietly

108
00:06:26,680 --> 00:06:30,359
Speaker 1: sitting on some server somewhere and no one would know

109
00:06:30,400 --> 00:06:34,080
Speaker 1: about it. Now, that is antithetical to the entire purpose

110
00:06:34,160 --> 00:06:37,840
Speaker 1: of a wide area information sharing system, because, I mean,

111
00:06:37,880 --> 00:06:40,800
Speaker 1: the name tells us the whole purpose of this technology

112
00:06:40,839 --> 00:06:45,360
Speaker 1: is to allow information to be widely shared. Jeremy Norman's

113
00:06:45,400 --> 00:06:50,000
Speaker 1: History of Information lists ways as quote the first Internet

114
00:06:50,080 --> 00:06:54,120
Speaker 1: publishing system, just predating Gopher and the World Wide Web

115
00:06:54,320 --> 00:06:58,839
Speaker 1: end quote. In a recorded presentation to some Xerox employees,

116
00:06:59,000 --> 00:07:03,120
Speaker 1: Kale laid out personal perspective on what he wants from

117
00:07:03,279 --> 00:07:06,159
Speaker 1: his experience on the Internet. So first up, he said

118
00:07:06,360 --> 00:07:09,520
Speaker 1: he wanted his own personal information to be easily accessible

119
00:07:09,960 --> 00:07:13,240
Speaker 1: by him. Specifically, not that it should be easily accessible

120
00:07:13,280 --> 00:07:16,880
Speaker 1: to everybody, but specifically to him. He wanted the ability

121
00:07:16,920 --> 00:07:19,760
Speaker 1: to get access to all the different stuff he generates,

122
00:07:19,800 --> 00:07:22,280
Speaker 1: like articles and such, and to make it really easy

123
00:07:22,320 --> 00:07:25,080
Speaker 1: to do that. He also wanted the ability for publishers

124
00:07:25,120 --> 00:07:27,960
Speaker 1: to get their work to him. So in Kal's mind,

125
00:07:28,280 --> 00:07:30,720
Speaker 1: the best approach would be for published works that are

126
00:07:30,760 --> 00:07:33,360
Speaker 1: relevant to his interests to find their way to him,

127
00:07:33,560 --> 00:07:36,120
Speaker 1: as opposed to Kale having to go out and hunt

128
00:07:36,200 --> 00:07:39,480
Speaker 1: down these published works himself. And he pointed out this

129
00:07:39,600 --> 00:07:42,480
Speaker 1: is what publishers want too, because you wouldn't publish something

130
00:07:42,560 --> 00:07:45,239
Speaker 1: unless he wanted folks to actually read it. He also

131
00:07:45,320 --> 00:07:48,160
Speaker 1: said that he wanted this technology to be usable anywhere.

132
00:07:48,600 --> 00:07:51,200
Speaker 1: He wanted people to be able to access it no

133
00:07:51,240 --> 00:07:53,080
Speaker 1: matter what kind of device they were relying on. Now

134
00:07:53,160 --> 00:07:56,160
Speaker 1: he was specifically referencing laptops at the time, but he

135
00:07:56,280 --> 00:08:00,120
Speaker 1: was also saying that portable computer systems, essentially things that

136
00:08:00,120 --> 00:08:03,400
Speaker 1: would become smartphones and tablets, were on the horizon and

137
00:08:03,440 --> 00:08:05,880
Speaker 1: that these needed to be able to access that stuff too.

138
00:08:06,280 --> 00:08:09,080
Speaker 1: And he said that he wanted people to be able

139
00:08:09,080 --> 00:08:11,880
Speaker 1: to use what he had learned should he choose to

140
00:08:11,880 --> 00:08:15,440
Speaker 1: share the information, that if he had come up with

141
00:08:15,480 --> 00:08:17,600
Speaker 1: something that was useful and he wanted to share that,

142
00:08:17,640 --> 00:08:19,760
Speaker 1: he wanted other people to be able to access that.

143
00:08:20,160 --> 00:08:23,120
Speaker 1: Cale didn't say that people should be compelled to share,

144
00:08:23,560 --> 00:08:26,000
Speaker 1: but if they wanted to it should be possible to

145
00:08:26,040 --> 00:08:30,560
Speaker 1: do so. Ways was Cale's attempt to bring these ideas

146
00:08:30,640 --> 00:08:34,199
Speaker 1: to life. In that presentation to the Xerox employees, he

147
00:08:34,320 --> 00:08:38,320
Speaker 1: defined ways as electronic publishing. He further defined that term

148
00:08:38,400 --> 00:08:41,880
Speaker 1: to mean the distribution of information. So whether the end

149
00:08:41,960 --> 00:08:45,080
Speaker 1: user was to look at this information on a computer

150
00:08:45,120 --> 00:08:48,280
Speaker 1: screen or they just chose to print out the information

151
00:08:48,640 --> 00:08:50,880
Speaker 1: and then read it that way, that was beside the point.

152
00:08:51,120 --> 00:08:55,559
Speaker 1: Electronic publishing was all about how information got from the

153
00:08:55,600 --> 00:08:58,760
Speaker 1: originator to the end user. That's what made it e

154
00:08:58,920 --> 00:09:02,880
Speaker 1: publishing that it was publishing over wires. Now, one thing

155
00:09:03,000 --> 00:09:06,800
Speaker 1: Cale introduced in this presentation to Xerox was this concept

156
00:09:06,800 --> 00:09:10,760
Speaker 1: of conducting searches using natural language. This concept is one

157
00:09:10,800 --> 00:09:13,640
Speaker 1: that we're really familiar with today. You enter a query

158
00:09:13,800 --> 00:09:16,200
Speaker 1: into a search bar. You describe what it is that

159
00:09:16,240 --> 00:09:19,760
Speaker 1: you want to know or learn about, or have access to,

160
00:09:20,080 --> 00:09:23,400
Speaker 1: or retrieve or whatever. This search engine brings back search

161
00:09:23,440 --> 00:09:26,600
Speaker 1: results that are ordered by some kind of relevance depending

162
00:09:26,679 --> 00:09:29,960
Speaker 1: upon the search engines, you know, various algorithms. How the

163
00:09:30,000 --> 00:09:33,760
Speaker 1: search engine determines relevance really depends upon the system itself,

164
00:09:33,880 --> 00:09:36,160
Speaker 1: of course, Like you could run the same search across

165
00:09:36,400 --> 00:09:39,760
Speaker 1: different search engines and get very different results based upon

166
00:09:40,080 --> 00:09:45,280
Speaker 1: that methodology of determining relevance. If the system believes it's relevant,

167
00:09:45,480 --> 00:09:47,240
Speaker 1: it may or may not be relevant to what you

168
00:09:47,320 --> 00:09:50,520
Speaker 1: actually want. Like hopefully the two are aligned. If it's

169
00:09:50,520 --> 00:09:53,400
Speaker 1: a really good search engine, then you're going to get

170
00:09:53,480 --> 00:09:57,600
Speaker 1: something that is actually meaningful to you. Anyway, Ways was

171
00:09:57,720 --> 00:10:01,720
Speaker 1: kind of following in that approach back before there was

172
00:10:01,760 --> 00:10:04,280
Speaker 1: a World Wide Web, you know, when you just needed

173
00:10:04,280 --> 00:10:08,200
Speaker 1: a way to find stuff that was being stored on

174
00:10:08,280 --> 00:10:11,880
Speaker 1: these Internet servers and to be able to retrieve these

175
00:10:11,920 --> 00:10:14,600
Speaker 1: documents to make use of them. Otherwise you had this

176
00:10:14,679 --> 00:10:19,360
Speaker 1: incredibly powerful communications tool, but it was so challenging to

177
00:10:19,480 --> 00:10:22,600
Speaker 1: use in a meaningful way that the information stored there

178
00:10:23,000 --> 00:10:26,560
Speaker 1: would be not that useful. I think of it akin

179
00:10:26,679 --> 00:10:31,720
Speaker 1: to imagine that there's this one remote library and it's tiny,

180
00:10:32,080 --> 00:10:36,440
Speaker 1: but it has the world's only copy of some text.

181
00:10:36,840 --> 00:10:39,280
Speaker 1: But this libraries in the middle of nowhere. It's really

182
00:10:39,360 --> 00:10:42,160
Speaker 1: hard to get to the fact that that library has

183
00:10:42,280 --> 00:10:45,800
Speaker 1: that document would not be terribly useful to most people,

184
00:10:45,920 --> 00:10:47,840
Speaker 1: and so it might as well not have the document

185
00:10:47,880 --> 00:10:50,120
Speaker 1: at all. That's kind of what Ways was trying to

186
00:10:50,160 --> 00:10:52,920
Speaker 1: do is solve this problem of making it easier to

187
00:10:52,960 --> 00:10:57,400
Speaker 1: get access to this wealth of information that Kale saw

188
00:10:57,720 --> 00:11:01,880
Speaker 1: was only going to get more complex and more full

189
00:11:01,960 --> 00:11:05,600
Speaker 1: of data. Well, we'll move away from Ways, because we

190
00:11:05,600 --> 00:11:08,280
Speaker 1: could do a full episode about that. I will say

191
00:11:08,280 --> 00:11:11,960
Speaker 1: that Cale and Morris, the founders of Ways, the guys

192
00:11:11,960 --> 00:11:17,120
Speaker 1: who created the Ways technologies, would actually leave Thinking Machines

193
00:11:17,320 --> 00:11:20,680
Speaker 1: and they would found a spinoff company just called Ways Incorporated.

194
00:11:20,920 --> 00:11:23,439
Speaker 1: And it was around this point when the mysterious Bruce

195
00:11:23,480 --> 00:11:26,840
Speaker 1: Gilliot joined the team. And while the Worldwide Web would

196
00:11:26,880 --> 00:11:29,840
Speaker 1: debut in the early nineties, which really opened up accessibility

197
00:11:29,840 --> 00:11:32,040
Speaker 1: to information on the Internet for a lot of people,

198
00:11:32,480 --> 00:11:35,840
Speaker 1: most of them for the first time, Ways would continue

199
00:11:35,880 --> 00:11:38,920
Speaker 1: to remain relevant. In fact, it was relevant enough that

200
00:11:39,040 --> 00:11:42,480
Speaker 1: in nineteen ninety five AOL would come calling with an

201
00:11:42,480 --> 00:11:45,959
Speaker 1: offer to purchase the company for a cool fifteen million dollars.

202
00:11:46,000 --> 00:11:48,840
Speaker 1: If we adjust that for inflation today's money, that would

203
00:11:48,880 --> 00:11:53,640
Speaker 1: be around thirty million bucks around that ballpark. Now, a

204
00:11:53,640 --> 00:11:56,680
Speaker 1: lot of the folks that Ways Incorporated would split off

205
00:11:56,760 --> 00:12:00,679
Speaker 1: to create new companies after this acquisition, and within a

206
00:12:00,800 --> 00:12:04,400
Speaker 1: year that included Cale and Gileat, who went on to

207
00:12:04,559 --> 00:12:10,000
Speaker 1: found a new company called Alexa Internet and you might think, huh, Alexa,

208
00:12:10,120 --> 00:12:13,280
Speaker 1: you mean like the same name as the Amazon Digital Assistant,

209
00:12:13,679 --> 00:12:16,559
Speaker 1: And yes, exactly that, because, as it would turn out,

210
00:12:16,600 --> 00:12:21,840
Speaker 1: Amazon would ultimately acquire Alexa Internet just a few years

211
00:12:21,880 --> 00:12:25,080
Speaker 1: after it was founded. But the name derived from the

212
00:12:25,120 --> 00:12:29,800
Speaker 1: Library at Alexandria, the ancient library of Egypt that at

213
00:12:29,880 --> 00:12:33,240
Speaker 1: one point housed one of the world's largest collections of

214
00:12:33,320 --> 00:12:39,400
Speaker 1: accumulated knowledge. Now around forty eight BCE, Julius Caesar Julie

215
00:12:39,400 --> 00:12:42,960
Speaker 1: Baby and his boys they barged into Alexandria, and as

216
00:12:43,000 --> 00:12:46,840
Speaker 1: a consequence of their rowdy invasion, the library caught fire

217
00:12:47,200 --> 00:12:49,920
Speaker 1: and much of the collection burned. Sadly, that was not

218
00:12:49,960 --> 00:12:52,880
Speaker 1: the only indignity. In fact, it wasn't the first indignity

219
00:12:53,200 --> 00:12:57,120
Speaker 1: that the library suffered that would impact its relevance. Further

220
00:12:57,240 --> 00:13:00,000
Speaker 1: conflicts a couple of centuries later pretty much wiped out

221
00:13:00,160 --> 00:13:03,560
Speaker 1: whatever had been left from the previous calamities, and the

222
00:13:03,600 --> 00:13:07,079
Speaker 1: Library of Alexandria became kind of a touchstone for folks

223
00:13:07,080 --> 00:13:10,160
Speaker 1: who have stressed the importance of access to knowledge and

224
00:13:10,240 --> 00:13:13,240
Speaker 1: the protection of that knowledge, and that the consequences that

225
00:13:13,360 --> 00:13:15,920
Speaker 1: could follow from the loss of such knowledge can be

226
00:13:15,960 --> 00:13:20,200
Speaker 1: really dire. See also like the Middle Ages the Dark Ages,

227
00:13:20,200 --> 00:13:24,120
Speaker 1: for example, that loss of knowledge is a really terrible thing.

228
00:13:24,520 --> 00:13:28,000
Speaker 1: So the impetus for Alexa Internet was that Cale and

229
00:13:28,080 --> 00:13:31,760
Speaker 1: Gillat wanted, in the words of the Web Design Museum quote,

230
00:13:31,840 --> 00:13:35,960
Speaker 1: to develop advanced web navigation that would continually improve itself

231
00:13:36,080 --> 00:13:39,520
Speaker 1: on the basis of user generated data end quote, which

232
00:13:39,559 --> 00:13:42,679
Speaker 1: is a pretty advanced idea for nineteen ninety six when

233
00:13:42,720 --> 00:13:45,600
Speaker 1: the Web was still very young and the general public

234
00:13:45,679 --> 00:13:47,439
Speaker 1: was still just trying to get a grip on exactly

235
00:13:47,480 --> 00:13:51,320
Speaker 1: what the Web and by extension, the Internet were. One

236
00:13:51,360 --> 00:13:54,679
Speaker 1: of the first tools that Alexa Internet developed was a

237
00:13:54,720 --> 00:13:58,000
Speaker 1: browser toolbar. So installing this toolbar into a browser would

238
00:13:58,000 --> 00:14:01,120
Speaker 1: give the user's access to a sort of crowd powered

239
00:14:01,200 --> 00:14:04,640
Speaker 1: recommendation engine. In some ways, it's not that different from

240
00:14:04,840 --> 00:14:08,360
Speaker 1: sites like dig and Reddit that would later rely on

241
00:14:08,440 --> 00:14:11,880
Speaker 1: the user community to actually work and to recommend links

242
00:14:11,920 --> 00:14:17,120
Speaker 1: to really interesting sites. This toolbar would recommend the sites

243
00:14:17,120 --> 00:14:20,760
Speaker 1: to users based upon how the overall community was browsing.

244
00:14:20,920 --> 00:14:24,160
Speaker 1: So the more people who were using this toolbar, the

245
00:14:24,200 --> 00:14:27,480
Speaker 1: more information was going into where they were going, and

246
00:14:27,520 --> 00:14:29,720
Speaker 1: thus you would get different recommendations. So if a lot

247
00:14:29,720 --> 00:14:32,440
Speaker 1: of people were navigating to a specific site for whatever reason,

248
00:14:32,680 --> 00:14:35,320
Speaker 1: you might get a recommendation to do the same. It

249
00:14:35,360 --> 00:14:38,160
Speaker 1: was an attempt at an organic way for folks to

250
00:14:38,240 --> 00:14:41,560
Speaker 1: suggest websites, kind of like a word of mouth campaign,

251
00:14:41,920 --> 00:14:45,920
Speaker 1: and Alexa Internet would also provide meta information about websites

252
00:14:45,960 --> 00:14:48,840
Speaker 1: to users if they wanted it. Meta information is information

253
00:14:48,920 --> 00:14:52,240
Speaker 1: about information, so this would include stuff like how many

254
00:14:52,440 --> 00:14:55,400
Speaker 1: web pages were part of an overall website, or how

255
00:14:55,440 --> 00:14:58,600
Speaker 1: many other websites were pointing back to the one you

256
00:14:58,640 --> 00:15:01,200
Speaker 1: were on, and so forth. A lot of the stuff

257
00:15:01,360 --> 00:15:04,840
Speaker 1: that Alexa Internet could tell you would reflect a specific

258
00:15:04,880 --> 00:15:07,640
Speaker 1: web page's relevance. It's the same sort of information that

259
00:15:07,640 --> 00:15:10,600
Speaker 1: search engines like Google would take into account when deciding

260
00:15:10,640 --> 00:15:14,480
Speaker 1: relevance for search results. And that meant that it didn't

261
00:15:14,480 --> 00:15:16,520
Speaker 1: take very long for Amazon to come around with an

262
00:15:16,560 --> 00:15:20,000
Speaker 1: offer to purchase Alexa Internet. I'll talk about that more,

263
00:15:20,120 --> 00:15:22,920
Speaker 1: as well as the development of the Internet Archive after

264
00:15:22,960 --> 00:15:26,360
Speaker 1: we come back from this quick break to thank our sponsors.

265
00:15:35,600 --> 00:15:40,000
Speaker 1: So Amazon in nineteen ninety nine takes a look at

266
00:15:40,080 --> 00:15:44,200
Speaker 1: Alexa Internet and says, Wow, this is pretty incredible. This

267
00:15:44,600 --> 00:15:49,480
Speaker 1: little company has created some means of checking for stuff

268
00:15:49,480 --> 00:15:53,840
Speaker 1: like relevance and metadata that could be really really useful

269
00:15:53,880 --> 00:15:57,280
Speaker 1: for us, And so Amazon made an offer that Alexa

270
00:15:57,320 --> 00:16:00,160
Speaker 1: Internet couldn't refuse to acquire the company for the and

271
00:16:00,240 --> 00:16:03,160
Speaker 1: slee some of two hundred and fifty million dollars in

272
00:16:03,280 --> 00:16:07,680
Speaker 1: Amazon stock in May of ninety nine. So this is

273
00:16:07,880 --> 00:16:10,880
Speaker 1: a little different than the earlier deal we talked about

274
00:16:10,880 --> 00:16:14,840
Speaker 1: where AOL bought you know, the Ways Incorporated, because they

275
00:16:14,840 --> 00:16:17,120
Speaker 1: bought it with two hundred and fifty million dollars with

276
00:16:17,200 --> 00:16:19,920
Speaker 1: a stock. If we just treated that like it was

277
00:16:19,960 --> 00:16:25,040
Speaker 1: a cash exchange, then if we had just for inflation,

278
00:16:25,120 --> 00:16:28,240
Speaker 1: that's like around four hundred and sixty nine million dollars

279
00:16:28,240 --> 00:16:31,480
Speaker 1: worth of stock. But that's not really how you deal

280
00:16:31,520 --> 00:16:33,920
Speaker 1: with the value here, right. You have to think about

281
00:16:33,920 --> 00:16:36,680
Speaker 1: how much was the stock worth back in nineteen ninety

282
00:16:36,800 --> 00:16:39,600
Speaker 1: nine versus how much is the stock worth today? I

283
00:16:39,800 --> 00:16:43,480
Speaker 1: checked and I saw that in May of nineteen ninety nine,

284
00:16:43,560 --> 00:16:46,520
Speaker 1: Amazon stock was trading for around two dollars eighty nine

285
00:16:46,560 --> 00:16:49,400
Speaker 1: cents per share. These days, it's closer to one hundred

286
00:16:49,400 --> 00:16:53,840
Speaker 1: and eighty dollars per share. Plus. Between that time, Amazon

287
00:16:53,920 --> 00:16:56,760
Speaker 1: had two different stock splits. There was a two to

288
00:16:56,760 --> 00:16:59,520
Speaker 1: one split in late ninety nine, and there was a

289
00:16:59,560 --> 00:17:03,240
Speaker 1: twenty to one stock split in twenty twenty two. When

290
00:17:03,240 --> 00:17:06,080
Speaker 1: you factor all that up, that two hundred and fifty

291
00:17:06,080 --> 00:17:10,840
Speaker 1: million dollars in stock ends up being a ton of wealth.

292
00:17:11,240 --> 00:17:13,760
Speaker 1: Like it's just a huge amount. It would take a

293
00:17:13,800 --> 00:17:17,040
Speaker 1: lot of calculating to get an estimate, and even then

294
00:17:17,359 --> 00:17:21,520
Speaker 1: it wouldn't really be accurate just say that deal is

295
00:17:21,560 --> 00:17:25,399
Speaker 1: worth a lot. So anyway, the important thing with the

296
00:17:25,400 --> 00:17:29,119
Speaker 1: Internet Archive is that Cale and Gileat, through their work

297
00:17:29,160 --> 00:17:32,359
Speaker 1: and creating tools for Alexa Internet, found themselves able to

298
00:17:32,400 --> 00:17:36,920
Speaker 1: create snapshots of the Web. So they were using Alexa

299
00:17:37,000 --> 00:17:40,560
Speaker 1: Internet to have a commercial business, and they established the

300
00:17:40,560 --> 00:17:45,480
Speaker 1: Internet Archive as a way of preserving information that had,

301
00:17:45,560 --> 00:17:48,680
Speaker 1: at some point or another found its home on the Internet.

302
00:17:48,960 --> 00:17:52,480
Speaker 1: So they were using Alexa Internet tech to crawl the

303
00:17:52,560 --> 00:17:55,080
Speaker 1: young Web in order to index everything, which is a

304
00:17:55,200 --> 00:17:58,040
Speaker 1: necessary step if you want to give people access to

305
00:17:58,119 --> 00:18:00,399
Speaker 1: the various documents posted on the web. We first have

306
00:18:00,440 --> 00:18:02,639
Speaker 1: to know what is there and where is it. To

307
00:18:02,720 --> 00:18:07,320
Speaker 1: do that, you've got to index everything. And then they said, well,

308
00:18:07,600 --> 00:18:09,760
Speaker 1: now that we are able to index this, we could

309
00:18:09,800 --> 00:18:14,000
Speaker 1: actually download these little snapshots and keep them. And according

310
00:18:14,000 --> 00:18:18,560
Speaker 1: to the Internet Archive, that would be important because the

311
00:18:18,640 --> 00:18:23,119
Speaker 1: average lifespan for a new web page was not very long,

312
00:18:23,400 --> 00:18:27,320
Speaker 1: So contrary to our belief that once something is posted

313
00:18:27,359 --> 00:18:30,480
Speaker 1: to the Internet, it's there forever, the archive found that

314
00:18:30,520 --> 00:18:34,560
Speaker 1: on average, new web pages stuck around for about seventy

315
00:18:34,680 --> 00:18:38,679
Speaker 1: seven days, which means it's less than three months, and

316
00:18:38,720 --> 00:18:42,639
Speaker 1: then puff they would disappear, like maybe they would change drastically,

317
00:18:42,680 --> 00:18:46,679
Speaker 1: maybe they would just go away. Now, imagine that you

318
00:18:46,720 --> 00:18:49,800
Speaker 1: were to walk into a brick and mortar library, but

319
00:18:49,880 --> 00:18:52,000
Speaker 1: then you found out that on average the books in

320
00:18:52,040 --> 00:18:54,639
Speaker 1: that library would only stick around for three months before

321
00:18:54,680 --> 00:18:57,720
Speaker 1: being lost forever. And think of all the knowledge that

322
00:18:57,760 --> 00:19:01,200
Speaker 1: would disappear on a regular basis and ongoing basis. It

323
00:19:01,200 --> 00:19:03,840
Speaker 1: would be impossible to calculate the impact of that kind

324
00:19:03,840 --> 00:19:06,200
Speaker 1: of reality. It would be like losing the Library of

325
00:19:06,240 --> 00:19:10,679
Speaker 1: Alexandria regularly every three months. So Cale had come to

326
00:19:10,720 --> 00:19:14,160
Speaker 1: the conclusion that knowledge should be preserved and made available

327
00:19:14,200 --> 00:19:17,399
Speaker 1: for posterity. This is similar to an idea that was

328
00:19:17,440 --> 00:19:20,880
Speaker 1: proposed by Stuart Brand back in the nineteen eighties. It's

329
00:19:20,920 --> 00:19:24,560
Speaker 1: a complicated idea that typically gets boiled down to the

330
00:19:24,600 --> 00:19:29,679
Speaker 1: saying information wants to be free. That's actually an oversimplification

331
00:19:29,720 --> 00:19:33,800
Speaker 1: of what Brand was really communicating. But his point was

332
00:19:33,800 --> 00:19:37,040
Speaker 1: that information's value is kind of like a paradox. The

333
00:19:37,119 --> 00:19:41,440
Speaker 1: information could be incredibly valuable, right, it could be absolutely critical,

334
00:19:41,480 --> 00:19:45,439
Speaker 1: and therefore it could be expensive, but the cost of

335
00:19:45,480 --> 00:19:50,040
Speaker 1: distributing information was consistently declining. It was getting easier and

336
00:19:50,200 --> 00:19:54,120
Speaker 1: cheaper to share information, and the benefits of making information

337
00:19:54,240 --> 00:19:59,560
Speaker 1: accessible are typically pretty tremendous. But information is only accessible

338
00:20:00,119 --> 00:20:03,560
Speaker 1: if someone is able to hold onto that info. Otherwise

339
00:20:03,560 --> 00:20:06,520
Speaker 1: it's lost. Right, The Internet was such a volatile thing

340
00:20:06,560 --> 00:20:09,119
Speaker 1: that there was no guarantee that what you saw today

341
00:20:09,520 --> 00:20:13,000
Speaker 1: would be available tomorrow. In the days before the dynamic web,

342
00:20:13,680 --> 00:20:16,639
Speaker 1: it wasn't really unusual for someone to establish a web page,

343
00:20:16,880 --> 00:20:20,159
Speaker 1: to publish that page, and then later on to wipe

344
00:20:20,160 --> 00:20:24,480
Speaker 1: the slate clean or you know, otherwise alter vast portions

345
00:20:24,480 --> 00:20:27,040
Speaker 1: of that page in order to use that same web

346
00:20:27,400 --> 00:20:31,400
Speaker 1: landscape to host a totally different document. So the old

347
00:20:31,440 --> 00:20:34,720
Speaker 1: stuff would just disappear. And so Calee and Gilliat created

348
00:20:35,000 --> 00:20:40,119
Speaker 1: the Internet Archive, a nonprofit organization dedicated to the archival

349
00:20:40,440 --> 00:20:44,399
Speaker 1: of information across the Internet. And I think most people

350
00:20:44,800 --> 00:20:49,040
Speaker 1: are familiar with it from the web wayback machine, but

351
00:20:49,080 --> 00:20:52,240
Speaker 1: that's just one part of what the Internet Archive does.

352
00:20:52,600 --> 00:20:55,199
Speaker 1: As stated in the Library of Congress, the mission of

353
00:20:55,240 --> 00:20:59,480
Speaker 1: the Internet Archive was quote offering permanent access for researchers,

354
00:20:59,520 --> 00:21:03,040
Speaker 1: his story and scholars to historical collections that exist in

355
00:21:03,119 --> 00:21:07,040
Speaker 1: digital format end quote. Cale and Gilliat founded the Internet

356
00:21:07,119 --> 00:21:09,600
Speaker 1: Archive the same year they founded Alexa Internet. So that's

357
00:21:09,720 --> 00:21:14,440
Speaker 1: nineteen ninety six. And it wasn't easy. And why is that? Well,

358
00:21:14,880 --> 00:21:17,280
Speaker 1: you got to think about the challenge you face if

359
00:21:17,320 --> 00:21:20,919
Speaker 1: you want to archive everything on the Internet, or at

360
00:21:21,000 --> 00:21:24,480
Speaker 1: least everything that you're allowed to archive on the Internet.

361
00:21:24,600 --> 00:21:26,600
Speaker 1: We'll come back to that a couple of times. So,

362
00:21:26,640 --> 00:21:28,240
Speaker 1: for one thing, you need to create a way to

363
00:21:28,320 --> 00:21:31,920
Speaker 1: capture the content of a web page and to preserve

364
00:21:31,960 --> 00:21:35,119
Speaker 1: that for posterity. And you need a way for people

365
00:21:35,280 --> 00:21:39,560
Speaker 1: to access those archived web pages and to navigate them.

366
00:21:39,800 --> 00:21:43,639
Speaker 1: So Alexa Internet would end up developing these technologies and

367
00:21:43,680 --> 00:21:47,320
Speaker 1: commercializing them in various ways, and the Internet Archive was

368
00:21:47,359 --> 00:21:51,119
Speaker 1: made possible through these tools. So you could think of

369
00:21:51,160 --> 00:21:56,000
Speaker 1: Alexa Internet as being the funding machine for Internet Archive

370
00:21:56,119 --> 00:21:58,600
Speaker 1: in the beginning, at least as far as the tools

371
00:21:58,680 --> 00:22:02,080
Speaker 1: Internet Archive would use in order to achieve its mission. Now,

372
00:22:02,119 --> 00:22:05,720
Speaker 1: on the capturing front, Alexa Internet created a web crawler.

373
00:22:06,000 --> 00:22:10,760
Speaker 1: So for applications like web search engines, primarily web search engines,

374
00:22:11,040 --> 00:22:14,919
Speaker 1: web crawlers are the soldiers that they send out. A

375
00:22:14,960 --> 00:22:19,080
Speaker 1: web crawler's job is to index content across the Internet

376
00:22:19,160 --> 00:22:22,119
Speaker 1: and to capture information about what the various web pages

377
00:22:22,160 --> 00:22:26,199
Speaker 1: on the Internet are actually about. It's complicated, right. You

378
00:22:26,240 --> 00:22:29,520
Speaker 1: could just have a directory of web pages that's based

379
00:22:29,520 --> 00:22:32,119
Speaker 1: off the title of the web pages, but title and

380
00:22:32,240 --> 00:22:36,280
Speaker 1: content are not always in alignment. So web crawlers are

381
00:22:36,320 --> 00:22:40,399
Speaker 1: all about following the various branching pathways across the web.

382
00:22:40,480 --> 00:22:43,520
Speaker 1: They crawl through the web, in other words, indexing every

383
00:22:43,640 --> 00:22:47,080
Speaker 1: page as they do. So. Not everyone, however, wants their

384
00:22:47,080 --> 00:22:50,760
Speaker 1: web page indexed. So you can actually include some HTML

385
00:22:50,880 --> 00:22:54,840
Speaker 1: language in your web page that indicates that it's off

386
00:22:54,880 --> 00:22:58,760
Speaker 1: limits for indexing, and appolite web crawlers such as the

387
00:22:58,760 --> 00:23:03,000
Speaker 1: ones that Alexi Internet was using, will honor those instructions

388
00:23:03,040 --> 00:23:06,480
Speaker 1: and it will not index that page. But other pages

389
00:23:06,760 --> 00:23:11,639
Speaker 1: that lack this specific instruction of hey, don't index this,

390
00:23:12,359 --> 00:23:15,920
Speaker 1: they're fair game. I like to think of web crellers

391
00:23:16,000 --> 00:23:18,440
Speaker 1: kind of like Doctor Strange from the Marvel Universe the

392
00:23:18,560 --> 00:23:21,399
Speaker 1: Cinematic Universe in particular, they all want. He uses his

393
00:23:21,520 --> 00:23:25,760
Speaker 1: time manipulation abilities to see where all the different possible

394
00:23:26,000 --> 00:23:29,800
Speaker 1: pathways can lead to. The web crellers do that across

395
00:23:29,880 --> 00:23:32,440
Speaker 1: the web. They explore all the nooks and crannies. They

396
00:23:32,480 --> 00:23:35,560
Speaker 1: follow each link that even the ones that no one

397
00:23:35,640 --> 00:23:38,520
Speaker 1: ever clicks on, they follow those two. And you know,

398
00:23:38,640 --> 00:23:41,359
Speaker 1: hats off to web crellers for doing that to build

399
00:23:41,359 --> 00:23:44,240
Speaker 1: out these indices, because without it, web search wouldn't work,

400
00:23:44,560 --> 00:23:49,919
Speaker 1: and Alexa Internet wouldn't have been a thing anyway. Alexa

401
00:23:49,960 --> 00:23:53,520
Speaker 1: Internet and by extension, the Internet Archive used several different

402
00:23:53,520 --> 00:23:56,240
Speaker 1: web crallers over the years, but they all basically do

403
00:23:56,359 --> 00:23:59,119
Speaker 1: the same thing, or they they you know, more accurately.

404
00:23:59,160 --> 00:24:02,800
Speaker 1: They all aimed to achieve the same results. So the

405
00:24:02,840 --> 00:24:06,280
Speaker 1: crawler starts with seed URLs. This is like the starting

406
00:24:06,320 --> 00:24:08,879
Speaker 1: point where you let them go, and then they follow

407
00:24:08,880 --> 00:24:11,920
Speaker 1: each link and they download documents to the archives servers.

408
00:24:12,119 --> 00:24:15,640
Speaker 1: The crawlers also reference the links to ensure that they're

409
00:24:15,640 --> 00:24:20,119
Speaker 1: not double dipping on a specific crawl. So if you

410
00:24:20,160 --> 00:24:22,600
Speaker 1: have a ton of different sites that are all linking

411
00:24:22,680 --> 00:24:25,240
Speaker 1: to the same document, like let's say that someone has

412
00:24:25,440 --> 00:24:30,160
Speaker 1: published something, and hundreds of other resources on the internet

413
00:24:30,840 --> 00:24:34,960
Speaker 1: reference that published document, Well, That means there's all these

414
00:24:34,960 --> 00:24:38,360
Speaker 1: different pathways that lead to the same destination, right, and

415
00:24:38,720 --> 00:24:42,680
Speaker 1: it would be somewhat wasteful to capture this exact same

416
00:24:42,760 --> 00:24:48,160
Speaker 1: document multiple times during the same crawl, so there's cross

417
00:24:48,280 --> 00:24:51,400
Speaker 1: referencing that happens in order to prevent that from occurring.

418
00:24:52,000 --> 00:24:55,159
Speaker 1: This process does work, but it also has limitations. So

419
00:24:55,240 --> 00:24:58,600
Speaker 1: for one thing, these crawls they do create snapshots of

420
00:24:58,600 --> 00:25:01,640
Speaker 1: the web in intervals, So if you use the wayback machine,

421
00:25:02,000 --> 00:25:04,359
Speaker 1: we'll talk more about that in a second. You'll see

422
00:25:04,400 --> 00:25:06,879
Speaker 1: that the history of a web page consists of a

423
00:25:07,040 --> 00:25:10,919
Speaker 1: series of dates from which the Internet archive first received

424
00:25:10,960 --> 00:25:13,720
Speaker 1: a snapshot of that page, and it leads all the

425
00:25:13,760 --> 00:25:17,000
Speaker 1: way up to the most recent reference of that page,

426
00:25:17,040 --> 00:25:20,560
Speaker 1: the most recent snapshot. The various dates and the wayback

427
00:25:20,640 --> 00:25:24,359
Speaker 1: machine are not necessarily relevant to any major changes that

428
00:25:24,480 --> 00:25:27,159
Speaker 1: happened on the web page itself. This is just when

429
00:25:27,640 --> 00:25:31,280
Speaker 1: the web crawlers went to that particular web page. So

430
00:25:31,880 --> 00:25:35,480
Speaker 1: it may be immediately after a massive change has been implemented,

431
00:25:35,520 --> 00:25:38,119
Speaker 1: it may be well after. In fact, there might be

432
00:25:38,240 --> 00:25:42,600
Speaker 1: a point where between webcraller visits a web page has

433
00:25:42,720 --> 00:25:45,520
Speaker 1: changed a couple of times. Well, that means that the

434
00:25:45,520 --> 00:25:48,320
Speaker 1: ones that are happening in between those changes aren't going

435
00:25:48,359 --> 00:25:51,200
Speaker 1: to be captured. It's just whatever was there the first

436
00:25:51,200 --> 00:25:53,760
Speaker 1: time the web crawler came through, and whatever was there

437
00:25:53,800 --> 00:25:57,200
Speaker 1: the next time the web craller came through. So interesting

438
00:25:57,240 --> 00:25:59,359
Speaker 1: thing is that if a particular page does have a

439
00:25:59,480 --> 00:26:02,960
Speaker 1: ton of other links pointing to it, that page is

440
00:26:03,000 --> 00:26:06,880
Speaker 1: more likely to have very frequent snapshots throughout its history,

441
00:26:07,280 --> 00:26:12,280
Speaker 1: because again, through subsequent crawls, there are various routes that

442
00:26:12,359 --> 00:26:15,320
Speaker 1: take web crallers through that web page, so they're more

443
00:26:15,480 --> 00:26:18,919
Speaker 1: likely to capture a snapshot of it. For pages that

444
00:26:18,960 --> 00:26:21,639
Speaker 1: have fewer links pointing to them, maybe there aren't that

445
00:26:21,720 --> 00:26:25,520
Speaker 1: many other web pages out there that cite this particular page,

446
00:26:25,720 --> 00:26:28,919
Speaker 1: they're more likely to have sporadic updates throughout their history.

447
00:26:28,960 --> 00:26:31,679
Speaker 1: You might pull up a page in the Wayback machine

448
00:26:31,680 --> 00:26:36,000
Speaker 1: and see that there's only maybe half a dozen captures

449
00:26:36,160 --> 00:26:39,840
Speaker 1: of that particular page, and that means that there could

450
00:26:39,840 --> 00:26:42,800
Speaker 1: be a lot of changes that were missed in between visits.

451
00:26:43,160 --> 00:26:47,040
Speaker 1: So not everything gets captured in the Internet archive. I

452
00:26:47,080 --> 00:26:51,080
Speaker 1: think that some people work under the mistaken presumption that

453
00:26:51,720 --> 00:26:55,200
Speaker 1: anything that was ever published to the web is captured

454
00:26:55,280 --> 00:26:58,439
Speaker 1: and archived. There that's not the case. It's whatever was

455
00:26:58,480 --> 00:27:00,920
Speaker 1: there when the web crawlers came through it. So, because

456
00:27:00,960 --> 00:27:03,359
Speaker 1: even the Internet Archive is not a perfect record of

457
00:27:03,440 --> 00:27:07,000
Speaker 1: everything that's ever happened on the web, other elements, like

458
00:27:07,040 --> 00:27:09,639
Speaker 1: I said, could also be lost to time due to

459
00:27:09,680 --> 00:27:13,200
Speaker 1: the complexity of web navigation. For example, so when web

460
00:27:13,240 --> 00:27:18,280
Speaker 1: designers started to incorporate things like flash, which really is

461
00:27:18,320 --> 00:27:20,600
Speaker 1: no longer a thing but it was for a while,

462
00:27:20,880 --> 00:27:24,240
Speaker 1: or JavaScript, then the web callers that were being used

463
00:27:24,359 --> 00:27:26,880
Speaker 1: to index the web, a lot of them just couldn't

464
00:27:27,359 --> 00:27:30,879
Speaker 1: navigate these types of tools that were made through flash

465
00:27:30,920 --> 00:27:34,840
Speaker 1: or JavaScript. So while human users could, and they could,

466
00:27:35,160 --> 00:27:39,680
Speaker 1: you know, interact with interfaces that had these tools created

467
00:27:39,720 --> 00:27:43,320
Speaker 1: through these various methods, web collers couldn't. And that meant

468
00:27:43,320 --> 00:27:46,680
Speaker 1: that if a website used like tools that were made

469
00:27:46,720 --> 00:27:50,800
Speaker 1: in JavaScript to act as the interface, the web creller

470
00:27:50,880 --> 00:27:54,000
Speaker 1: might only be able to index the homepage, but not

471
00:27:54,080 --> 00:27:57,320
Speaker 1: any of the other links branching off from the homepage

472
00:27:57,359 --> 00:28:01,280
Speaker 1: because it couldn't navigate that same interface. So there's a

473
00:28:01,280 --> 00:28:04,199
Speaker 1: lot of stuff from that era that's lost to the

474
00:28:04,240 --> 00:28:07,320
Speaker 1: Internet Archive as well, simply because the crawlers just could

475
00:28:07,359 --> 00:28:11,560
Speaker 1: not navigate those pages. They were never captured. And like

476
00:28:11,600 --> 00:28:15,080
Speaker 1: I said, if you happen to have the instruction, the

477
00:28:15,200 --> 00:28:18,840
Speaker 1: HTML instruction not to index the site, well then that's

478
00:28:18,880 --> 00:28:21,119
Speaker 1: not going to be there either. Now let's move on

479
00:28:21,240 --> 00:28:25,160
Speaker 1: to another challenge, which is the storing of these files.

480
00:28:25,520 --> 00:28:29,960
Speaker 1: Indexing everything was one thing. How do you store everything

481
00:28:30,000 --> 00:28:32,960
Speaker 1: that can be indexed on the web in an archive?

482
00:28:33,880 --> 00:28:36,800
Speaker 1: That's what we're going to come back and explore after

483
00:28:36,840 --> 00:28:49,840
Speaker 1: we take another quick break to thank our sponsors. Okay,

484
00:28:50,360 --> 00:28:54,160
Speaker 1: so the Internet archive, how do you store all the

485
00:28:54,200 --> 00:28:57,040
Speaker 1: information that you find across the web. Well, the big

486
00:28:57,080 --> 00:29:00,600
Speaker 1: one for web pages was that you had to figure

487
00:29:00,600 --> 00:29:03,840
Speaker 1: out where do you store and how do you organize

488
00:29:03,840 --> 00:29:06,640
Speaker 1: snapshots of the web so that one you have a

489
00:29:06,680 --> 00:29:09,320
Speaker 1: record of them, and two you can find what you're

490
00:29:09,360 --> 00:29:12,720
Speaker 1: looking for. You can navigate to the specific instance that

491
00:29:12,760 --> 00:29:16,000
Speaker 1: you're looking for. Keep in mind again, the archives not

492
00:29:16,040 --> 00:29:18,800
Speaker 1: capturing everything. As I said before the break, there's a

493
00:29:18,840 --> 00:29:21,440
Speaker 1: lot of stuff that web crawlers could not access for

494
00:29:21,480 --> 00:29:25,000
Speaker 1: one reason or another. Those things would be either off

495
00:29:25,040 --> 00:29:28,080
Speaker 1: limits or inaccessible and thus would not be in the archive.

496
00:29:28,400 --> 00:29:31,880
Speaker 1: But everything else was still fair game. So to store

497
00:29:31,920 --> 00:29:35,880
Speaker 1: and organize everything, Alexa Internet created a new file format

498
00:29:36,000 --> 00:29:41,680
Speaker 1: called an ARC file. ARC ARC files contain information about

499
00:29:41,720 --> 00:29:45,840
Speaker 1: all the stuff that's inside them, the metadata of the Internet.

500
00:29:46,000 --> 00:29:50,240
Speaker 1: So again, metadata is data about data. It makes the

501
00:29:50,280 --> 00:29:53,880
Speaker 1: small files inside the larger ARC files all self identifying,

502
00:29:54,000 --> 00:29:56,480
Speaker 1: so there's no need to actually build out an index.

503
00:29:56,760 --> 00:30:00,480
Speaker 1: The self identifying information includes stuff like the URL for

504
00:30:00,600 --> 00:30:03,640
Speaker 1: the file, like what the URL for that particular document is,

505
00:30:03,880 --> 00:30:06,680
Speaker 1: how big the document is when it was retrieved, and

506
00:30:06,880 --> 00:30:10,160
Speaker 1: other stuff like that. Each ARC file would have a

507
00:30:10,200 --> 00:30:13,120
Speaker 1: capacity of around one hundred megabytes, and it was possible

508
00:30:13,160 --> 00:30:15,840
Speaker 1: for a single website to span multiple ARC files. I mean,

509
00:30:15,880 --> 00:30:18,120
Speaker 1: there's some big websites out there that have been around

510
00:30:18,120 --> 00:30:22,400
Speaker 1: for a long time, so yeah, sometimes a single ARC

511
00:30:22,480 --> 00:30:26,040
Speaker 1: file would just be a portion of that website. At first,

512
00:30:26,320 --> 00:30:30,160
Speaker 1: the Internet archives stored all this information on magnetic tape,

513
00:30:30,440 --> 00:30:34,360
Speaker 1: So you would do this indexing of the web, all

514
00:30:34,400 --> 00:30:37,320
Speaker 1: these snapshots, and you would save it to magnetic tape.

515
00:30:37,400 --> 00:30:40,200
Speaker 1: I remember I used to work for a company, a

516
00:30:40,280 --> 00:30:44,120
Speaker 1: consulting firm that had magnetic tape backups. So it was

517
00:30:44,200 --> 00:30:48,040
Speaker 1: my job, one of my jobs to occasionally back up

518
00:30:48,120 --> 00:30:51,520
Speaker 1: all the data on our network to tape, and I

519
00:30:51,560 --> 00:30:54,720
Speaker 1: would have to swap tapes out and label them and

520
00:30:54,760 --> 00:30:58,240
Speaker 1: everything and archive them properly. The Internet Archive worked under

521
00:30:58,280 --> 00:31:01,560
Speaker 1: the same idea. It would capture a snapshot of all

522
00:31:01,680 --> 00:31:05,720
Speaker 1: the files across the web, save them to tape, and

523
00:31:06,280 --> 00:31:09,160
Speaker 1: that was how the Internet Archive kept track of things

524
00:31:09,200 --> 00:31:14,440
Speaker 1: for about three years. But eventually activity on the Internet

525
00:31:14,800 --> 00:31:16,760
Speaker 1: was such that that was not going to do it.

526
00:31:16,840 --> 00:31:19,640
Speaker 1: There were too many users who wanted to be able

527
00:31:19,880 --> 00:31:23,720
Speaker 1: to access things that were stored or saved within the

528
00:31:23,760 --> 00:31:27,680
Speaker 1: Internet Archive, and this method just couldn't keep up with

529
00:31:27,800 --> 00:31:30,560
Speaker 1: demand and necessity, as we all know, is the mother

530
00:31:30,640 --> 00:31:33,800
Speaker 1: of invention. So the Internet Archive needed an alternative way

531
00:31:33,840 --> 00:31:37,080
Speaker 1: to store these snapshots. And of course, the Web was

532
00:31:37,640 --> 00:31:41,080
Speaker 1: really growing dramatically, which is putting it lightly, and there

533
00:31:41,120 --> 00:31:43,320
Speaker 1: was a real need to step things up considerably. So

534
00:31:43,360 --> 00:31:46,600
Speaker 1: to that end, the staff at Internet Archive developed a

535
00:31:46,640 --> 00:31:52,080
Speaker 1: storage system they called the PetaBox PetaBox, and it was

536
00:31:52,120 --> 00:31:55,600
Speaker 1: called the PetaBox because it could house a petabyte of information.

537
00:31:55,960 --> 00:32:00,120
Speaker 1: A petabyte, in case you're curious, is a million gigabytes. Now,

538
00:32:00,160 --> 00:32:02,719
Speaker 1: the most recent data I have about the PetaBox storage

539
00:32:02,720 --> 00:32:05,920
Speaker 1: system actually comes from December twenty twenty one, so it's

540
00:32:05,960 --> 00:32:08,400
Speaker 1: a few years out of date. But at that time,

541
00:32:08,600 --> 00:32:11,760
Speaker 1: the Internet Archive was using two hundred and twelve petabytes

542
00:32:11,760 --> 00:32:15,160
Speaker 1: of storage, which is a lot that wasn't all the

543
00:32:15,200 --> 00:32:20,000
Speaker 1: Wayback Machine. However, only around fifty seven petabytes of that

544
00:32:20,440 --> 00:32:23,600
Speaker 1: was for the Wayback Machine. The rest was for other

545
00:32:23,680 --> 00:32:27,640
Speaker 1: things like archiving various forms of digital media as well

546
00:32:27,680 --> 00:32:32,920
Speaker 1: as what Internet Archive references as quote unquote unique data. Anyway,

547
00:32:33,320 --> 00:32:36,640
Speaker 1: the page on Internet Archive site says that the data

548
00:32:36,680 --> 00:32:40,240
Speaker 1: centers there are four of them that house the petabyte

549
00:32:40,280 --> 00:32:44,280
Speaker 1: storage system, don't use air conditioning, which helps keep electric

550
00:32:44,320 --> 00:32:48,440
Speaker 1: bills down. They actually let the heat from the data

551
00:32:48,480 --> 00:32:52,440
Speaker 1: storage devices provide heating for the buildings that they're stored

552
00:32:52,480 --> 00:32:56,200
Speaker 1: in and that you know, this is all part of

553
00:32:56,240 --> 00:33:00,960
Speaker 1: a strategy to keep things at low cost but high

554
00:33:01,040 --> 00:33:05,440
Speaker 1: usability and high efficiency. So that's really the big requirements

555
00:33:05,480 --> 00:33:08,480
Speaker 1: for the PetaBox system. It has to be efficient. It

556
00:33:08,520 --> 00:33:12,520
Speaker 1: cannot require too much power to operate any single PetaBox.

557
00:33:12,760 --> 00:33:17,040
Speaker 1: Another requirement is that each rack of hard drive storage

558
00:33:17,080 --> 00:33:19,320
Speaker 1: has to hold a ton of hard drives. We're talking

559
00:33:19,440 --> 00:33:23,160
Speaker 1: like one hundred plus terabytes worth of hard drive space.

560
00:33:23,600 --> 00:33:26,920
Speaker 1: Another requirement is that to serve as an administrator, it

561
00:33:27,000 --> 00:33:30,640
Speaker 1: needs to be easy like it can't be complicated to

562
00:33:30,880 --> 00:33:37,440
Speaker 1: administrate this storage system, and according to Internet Archive, the

563
00:33:37,480 --> 00:33:40,840
Speaker 1: structure of this is such that you need about one

564
00:33:41,000 --> 00:33:44,640
Speaker 1: administrator for every petabyte worth of data, so you know,

565
00:33:44,720 --> 00:33:47,840
Speaker 1: that's like two hundred administrators. Essentially, the whole goal was

566
00:33:47,880 --> 00:33:52,480
Speaker 1: to create systems that were relatively inexpensive, relatively efficient, and

567
00:33:52,640 --> 00:33:56,160
Speaker 1: relatively easy to use. At least from an administrative perspective.

568
00:33:56,480 --> 00:33:59,640
Speaker 1: That's really tall order. It's hard to meet all those

569
00:34:00,560 --> 00:34:03,360
Speaker 1: but the folks at Internet Archive made it happen, and

570
00:34:03,480 --> 00:34:07,480
Speaker 1: it was such a useful approach to storage and to

571
00:34:07,680 --> 00:34:10,719
Speaker 1: being able to organize the files within storage so that

572
00:34:10,800 --> 00:34:14,160
Speaker 1: you didn't have to build out indices that ultimately Internet

573
00:34:14,280 --> 00:34:21,160
Speaker 1: Archive would deploy this same strategy for other organizations and institutions. Okay,

574
00:34:21,239 --> 00:34:26,040
Speaker 1: but that's all about, you know, collecting and storing all

575
00:34:26,080 --> 00:34:30,640
Speaker 1: the information across the Internet. How do you access it?

576
00:34:30,920 --> 00:34:33,440
Speaker 1: How is a user? How is a researcher? Are you

577
00:34:33,520 --> 00:34:39,320
Speaker 1: able to tap into this? Because again, unless accessibility is easy,

578
00:34:39,960 --> 00:34:42,440
Speaker 1: then there's not much point to doing this. You're just

579
00:34:42,480 --> 00:34:46,279
Speaker 1: making a record that nobody can reference. Well, I would

580
00:34:46,360 --> 00:34:51,680
Speaker 1: argue the most famous of the ways to access information

581
00:34:51,800 --> 00:34:54,879
Speaker 1: contained within the Internet Archive is the wayback machine, which

582
00:34:54,920 --> 00:34:58,960
Speaker 1: is specifically for web pages. The Internet Archive first introduced

583
00:34:59,000 --> 00:35:02,279
Speaker 1: the wayback Machine in two thousand and one, and the

584
00:35:02,320 --> 00:35:05,160
Speaker 1: way it works is pretty simple. There's a little it's

585
00:35:05,239 --> 00:35:07,520
Speaker 1: kind of like a search bar, but it's a urlbar.

586
00:35:07,680 --> 00:35:10,520
Speaker 1: You put in a URL for the web page that

587
00:35:10,560 --> 00:35:13,799
Speaker 1: you're interested in, and the wayback machine pulls up the

588
00:35:13,840 --> 00:35:17,040
Speaker 1: snapshots that are contained within the archive if there are

589
00:35:17,080 --> 00:35:20,120
Speaker 1: any snapshots. As I mentioned earlier, not everything is in there,

590
00:35:20,200 --> 00:35:22,440
Speaker 1: but if it is in there, you will see options

591
00:35:22,440 --> 00:35:25,600
Speaker 1: available to you to look at the page at different

592
00:35:25,640 --> 00:35:28,239
Speaker 1: points in history. One thing I like to do is

593
00:35:28,320 --> 00:35:31,920
Speaker 1: look back at how famous web pages have changed in

594
00:35:32,000 --> 00:35:34,680
Speaker 1: their design over the years. If you put in something

595
00:35:34,719 --> 00:35:38,360
Speaker 1: like really big like CNN dot com, you can see

596
00:35:38,360 --> 00:35:41,359
Speaker 1: how the look and interface of that site has transitioned

597
00:35:41,640 --> 00:35:44,920
Speaker 1: during different eras across the web. I also used to

598
00:35:44,960 --> 00:35:47,920
Speaker 1: do this with the old website I worked for houstuffworks

599
00:35:47,960 --> 00:35:50,560
Speaker 1: dot com. I mean that's where tech stuff gets the

600
00:35:50,640 --> 00:35:53,880
Speaker 1: stuff and its name is from HowStuffWorks dot com. I

601
00:35:54,000 --> 00:35:57,160
Speaker 1: like using the wayback machine to look at what the

602
00:35:57,200 --> 00:35:59,719
Speaker 1: site looked like when I first joined, which was a

603
00:36:00,040 --> 00:36:02,400
Speaker 1: in February two thousand and seven. In case you're curious.

604
00:36:02,680 --> 00:36:07,000
Speaker 1: It looks entirely different now than how it looked back then,

605
00:36:07,200 --> 00:36:09,360
Speaker 1: and through the wayback Machine you can see what it

606
00:36:09,360 --> 00:36:12,400
Speaker 1: looked like back then. Also, these days, the wayback machine

607
00:36:12,440 --> 00:36:13,920
Speaker 1: is the only way I can see some of the

608
00:36:14,000 --> 00:36:17,840
Speaker 1: articles I wrote for that site, because the articles have

609
00:36:17,960 --> 00:36:23,040
Speaker 1: been either deleted or more likely rewritten over the time. Now.

610
00:36:23,040 --> 00:36:24,839
Speaker 1: To be fair to how stuff works, a lot of

611
00:36:24,840 --> 00:36:28,000
Speaker 1: my writing was in the computers and electronics sections, and

612
00:36:28,120 --> 00:36:32,520
Speaker 1: obviously things change in those fields very quickly, and something

613
00:36:32,560 --> 00:36:37,320
Speaker 1: that was relevant fifteen years ago is definitely not relevant today.

614
00:36:37,880 --> 00:36:41,040
Speaker 1: So you have to replace old stuff on a regular basis.

615
00:36:41,160 --> 00:36:43,319
Speaker 1: But it is kind of sad that a lot of

616
00:36:43,320 --> 00:36:45,360
Speaker 1: my work, a lot of my work for the first

617
00:36:45,719 --> 00:36:49,040
Speaker 1: you know, ten years of my career doing this kind

618
00:36:49,040 --> 00:36:52,560
Speaker 1: of stuff, is not accessible unless you use something like

619
00:36:52,600 --> 00:36:55,440
Speaker 1: the wayback Machine. Now, one super neat thing about the

620
00:36:55,440 --> 00:36:58,680
Speaker 1: wayback machine is that you can still follow links that

621
00:36:58,719 --> 00:37:02,600
Speaker 1: are on pages, like if the archive has those linked

622
00:37:02,640 --> 00:37:05,600
Speaker 1: assets also in the archive, then you're going to be

623
00:37:05,600 --> 00:37:08,120
Speaker 1: shown a record, and the record will be one that

624
00:37:08,160 --> 00:37:11,719
Speaker 1: was captured closest in time with the first page that

625
00:37:11,800 --> 00:37:15,319
Speaker 1: you were originally on. This sounds complicated, Let me give

626
00:37:15,320 --> 00:37:17,880
Speaker 1: an example, it makes it way easier. So let's say

627
00:37:18,080 --> 00:37:22,440
Speaker 1: that I visit the web capture the snapshot for HowStuffWorks

628
00:37:22,520 --> 00:37:26,680
Speaker 1: dot COM's homepage on February nineteenth, two thousand and seven.

629
00:37:27,160 --> 00:37:30,440
Speaker 1: By the way, this snapshot on feb nineteenth, two thousand

630
00:37:30,440 --> 00:37:33,400
Speaker 1: and seven is the closest date to when I started

631
00:37:33,480 --> 00:37:37,600
Speaker 1: working at that company that's in the archive. The actual

632
00:37:37,680 --> 00:37:40,960
Speaker 1: date when I started the website was not captured on

633
00:37:41,000 --> 00:37:45,840
Speaker 1: that day. Anyway, By clicking around on this homepage, I

634
00:37:45,840 --> 00:37:49,399
Speaker 1: can actually follow links and it'll pull up archived links

635
00:37:49,440 --> 00:37:52,840
Speaker 1: of archived articles, which is really neat. And when I

636
00:37:52,880 --> 00:37:56,120
Speaker 1: did that, at one point, I clicked on a link

637
00:37:56,239 --> 00:38:01,320
Speaker 1: for more information or related articles to how helicopters work.

638
00:38:01,719 --> 00:38:06,320
Speaker 1: That page, the related page was actually archived on February

639
00:38:06,360 --> 00:38:09,319
Speaker 1: twenty second, two thousand and seven. So one was on

640
00:38:09,360 --> 00:38:12,360
Speaker 1: February nineteenth, the other was February twenty second, but the

641
00:38:12,440 --> 00:38:16,800
Speaker 1: link still worked. Right. Yes, these were two different pages

642
00:38:16,840 --> 00:38:20,560
Speaker 1: that were archived on two different days, but the nature

643
00:38:20,760 --> 00:38:26,120
Speaker 1: of the archive allows those links to still work between

644
00:38:26,160 --> 00:38:28,680
Speaker 1: the two, which is neat because I'm not just popping

645
00:38:28,719 --> 00:38:31,480
Speaker 1: around through a web of links. I'm also kind of

646
00:38:31,520 --> 00:38:36,040
Speaker 1: time traveling, right, I'm looking at a timeline of snapshots

647
00:38:36,280 --> 00:38:39,279
Speaker 1: that are all still interlinked together, even if they were

648
00:38:39,280 --> 00:38:42,839
Speaker 1: captured on different days. I think that's really cool. Now

649
00:38:42,880 --> 00:38:45,040
Speaker 1: it gets even more cool when you think about the

650
00:38:45,080 --> 00:38:48,440
Speaker 1: scale of this project. So, according to the Internet Archive itself,

651
00:38:48,640 --> 00:38:52,120
Speaker 1: the archive contains eight hundred and thirty five billion with

652
00:38:52,200 --> 00:38:55,439
Speaker 1: a B web pages, And as I mentioned earlier, that

653
00:38:55,680 --> 00:38:58,360
Speaker 1: just makes up part of all the data that's stored

654
00:38:58,400 --> 00:39:02,040
Speaker 1: on Internet Archive servers, because the organization is also home

655
00:39:02,080 --> 00:39:05,640
Speaker 1: to more than forty four million books and other texts,

656
00:39:06,000 --> 00:39:11,040
Speaker 1: fifteen million audio recordings, more than ten million videos, and

657
00:39:11,120 --> 00:39:15,040
Speaker 1: more than a million different pieces of software. Again, some

658
00:39:15,120 --> 00:39:19,040
Speaker 1: of this stuff might not be recorded anywhere else. There

659
00:39:19,080 --> 00:39:22,160
Speaker 1: may not be duplicates or copies of some of this

660
00:39:22,200 --> 00:39:26,799
Speaker 1: stuff anywhere else. While you might have things like Blu

661
00:39:26,920 --> 00:39:31,160
Speaker 1: ray DVDs or whatever of some of those videos, others

662
00:39:31,239 --> 00:39:36,080
Speaker 1: might not have anything. And history is filled with instances

663
00:39:36,120 --> 00:39:40,759
Speaker 1: of media companies generating stuff or others, you know, independent

664
00:39:40,920 --> 00:39:45,359
Speaker 1: people too, generating stuff but not keeping a copy for posterity,

665
00:39:45,440 --> 00:39:48,880
Speaker 1: and then it's here and it's gone. Sometimes that's on purpose.

666
00:39:49,280 --> 00:39:52,719
Speaker 1: Sometimes it's a statement, like you make something ephemeral for

667
00:39:52,760 --> 00:39:56,400
Speaker 1: that very reason. Other times it's out of convenience, Like

668
00:39:56,719 --> 00:40:00,719
Speaker 1: there are stories about how the BBC would regularly reuse

669
00:40:00,880 --> 00:40:05,719
Speaker 1: tapes and tape over previous programming because there was no

670
00:40:05,800 --> 00:40:13,359
Speaker 1: thought about preservation or a home theater industry. So there

671
00:40:13,400 --> 00:40:17,200
Speaker 1: are entire eras of stuff like Doctor Who that are

672
00:40:17,239 --> 00:40:21,160
Speaker 1: just gone or believed to be gone because the BBC

673
00:40:21,280 --> 00:40:25,000
Speaker 1: would just tape over old tapes and so you lost

674
00:40:25,040 --> 00:40:29,440
Speaker 1: whatever was on there originally. That's why things like the

675
00:40:29,440 --> 00:40:32,880
Speaker 1: Internet Archive exist is to avoid that in the case

676
00:40:32,920 --> 00:40:35,680
Speaker 1: of stuff that's stored across the Internet, to make sure

677
00:40:35,800 --> 00:40:39,239
Speaker 1: that there is an accessible record of those things and

678
00:40:39,239 --> 00:40:41,920
Speaker 1: that they don't just disappear. In two thousand and seven,

679
00:40:42,120 --> 00:40:45,640
Speaker 1: the state of California recognize the Internet Archive as an

680
00:40:45,680 --> 00:40:49,480
Speaker 1: official library, which was important it's not just an honorarium.

681
00:40:49,760 --> 00:40:53,520
Speaker 1: It would allow the nonprofit organization to receive federal funding,

682
00:40:53,600 --> 00:40:56,360
Speaker 1: which is a pretty important development for the longevity of

683
00:40:56,400 --> 00:40:59,440
Speaker 1: the program. But while the usefulness of the organization is

684
00:40:59,440 --> 00:41:02,480
Speaker 1: beyond question, the methods that the Archive has used this

685
00:41:02,680 --> 00:41:06,680
Speaker 1: have not always been met with universal approval. For example, recently,

686
00:41:06,920 --> 00:41:10,800
Speaker 1: the Internet Archive has been embroiled in a pretty nasty lawsuit.

687
00:41:10,960 --> 00:41:14,719
Speaker 1: It's called the Hatchet versus Internet Archive suit, and it

688
00:41:14,760 --> 00:41:17,839
Speaker 1: revolves around a group of publishers that object to how

689
00:41:17,880 --> 00:41:21,880
Speaker 1: the Internet Archive scans physical books for the purposes of

690
00:41:22,000 --> 00:41:26,160
Speaker 1: lending them out as digital copies. Publishers are in the

691
00:41:26,160 --> 00:41:29,680
Speaker 1: business of publishing and selling copies of books, but for years,

692
00:41:29,680 --> 00:41:32,520
Speaker 1: libraries have existed in order to get copies of various

693
00:41:32,560 --> 00:41:35,200
Speaker 1: books and to make them available for lending. So libraries

694
00:41:35,239 --> 00:41:38,640
Speaker 1: have to purchase the books or have them donated to

695
00:41:38,760 --> 00:41:42,080
Speaker 1: the library, and then makes those books available to lend

696
00:41:42,120 --> 00:41:46,399
Speaker 1: out to members of the library. The Internet Archive has

697
00:41:46,440 --> 00:41:49,879
Speaker 1: a controlled digital lending program to handle this sort of thing,

698
00:41:50,040 --> 00:41:54,800
Speaker 1: only we're talking about digital formats, not a physical copy

699
00:41:54,840 --> 00:41:58,600
Speaker 1: of a book. This is where things get tricky because obviously,

700
00:41:58,760 --> 00:42:02,000
Speaker 1: if you, as a American citizen at least, if you

701
00:42:02,040 --> 00:42:04,319
Speaker 1: go out and buy a copy of a book, you

702
00:42:04,360 --> 00:42:07,800
Speaker 1: can do whatever you like with your copy of that book,

703
00:42:08,040 --> 00:42:10,359
Speaker 1: apart from making your own copies of it and then

704
00:42:10,480 --> 00:42:14,120
Speaker 1: selling those. You can't do that. That's copyright infringement. But

705
00:42:14,200 --> 00:42:16,760
Speaker 1: if you own a physical copy of a book, you can.

706
00:42:16,920 --> 00:42:19,560
Speaker 1: You can keep it for yourself. You could lend it

707
00:42:19,600 --> 00:42:22,040
Speaker 1: to a friend and let them read it, they return

708
00:42:22,080 --> 00:42:24,440
Speaker 1: it to you later. You could give the book away.

709
00:42:24,880 --> 00:42:28,120
Speaker 1: You could resell your copy to someone else, even if

710
00:42:28,160 --> 00:42:30,520
Speaker 1: you're selling it for a fraction of what the book

711
00:42:30,600 --> 00:42:33,160
Speaker 1: is going for in bookstores. You could do that. You

712
00:42:33,160 --> 00:42:35,560
Speaker 1: could even burn the darn thing if you're so inclined.

713
00:42:35,960 --> 00:42:38,719
Speaker 1: Just don't do that. Don't burn books. But all of

714
00:42:38,760 --> 00:42:42,920
Speaker 1: those things are permitted with your personal copy of the book. However,

715
00:42:43,160 --> 00:42:46,520
Speaker 1: a digital copy, well, now we're starting to talk about

716
00:42:46,600 --> 00:42:49,840
Speaker 1: different rules. So yes, you can lend out a physical

717
00:42:49,840 --> 00:42:52,920
Speaker 1: copy of a book. That's allowed. That's fair use. But

718
00:42:53,280 --> 00:42:57,400
Speaker 1: actually it's not even fair use. That's under laws of property.

719
00:42:57,719 --> 00:43:00,640
Speaker 1: But we won't get into all that. A digital copy

720
00:43:00,760 --> 00:43:04,200
Speaker 1: is a lot trickier because it's easy to replicate, much

721
00:43:04,239 --> 00:43:07,520
Speaker 1: easier than replicating a physical copy of a book, and

722
00:43:07,560 --> 00:43:11,160
Speaker 1: so different rules have developed to handle digital information compared

723
00:43:11,200 --> 00:43:14,880
Speaker 1: to stuff that's in our physical meat space. So this

724
00:43:15,000 --> 00:43:18,600
Speaker 1: lawsuit argues that the Internet Archive first digitized physical books

725
00:43:18,640 --> 00:43:21,800
Speaker 1: without permission from the publishers, and that that was problem

726
00:43:21,880 --> 00:43:26,759
Speaker 1: number one. There's been some different arguments about that, like

727
00:43:27,000 --> 00:43:30,200
Speaker 1: if there was no ebook equivalent of the copy of

728
00:43:30,239 --> 00:43:33,799
Speaker 1: the book, if the publishers had not digitized that, that's

729
00:43:33,840 --> 00:43:37,560
Speaker 1: slightly different than if the publishers also offer an electronic

730
00:43:37,680 --> 00:43:40,560
Speaker 1: version of the physical books they sell. But the other

731
00:43:40,600 --> 00:43:43,880
Speaker 1: problem is that the Internet Archive received donations and funding

732
00:43:43,920 --> 00:43:46,480
Speaker 1: that in part stemmed from the practice of lending out

733
00:43:46,520 --> 00:43:49,640
Speaker 1: digitized books, So the publisher said that made the Internet

734
00:43:50,040 --> 00:43:54,200
Speaker 1: Archives activities a commercial enterprise. In twenty twenty three, a

735
00:43:54,400 --> 00:43:57,400
Speaker 1: judge found in favor of the publishers, saying that the

736
00:43:57,400 --> 00:44:00,319
Speaker 1: Internet Archive failed to argue that their work fell under

737
00:44:00,360 --> 00:44:03,200
Speaker 1: the principles of fair use. Again, getting into fair use,

738
00:44:03,239 --> 00:44:06,000
Speaker 1: that's a whole thing, but generally speaking, fair use covers

739
00:44:06,040 --> 00:44:08,880
Speaker 1: a relatively narrow set of use cases in which the

740
00:44:09,000 --> 00:44:12,680
Speaker 1: copying are the use or the distribution of a copyrighted

741
00:44:12,719 --> 00:44:16,080
Speaker 1: work does not count as copyright infringement. But it has

742
00:44:16,160 --> 00:44:20,200
Speaker 1: to meet certain criteria, and it's only ever decided in

743
00:44:20,239 --> 00:44:23,279
Speaker 1: a court of law. It's not something that's just you

744
00:44:23,280 --> 00:44:27,920
Speaker 1: can apply to proactively. It's something that you use in

745
00:44:28,000 --> 00:44:31,200
Speaker 1: a defense if you're brought up on charges of copyright infringement.

746
00:44:31,400 --> 00:44:34,560
Speaker 1: So by the time you're actually talking fair use, it's

747
00:44:34,600 --> 00:44:38,480
Speaker 1: already pretty late in the game. But anyway, this particular

748
00:44:38,600 --> 00:44:43,000
Speaker 1: lawsuit is under appeal. The Internet Archive recently made final

749
00:44:43,120 --> 00:44:46,920
Speaker 1: arguments in the case I have not seen anything about

750
00:44:46,960 --> 00:44:49,239
Speaker 1: the case being decided one way or the other since then,

751
00:44:49,560 --> 00:44:53,520
Speaker 1: so I'm not really sure which way it's going. Again.

752
00:44:54,200 --> 00:44:57,600
Speaker 1: I didn't see anything about a decision made, but then

753
00:44:57,760 --> 00:45:01,280
Speaker 1: most the articles about this are about the initial trial

754
00:45:01,360 --> 00:45:04,439
Speaker 1: that happened in twenty twenty three, so hopefully I will

755
00:45:04,480 --> 00:45:08,000
Speaker 1: find some follow up on this at some point. But

756
00:45:08,480 --> 00:45:11,239
Speaker 1: there's no denying the Internet Archive has done a tremendous

757
00:45:11,280 --> 00:45:14,359
Speaker 1: amount of work in the field of knowledge preservation and

758
00:45:14,440 --> 00:45:18,160
Speaker 1: knowledge accessibility. Without the Internet Archive, there's no way of

759
00:45:18,200 --> 00:45:21,400
Speaker 1: knowing how much information would be lost to us forever.

760
00:45:21,760 --> 00:45:25,000
Speaker 1: Stuff that could have been incredibly useful or even just

761
00:45:25,160 --> 00:45:29,759
Speaker 1: diverting could be gone, and we'd never have a way

762
00:45:29,760 --> 00:45:33,200
Speaker 1: of retrieving it again. And I am very thankful that

763
00:45:33,280 --> 00:45:36,440
Speaker 1: an organization like the Internet Archive exists. If you're not

764
00:45:36,520 --> 00:45:38,800
Speaker 1: familiar with it, if you never used it, I recommend

765
00:45:38,800 --> 00:45:42,560
Speaker 1: you check it out and explore the Internet Archive. Look

766
00:45:42,600 --> 00:45:45,040
Speaker 1: at some of the things that are in that archive,

767
00:45:45,280 --> 00:45:47,480
Speaker 1: like some of the books, some of the recordings. There's

768
00:45:47,520 --> 00:45:49,440
Speaker 1: some great stuff. I think there's like a quarter of

769
00:45:49,480 --> 00:45:53,720
Speaker 1: a million live performances archived just on the Internet Archive,

770
00:45:54,080 --> 00:45:58,719
Speaker 1: like live music performances. That alone is super cool. Anyway,

771
00:45:58,920 --> 00:46:03,080
Speaker 1: I hope you found this episode informative and entertaining. I

772
00:46:03,120 --> 00:46:06,560
Speaker 1: hope you check out Internet archive. I also very much

773
00:46:06,600 --> 00:46:09,560
Speaker 1: hope that you are all well and I will talk

774
00:46:09,600 --> 00:46:20,360
Speaker 1: to you again really soon. Tech Stuff is an iHeartRadio production.

775
00:46:20,640 --> 00:46:25,680
Speaker 1: For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts,

776
00:46:25,800 --> 00:46:27,800
Speaker 1: or wherever you listen to your favorite shows.