WEBVTT - The Internet Archive

0:00:04.480 --> 0:00:12.639
<v Speaker 1>Welcome to tech Stuff, a production from iHeartRadio. Hey there,

0:00:12.640 --> 0:00:16.000
<v Speaker 1>and welcome to tech Stuff. I'm your host, Jonathan Strickland.

0:00:16.040 --> 0:00:19.040
<v Speaker 1>I'm an executive producer with iHeart Podcasts. And how the

0:00:19.079 --> 0:00:23.280
<v Speaker 1>tech are yet. So let's take a little literary trip.

0:00:23.600 --> 0:00:29.200
<v Speaker 1>In Anthony Burgess's a clockwork Orange, the extremely wicked protagonist

0:00:29.680 --> 0:00:32.920
<v Speaker 1>it's putting it lightly. At one point early early in

0:00:32.920 --> 0:00:36.760
<v Speaker 1>the novel, reflects on the nature of permanence. He thinks

0:00:36.800 --> 0:00:40.680
<v Speaker 1>the reader might not remember what milk bars were like

0:00:41.159 --> 0:00:45.360
<v Speaker 1>due to quote things changing so scory these days and

0:00:45.479 --> 0:00:49.600
<v Speaker 1>everybody very quick to forget, newspapers not being read much

0:00:49.760 --> 0:00:54.120
<v Speaker 1>neither end quote. Alex in this case is saying that

0:00:54.200 --> 0:00:58.040
<v Speaker 1>the combination of the world changing very quickly scory is

0:00:58.080 --> 0:01:01.880
<v Speaker 1>derived from a Slavic word meaning swiftly or quickly, and

0:01:02.000 --> 0:01:05.720
<v Speaker 1>people having short memories means that referencing something that happened

0:01:05.760 --> 0:01:08.680
<v Speaker 1>even just a few years ago might mean you're met

0:01:08.680 --> 0:01:12.360
<v Speaker 1>with blank stares because the world has moved on. Now

0:01:12.520 --> 0:01:15.759
<v Speaker 1>take that same sentiment and crank it up to eleven

0:01:16.040 --> 0:01:18.840
<v Speaker 1>when you talk about the Internet in general and the

0:01:18.840 --> 0:01:21.600
<v Speaker 1>Web in particular. So, on the one hand, we know

0:01:22.000 --> 0:01:24.240
<v Speaker 1>that the rule of thumb is that once something gets

0:01:24.280 --> 0:01:27.920
<v Speaker 1>posted online, that's kind of it, right, it's sort of

0:01:27.959 --> 0:01:31.240
<v Speaker 1>perpetually online. Like that's kind of the joke. Like once

0:01:31.280 --> 0:01:33.520
<v Speaker 1>it's up, it's up, and you can take it down,

0:01:33.520 --> 0:01:35.280
<v Speaker 1>but there's going to be a copy of it somewhere.

0:01:35.720 --> 0:01:39.319
<v Speaker 1>So even if the originator tries to take down whatever

0:01:39.400 --> 0:01:43.440
<v Speaker 1>the stuff was, somebody's got it. But on the other hand,

0:01:43.440 --> 0:01:46.200
<v Speaker 1>we also know that so much stuff gets added every

0:01:46.240 --> 0:01:49.400
<v Speaker 1>single day to the Internet. There's actually a colossal mountain

0:01:49.400 --> 0:01:53.120
<v Speaker 1>of content out there that just keeps getting bigger moment

0:01:53.160 --> 0:01:55.960
<v Speaker 1>by moment, and everything that came before it can end

0:01:56.040 --> 0:01:59.480
<v Speaker 1>up getting buried in the process. And sometimes stuff can

0:01:59.560 --> 0:02:03.760
<v Speaker 1>be added and taken down without anyone being the wiser. Now,

0:02:03.800 --> 0:02:06.640
<v Speaker 1>on top of that, web pages obviously can change. A

0:02:06.720 --> 0:02:10.760
<v Speaker 1>website might adopt a new format or style, might incorporate

0:02:10.840 --> 0:02:15.000
<v Speaker 1>new technologies and interfaces that are added to web browsers,

0:02:15.360 --> 0:02:18.680
<v Speaker 1>or it might choose to remove sections that once might

0:02:18.720 --> 0:02:21.960
<v Speaker 1>have been relevant but maybe now not so much. Or

0:02:22.080 --> 0:02:27.079
<v Speaker 1>entire websites could disappear as servers go offline or companies

0:02:27.320 --> 0:02:32.040
<v Speaker 1>go bankrupt, or you know, web administrators just lose interest.

0:02:32.520 --> 0:02:36.520
<v Speaker 1>The entire spectrum of human output can be found on

0:02:36.560 --> 0:02:39.400
<v Speaker 1>the web. Not every instance of human output, but an

0:02:39.440 --> 0:02:44.440
<v Speaker 1>example of everything is out there. Everything from deep philosophical

0:02:44.520 --> 0:02:48.040
<v Speaker 1>musings to the most banal posts you know, which often

0:02:48.520 --> 0:02:51.320
<v Speaker 1>revolve around what someone is having for lunch. All of

0:02:51.320 --> 0:02:53.760
<v Speaker 1>that finds its way to the Internet. And while you

0:02:53.840 --> 0:02:56.600
<v Speaker 1>might argue that a lot of it, or perhaps even

0:02:56.680 --> 0:02:59.040
<v Speaker 1>most of it, is it really worth the time it

0:02:59.080 --> 0:03:02.920
<v Speaker 1>takes to consume, let alone keep it around. There is

0:03:03.080 --> 0:03:06.160
<v Speaker 1>undeniably a huge amount of valuable data out there too,

0:03:06.639 --> 0:03:09.800
<v Speaker 1>but there's no guarantee that it will stay there or

0:03:09.880 --> 0:03:13.880
<v Speaker 1>remain easily findable. And that's where today's topic comes in.

0:03:13.960 --> 0:03:16.480
<v Speaker 1>I wanted to talk about a project that began back

0:03:16.520 --> 0:03:19.320
<v Speaker 1>in nineteen ninety six. It's a project that aims to

0:03:19.360 --> 0:03:22.520
<v Speaker 1>preserve as much of the Internet as possible and little

0:03:22.720 --> 0:03:26.600
<v Speaker 1>slices of time, little snapshots. Not only does that mean

0:03:26.639 --> 0:03:29.200
<v Speaker 1>you can potentially dig up something that hasn't been online

0:03:29.240 --> 0:03:31.919
<v Speaker 1>for years, but also you can get a look at

0:03:32.000 --> 0:03:35.080
<v Speaker 1>what different sites were like in various eras of the Web.

0:03:35.320 --> 0:03:37.600
<v Speaker 1>It could be a really eye opening experience to see

0:03:37.640 --> 0:03:40.480
<v Speaker 1>something like Amazon and what it looked like, you know,

0:03:40.520 --> 0:03:43.960
<v Speaker 1>shortly after it launched, compared to what it looks like today.

0:03:44.400 --> 0:03:48.960
<v Speaker 1>So we are going to talk about the Internet Archive. Now.

0:03:48.960 --> 0:03:51.240
<v Speaker 1>To do that, we need to talk a little bit

0:03:51.240 --> 0:03:54.040
<v Speaker 1>about the people who founded the ding dang darn thing,

0:03:54.320 --> 0:03:58.520
<v Speaker 1>and that would be Brewster Kale and Bruce Gilliat. So

0:03:58.680 --> 0:04:02.040
<v Speaker 1>Klee graduated from m with a degree in computer science

0:04:02.040 --> 0:04:06.280
<v Speaker 1>and engineering. After he graduated, he joined fellow MIT graduate

0:04:06.400 --> 0:04:10.080
<v Speaker 1>Danny Hillis, who had created a company called Thinking Machines.

0:04:10.320 --> 0:04:13.960
<v Speaker 1>So this was a super computer company. His team specialized

0:04:13.960 --> 0:04:17.920
<v Speaker 1>in building massively parallel computer systems, mostly with the aim

0:04:17.960 --> 0:04:21.120
<v Speaker 1>of building machines for AI research and development. So yeah,

0:04:21.240 --> 0:04:24.480
<v Speaker 1>Calee was working on the challenges of providing AI researchers

0:04:24.520 --> 0:04:28.040
<v Speaker 1>with the compute power they need, decades before our current

0:04:28.120 --> 0:04:33.040
<v Speaker 1>AI explosion. Bruce Gilliot is also a computer scientist, and

0:04:33.080 --> 0:04:35.160
<v Speaker 1>that's just about all I know about him. I mean,

0:04:35.320 --> 0:04:38.040
<v Speaker 1>I know he is, or at least was married, and

0:04:38.120 --> 0:04:40.600
<v Speaker 1>I also know he owned a series of very impressive

0:04:40.600 --> 0:04:43.960
<v Speaker 1>houses in the San Francisco and San Jose areas because

0:04:44.000 --> 0:04:46.600
<v Speaker 1>it made the news whenever he sold one or bought

0:04:46.600 --> 0:04:49.679
<v Speaker 1>a new one. But other than that, there's precious little

0:04:49.680 --> 0:04:53.000
<v Speaker 1>information about him that I could find, which is somewhat ironic.

0:04:53.040 --> 0:04:55.440
<v Speaker 1>When you consider that he has dedicated a lot of

0:04:55.440 --> 0:04:58.520
<v Speaker 1>time and effort to preserving information on the Internet. He

0:04:58.520 --> 0:05:00.839
<v Speaker 1>would go on to co found the company called Alexa

0:05:00.920 --> 0:05:03.960
<v Speaker 1>Internet with Brewster Kale, but that's getting ahead of ourselves.

0:05:04.080 --> 0:05:07.839
<v Speaker 1>So most of my story will center around Kale simply

0:05:07.880 --> 0:05:10.520
<v Speaker 1>because out of the two co founders, he's the one

0:05:10.520 --> 0:05:13.839
<v Speaker 1>who acted more as the face of the efforts, and Gileat,

0:05:13.839 --> 0:05:15.880
<v Speaker 1>from what I can tell, has just been really good

0:05:15.880 --> 0:05:20.120
<v Speaker 1>about kind of maintaining a very personal private life. So

0:05:20.880 --> 0:05:24.960
<v Speaker 1>I don't mean to diminish Gileat's contributions, but at the

0:05:24.960 --> 0:05:27.640
<v Speaker 1>same time, you know, I can only cover what I

0:05:27.640 --> 0:05:31.240
<v Speaker 1>can find. So in nineteen eighty nine, Kale, along with

0:05:31.320 --> 0:05:35.080
<v Speaker 1>a colleague named Harry Morris, created an innovative tool for

0:05:35.200 --> 0:05:38.760
<v Speaker 1>the blossoming Internet. Now remember this is the Internet. It's

0:05:38.839 --> 0:05:42.119
<v Speaker 1>not the Worldwide Web. It didn't exist yet the Web

0:05:42.240 --> 0:05:45.159
<v Speaker 1>the Internet did, and the tool they created was called

0:05:45.160 --> 0:05:51.960
<v Speaker 1>the Wide Area Information Server or ways WAIS. So people

0:05:52.000 --> 0:05:55.040
<v Speaker 1>could create a web server. They could host documents on

0:05:55.080 --> 0:05:59.960
<v Speaker 1>their web servers. But finding these documents was really hard

0:06:00.720 --> 0:06:04.680
<v Speaker 1>because you didn't necessarily have hyperlinks connecting one document to

0:06:04.760 --> 0:06:07.920
<v Speaker 1>others and vice versa. You didn't have an easy way

0:06:07.960 --> 0:06:12.680
<v Speaker 1>of even navigating through different documents from one to the next.

0:06:13.160 --> 0:06:15.320
<v Speaker 1>So it was almost a case that you needed to

0:06:15.360 --> 0:06:19.080
<v Speaker 1>know where something was and what it was called first,

0:06:19.240 --> 0:06:22.440
<v Speaker 1>and then you could go to the relevant server and

0:06:22.480 --> 0:06:26.599
<v Speaker 1>retrieve that document. Otherwise the document would just remain quietly

0:06:26.680 --> 0:06:30.359
<v Speaker 1>sitting on some server somewhere and no one would know

0:06:30.400 --> 0:06:34.080
<v Speaker 1>about it. Now, that is antithetical to the entire purpose

0:06:34.160 --> 0:06:37.840
<v Speaker 1>of a wide area information sharing system, because, I mean,

0:06:37.880 --> 0:06:40.800
<v Speaker 1>the name tells us the whole purpose of this technology

0:06:40.839 --> 0:06:45.360
<v Speaker 1>is to allow information to be widely shared. Jeremy Norman's

0:06:45.400 --> 0:06:50.000
<v Speaker 1>History of Information lists ways as quote the first Internet

0:06:50.080 --> 0:06:54.120
<v Speaker 1>publishing system, just predating Gopher and the World Wide Web

0:06:54.320 --> 0:06:58.839
<v Speaker 1>end quote. In a recorded presentation to some Xerox employees,

0:06:59.000 --> 0:07:03.120
<v Speaker 1>Kale laid out personal perspective on what he wants from

0:07:03.279 --> 0:07:06.159
<v Speaker 1>his experience on the Internet. So first up, he said

0:07:06.360 --> 0:07:09.520
<v Speaker 1>he wanted his own personal information to be easily accessible

0:07:09.960 --> 0:07:13.240
<v Speaker 1>by him. Specifically, not that it should be easily accessible

0:07:13.280 --> 0:07:16.880
<v Speaker 1>to everybody, but specifically to him. He wanted the ability

0:07:16.920 --> 0:07:19.760
<v Speaker 1>to get access to all the different stuff he generates,

0:07:19.800 --> 0:07:22.280
<v Speaker 1>like articles and such, and to make it really easy

0:07:22.320 --> 0:07:25.080
<v Speaker 1>to do that. He also wanted the ability for publishers

0:07:25.120 --> 0:07:27.960
<v Speaker 1>to get their work to him. So in Kal's mind,

0:07:28.280 --> 0:07:30.720
<v Speaker 1>the best approach would be for published works that are

0:07:30.760 --> 0:07:33.360
<v Speaker 1>relevant to his interests to find their way to him,

0:07:33.560 --> 0:07:36.120
<v Speaker 1>as opposed to Kale having to go out and hunt

0:07:36.200 --> 0:07:39.480
<v Speaker 1>down these published works himself. And he pointed out this

0:07:39.600 --> 0:07:42.480
<v Speaker 1>is what publishers want too, because you wouldn't publish something

0:07:42.560 --> 0:07:45.239
<v Speaker 1>unless he wanted folks to actually read it. He also

0:07:45.320 --> 0:07:48.160
<v Speaker 1>said that he wanted this technology to be usable anywhere.

0:07:48.600 --> 0:07:51.200
<v Speaker 1>He wanted people to be able to access it no

0:07:51.240 --> 0:07:53.080
<v Speaker 1>matter what kind of device they were relying on. Now

0:07:53.160 --> 0:07:56.160
<v Speaker 1>he was specifically referencing laptops at the time, but he

0:07:56.280 --> 0:08:00.120
<v Speaker 1>was also saying that portable computer systems, essentially things that

0:08:00.120 --> 0:08:03.400
<v Speaker 1>would become smartphones and tablets, were on the horizon and

0:08:03.440 --> 0:08:05.880
<v Speaker 1>that these needed to be able to access that stuff too.

0:08:06.280 --> 0:08:09.080
<v Speaker 1>And he said that he wanted people to be able

0:08:09.080 --> 0:08:11.880
<v Speaker 1>to use what he had learned should he choose to

0:08:11.880 --> 0:08:15.440
<v Speaker 1>share the information, that if he had come up with

0:08:15.480 --> 0:08:17.600
<v Speaker 1>something that was useful and he wanted to share that,

0:08:17.640 --> 0:08:19.760
<v Speaker 1>he wanted other people to be able to access that.

0:08:20.160 --> 0:08:23.120
<v Speaker 1>Cale didn't say that people should be compelled to share,

0:08:23.560 --> 0:08:26.000
<v Speaker 1>but if they wanted to it should be possible to

0:08:26.040 --> 0:08:30.560
<v Speaker 1>do so. Ways was Cale's attempt to bring these ideas

0:08:30.640 --> 0:08:34.199
<v Speaker 1>to life. In that presentation to the Xerox employees, he

0:08:34.320 --> 0:08:38.320
<v Speaker 1>defined ways as electronic publishing. He further defined that term

0:08:38.400 --> 0:08:41.880
<v Speaker 1>to mean the distribution of information. So whether the end

0:08:41.960 --> 0:08:45.080
<v Speaker 1>user was to look at this information on a computer

0:08:45.120 --> 0:08:48.280
<v Speaker 1>screen or they just chose to print out the information

0:08:48.640 --> 0:08:50.880
<v Speaker 1>and then read it that way, that was beside the point.

0:08:51.120 --> 0:08:55.559
<v Speaker 1>Electronic publishing was all about how information got from the

0:08:55.600 --> 0:08:58.760
<v Speaker 1>originator to the end user. That's what made it e

0:08:58.920 --> 0:09:02.880
<v Speaker 1>publishing that it was publishing over wires. Now, one thing

0:09:03.000 --> 0:09:06.800
<v Speaker 1>Cale introduced in this presentation to Xerox was this concept

0:09:06.800 --> 0:09:10.760
<v Speaker 1>of conducting searches using natural language. This concept is one

0:09:10.800 --> 0:09:13.640
<v Speaker 1>that we're really familiar with today. You enter a query

0:09:13.800 --> 0:09:16.200
<v Speaker 1>into a search bar. You describe what it is that

0:09:16.240 --> 0:09:19.760
<v Speaker 1>you want to know or learn about, or have access to,

0:09:20.080 --> 0:09:23.400
<v Speaker 1>or retrieve or whatever. This search engine brings back search

0:09:23.440 --> 0:09:26.600
<v Speaker 1>results that are ordered by some kind of relevance depending

0:09:26.679 --> 0:09:29.960
<v Speaker 1>upon the search engines, you know, various algorithms. How the

0:09:30.000 --> 0:09:33.760
<v Speaker 1>search engine determines relevance really depends upon the system itself,

0:09:33.880 --> 0:09:36.160
<v Speaker 1>of course, Like you could run the same search across

0:09:36.400 --> 0:09:39.760
<v Speaker 1>different search engines and get very different results based upon

0:09:40.080 --> 0:09:45.280
<v Speaker 1>that methodology of determining relevance. If the system believes it's relevant,

0:09:45.480 --> 0:09:47.240
<v Speaker 1>it may or may not be relevant to what you

0:09:47.320 --> 0:09:50.520
<v Speaker 1>actually want. Like hopefully the two are aligned. If it's

0:09:50.520 --> 0:09:53.400
<v Speaker 1>a really good search engine, then you're going to get

0:09:53.480 --> 0:09:57.600
<v Speaker 1>something that is actually meaningful to you. Anyway, Ways was

0:09:57.720 --> 0:10:01.720
<v Speaker 1>kind of following in that approach back before there was

0:10:01.760 --> 0:10:04.280
<v Speaker 1>a World Wide Web, you know, when you just needed

0:10:04.280 --> 0:10:08.200
<v Speaker 1>a way to find stuff that was being stored on

0:10:08.280 --> 0:10:11.880
<v Speaker 1>these Internet servers and to be able to retrieve these

0:10:11.920 --> 0:10:14.600
<v Speaker 1>documents to make use of them. Otherwise you had this

0:10:14.679 --> 0:10:19.360
<v Speaker 1>incredibly powerful communications tool, but it was so challenging to

0:10:19.480 --> 0:10:22.600
<v Speaker 1>use in a meaningful way that the information stored there

0:10:23.000 --> 0:10:26.560
<v Speaker 1>would be not that useful. I think of it akin

0:10:26.679 --> 0:10:31.720
<v Speaker 1>to imagine that there's this one remote library and it's tiny,

0:10:32.080 --> 0:10:36.440
<v Speaker 1>but it has the world's only copy of some text.

0:10:36.840 --> 0:10:39.280
<v Speaker 1>But this libraries in the middle of nowhere. It's really

0:10:39.360 --> 0:10:42.160
<v Speaker 1>hard to get to the fact that that library has

0:10:42.280 --> 0:10:45.800
<v Speaker 1>that document would not be terribly useful to most people,

0:10:45.920 --> 0:10:47.840
<v Speaker 1>and so it might as well not have the document

0:10:47.880 --> 0:10:50.120
<v Speaker 1>at all. That's kind of what Ways was trying to

0:10:50.160 --> 0:10:52.920
<v Speaker 1>do is solve this problem of making it easier to

0:10:52.960 --> 0:10:57.400
<v Speaker 1>get access to this wealth of information that Kale saw

0:10:57.720 --> 0:11:01.880
<v Speaker 1>was only going to get more complex and more full

0:11:01.960 --> 0:11:05.600
<v Speaker 1>of data. Well, we'll move away from Ways, because we

0:11:05.600 --> 0:11:08.280
<v Speaker 1>could do a full episode about that. I will say

0:11:08.280 --> 0:11:11.960
<v Speaker 1>that Cale and Morris, the founders of Ways, the guys

0:11:11.960 --> 0:11:17.120
<v Speaker 1>who created the Ways technologies, would actually leave Thinking Machines

0:11:17.320 --> 0:11:20.680
<v Speaker 1>and they would found a spinoff company just called Ways Incorporated.

0:11:20.920 --> 0:11:23.439
<v Speaker 1>And it was around this point when the mysterious Bruce

0:11:23.480 --> 0:11:26.840
<v Speaker 1>Gilliot joined the team. And while the Worldwide Web would

0:11:26.880 --> 0:11:29.840
<v Speaker 1>debut in the early nineties, which really opened up accessibility

0:11:29.840 --> 0:11:32.040
<v Speaker 1>to information on the Internet for a lot of people,

0:11:32.480 --> 0:11:35.840
<v Speaker 1>most of them for the first time, Ways would continue

0:11:35.880 --> 0:11:38.920
<v Speaker 1>to remain relevant. In fact, it was relevant enough that

0:11:39.040 --> 0:11:42.480
<v Speaker 1>in nineteen ninety five AOL would come calling with an

0:11:42.480 --> 0:11:45.959
<v Speaker 1>offer to purchase the company for a cool fifteen million dollars.

0:11:46.000 --> 0:11:48.840
<v Speaker 1>If we adjust that for inflation today's money, that would

0:11:48.880 --> 0:11:53.640
<v Speaker 1>be around thirty million bucks around that ballpark. Now, a

0:11:53.640 --> 0:11:56.680
<v Speaker 1>lot of the folks that Ways Incorporated would split off

0:11:56.760 --> 0:12:00.679
<v Speaker 1>to create new companies after this acquisition, and within a

0:12:00.800 --> 0:12:04.400
<v Speaker 1>year that included Cale and Gileat, who went on to

0:12:04.559 --> 0:12:10.000
<v Speaker 1>found a new company called Alexa Internet and you might think, huh, Alexa,

0:12:10.120 --> 0:12:13.280
<v Speaker 1>you mean like the same name as the Amazon Digital Assistant,

0:12:13.679 --> 0:12:16.559
<v Speaker 1>And yes, exactly that, because, as it would turn out,

0:12:16.600 --> 0:12:21.840
<v Speaker 1>Amazon would ultimately acquire Alexa Internet just a few years

0:12:21.880 --> 0:12:25.080
<v Speaker 1>after it was founded. But the name derived from the

0:12:25.120 --> 0:12:29.800
<v Speaker 1>Library at Alexandria, the ancient library of Egypt that at

0:12:29.880 --> 0:12:33.240
<v Speaker 1>one point housed one of the world's largest collections of

0:12:33.320 --> 0:12:39.400
<v Speaker 1>accumulated knowledge. Now around forty eight BCE, Julius Caesar Julie

0:12:39.400 --> 0:12:42.960
<v Speaker 1>Baby and his boys they barged into Alexandria, and as

0:12:43.000 --> 0:12:46.840
<v Speaker 1>a consequence of their rowdy invasion, the library caught fire

0:12:47.200 --> 0:12:49.920
<v Speaker 1>and much of the collection burned. Sadly, that was not

0:12:49.960 --> 0:12:52.880
<v Speaker 1>the only indignity. In fact, it wasn't the first indignity

0:12:53.200 --> 0:12:57.120
<v Speaker 1>that the library suffered that would impact its relevance. Further

0:12:57.240 --> 0:13:00.000
<v Speaker 1>conflicts a couple of centuries later pretty much wiped out

0:13:00.160 --> 0:13:03.560
<v Speaker 1>whatever had been left from the previous calamities, and the

0:13:03.600 --> 0:13:07.079
<v Speaker 1>Library of Alexandria became kind of a touchstone for folks

0:13:07.080 --> 0:13:10.160
<v Speaker 1>who have stressed the importance of access to knowledge and

0:13:10.240 --> 0:13:13.240
<v Speaker 1>the protection of that knowledge, and that the consequences that

0:13:13.360 --> 0:13:15.920
<v Speaker 1>could follow from the loss of such knowledge can be

0:13:15.960 --> 0:13:20.200
<v Speaker 1>really dire. See also like the Middle Ages the Dark Ages,

0:13:20.200 --> 0:13:24.120
<v Speaker 1>for example, that loss of knowledge is a really terrible thing.

0:13:24.520 --> 0:13:28.000
<v Speaker 1>So the impetus for Alexa Internet was that Cale and

0:13:28.080 --> 0:13:31.760
<v Speaker 1>Gillat wanted, in the words of the Web Design Museum quote,

0:13:31.840 --> 0:13:35.960
<v Speaker 1>to develop advanced web navigation that would continually improve itself

0:13:36.080 --> 0:13:39.520
<v Speaker 1>on the basis of user generated data end quote, which

0:13:39.559 --> 0:13:42.679
<v Speaker 1>is a pretty advanced idea for nineteen ninety six when

0:13:42.720 --> 0:13:45.600
<v Speaker 1>the Web was still very young and the general public

0:13:45.679 --> 0:13:47.439
<v Speaker 1>was still just trying to get a grip on exactly

0:13:47.480 --> 0:13:51.320
<v Speaker 1>what the Web and by extension, the Internet were. One

0:13:51.360 --> 0:13:54.679
<v Speaker 1>of the first tools that Alexa Internet developed was a

0:13:54.720 --> 0:13:58.000
<v Speaker 1>browser toolbar. So installing this toolbar into a browser would

0:13:58.000 --> 0:14:01.120
<v Speaker 1>give the user's access to a sort of crowd powered

0:14:01.200 --> 0:14:04.640
<v Speaker 1>recommendation engine. In some ways, it's not that different from

0:14:04.840 --> 0:14:08.360
<v Speaker 1>sites like dig and Reddit that would later rely on

0:14:08.440 --> 0:14:11.880
<v Speaker 1>the user community to actually work and to recommend links

0:14:11.920 --> 0:14:17.120
<v Speaker 1>to really interesting sites. This toolbar would recommend the sites

0:14:17.120 --> 0:14:20.760
<v Speaker 1>to users based upon how the overall community was browsing.

0:14:20.920 --> 0:14:24.160
<v Speaker 1>So the more people who were using this toolbar, the

0:14:24.200 --> 0:14:27.480
<v Speaker 1>more information was going into where they were going, and

0:14:27.520 --> 0:14:29.720
<v Speaker 1>thus you would get different recommendations. So if a lot

0:14:29.720 --> 0:14:32.440
<v Speaker 1>of people were navigating to a specific site for whatever reason,

0:14:32.680 --> 0:14:35.320
<v Speaker 1>you might get a recommendation to do the same. It

0:14:35.360 --> 0:14:38.160
<v Speaker 1>was an attempt at an organic way for folks to

0:14:38.240 --> 0:14:41.560
<v Speaker 1>suggest websites, kind of like a word of mouth campaign,

0:14:41.920 --> 0:14:45.920
<v Speaker 1>and Alexa Internet would also provide meta information about websites

0:14:45.960 --> 0:14:48.840
<v Speaker 1>to users if they wanted it. Meta information is information

0:14:48.920 --> 0:14:52.240
<v Speaker 1>about information, so this would include stuff like how many

0:14:52.440 --> 0:14:55.400
<v Speaker 1>web pages were part of an overall website, or how

0:14:55.440 --> 0:14:58.600
<v Speaker 1>many other websites were pointing back to the one you

0:14:58.640 --> 0:15:01.200
<v Speaker 1>were on, and so forth. A lot of the stuff

0:15:01.360 --> 0:15:04.840
<v Speaker 1>that Alexa Internet could tell you would reflect a specific

0:15:04.880 --> 0:15:07.640
<v Speaker 1>web page's relevance. It's the same sort of information that

0:15:07.640 --> 0:15:10.600
<v Speaker 1>search engines like Google would take into account when deciding

0:15:10.640 --> 0:15:14.480
<v Speaker 1>relevance for search results. And that meant that it didn't

0:15:14.480 --> 0:15:16.520
<v Speaker 1>take very long for Amazon to come around with an

0:15:16.560 --> 0:15:20.000
<v Speaker 1>offer to purchase Alexa Internet. I'll talk about that more,

0:15:20.120 --> 0:15:22.920
<v Speaker 1>as well as the development of the Internet Archive after

0:15:22.960 --> 0:15:26.360
<v Speaker 1>we come back from this quick break to thank our sponsors.

0:15:35.600 --> 0:15:40.000
<v Speaker 1>So Amazon in nineteen ninety nine takes a look at

0:15:40.080 --> 0:15:44.200
<v Speaker 1>Alexa Internet and says, Wow, this is pretty incredible. This

0:15:44.600 --> 0:15:49.480
<v Speaker 1>little company has created some means of checking for stuff

0:15:49.480 --> 0:15:53.840
<v Speaker 1>like relevance and metadata that could be really really useful

0:15:53.880 --> 0:15:57.280
<v Speaker 1>for us, And so Amazon made an offer that Alexa

0:15:57.320 --> 0:16:00.160
<v Speaker 1>Internet couldn't refuse to acquire the company for the and

0:16:00.240 --> 0:16:03.160
<v Speaker 1>slee some of two hundred and fifty million dollars in

0:16:03.280 --> 0:16:07.680
<v Speaker 1>Amazon stock in May of ninety nine. So this is

0:16:07.880 --> 0:16:10.880
<v Speaker 1>a little different than the earlier deal we talked about

0:16:10.880 --> 0:16:14.840
<v Speaker 1>where AOL bought you know, the Ways Incorporated, because they

0:16:14.840 --> 0:16:17.120
<v Speaker 1>bought it with two hundred and fifty million dollars with

0:16:17.200 --> 0:16:19.920
<v Speaker 1>a stock. If we just treated that like it was

0:16:19.960 --> 0:16:25.040
<v Speaker 1>a cash exchange, then if we had just for inflation,

0:16:25.120 --> 0:16:28.240
<v Speaker 1>that's like around four hundred and sixty nine million dollars

0:16:28.240 --> 0:16:31.480
<v Speaker 1>worth of stock. But that's not really how you deal

0:16:31.520 --> 0:16:33.920
<v Speaker 1>with the value here, right. You have to think about

0:16:33.920 --> 0:16:36.680
<v Speaker 1>how much was the stock worth back in nineteen ninety

0:16:36.800 --> 0:16:39.600
<v Speaker 1>nine versus how much is the stock worth today? I

0:16:39.800 --> 0:16:43.480
<v Speaker 1>checked and I saw that in May of nineteen ninety nine,

0:16:43.560 --> 0:16:46.520
<v Speaker 1>Amazon stock was trading for around two dollars eighty nine

0:16:46.560 --> 0:16:49.400
<v Speaker 1>cents per share. These days, it's closer to one hundred

0:16:49.400 --> 0:16:53.840
<v Speaker 1>and eighty dollars per share. Plus. Between that time, Amazon

0:16:53.920 --> 0:16:56.760
<v Speaker 1>had two different stock splits. There was a two to

0:16:56.760 --> 0:16:59.520
<v Speaker 1>one split in late ninety nine, and there was a

0:16:59.560 --> 0:17:03.240
<v Speaker 1>twenty to one stock split in twenty twenty two. When

0:17:03.240 --> 0:17:06.080
<v Speaker 1>you factor all that up, that two hundred and fifty

0:17:06.080 --> 0:17:10.840
<v Speaker 1>million dollars in stock ends up being a ton of wealth.

0:17:11.240 --> 0:17:13.760
<v Speaker 1>Like it's just a huge amount. It would take a

0:17:13.800 --> 0:17:17.040
<v Speaker 1>lot of calculating to get an estimate, and even then

0:17:17.359 --> 0:17:21.520
<v Speaker 1>it wouldn't really be accurate just say that deal is

0:17:21.560 --> 0:17:25.399
<v Speaker 1>worth a lot. So anyway, the important thing with the

0:17:25.400 --> 0:17:29.119
<v Speaker 1>Internet Archive is that Cale and Gileat, through their work

0:17:29.160 --> 0:17:32.359
<v Speaker 1>and creating tools for Alexa Internet, found themselves able to

0:17:32.400 --> 0:17:36.920
<v Speaker 1>create snapshots of the Web. So they were using Alexa

0:17:37.000 --> 0:17:40.560
<v Speaker 1>Internet to have a commercial business, and they established the

0:17:40.560 --> 0:17:45.480
<v Speaker 1>Internet Archive as a way of preserving information that had,

0:17:45.560 --> 0:17:48.680
<v Speaker 1>at some point or another found its home on the Internet.

0:17:48.960 --> 0:17:52.480
<v Speaker 1>So they were using Alexa Internet tech to crawl the

0:17:52.560 --> 0:17:55.080
<v Speaker 1>young Web in order to index everything, which is a

0:17:55.200 --> 0:17:58.040
<v Speaker 1>necessary step if you want to give people access to

0:17:58.119 --> 0:18:00.399
<v Speaker 1>the various documents posted on the web. We first have

0:18:00.440 --> 0:18:02.639
<v Speaker 1>to know what is there and where is it. To

0:18:02.720 --> 0:18:07.320
<v Speaker 1>do that, you've got to index everything. And then they said, well,

0:18:07.600 --> 0:18:09.760
<v Speaker 1>now that we are able to index this, we could

0:18:09.800 --> 0:18:14.000
<v Speaker 1>actually download these little snapshots and keep them. And according

0:18:14.000 --> 0:18:18.560
<v Speaker 1>to the Internet Archive, that would be important because the

0:18:18.640 --> 0:18:23.119
<v Speaker 1>average lifespan for a new web page was not very long,

0:18:23.400 --> 0:18:27.320
<v Speaker 1>So contrary to our belief that once something is posted

0:18:27.359 --> 0:18:30.480
<v Speaker 1>to the Internet, it's there forever, the archive found that

0:18:30.520 --> 0:18:34.560
<v Speaker 1>on average, new web pages stuck around for about seventy

0:18:34.680 --> 0:18:38.679
<v Speaker 1>seven days, which means it's less than three months, and

0:18:38.720 --> 0:18:42.639
<v Speaker 1>then puff they would disappear, like maybe they would change drastically,

0:18:42.680 --> 0:18:46.679
<v Speaker 1>maybe they would just go away. Now, imagine that you

0:18:46.720 --> 0:18:49.800
<v Speaker 1>were to walk into a brick and mortar library, but

0:18:49.880 --> 0:18:52.000
<v Speaker 1>then you found out that on average the books in

0:18:52.040 --> 0:18:54.639
<v Speaker 1>that library would only stick around for three months before

0:18:54.680 --> 0:18:57.720
<v Speaker 1>being lost forever. And think of all the knowledge that

0:18:57.760 --> 0:19:01.200
<v Speaker 1>would disappear on a regular basis and ongoing basis. It

0:19:01.200 --> 0:19:03.840
<v Speaker 1>would be impossible to calculate the impact of that kind

0:19:03.840 --> 0:19:06.200
<v Speaker 1>of reality. It would be like losing the Library of

0:19:06.240 --> 0:19:10.679
<v Speaker 1>Alexandria regularly every three months. So Cale had come to

0:19:10.720 --> 0:19:14.160
<v Speaker 1>the conclusion that knowledge should be preserved and made available

0:19:14.200 --> 0:19:17.399
<v Speaker 1>for posterity. This is similar to an idea that was

0:19:17.440 --> 0:19:20.880
<v Speaker 1>proposed by Stuart Brand back in the nineteen eighties. It's

0:19:20.920 --> 0:19:24.560
<v Speaker 1>a complicated idea that typically gets boiled down to the

0:19:24.600 --> 0:19:29.679
<v Speaker 1>saying information wants to be free. That's actually an oversimplification

0:19:29.720 --> 0:19:33.800
<v Speaker 1>of what Brand was really communicating. But his point was

0:19:33.800 --> 0:19:37.040
<v Speaker 1>that information's value is kind of like a paradox. The

0:19:37.119 --> 0:19:41.440
<v Speaker 1>information could be incredibly valuable, right, it could be absolutely critical,

0:19:41.480 --> 0:19:45.439
<v Speaker 1>and therefore it could be expensive, but the cost of

0:19:45.480 --> 0:19:50.040
<v Speaker 1>distributing information was consistently declining. It was getting easier and

0:19:50.200 --> 0:19:54.120
<v Speaker 1>cheaper to share information, and the benefits of making information

0:19:54.240 --> 0:19:59.560
<v Speaker 1>accessible are typically pretty tremendous. But information is only accessible

0:20:00.119 --> 0:20:03.560
<v Speaker 1>if someone is able to hold onto that info. Otherwise

0:20:03.560 --> 0:20:06.520
<v Speaker 1>it's lost. Right, The Internet was such a volatile thing

0:20:06.560 --> 0:20:09.119
<v Speaker 1>that there was no guarantee that what you saw today

0:20:09.520 --> 0:20:13.000
<v Speaker 1>would be available tomorrow. In the days before the dynamic web,

0:20:13.680 --> 0:20:16.639
<v Speaker 1>it wasn't really unusual for someone to establish a web page,

0:20:16.880 --> 0:20:20.159
<v Speaker 1>to publish that page, and then later on to wipe

0:20:20.160 --> 0:20:24.480
<v Speaker 1>the slate clean or you know, otherwise alter vast portions

0:20:24.480 --> 0:20:27.040
<v Speaker 1>of that page in order to use that same web

0:20:27.400 --> 0:20:31.400
<v Speaker 1>landscape to host a totally different document. So the old

0:20:31.440 --> 0:20:34.720
<v Speaker 1>stuff would just disappear. And so Calee and Gilliat created

0:20:35.000 --> 0:20:40.119
<v Speaker 1>the Internet Archive, a nonprofit organization dedicated to the archival

0:20:40.440 --> 0:20:44.399
<v Speaker 1>of information across the Internet. And I think most people

0:20:44.800 --> 0:20:49.040
<v Speaker 1>are familiar with it from the web wayback machine, but

0:20:49.080 --> 0:20:52.240
<v Speaker 1>that's just one part of what the Internet Archive does.

0:20:52.600 --> 0:20:55.199
<v Speaker 1>As stated in the Library of Congress, the mission of

0:20:55.240 --> 0:20:59.480
<v Speaker 1>the Internet Archive was quote offering permanent access for researchers,

0:20:59.520 --> 0:21:03.040
<v Speaker 1>his story and scholars to historical collections that exist in

0:21:03.119 --> 0:21:07.040
<v Speaker 1>digital format end quote. Cale and Gilliat founded the Internet

0:21:07.119 --> 0:21:09.600
<v Speaker 1>Archive the same year they founded Alexa Internet. So that's

0:21:09.720 --> 0:21:14.440
<v Speaker 1>nineteen ninety six. And it wasn't easy. And why is that? Well,

0:21:14.880 --> 0:21:17.280
<v Speaker 1>you got to think about the challenge you face if

0:21:17.320 --> 0:21:20.919
<v Speaker 1>you want to archive everything on the Internet, or at

0:21:21.000 --> 0:21:24.480
<v Speaker 1>least everything that you're allowed to archive on the Internet.

0:21:24.600 --> 0:21:26.600
<v Speaker 1>We'll come back to that a couple of times. So,

0:21:26.640 --> 0:21:28.240
<v Speaker 1>for one thing, you need to create a way to

0:21:28.320 --> 0:21:31.920
<v Speaker 1>capture the content of a web page and to preserve

0:21:31.960 --> 0:21:35.119
<v Speaker 1>that for posterity. And you need a way for people

0:21:35.280 --> 0:21:39.560
<v Speaker 1>to access those archived web pages and to navigate them.

0:21:39.800 --> 0:21:43.639
<v Speaker 1>So Alexa Internet would end up developing these technologies and

0:21:43.680 --> 0:21:47.320
<v Speaker 1>commercializing them in various ways, and the Internet Archive was

0:21:47.359 --> 0:21:51.119
<v Speaker 1>made possible through these tools. So you could think of

0:21:51.160 --> 0:21:56.000
<v Speaker 1>Alexa Internet as being the funding machine for Internet Archive

0:21:56.119 --> 0:21:58.600
<v Speaker 1>in the beginning, at least as far as the tools

0:21:58.680 --> 0:22:02.080
<v Speaker 1>Internet Archive would use in order to achieve its mission. Now,

0:22:02.119 --> 0:22:05.720
<v Speaker 1>on the capturing front, Alexa Internet created a web crawler.

0:22:06.000 --> 0:22:10.760
<v Speaker 1>So for applications like web search engines, primarily web search engines,

0:22:11.040 --> 0:22:14.919
<v Speaker 1>web crawlers are the soldiers that they send out. A

0:22:14.960 --> 0:22:19.080
<v Speaker 1>web crawler's job is to index content across the Internet

0:22:19.160 --> 0:22:22.119
<v Speaker 1>and to capture information about what the various web pages

0:22:22.160 --> 0:22:26.199
<v Speaker 1>on the Internet are actually about. It's complicated, right. You

0:22:26.240 --> 0:22:29.520
<v Speaker 1>could just have a directory of web pages that's based

0:22:29.520 --> 0:22:32.119
<v Speaker 1>off the title of the web pages, but title and

0:22:32.240 --> 0:22:36.280
<v Speaker 1>content are not always in alignment. So web crawlers are

0:22:36.320 --> 0:22:40.399
<v Speaker 1>all about following the various branching pathways across the web.

0:22:40.480 --> 0:22:43.520
<v Speaker 1>They crawl through the web, in other words, indexing every

0:22:43.640 --> 0:22:47.080
<v Speaker 1>page as they do. So. Not everyone, however, wants their

0:22:47.080 --> 0:22:50.760
<v Speaker 1>web page indexed. So you can actually include some HTML

0:22:50.880 --> 0:22:54.840
<v Speaker 1>language in your web page that indicates that it's off

0:22:54.880 --> 0:22:58.760
<v Speaker 1>limits for indexing, and appolite web crawlers such as the

0:22:58.760 --> 0:23:03.000
<v Speaker 1>ones that Alexi Internet was using, will honor those instructions

0:23:03.040 --> 0:23:06.480
<v Speaker 1>and it will not index that page. But other pages

0:23:06.760 --> 0:23:11.639
<v Speaker 1>that lack this specific instruction of hey, don't index this,

0:23:12.359 --> 0:23:15.920
<v Speaker 1>they're fair game. I like to think of web crellers

0:23:16.000 --> 0:23:18.440
<v Speaker 1>kind of like Doctor Strange from the Marvel Universe the

0:23:18.560 --> 0:23:21.399
<v Speaker 1>Cinematic Universe in particular, they all want. He uses his

0:23:21.520 --> 0:23:25.760
<v Speaker 1>time manipulation abilities to see where all the different possible

0:23:26.000 --> 0:23:29.800
<v Speaker 1>pathways can lead to. The web crellers do that across

0:23:29.880 --> 0:23:32.440
<v Speaker 1>the web. They explore all the nooks and crannies. They

0:23:32.480 --> 0:23:35.560
<v Speaker 1>follow each link that even the ones that no one

0:23:35.640 --> 0:23:38.520
<v Speaker 1>ever clicks on, they follow those two. And you know,

0:23:38.640 --> 0:23:41.359
<v Speaker 1>hats off to web crellers for doing that to build

0:23:41.359 --> 0:23:44.240
<v Speaker 1>out these indices, because without it, web search wouldn't work,

0:23:44.560 --> 0:23:49.919
<v Speaker 1>and Alexa Internet wouldn't have been a thing anyway. Alexa

0:23:49.960 --> 0:23:53.520
<v Speaker 1>Internet and by extension, the Internet Archive used several different

0:23:53.520 --> 0:23:56.240
<v Speaker 1>web crallers over the years, but they all basically do

0:23:56.359 --> 0:23:59.119
<v Speaker 1>the same thing, or they they you know, more accurately.

0:23:59.160 --> 0:24:02.800
<v Speaker 1>They all aimed to achieve the same results. So the

0:24:02.840 --> 0:24:06.280
<v Speaker 1>crawler starts with seed URLs. This is like the starting

0:24:06.320 --> 0:24:08.879
<v Speaker 1>point where you let them go, and then they follow

0:24:08.880 --> 0:24:11.920
<v Speaker 1>each link and they download documents to the archives servers.

0:24:12.119 --> 0:24:15.640
<v Speaker 1>The crawlers also reference the links to ensure that they're

0:24:15.640 --> 0:24:20.119
<v Speaker 1>not double dipping on a specific crawl. So if you

0:24:20.160 --> 0:24:22.600
<v Speaker 1>have a ton of different sites that are all linking

0:24:22.680 --> 0:24:25.240
<v Speaker 1>to the same document, like let's say that someone has

0:24:25.440 --> 0:24:30.160
<v Speaker 1>published something, and hundreds of other resources on the internet

0:24:30.840 --> 0:24:34.960
<v Speaker 1>reference that published document, Well, That means there's all these

0:24:34.960 --> 0:24:38.360
<v Speaker 1>different pathways that lead to the same destination, right, and

0:24:38.720 --> 0:24:42.680
<v Speaker 1>it would be somewhat wasteful to capture this exact same

0:24:42.760 --> 0:24:48.160
<v Speaker 1>document multiple times during the same crawl, so there's cross

0:24:48.280 --> 0:24:51.400
<v Speaker 1>referencing that happens in order to prevent that from occurring.

0:24:52.000 --> 0:24:55.159
<v Speaker 1>This process does work, but it also has limitations. So

0:24:55.240 --> 0:24:58.600
<v Speaker 1>for one thing, these crawls they do create snapshots of

0:24:58.600 --> 0:25:01.640
<v Speaker 1>the web in intervals, So if you use the wayback machine,

0:25:02.000 --> 0:25:04.359
<v Speaker 1>we'll talk more about that in a second. You'll see

0:25:04.400 --> 0:25:06.879
<v Speaker 1>that the history of a web page consists of a

0:25:07.040 --> 0:25:10.919
<v Speaker 1>series of dates from which the Internet archive first received

0:25:10.960 --> 0:25:13.720
<v Speaker 1>a snapshot of that page, and it leads all the

0:25:13.760 --> 0:25:17.000
<v Speaker 1>way up to the most recent reference of that page,

0:25:17.040 --> 0:25:20.560
<v Speaker 1>the most recent snapshot. The various dates and the wayback

0:25:20.640 --> 0:25:24.359
<v Speaker 1>machine are not necessarily relevant to any major changes that

0:25:24.480 --> 0:25:27.159
<v Speaker 1>happened on the web page itself. This is just when

0:25:27.640 --> 0:25:31.280
<v Speaker 1>the web crawlers went to that particular web page. So

0:25:31.880 --> 0:25:35.480
<v Speaker 1>it may be immediately after a massive change has been implemented,

0:25:35.520 --> 0:25:38.119
<v Speaker 1>it may be well after. In fact, there might be

0:25:38.240 --> 0:25:42.600
<v Speaker 1>a point where between webcraller visits a web page has

0:25:42.720 --> 0:25:45.520
<v Speaker 1>changed a couple of times. Well, that means that the

0:25:45.520 --> 0:25:48.320
<v Speaker 1>ones that are happening in between those changes aren't going

0:25:48.359 --> 0:25:51.200
<v Speaker 1>to be captured. It's just whatever was there the first

0:25:51.200 --> 0:25:53.760
<v Speaker 1>time the web crawler came through, and whatever was there

0:25:53.800 --> 0:25:57.200
<v Speaker 1>the next time the web craller came through. So interesting

0:25:57.240 --> 0:25:59.359
<v Speaker 1>thing is that if a particular page does have a

0:25:59.480 --> 0:26:02.960
<v Speaker 1>ton of other links pointing to it, that page is

0:26:03.000 --> 0:26:06.880
<v Speaker 1>more likely to have very frequent snapshots throughout its history,

0:26:07.280 --> 0:26:12.280
<v Speaker 1>because again, through subsequent crawls, there are various routes that

0:26:12.359 --> 0:26:15.320
<v Speaker 1>take web crallers through that web page, so they're more

0:26:15.480 --> 0:26:18.919
<v Speaker 1>likely to capture a snapshot of it. For pages that

0:26:18.960 --> 0:26:21.639
<v Speaker 1>have fewer links pointing to them, maybe there aren't that

0:26:21.720 --> 0:26:25.520
<v Speaker 1>many other web pages out there that cite this particular page,

0:26:25.720 --> 0:26:28.919
<v Speaker 1>they're more likely to have sporadic updates throughout their history.

0:26:28.960 --> 0:26:31.679
<v Speaker 1>You might pull up a page in the Wayback machine

0:26:31.680 --> 0:26:36.000
<v Speaker 1>and see that there's only maybe half a dozen captures

0:26:36.160 --> 0:26:39.840
<v Speaker 1>of that particular page, and that means that there could

0:26:39.840 --> 0:26:42.800
<v Speaker 1>be a lot of changes that were missed in between visits.

0:26:43.160 --> 0:26:47.040
<v Speaker 1>So not everything gets captured in the Internet archive. I

0:26:47.080 --> 0:26:51.080
<v Speaker 1>think that some people work under the mistaken presumption that

0:26:51.720 --> 0:26:55.200
<v Speaker 1>anything that was ever published to the web is captured

0:26:55.280 --> 0:26:58.439
<v Speaker 1>and archived. There that's not the case. It's whatever was

0:26:58.480 --> 0:27:00.920
<v Speaker 1>there when the web crawlers came through it. So, because

0:27:00.960 --> 0:27:03.359
<v Speaker 1>even the Internet Archive is not a perfect record of

0:27:03.440 --> 0:27:07.000
<v Speaker 1>everything that's ever happened on the web, other elements, like

0:27:07.040 --> 0:27:09.639
<v Speaker 1>I said, could also be lost to time due to

0:27:09.680 --> 0:27:13.200
<v Speaker 1>the complexity of web navigation. For example, so when web

0:27:13.240 --> 0:27:18.280
<v Speaker 1>designers started to incorporate things like flash, which really is

0:27:18.320 --> 0:27:20.600
<v Speaker 1>no longer a thing but it was for a while,

0:27:20.880 --> 0:27:24.240
<v Speaker 1>or JavaScript, then the web callers that were being used

0:27:24.359 --> 0:27:26.880
<v Speaker 1>to index the web, a lot of them just couldn't

0:27:27.359 --> 0:27:30.879
<v Speaker 1>navigate these types of tools that were made through flash

0:27:30.920 --> 0:27:34.840
<v Speaker 1>or JavaScript. So while human users could, and they could,

0:27:35.160 --> 0:27:39.680
<v Speaker 1>you know, interact with interfaces that had these tools created

0:27:39.720 --> 0:27:43.320
<v Speaker 1>through these various methods, web collers couldn't. And that meant

0:27:43.320 --> 0:27:46.680
<v Speaker 1>that if a website used like tools that were made

0:27:46.720 --> 0:27:50.800
<v Speaker 1>in JavaScript to act as the interface, the web creller

0:27:50.880 --> 0:27:54.000
<v Speaker 1>might only be able to index the homepage, but not

0:27:54.080 --> 0:27:57.320
<v Speaker 1>any of the other links branching off from the homepage

0:27:57.359 --> 0:28:01.280
<v Speaker 1>because it couldn't navigate that same interface. So there's a

0:28:01.280 --> 0:28:04.199
<v Speaker 1>lot of stuff from that era that's lost to the

0:28:04.240 --> 0:28:07.320
<v Speaker 1>Internet Archive as well, simply because the crawlers just could

0:28:07.359 --> 0:28:11.560
<v Speaker 1>not navigate those pages. They were never captured. And like

0:28:11.600 --> 0:28:15.080
<v Speaker 1>I said, if you happen to have the instruction, the

0:28:15.200 --> 0:28:18.840
<v Speaker 1>HTML instruction not to index the site, well then that's

0:28:18.880 --> 0:28:21.119
<v Speaker 1>not going to be there either. Now let's move on

0:28:21.240 --> 0:28:25.160
<v Speaker 1>to another challenge, which is the storing of these files.

0:28:25.520 --> 0:28:29.960
<v Speaker 1>Indexing everything was one thing. How do you store everything

0:28:30.000 --> 0:28:32.960
<v Speaker 1>that can be indexed on the web in an archive?

0:28:33.880 --> 0:28:36.800
<v Speaker 1>That's what we're going to come back and explore after

0:28:36.840 --> 0:28:49.840
<v Speaker 1>we take another quick break to thank our sponsors. Okay,

0:28:50.360 --> 0:28:54.160
<v Speaker 1>so the Internet archive, how do you store all the

0:28:54.200 --> 0:28:57.040
<v Speaker 1>information that you find across the web. Well, the big

0:28:57.080 --> 0:29:00.600
<v Speaker 1>one for web pages was that you had to figure

0:29:00.600 --> 0:29:03.840
<v Speaker 1>out where do you store and how do you organize

0:29:03.840 --> 0:29:06.640
<v Speaker 1>snapshots of the web so that one you have a

0:29:06.680 --> 0:29:09.320
<v Speaker 1>record of them, and two you can find what you're

0:29:09.360 --> 0:29:12.720
<v Speaker 1>looking for. You can navigate to the specific instance that

0:29:12.760 --> 0:29:16.000
<v Speaker 1>you're looking for. Keep in mind again, the archives not

0:29:16.040 --> 0:29:18.800
<v Speaker 1>capturing everything. As I said before the break, there's a

0:29:18.840 --> 0:29:21.440
<v Speaker 1>lot of stuff that web crawlers could not access for

0:29:21.480 --> 0:29:25.000
<v Speaker 1>one reason or another. Those things would be either off

0:29:25.040 --> 0:29:28.080
<v Speaker 1>limits or inaccessible and thus would not be in the archive.

0:29:28.400 --> 0:29:31.880
<v Speaker 1>But everything else was still fair game. So to store

0:29:31.920 --> 0:29:35.880
<v Speaker 1>and organize everything, Alexa Internet created a new file format

0:29:36.000 --> 0:29:41.680
<v Speaker 1>called an ARC file. ARC ARC files contain information about

0:29:41.720 --> 0:29:45.840
<v Speaker 1>all the stuff that's inside them, the metadata of the Internet.

0:29:46.000 --> 0:29:50.240
<v Speaker 1>So again, metadata is data about data. It makes the

0:29:50.280 --> 0:29:53.880
<v Speaker 1>small files inside the larger ARC files all self identifying,

0:29:54.000 --> 0:29:56.480
<v Speaker 1>so there's no need to actually build out an index.

0:29:56.760 --> 0:30:00.480
<v Speaker 1>The self identifying information includes stuff like the URL for

0:30:00.600 --> 0:30:03.640
<v Speaker 1>the file, like what the URL for that particular document is,

0:30:03.880 --> 0:30:06.680
<v Speaker 1>how big the document is when it was retrieved, and

0:30:06.880 --> 0:30:10.160
<v Speaker 1>other stuff like that. Each ARC file would have a

0:30:10.200 --> 0:30:13.120
<v Speaker 1>capacity of around one hundred megabytes, and it was possible

0:30:13.160 --> 0:30:15.840
<v Speaker 1>for a single website to span multiple ARC files. I mean,

0:30:15.880 --> 0:30:18.120
<v Speaker 1>there's some big websites out there that have been around

0:30:18.120 --> 0:30:22.400
<v Speaker 1>for a long time, so yeah, sometimes a single ARC

0:30:22.480 --> 0:30:26.040
<v Speaker 1>file would just be a portion of that website. At first,

0:30:26.320 --> 0:30:30.160
<v Speaker 1>the Internet archives stored all this information on magnetic tape,

0:30:30.440 --> 0:30:34.360
<v Speaker 1>So you would do this indexing of the web, all

0:30:34.400 --> 0:30:37.320
<v Speaker 1>these snapshots, and you would save it to magnetic tape.

0:30:37.400 --> 0:30:40.200
<v Speaker 1>I remember I used to work for a company, a

0:30:40.280 --> 0:30:44.120
<v Speaker 1>consulting firm that had magnetic tape backups. So it was

0:30:44.200 --> 0:30:48.040
<v Speaker 1>my job, one of my jobs to occasionally back up

0:30:48.120 --> 0:30:51.520
<v Speaker 1>all the data on our network to tape, and I

0:30:51.560 --> 0:30:54.720
<v Speaker 1>would have to swap tapes out and label them and

0:30:54.760 --> 0:30:58.240
<v Speaker 1>everything and archive them properly. The Internet Archive worked under

0:30:58.280 --> 0:31:01.560
<v Speaker 1>the same idea. It would capture a snapshot of all

0:31:01.680 --> 0:31:05.720
<v Speaker 1>the files across the web, save them to tape, and

0:31:06.280 --> 0:31:09.160
<v Speaker 1>that was how the Internet Archive kept track of things

0:31:09.200 --> 0:31:14.440
<v Speaker 1>for about three years. But eventually activity on the Internet

0:31:14.800 --> 0:31:16.760
<v Speaker 1>was such that that was not going to do it.

0:31:16.840 --> 0:31:19.640
<v Speaker 1>There were too many users who wanted to be able

0:31:19.880 --> 0:31:23.720
<v Speaker 1>to access things that were stored or saved within the

0:31:23.760 --> 0:31:27.680
<v Speaker 1>Internet Archive, and this method just couldn't keep up with

0:31:27.800 --> 0:31:30.560
<v Speaker 1>demand and necessity, as we all know, is the mother

0:31:30.640 --> 0:31:33.800
<v Speaker 1>of invention. So the Internet Archive needed an alternative way

0:31:33.840 --> 0:31:37.080
<v Speaker 1>to store these snapshots. And of course, the Web was

0:31:37.640 --> 0:31:41.080
<v Speaker 1>really growing dramatically, which is putting it lightly, and there

0:31:41.120 --> 0:31:43.320
<v Speaker 1>was a real need to step things up considerably. So

0:31:43.360 --> 0:31:46.600
<v Speaker 1>to that end, the staff at Internet Archive developed a

0:31:46.640 --> 0:31:52.080
<v Speaker 1>storage system they called the PetaBox PetaBox, and it was

0:31:52.120 --> 0:31:55.600
<v Speaker 1>called the PetaBox because it could house a petabyte of information.

0:31:55.960 --> 0:32:00.120
<v Speaker 1>A petabyte, in case you're curious, is a million gigabytes. Now,

0:32:00.160 --> 0:32:02.719
<v Speaker 1>the most recent data I have about the PetaBox storage

0:32:02.720 --> 0:32:05.920
<v Speaker 1>system actually comes from December twenty twenty one, so it's

0:32:05.960 --> 0:32:08.400
<v Speaker 1>a few years out of date. But at that time,

0:32:08.600 --> 0:32:11.760
<v Speaker 1>the Internet Archive was using two hundred and twelve petabytes

0:32:11.760 --> 0:32:15.160
<v Speaker 1>of storage, which is a lot that wasn't all the

0:32:15.200 --> 0:32:20.000
<v Speaker 1>Wayback Machine. However, only around fifty seven petabytes of that

0:32:20.440 --> 0:32:23.600
<v Speaker 1>was for the Wayback Machine. The rest was for other

0:32:23.680 --> 0:32:27.640
<v Speaker 1>things like archiving various forms of digital media as well

0:32:27.680 --> 0:32:32.920
<v Speaker 1>as what Internet Archive references as quote unquote unique data. Anyway,

0:32:33.320 --> 0:32:36.640
<v Speaker 1>the page on Internet Archive site says that the data

0:32:36.680 --> 0:32:40.240
<v Speaker 1>centers there are four of them that house the petabyte

0:32:40.280 --> 0:32:44.280
<v Speaker 1>storage system, don't use air conditioning, which helps keep electric

0:32:44.320 --> 0:32:48.440
<v Speaker 1>bills down. They actually let the heat from the data

0:32:48.480 --> 0:32:52.440
<v Speaker 1>storage devices provide heating for the buildings that they're stored

0:32:52.480 --> 0:32:56.200
<v Speaker 1>in and that you know, this is all part of

0:32:56.240 --> 0:33:00.960
<v Speaker 1>a strategy to keep things at low cost but high

0:33:01.040 --> 0:33:05.440
<v Speaker 1>usability and high efficiency. So that's really the big requirements

0:33:05.480 --> 0:33:08.480
<v Speaker 1>for the PetaBox system. It has to be efficient. It

0:33:08.520 --> 0:33:12.520
<v Speaker 1>cannot require too much power to operate any single PetaBox.

0:33:12.760 --> 0:33:17.040
<v Speaker 1>Another requirement is that each rack of hard drive storage

0:33:17.080 --> 0:33:19.320
<v Speaker 1>has to hold a ton of hard drives. We're talking

0:33:19.440 --> 0:33:23.160
<v Speaker 1>like one hundred plus terabytes worth of hard drive space.

0:33:23.600 --> 0:33:26.920
<v Speaker 1>Another requirement is that to serve as an administrator, it

0:33:27.000 --> 0:33:30.640
<v Speaker 1>needs to be easy like it can't be complicated to

0:33:30.880 --> 0:33:37.440
<v Speaker 1>administrate this storage system, and according to Internet Archive, the

0:33:37.480 --> 0:33:40.840
<v Speaker 1>structure of this is such that you need about one

0:33:41.000 --> 0:33:44.640
<v Speaker 1>administrator for every petabyte worth of data, so you know,

0:33:44.720 --> 0:33:47.840
<v Speaker 1>that's like two hundred administrators. Essentially, the whole goal was

0:33:47.880 --> 0:33:52.480
<v Speaker 1>to create systems that were relatively inexpensive, relatively efficient, and

0:33:52.640 --> 0:33:56.160
<v Speaker 1>relatively easy to use. At least from an administrative perspective.

0:33:56.480 --> 0:33:59.640
<v Speaker 1>That's really tall order. It's hard to meet all those

0:34:00.560 --> 0:34:03.360
<v Speaker 1>but the folks at Internet Archive made it happen, and

0:34:03.480 --> 0:34:07.480
<v Speaker 1>it was such a useful approach to storage and to

0:34:07.680 --> 0:34:10.719
<v Speaker 1>being able to organize the files within storage so that

0:34:10.800 --> 0:34:14.160
<v Speaker 1>you didn't have to build out indices that ultimately Internet

0:34:14.280 --> 0:34:21.160
<v Speaker 1>Archive would deploy this same strategy for other organizations and institutions. Okay,

0:34:21.239 --> 0:34:26.040
<v Speaker 1>but that's all about, you know, collecting and storing all

0:34:26.080 --> 0:34:30.640
<v Speaker 1>the information across the Internet. How do you access it?

0:34:30.920 --> 0:34:33.440
<v Speaker 1>How is a user? How is a researcher? Are you

0:34:33.520 --> 0:34:39.320
<v Speaker 1>able to tap into this? Because again, unless accessibility is easy,

0:34:39.960 --> 0:34:42.440
<v Speaker 1>then there's not much point to doing this. You're just

0:34:42.480 --> 0:34:46.279
<v Speaker 1>making a record that nobody can reference. Well, I would

0:34:46.360 --> 0:34:51.680
<v Speaker 1>argue the most famous of the ways to access information

0:34:51.800 --> 0:34:54.879
<v Speaker 1>contained within the Internet Archive is the wayback machine, which

0:34:54.920 --> 0:34:58.960
<v Speaker 1>is specifically for web pages. The Internet Archive first introduced

0:34:59.000 --> 0:35:02.279
<v Speaker 1>the wayback Machine in two thousand and one, and the

0:35:02.320 --> 0:35:05.160
<v Speaker 1>way it works is pretty simple. There's a little it's

0:35:05.239 --> 0:35:07.520
<v Speaker 1>kind of like a search bar, but it's a urlbar.

0:35:07.680 --> 0:35:10.520
<v Speaker 1>You put in a URL for the web page that

0:35:10.560 --> 0:35:13.799
<v Speaker 1>you're interested in, and the wayback machine pulls up the

0:35:13.840 --> 0:35:17.040
<v Speaker 1>snapshots that are contained within the archive if there are

0:35:17.080 --> 0:35:20.120
<v Speaker 1>any snapshots. As I mentioned earlier, not everything is in there,

0:35:20.200 --> 0:35:22.440
<v Speaker 1>but if it is in there, you will see options

0:35:22.440 --> 0:35:25.600
<v Speaker 1>available to you to look at the page at different

0:35:25.640 --> 0:35:28.239
<v Speaker 1>points in history. One thing I like to do is

0:35:28.320 --> 0:35:31.920
<v Speaker 1>look back at how famous web pages have changed in

0:35:32.000 --> 0:35:34.680
<v Speaker 1>their design over the years. If you put in something

0:35:34.719 --> 0:35:38.360
<v Speaker 1>like really big like CNN dot com, you can see

0:35:38.360 --> 0:35:41.359
<v Speaker 1>how the look and interface of that site has transitioned

0:35:41.640 --> 0:35:44.920
<v Speaker 1>during different eras across the web. I also used to

0:35:44.960 --> 0:35:47.920
<v Speaker 1>do this with the old website I worked for houstuffworks

0:35:47.960 --> 0:35:50.560
<v Speaker 1>dot com. I mean that's where tech stuff gets the

0:35:50.640 --> 0:35:53.880
<v Speaker 1>stuff and its name is from HowStuffWorks dot com. I

0:35:54.000 --> 0:35:57.160
<v Speaker 1>like using the wayback machine to look at what the

0:35:57.200 --> 0:35:59.719
<v Speaker 1>site looked like when I first joined, which was a

0:36:00.040 --> 0:36:02.400
<v Speaker 1>in February two thousand and seven. In case you're curious.

0:36:02.680 --> 0:36:07.000
<v Speaker 1>It looks entirely different now than how it looked back then,

0:36:07.200 --> 0:36:09.360
<v Speaker 1>and through the wayback Machine you can see what it

0:36:09.360 --> 0:36:12.400
<v Speaker 1>looked like back then. Also, these days, the wayback machine

0:36:12.440 --> 0:36:13.920
<v Speaker 1>is the only way I can see some of the

0:36:14.000 --> 0:36:17.840
<v Speaker 1>articles I wrote for that site, because the articles have

0:36:17.960 --> 0:36:23.040
<v Speaker 1>been either deleted or more likely rewritten over the time. Now.

0:36:23.040 --> 0:36:24.839
<v Speaker 1>To be fair to how stuff works, a lot of

0:36:24.840 --> 0:36:28.000
<v Speaker 1>my writing was in the computers and electronics sections, and

0:36:28.120 --> 0:36:32.520
<v Speaker 1>obviously things change in those fields very quickly, and something

0:36:32.560 --> 0:36:37.320
<v Speaker 1>that was relevant fifteen years ago is definitely not relevant today.

0:36:37.880 --> 0:36:41.040
<v Speaker 1>So you have to replace old stuff on a regular basis.

0:36:41.160 --> 0:36:43.319
<v Speaker 1>But it is kind of sad that a lot of

0:36:43.320 --> 0:36:45.360
<v Speaker 1>my work, a lot of my work for the first

0:36:45.719 --> 0:36:49.040
<v Speaker 1>you know, ten years of my career doing this kind

0:36:49.040 --> 0:36:52.560
<v Speaker 1>of stuff, is not accessible unless you use something like

0:36:52.600 --> 0:36:55.440
<v Speaker 1>the wayback Machine. Now, one super neat thing about the

0:36:55.440 --> 0:36:58.680
<v Speaker 1>wayback machine is that you can still follow links that

0:36:58.719 --> 0:37:02.600
<v Speaker 1>are on pages, like if the archive has those linked

0:37:02.640 --> 0:37:05.600
<v Speaker 1>assets also in the archive, then you're going to be

0:37:05.600 --> 0:37:08.120
<v Speaker 1>shown a record, and the record will be one that

0:37:08.160 --> 0:37:11.719
<v Speaker 1>was captured closest in time with the first page that

0:37:11.800 --> 0:37:15.319
<v Speaker 1>you were originally on. This sounds complicated, Let me give

0:37:15.320 --> 0:37:17.880
<v Speaker 1>an example, it makes it way easier. So let's say

0:37:18.080 --> 0:37:22.440
<v Speaker 1>that I visit the web capture the snapshot for HowStuffWorks

0:37:22.520 --> 0:37:26.680
<v Speaker 1>dot COM's homepage on February nineteenth, two thousand and seven.

0:37:27.160 --> 0:37:30.440
<v Speaker 1>By the way, this snapshot on feb nineteenth, two thousand

0:37:30.440 --> 0:37:33.400
<v Speaker 1>and seven is the closest date to when I started

0:37:33.480 --> 0:37:37.600
<v Speaker 1>working at that company that's in the archive. The actual

0:37:37.680 --> 0:37:40.960
<v Speaker 1>date when I started the website was not captured on

0:37:41.000 --> 0:37:45.840
<v Speaker 1>that day. Anyway, By clicking around on this homepage, I

0:37:45.840 --> 0:37:49.399
<v Speaker 1>can actually follow links and it'll pull up archived links

0:37:49.440 --> 0:37:52.840
<v Speaker 1>of archived articles, which is really neat. And when I

0:37:52.880 --> 0:37:56.120
<v Speaker 1>did that, at one point, I clicked on a link

0:37:56.239 --> 0:38:01.320
<v Speaker 1>for more information or related articles to how helicopters work.

0:38:01.719 --> 0:38:06.320
<v Speaker 1>That page, the related page was actually archived on February

0:38:06.360 --> 0:38:09.319
<v Speaker 1>twenty second, two thousand and seven. So one was on

0:38:09.360 --> 0:38:12.360
<v Speaker 1>February nineteenth, the other was February twenty second, but the

0:38:12.440 --> 0:38:16.800
<v Speaker 1>link still worked. Right. Yes, these were two different pages

0:38:16.840 --> 0:38:20.560
<v Speaker 1>that were archived on two different days, but the nature

0:38:20.760 --> 0:38:26.120
<v Speaker 1>of the archive allows those links to still work between

0:38:26.160 --> 0:38:28.680
<v Speaker 1>the two, which is neat because I'm not just popping

0:38:28.719 --> 0:38:31.480
<v Speaker 1>around through a web of links. I'm also kind of

0:38:31.520 --> 0:38:36.040
<v Speaker 1>time traveling, right, I'm looking at a timeline of snapshots

0:38:36.280 --> 0:38:39.279
<v Speaker 1>that are all still interlinked together, even if they were

0:38:39.280 --> 0:38:42.839
<v Speaker 1>captured on different days. I think that's really cool. Now

0:38:42.880 --> 0:38:45.040
<v Speaker 1>it gets even more cool when you think about the

0:38:45.080 --> 0:38:48.440
<v Speaker 1>scale of this project. So, according to the Internet Archive itself,

0:38:48.640 --> 0:38:52.120
<v Speaker 1>the archive contains eight hundred and thirty five billion with

0:38:52.200 --> 0:38:55.439
<v Speaker 1>a B web pages, And as I mentioned earlier, that

0:38:55.680 --> 0:38:58.360
<v Speaker 1>just makes up part of all the data that's stored

0:38:58.400 --> 0:39:02.040
<v Speaker 1>on Internet Archive servers, because the organization is also home

0:39:02.080 --> 0:39:05.640
<v Speaker 1>to more than forty four million books and other texts,

0:39:06.000 --> 0:39:11.040
<v Speaker 1>fifteen million audio recordings, more than ten million videos, and

0:39:11.120 --> 0:39:15.040
<v Speaker 1>more than a million different pieces of software. Again, some

0:39:15.120 --> 0:39:19.040
<v Speaker 1>of this stuff might not be recorded anywhere else. There

0:39:19.080 --> 0:39:22.160
<v Speaker 1>may not be duplicates or copies of some of this

0:39:22.200 --> 0:39:26.799
<v Speaker 1>stuff anywhere else. While you might have things like Blu

0:39:26.920 --> 0:39:31.160
<v Speaker 1>ray DVDs or whatever of some of those videos, others

0:39:31.239 --> 0:39:36.080
<v Speaker 1>might not have anything. And history is filled with instances

0:39:36.120 --> 0:39:40.759
<v Speaker 1>of media companies generating stuff or others, you know, independent

0:39:40.920 --> 0:39:45.359
<v Speaker 1>people too, generating stuff but not keeping a copy for posterity,

0:39:45.440 --> 0:39:48.880
<v Speaker 1>and then it's here and it's gone. Sometimes that's on purpose.

0:39:49.280 --> 0:39:52.719
<v Speaker 1>Sometimes it's a statement, like you make something ephemeral for

0:39:52.760 --> 0:39:56.400
<v Speaker 1>that very reason. Other times it's out of convenience, Like

0:39:56.719 --> 0:40:00.719
<v Speaker 1>there are stories about how the BBC would regularly reuse

0:40:00.880 --> 0:40:05.719
<v Speaker 1>tapes and tape over previous programming because there was no

0:40:05.800 --> 0:40:13.359
<v Speaker 1>thought about preservation or a home theater industry. So there

0:40:13.400 --> 0:40:17.200
<v Speaker 1>are entire eras of stuff like Doctor Who that are

0:40:17.239 --> 0:40:21.160
<v Speaker 1>just gone or believed to be gone because the BBC

0:40:21.280 --> 0:40:25.000
<v Speaker 1>would just tape over old tapes and so you lost

0:40:25.040 --> 0:40:29.440
<v Speaker 1>whatever was on there originally. That's why things like the

0:40:29.440 --> 0:40:32.880
<v Speaker 1>Internet Archive exist is to avoid that in the case

0:40:32.920 --> 0:40:35.680
<v Speaker 1>of stuff that's stored across the Internet, to make sure

0:40:35.800 --> 0:40:39.239
<v Speaker 1>that there is an accessible record of those things and

0:40:39.239 --> 0:40:41.920
<v Speaker 1>that they don't just disappear. In two thousand and seven,

0:40:42.120 --> 0:40:45.640
<v Speaker 1>the state of California recognize the Internet Archive as an

0:40:45.680 --> 0:40:49.480
<v Speaker 1>official library, which was important it's not just an honorarium.

0:40:49.760 --> 0:40:53.520
<v Speaker 1>It would allow the nonprofit organization to receive federal funding,

0:40:53.600 --> 0:40:56.360
<v Speaker 1>which is a pretty important development for the longevity of

0:40:56.400 --> 0:40:59.440
<v Speaker 1>the program. But while the usefulness of the organization is

0:40:59.440 --> 0:41:02.480
<v Speaker 1>beyond question, the methods that the Archive has used this

0:41:02.680 --> 0:41:06.680
<v Speaker 1>have not always been met with universal approval. For example, recently,

0:41:06.920 --> 0:41:10.800
<v Speaker 1>the Internet Archive has been embroiled in a pretty nasty lawsuit.

0:41:10.960 --> 0:41:14.719
<v Speaker 1>It's called the Hatchet versus Internet Archive suit, and it

0:41:14.760 --> 0:41:17.839
<v Speaker 1>revolves around a group of publishers that object to how

0:41:17.880 --> 0:41:21.880
<v Speaker 1>the Internet Archive scans physical books for the purposes of

0:41:22.000 --> 0:41:26.160
<v Speaker 1>lending them out as digital copies. Publishers are in the

0:41:26.160 --> 0:41:29.680
<v Speaker 1>business of publishing and selling copies of books, but for years,

0:41:29.680 --> 0:41:32.520
<v Speaker 1>libraries have existed in order to get copies of various

0:41:32.560 --> 0:41:35.200
<v Speaker 1>books and to make them available for lending. So libraries

0:41:35.239 --> 0:41:38.640
<v Speaker 1>have to purchase the books or have them donated to

0:41:38.760 --> 0:41:42.080
<v Speaker 1>the library, and then makes those books available to lend

0:41:42.120 --> 0:41:46.399
<v Speaker 1>out to members of the library. The Internet Archive has

0:41:46.440 --> 0:41:49.879
<v Speaker 1>a controlled digital lending program to handle this sort of thing,

0:41:50.040 --> 0:41:54.800
<v Speaker 1>only we're talking about digital formats, not a physical copy

0:41:54.840 --> 0:41:58.600
<v Speaker 1>of a book. This is where things get tricky because obviously,

0:41:58.760 --> 0:42:02.000
<v Speaker 1>if you, as a American citizen at least, if you

0:42:02.040 --> 0:42:04.319
<v Speaker 1>go out and buy a copy of a book, you

0:42:04.360 --> 0:42:07.800
<v Speaker 1>can do whatever you like with your copy of that book,

0:42:08.040 --> 0:42:10.359
<v Speaker 1>apart from making your own copies of it and then

0:42:10.480 --> 0:42:14.120
<v Speaker 1>selling those. You can't do that. That's copyright infringement. But

0:42:14.200 --> 0:42:16.760
<v Speaker 1>if you own a physical copy of a book, you can.

0:42:16.920 --> 0:42:19.560
<v Speaker 1>You can keep it for yourself. You could lend it

0:42:19.600 --> 0:42:22.040
<v Speaker 1>to a friend and let them read it, they return

0:42:22.080 --> 0:42:24.440
<v Speaker 1>it to you later. You could give the book away.

0:42:24.880 --> 0:42:28.120
<v Speaker 1>You could resell your copy to someone else, even if

0:42:28.160 --> 0:42:30.520
<v Speaker 1>you're selling it for a fraction of what the book

0:42:30.600 --> 0:42:33.160
<v Speaker 1>is going for in bookstores. You could do that. You

0:42:33.160 --> 0:42:35.560
<v Speaker 1>could even burn the darn thing if you're so inclined.

0:42:35.960 --> 0:42:38.719
<v Speaker 1>Just don't do that. Don't burn books. But all of

0:42:38.760 --> 0:42:42.920
<v Speaker 1>those things are permitted with your personal copy of the book. However,

0:42:43.160 --> 0:42:46.520
<v Speaker 1>a digital copy, well, now we're starting to talk about

0:42:46.600 --> 0:42:49.840
<v Speaker 1>different rules. So yes, you can lend out a physical

0:42:49.840 --> 0:42:52.920
<v Speaker 1>copy of a book. That's allowed. That's fair use. But

0:42:53.280 --> 0:42:57.400
<v Speaker 1>actually it's not even fair use. That's under laws of property.

0:42:57.719 --> 0:43:00.640
<v Speaker 1>But we won't get into all that. A digital copy

0:43:00.760 --> 0:43:04.200
<v Speaker 1>is a lot trickier because it's easy to replicate, much

0:43:04.239 --> 0:43:07.520
<v Speaker 1>easier than replicating a physical copy of a book, and

0:43:07.560 --> 0:43:11.160
<v Speaker 1>so different rules have developed to handle digital information compared

0:43:11.200 --> 0:43:14.880
<v Speaker 1>to stuff that's in our physical meat space. So this

0:43:15.000 --> 0:43:18.600
<v Speaker 1>lawsuit argues that the Internet Archive first digitized physical books

0:43:18.640 --> 0:43:21.800
<v Speaker 1>without permission from the publishers, and that that was problem

0:43:21.880 --> 0:43:26.759
<v Speaker 1>number one. There's been some different arguments about that, like

0:43:27.000 --> 0:43:30.200
<v Speaker 1>if there was no ebook equivalent of the copy of

0:43:30.239 --> 0:43:33.799
<v Speaker 1>the book, if the publishers had not digitized that, that's

0:43:33.840 --> 0:43:37.560
<v Speaker 1>slightly different than if the publishers also offer an electronic

0:43:37.680 --> 0:43:40.560
<v Speaker 1>version of the physical books they sell. But the other

0:43:40.600 --> 0:43:43.880
<v Speaker 1>problem is that the Internet Archive received donations and funding

0:43:43.920 --> 0:43:46.480
<v Speaker 1>that in part stemmed from the practice of lending out

0:43:46.520 --> 0:43:49.640
<v Speaker 1>digitized books, So the publisher said that made the Internet

0:43:50.040 --> 0:43:54.200
<v Speaker 1>Archives activities a commercial enterprise. In twenty twenty three, a

0:43:54.400 --> 0:43:57.400
<v Speaker 1>judge found in favor of the publishers, saying that the

0:43:57.400 --> 0:44:00.319
<v Speaker 1>Internet Archive failed to argue that their work fell under

0:44:00.360 --> 0:44:03.200
<v Speaker 1>the principles of fair use. Again, getting into fair use,

0:44:03.239 --> 0:44:06.000
<v Speaker 1>that's a whole thing, but generally speaking, fair use covers

0:44:06.040 --> 0:44:08.880
<v Speaker 1>a relatively narrow set of use cases in which the

0:44:09.000 --> 0:44:12.680
<v Speaker 1>copying are the use or the distribution of a copyrighted

0:44:12.719 --> 0:44:16.080
<v Speaker 1>work does not count as copyright infringement. But it has

0:44:16.160 --> 0:44:20.200
<v Speaker 1>to meet certain criteria, and it's only ever decided in

0:44:20.239 --> 0:44:23.279
<v Speaker 1>a court of law. It's not something that's just you

0:44:23.280 --> 0:44:27.920
<v Speaker 1>can apply to proactively. It's something that you use in

0:44:28.000 --> 0:44:31.200
<v Speaker 1>a defense if you're brought up on charges of copyright infringement.

0:44:31.400 --> 0:44:34.560
<v Speaker 1>So by the time you're actually talking fair use, it's

0:44:34.600 --> 0:44:38.480
<v Speaker 1>already pretty late in the game. But anyway, this particular

0:44:38.600 --> 0:44:43.000
<v Speaker 1>lawsuit is under appeal. The Internet Archive recently made final

0:44:43.120 --> 0:44:46.920
<v Speaker 1>arguments in the case I have not seen anything about

0:44:46.960 --> 0:44:49.239
<v Speaker 1>the case being decided one way or the other since then,

0:44:49.560 --> 0:44:53.520
<v Speaker 1>so I'm not really sure which way it's going. Again.

0:44:54.200 --> 0:44:57.600
<v Speaker 1>I didn't see anything about a decision made, but then

0:44:57.760 --> 0:45:01.280
<v Speaker 1>most the articles about this are about the initial trial

0:45:01.360 --> 0:45:04.439
<v Speaker 1>that happened in twenty twenty three, so hopefully I will

0:45:04.480 --> 0:45:08.000
<v Speaker 1>find some follow up on this at some point. But

0:45:08.480 --> 0:45:11.239
<v Speaker 1>there's no denying the Internet Archive has done a tremendous

0:45:11.280 --> 0:45:14.359
<v Speaker 1>amount of work in the field of knowledge preservation and

0:45:14.440 --> 0:45:18.160
<v Speaker 1>knowledge accessibility. Without the Internet Archive, there's no way of

0:45:18.200 --> 0:45:21.400
<v Speaker 1>knowing how much information would be lost to us forever.

0:45:21.760 --> 0:45:25.000
<v Speaker 1>Stuff that could have been incredibly useful or even just

0:45:25.160 --> 0:45:29.759
<v Speaker 1>diverting could be gone, and we'd never have a way

0:45:29.760 --> 0:45:33.200
<v Speaker 1>of retrieving it again. And I am very thankful that

0:45:33.280 --> 0:45:36.440
<v Speaker 1>an organization like the Internet Archive exists. If you're not

0:45:36.520 --> 0:45:38.800
<v Speaker 1>familiar with it, if you never used it, I recommend

0:45:38.800 --> 0:45:42.560
<v Speaker 1>you check it out and explore the Internet Archive. Look

0:45:42.600 --> 0:45:45.040
<v Speaker 1>at some of the things that are in that archive,

0:45:45.280 --> 0:45:47.480
<v Speaker 1>like some of the books, some of the recordings. There's

0:45:47.520 --> 0:45:49.440
<v Speaker 1>some great stuff. I think there's like a quarter of

0:45:49.480 --> 0:45:53.720
<v Speaker 1>a million live performances archived just on the Internet Archive,

0:45:54.080 --> 0:45:58.719
<v Speaker 1>like live music performances. That alone is super cool. Anyway,

0:45:58.920 --> 0:46:03.080
<v Speaker 1>I hope you found this episode informative and entertaining. I

0:46:03.120 --> 0:46:06.560
<v Speaker 1>hope you check out Internet archive. I also very much

0:46:06.600 --> 0:46:09.560
<v Speaker 1>hope that you are all well and I will talk

0:46:09.600 --> 0:46:20.360
<v Speaker 1>to you again really soon. Tech Stuff is an iHeartRadio production.

0:46:20.640 --> 0:46:25.680
<v Speaker 1>For more podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts,

0:46:25.800 --> 0:46:27.800
<v Speaker 1>or wherever you listen to your favorite shows.