WEBVTT - Web Spider Is Our Hero

0:00:04.400 --> 0:00:12.520
<v Speaker 1>Welcome to Textuff, a production from my Heart Radio. Hey there,

0:00:12.520 --> 0:00:16.240
<v Speaker 1>and welcome to tech Stuff. I'm your host, Jonathan Strickland.

0:00:16.320 --> 0:00:18.880
<v Speaker 1>I'm an executive producer with I Heart Radio and I

0:00:19.000 --> 0:00:23.840
<v Speaker 1>love all things tech and listener David reached out to

0:00:23.840 --> 0:00:26.040
<v Speaker 1>me on Twitter and said, I would like to hear

0:00:26.079 --> 0:00:30.160
<v Speaker 1>an episode on search engine spiders. He is our hero.

0:00:30.760 --> 0:00:33.800
<v Speaker 1>You gotta David, and if you get the spider, he

0:00:33.920 --> 0:00:36.600
<v Speaker 1>is our hero. Reference let me know. So we're gonna

0:00:36.640 --> 0:00:40.920
<v Speaker 1>talk about the development of search engines and how they

0:00:40.960 --> 0:00:44.479
<v Speaker 1>work from admittedly a pretty high level, because to go

0:00:44.520 --> 0:00:48.240
<v Speaker 1>into great detail would probably take three or four episodes plus.

0:00:48.320 --> 0:00:52.640
<v Speaker 1>Different search engines use slightly different strategies in order to

0:00:52.880 --> 0:00:57.640
<v Speaker 1>index and rank search results. And the reason I'm doing

0:00:57.680 --> 0:01:00.760
<v Speaker 1>all of that is because if we just talked about spiders,

0:01:01.200 --> 0:01:03.880
<v Speaker 1>it would be a fairly short episode. But what the

0:01:03.920 --> 0:01:07.720
<v Speaker 1>heck is a search engine spider. Well, the index the

0:01:07.800 --> 0:01:11.559
<v Speaker 1>contents of the world wide Web. You need to search

0:01:11.600 --> 0:01:16.560
<v Speaker 1>around and find what's there first, right, you can't return

0:01:16.680 --> 0:01:19.319
<v Speaker 1>results without first knowing what is out there in the

0:01:19.360 --> 0:01:23.479
<v Speaker 1>first place. So a search engine spider is a bot

0:01:24.120 --> 0:01:28.200
<v Speaker 1>that does this. It crawls through the Worldwide Web. Thus

0:01:28.240 --> 0:01:31.679
<v Speaker 1>the whole spider name. We'll learn more about what's actually

0:01:31.680 --> 0:01:34.119
<v Speaker 1>going on a little bit later, but to understand search,

0:01:34.600 --> 0:01:37.400
<v Speaker 1>we need a few more basics. So keep in mind

0:01:37.640 --> 0:01:40.440
<v Speaker 1>that all the stuff we see online, whether it's a

0:01:40.440 --> 0:01:43.840
<v Speaker 1>web page or it's a web service or whatever it

0:01:43.840 --> 0:01:47.600
<v Speaker 1>may be, it ultimately sits on a computer that is

0:01:47.600 --> 0:01:52.320
<v Speaker 1>connected to the Internet infrastructure, so it's connected to routers,

0:01:52.360 --> 0:01:56.000
<v Speaker 1>which then connected to various servers and and domain name

0:01:56.080 --> 0:01:59.080
<v Speaker 1>servers and all that kind of stuff. If you visit

0:01:59.120 --> 0:02:02.440
<v Speaker 1>the house Stuff Works homepage, that's the site for the

0:02:02.440 --> 0:02:04.440
<v Speaker 1>company that I used to work for, don't work for

0:02:04.480 --> 0:02:08.600
<v Speaker 1>them anymore, But that website consists of pages that are

0:02:08.720 --> 0:02:12.520
<v Speaker 1>on a computer in a data center. If you happen

0:02:12.560 --> 0:02:15.160
<v Speaker 1>to know the u r L for the site, so

0:02:15.280 --> 0:02:17.880
<v Speaker 1>you happen to know how stuff works dot com, you

0:02:17.919 --> 0:02:20.679
<v Speaker 1>can type that into a browser u r L bar

0:02:20.800 --> 0:02:23.760
<v Speaker 1>address bar, and the browser will then take care of

0:02:23.960 --> 0:02:27.800
<v Speaker 1>sending the appropriate message to that computer. In this case,

0:02:27.840 --> 0:02:30.639
<v Speaker 1>we will call it a server, and the server will

0:02:30.720 --> 0:02:34.240
<v Speaker 1>then return the appropriate information to your browser the web

0:02:34.280 --> 0:02:36.560
<v Speaker 1>page maybe the home page for how Stuff Works in

0:02:36.600 --> 0:02:39.720
<v Speaker 1>this case, and then you'll see the website. But all

0:02:39.760 --> 0:02:42.520
<v Speaker 1>of that requires that First, you have to know that

0:02:42.639 --> 0:02:45.240
<v Speaker 1>there's a site there at all. Plus you have to

0:02:45.240 --> 0:02:47.119
<v Speaker 1>know the U r L for it, and you might

0:02:47.280 --> 0:02:51.919
<v Speaker 1>not have that information. Before there even was a Worldwide Web,

0:02:52.160 --> 0:02:54.360
<v Speaker 1>there was a need to know where you could find

0:02:54.400 --> 0:02:58.280
<v Speaker 1>stuff on the Internet. Now, remember, the Internet is older

0:02:58.320 --> 0:03:00.200
<v Speaker 1>than the Web, and the Internet and the Web or

0:03:00.280 --> 0:03:04.320
<v Speaker 1>not the same thing. The Web exists on top of

0:03:04.360 --> 0:03:07.120
<v Speaker 1>the Internet. It consists of a lot of other stuff

0:03:07.160 --> 0:03:12.120
<v Speaker 1>besides the Web, right, like email and FTP servers. In fact,

0:03:12.760 --> 0:03:16.560
<v Speaker 1>we need to really talk about FTP servers. FTP stands

0:03:16.600 --> 0:03:21.200
<v Speaker 1>for File Transfer Protocol. So these are computers that house

0:03:21.520 --> 0:03:26.440
<v Speaker 1>certain files on them and through FTP. Through this protocol

0:03:26.560 --> 0:03:30.160
<v Speaker 1>that allows for files to transfer from one computer to

0:03:30.320 --> 0:03:34.560
<v Speaker 1>another across a network connection. People can thus access files

0:03:34.560 --> 0:03:37.880
<v Speaker 1>that can transfer them from the server to their own computer,

0:03:38.000 --> 0:03:41.240
<v Speaker 1>which in this case we would call a client. But again,

0:03:41.680 --> 0:03:45.400
<v Speaker 1>FTP is really only useful if you know the address

0:03:45.520 --> 0:03:49.520
<v Speaker 1>of the servers where the stuff is that you want. Right,

0:03:49.960 --> 0:03:52.480
<v Speaker 1>you can't just use FTP to pull a file out

0:03:52.520 --> 0:03:55.920
<v Speaker 1>of nowhere. You have to contact the proper server and

0:03:55.960 --> 0:04:00.960
<v Speaker 1>pull the relevant file from that server. Enter an Emtaj,

0:04:01.200 --> 0:04:04.920
<v Speaker 1>who in nineteen nine was a graduate student at McGill

0:04:05.040 --> 0:04:09.000
<v Speaker 1>University in Montreal, Canada. He also worked as a systems

0:04:09.040 --> 0:04:13.280
<v Speaker 1>administrator for the School of Computer Science at the university,

0:04:13.320 --> 0:04:16.040
<v Speaker 1>and he was running into a challenge. It was his

0:04:16.200 --> 0:04:20.719
<v Speaker 1>job to locate software for professors, for staff, for students

0:04:20.760 --> 0:04:23.080
<v Speaker 1>at the university, but there was no easy way to

0:04:23.160 --> 0:04:26.640
<v Speaker 1>know where all the various files were on the network

0:04:26.760 --> 0:04:31.080
<v Speaker 1>of public FTP servers. Emtage decided there needed to be

0:04:31.200 --> 0:04:34.839
<v Speaker 1>a way to get a snapshot of which public FTP

0:04:35.080 --> 0:04:39.559
<v Speaker 1>servers had which files. There needed to be some sort

0:04:39.600 --> 0:04:43.760
<v Speaker 1>of directory, and since servers were popping up more frequently

0:04:43.960 --> 0:04:48.080
<v Speaker 1>as more people began to develop stuff for the Internet,

0:04:48.360 --> 0:04:50.360
<v Speaker 1>there also needed to be a good way to search

0:04:50.520 --> 0:04:54.680
<v Speaker 1>those lists to find something specific. Otherwise it would be

0:04:54.720 --> 0:04:57.479
<v Speaker 1>like reading through an entire phone book to find out

0:04:57.480 --> 0:05:00.520
<v Speaker 1>which person or business corresponded to a phone number you

0:05:00.600 --> 0:05:03.200
<v Speaker 1>happen to have seen. Let's say the phone number was

0:05:03.279 --> 0:05:06.000
<v Speaker 1>eight six seven five three oh nine, and you don't

0:05:06.040 --> 0:05:08.440
<v Speaker 1>know that that's Jenny's phone number. You just know it's

0:05:08.440 --> 0:05:11.560
<v Speaker 1>the number. So instead of calling the number and asking, Hey,

0:05:11.600 --> 0:05:13.960
<v Speaker 1>whose number is this, you get a phone book and

0:05:14.000 --> 0:05:16.720
<v Speaker 1>you start searching for eight three o nine to find

0:05:16.760 --> 0:05:20.960
<v Speaker 1>the corresponding name that is not efficient. In fact, in

0:05:21.000 --> 0:05:24.559
<v Speaker 1>the early days, information about servers frequently had no other

0:05:24.720 --> 0:05:29.000
<v Speaker 1>real channel to get to users other than word of mouth.

0:05:29.160 --> 0:05:32.800
<v Speaker 1>So there was a really good chance that there was

0:05:32.839 --> 0:05:35.680
<v Speaker 1>stuff that was relevant to you that you just had

0:05:35.720 --> 0:05:38.000
<v Speaker 1>no way of knowing about because you had to hear

0:05:38.040 --> 0:05:41.680
<v Speaker 1>it from somebody else first. Imtash, along with a couple

0:05:41.720 --> 0:05:44.360
<v Speaker 1>of other folks like Bill Healen and J. Peter Deutsch,

0:05:44.760 --> 0:05:48.200
<v Speaker 1>began building a tool to solve this problem. They ended

0:05:48.279 --> 0:05:52.680
<v Speaker 1>up calling this tool Archie, which actually was not a

0:05:52.720 --> 0:05:56.800
<v Speaker 1>nod to the comic book character from Archie Comics. Instead,

0:05:57.040 --> 0:06:01.200
<v Speaker 1>Archie was a somewhat shortened form of the word archives.

0:06:01.920 --> 0:06:05.600
<v Speaker 1>They created programs that could look through the repositories of

0:06:05.720 --> 0:06:09.680
<v Speaker 1>public FTP sites and get an inventory of the files

0:06:09.720 --> 0:06:13.599
<v Speaker 1>stored on those servers or as documented in the book

0:06:13.800 --> 0:06:17.320
<v Speaker 1>A Rough Guide to the Internet by Nicholas West Quote,

0:06:17.800 --> 0:06:22.440
<v Speaker 1>it combined a script based data gatherer which accessed listings

0:06:22.440 --> 0:06:26.600
<v Speaker 1>from anonymous sites with a script which matched regular expressions

0:06:26.600 --> 0:06:29.960
<v Speaker 1>which could retrieve file names matching a user query end

0:06:30.040 --> 0:06:34.279
<v Speaker 1>quote simple right now. In case you're like me and

0:06:34.480 --> 0:06:38.160
<v Speaker 1>what I just quoted sounded a little bit confusing. One

0:06:38.240 --> 0:06:40.719
<v Speaker 1>it really boils down to is to say they made

0:06:40.800 --> 0:06:44.800
<v Speaker 1>a computer program that followed some fairly simple rules. The

0:06:44.880 --> 0:06:47.720
<v Speaker 1>program made note of the file titles that were on

0:06:47.880 --> 0:06:51.520
<v Speaker 1>various FTP servers, kind of like a list of contents,

0:06:51.920 --> 0:06:55.680
<v Speaker 1>and they noted which files were on which servers. Another

0:06:55.680 --> 0:06:58.839
<v Speaker 1>part of the program arranged those findings into a database

0:06:59.360 --> 0:07:02.080
<v Speaker 1>not that much different from the types of spreadsheets you've

0:07:02.080 --> 0:07:05.520
<v Speaker 1>probably worked with in the past. Imtaj and crew also

0:07:05.640 --> 0:07:08.960
<v Speaker 1>created a tool that would allow them to search this database.

0:07:09.440 --> 0:07:12.040
<v Speaker 1>Before long, other people began to hear that he had

0:07:12.080 --> 0:07:14.840
<v Speaker 1>this database and that they would ask him, Hey, can

0:07:14.880 --> 0:07:16.480
<v Speaker 1>you do a search for me, and they would give

0:07:16.520 --> 0:07:18.960
<v Speaker 1>him the search terms, and it started taking up a

0:07:18.960 --> 0:07:22.040
<v Speaker 1>lot of his time. So in an effort to streamline things,

0:07:22.320 --> 0:07:26.320
<v Speaker 1>he programmed a user interface or UI that would allow

0:07:26.360 --> 0:07:29.440
<v Speaker 1>people to conduct their own searches. They could just log

0:07:29.480 --> 0:07:32.200
<v Speaker 1>into this tool and then type in the file that

0:07:32.240 --> 0:07:34.960
<v Speaker 1>they were looking for and it would return the results

0:07:35.000 --> 0:07:37.960
<v Speaker 1>for them. So as long as they were sure about

0:07:38.000 --> 0:07:41.480
<v Speaker 1>the specific file they needed, they would get the results. Now,

0:07:41.560 --> 0:07:45.239
<v Speaker 1>most resources generally agree that Archie was the first real

0:07:45.480 --> 0:07:49.120
<v Speaker 1>search engine on the Internet, but it wasn't a web

0:07:49.320 --> 0:07:53.200
<v Speaker 1>search engine, the Web didn't exist yet. It wasn't long

0:07:53.240 --> 0:07:58.160
<v Speaker 1>before a couple of other tools followed. In some researchers

0:07:58.160 --> 0:08:01.200
<v Speaker 1>with the University of Minnesota develop a new tool to

0:08:01.440 --> 0:08:05.840
<v Speaker 1>organize and discover documents stored on servers, and the tool

0:08:05.960 --> 0:08:10.840
<v Speaker 1>was called the Gopher protocol. Servers were data repositories called

0:08:11.000 --> 0:08:14.800
<v Speaker 1>Gopher holes eventually that's what they were called anyway, and

0:08:14.880 --> 0:08:19.440
<v Speaker 1>Gopher organized everything into a hierarchical text based menu system.

0:08:19.520 --> 0:08:23.040
<v Speaker 1>So this was a specific strategy that was built on

0:08:23.080 --> 0:08:25.320
<v Speaker 1>top of the Internet. It was kind of think of

0:08:25.320 --> 0:08:29.400
<v Speaker 1>it as being in parallel with the Web. It predated

0:08:29.440 --> 0:08:31.840
<v Speaker 1>the Web, but the Web and Gopher would exist at

0:08:31.880 --> 0:08:34.240
<v Speaker 1>the same time, but they were not the same thing.

0:08:34.640 --> 0:08:38.679
<v Speaker 1>This was a different strategy in order to serve information

0:08:38.720 --> 0:08:45.360
<v Speaker 1>across the networks. Before long, like in more researchers developed

0:08:45.400 --> 0:08:48.600
<v Speaker 1>a search function to work on top of Gopher. Because again,

0:08:49.240 --> 0:08:52.319
<v Speaker 1>if you didn't know where something actually quote unquote lived

0:08:52.800 --> 0:08:56.160
<v Speaker 1>in the Gopher network, you would never be able to

0:08:56.200 --> 0:08:59.520
<v Speaker 1>find it unless you were just lucky. So this search

0:08:59.600 --> 0:09:03.400
<v Speaker 1>tool was called Veronica, and that is pretty cute because

0:09:04.080 --> 0:09:08.719
<v Speaker 1>Veronica is a character in the Archie comics books. And

0:09:08.800 --> 0:09:12.400
<v Speaker 1>while the search engine Archie did not pull its name

0:09:12.520 --> 0:09:15.840
<v Speaker 1>from Archie Comics. Veronica was a nod to the older

0:09:15.880 --> 0:09:19.160
<v Speaker 1>search engine as well as a nod to the comic book,

0:09:19.200 --> 0:09:23.920
<v Speaker 1>so it almost kind of retroactively made Archie relate back

0:09:23.960 --> 0:09:29.560
<v Speaker 1>to the comics. Later, computer geeks assigned a backronym to Veronica.

0:09:29.920 --> 0:09:33.080
<v Speaker 1>This is an acronym that you create after you've already

0:09:33.160 --> 0:09:35.080
<v Speaker 1>named a thing. So you've given the thing a name,

0:09:35.320 --> 0:09:37.360
<v Speaker 1>and then you're thinking, okay, well, what can we say.

0:09:37.400 --> 0:09:41.120
<v Speaker 1>Each of those letters stands for that's a backronym, And

0:09:41.160 --> 0:09:45.600
<v Speaker 1>in this case, the revisionist name was very easy, rodent

0:09:45.840 --> 0:09:51.400
<v Speaker 1>oriented net Wide Index to Computer Archives, or Veronica super cute.

0:09:51.920 --> 0:09:56.040
<v Speaker 1>What Veronica did was fairly primitive. It created a database

0:09:56.160 --> 0:10:00.199
<v Speaker 1>of every file and every directory on every go for

0:10:00.280 --> 0:10:02.960
<v Speaker 1>a server that was connected to the Internet, and it

0:10:02.960 --> 0:10:07.040
<v Speaker 1>would update dynamically as more servers joined the network. That

0:10:07.080 --> 0:10:10.599
<v Speaker 1>approach worked fairly well when there was still a relatively

0:10:10.679 --> 0:10:14.440
<v Speaker 1>small number of servers to keep track of, But as

0:10:14.520 --> 0:10:18.200
<v Speaker 1>more servers came online and joined this Gopher network, with

0:10:18.280 --> 0:10:22.199
<v Speaker 1>more documents stored on each server, it started to get

0:10:22.200 --> 0:10:26.800
<v Speaker 1>a really you know, challenging to manage Veronica, a secondary

0:10:26.840 --> 0:10:30.040
<v Speaker 1>Gopher search tool kind of addressed this problem, and this

0:10:30.080 --> 0:10:33.520
<v Speaker 1>one also took its name from a character from Archie comics,

0:10:34.040 --> 0:10:37.840
<v Speaker 1>jug Head. This one didn't create a full database of

0:10:37.960 --> 0:10:42.040
<v Speaker 1>everything that was on the Gopher network. Instead, as a user,

0:10:42.360 --> 0:10:45.920
<v Speaker 1>you would have to designate which Gopher server you wanted

0:10:45.960 --> 0:10:48.320
<v Speaker 1>to search, so you had to at least have some

0:10:48.480 --> 0:10:51.640
<v Speaker 1>general idea of where it was you needed to look.

0:10:52.360 --> 0:10:54.199
<v Speaker 1>But if you did know that, it was a much

0:10:54.240 --> 0:10:57.920
<v Speaker 1>faster approach than trying to search everything on the network

0:10:57.920 --> 0:11:01.760
<v Speaker 1>as a whole. Gopher had a major problem, and that

0:11:01.920 --> 0:11:06.079
<v Speaker 1>was that was becoming increasingly less efficient and easy to navigate.

0:11:06.600 --> 0:11:11.360
<v Speaker 1>The larger it got, it didn't scale well. Meanwhile, at

0:11:11.360 --> 0:11:14.240
<v Speaker 1>the same time that Gopher was growing, a guy named

0:11:14.400 --> 0:11:17.959
<v Speaker 1>Tim berners Lee over at CERN. You know that's the

0:11:18.000 --> 0:11:22.120
<v Speaker 1>research facility that oversees stuff like the large Hadron collider. Well,

0:11:22.160 --> 0:11:26.240
<v Speaker 1>he was developing a different approach to storing and sharing

0:11:26.280 --> 0:11:30.199
<v Speaker 1>information across networks. Tim and his team at CERN developed

0:11:30.240 --> 0:11:34.160
<v Speaker 1>a protocol called Hypertext Transfer Protocol or h t t

0:11:34.360 --> 0:11:39.040
<v Speaker 1>P and Hypertext Markup Language or h t mL. Both

0:11:39.040 --> 0:11:41.240
<v Speaker 1>of these kind of grew out of stuff that CERN

0:11:41.320 --> 0:11:44.600
<v Speaker 1>had been using internally for a while. Now I'm guessing

0:11:44.640 --> 0:11:47.240
<v Speaker 1>those terms sound familiar to you, guys. These are the

0:11:47.280 --> 0:11:51.040
<v Speaker 1>two components that really formed the basis of web pages

0:11:51.120 --> 0:11:54.760
<v Speaker 1>and the Worldwide Web. The markup language acts as the

0:11:54.800 --> 0:11:58.440
<v Speaker 1>set of instructions on how a computer, or more specifically,

0:11:58.480 --> 0:12:04.040
<v Speaker 1>how a browser is to interpret and display documents, eventually

0:12:04.040 --> 0:12:07.880
<v Speaker 1>including stuff like images and sound files. Although initially the

0:12:07.920 --> 0:12:11.640
<v Speaker 1>web was strictly text based and browsers were text based

0:12:11.640 --> 0:12:15.240
<v Speaker 1>as well. H t t P is the set of

0:12:15.360 --> 0:12:19.320
<v Speaker 1>rules and the processes through which a client that being

0:12:19.400 --> 0:12:23.680
<v Speaker 1>your web browser, can request a specific document from a server,

0:12:24.280 --> 0:12:27.600
<v Speaker 1>and how the server can then send that requested document

0:12:27.640 --> 0:12:30.400
<v Speaker 1>to the browser. The server since the h t m

0:12:30.520 --> 0:12:33.440
<v Speaker 1>L files to the client, and the client interprets those

0:12:33.559 --> 0:12:37.200
<v Speaker 1>HTML files in order to display the relevant web page

0:12:37.240 --> 0:12:42.040
<v Speaker 1>to the user. Hypertext refers to text that has a

0:12:42.120 --> 0:12:45.480
<v Speaker 1>link to some other text, and you can think of

0:12:45.520 --> 0:12:48.360
<v Speaker 1>it kind of like a footnote in a book. The

0:12:48.480 --> 0:12:52.840
<v Speaker 1>hypertext has an asterisk that corresponds to another piece of

0:12:52.880 --> 0:12:57.440
<v Speaker 1>information somewhere that is also marked by an asterisk, except

0:12:57.480 --> 0:13:02.600
<v Speaker 1>in this case the asterisks are invisible. It's highlighted text

0:13:02.720 --> 0:13:05.360
<v Speaker 1>or or text in a different color, or it's underlined.

0:13:05.840 --> 0:13:08.480
<v Speaker 1>It's designated in some way to be different from all

0:13:08.520 --> 0:13:10.719
<v Speaker 1>the rest of the text. That's what lets you know

0:13:10.960 --> 0:13:14.800
<v Speaker 1>it's hypertext and it's linked to something else. Hypertext documents

0:13:14.840 --> 0:13:18.800
<v Speaker 1>connect to one another through hyperlinks. Those documents don't even

0:13:18.840 --> 0:13:20.640
<v Speaker 1>have to be on the same server. They can be

0:13:20.679 --> 0:13:23.240
<v Speaker 1>on opposite sides of the world. So this means you

0:13:23.240 --> 0:13:27.679
<v Speaker 1>can build a reference in one hypertext document to content

0:13:27.760 --> 0:13:31.640
<v Speaker 1>that's found on a totally different hypertext document. Clicking on

0:13:31.840 --> 0:13:35.840
<v Speaker 1>that hypertext activates the link. It sends a command to

0:13:36.360 --> 0:13:39.440
<v Speaker 1>the browser, which then relays that command to the server

0:13:39.800 --> 0:13:43.880
<v Speaker 1>that the client wants to see specific linked information, and

0:13:44.040 --> 0:13:47.600
<v Speaker 1>the server returns that. You can also link the locations

0:13:47.640 --> 0:13:50.080
<v Speaker 1>that are within the same page of a document, or

0:13:50.200 --> 0:13:53.440
<v Speaker 1>specific locations on other pages. It doesn't have to just

0:13:53.520 --> 0:13:55.559
<v Speaker 1>be click on this and you go to a new

0:13:55.559 --> 0:13:57.880
<v Speaker 1>web page. It might be click on this and you

0:13:58.120 --> 0:14:02.000
<v Speaker 1>skip down, you know, a significant number of paragraphs to

0:14:02.040 --> 0:14:05.360
<v Speaker 1>get to the relevant information. Really, all the link is

0:14:05.400 --> 0:14:09.920
<v Speaker 1>doing is telling the browser where some specific point in

0:14:09.960 --> 0:14:12.680
<v Speaker 1>a specific document happens to be and how to get there.

0:14:13.000 --> 0:14:14.640
<v Speaker 1>It's kind of like if you were reading a book

0:14:14.640 --> 0:14:17.080
<v Speaker 1>that said I want to know more, skipped a page

0:14:17.080 --> 0:14:20.400
<v Speaker 1>to nineteen and read the third paragraph or or sometimes

0:14:20.440 --> 0:14:22.760
<v Speaker 1>I compare it to those old choose your own adventure

0:14:22.800 --> 0:14:25.000
<v Speaker 1>books where you get to the bottom of a page

0:14:25.080 --> 0:14:27.160
<v Speaker 1>and you have to make a decision, and based on

0:14:27.200 --> 0:14:28.960
<v Speaker 1>which decision you make, you have to turn to a

0:14:28.960 --> 0:14:32.280
<v Speaker 1>specific page to pick up the story again. Well, you

0:14:32.280 --> 0:14:35.360
<v Speaker 1>can quickly see how this would be really useful. Let's

0:14:35.360 --> 0:14:37.400
<v Speaker 1>say I want to make a web page that includes

0:14:37.520 --> 0:14:40.560
<v Speaker 1>directions and how to perform a particular process. We're gonna

0:14:40.600 --> 0:14:43.640
<v Speaker 1>call it baking a soup FLA and the steps I

0:14:43.760 --> 0:14:48.160
<v Speaker 1>list include references to other, maybe slightly less involved processes

0:14:48.240 --> 0:14:50.960
<v Speaker 1>that are part of this, and I don't go into

0:14:51.000 --> 0:14:54.240
<v Speaker 1>explaining how those work. Let's say, like I talk about

0:14:54.280 --> 0:14:56.080
<v Speaker 1>cracking eggs, but I don't tell you the best way

0:14:56.080 --> 0:14:59.440
<v Speaker 1>to crack an egg. However, I could create hypertext links

0:14:59.520 --> 0:15:03.120
<v Speaker 1>to other sets of instructions, maybe a specific page just

0:15:03.320 --> 0:15:06.040
<v Speaker 1>on different ways to crack an egg, And that way

0:15:06.040 --> 0:15:08.440
<v Speaker 1>you could go and look that up if you weren't confident.

0:15:08.760 --> 0:15:10.720
<v Speaker 1>So if you don't know how something works, you can

0:15:10.720 --> 0:15:13.280
<v Speaker 1>click on that other link and go to a page

0:15:13.320 --> 0:15:17.240
<v Speaker 1>let's dedicated to that in order to learn more. And yes,

0:15:17.560 --> 0:15:20.600
<v Speaker 1>I just described how the web works in general, which

0:15:20.640 --> 0:15:23.240
<v Speaker 1>is something I'm sure you all know at least at

0:15:23.280 --> 0:15:25.920
<v Speaker 1>some level, even if it's not you know, a formal one.

0:15:26.240 --> 0:15:29.400
<v Speaker 1>But if you've ever been on say Wikipedia, reading up

0:15:29.400 --> 0:15:32.360
<v Speaker 1>on a topic and saw hypertext link and thought, yeah,

0:15:32.520 --> 0:15:34.840
<v Speaker 1>I should find out what this term means. I don't

0:15:34.920 --> 0:15:37.880
<v Speaker 1>understand it. So you click on that and you go

0:15:38.040 --> 0:15:40.400
<v Speaker 1>follow that so you can get better understanding, then that's

0:15:40.400 --> 0:15:43.400
<v Speaker 1>the use case I'm referring to. And there's a conversation

0:15:43.480 --> 0:15:46.440
<v Speaker 1>we could have about what actually goes on when you

0:15:46.520 --> 0:15:49.960
<v Speaker 1>click a link, but that would require a deeper dive

0:15:50.000 --> 0:15:53.160
<v Speaker 1>into how the web and by extension, how the Internet

0:15:53.200 --> 0:15:55.680
<v Speaker 1>works on a very technical level, and I think that

0:15:55.720 --> 0:15:57.640
<v Speaker 1>goes beyond the scope of what we're trying to do

0:15:57.680 --> 0:16:00.840
<v Speaker 1>in this episode. So I'm going to simple five, perhaps

0:16:00.840 --> 0:16:04.760
<v Speaker 1>to a ludicrous degree, and say that a link contains

0:16:04.800 --> 0:16:08.480
<v Speaker 1>within it the information about where another document, or even

0:16:08.560 --> 0:16:13.160
<v Speaker 1>a specific point within another document exists, and activating that

0:16:13.240 --> 0:16:16.160
<v Speaker 1>link by clicking on it in a browser initiates a

0:16:16.200 --> 0:16:20.240
<v Speaker 1>sequence that results in the browser requesting that specific document

0:16:20.400 --> 0:16:24.040
<v Speaker 1>from the appropriate server, which then sends that document to

0:16:24.120 --> 0:16:26.440
<v Speaker 1>the browser so that you can see it. A lot

0:16:26.480 --> 0:16:29.280
<v Speaker 1>more is going on to make this happen, but let's

0:16:29.280 --> 0:16:32.280
<v Speaker 1>just stick with that high level view. So the pair

0:16:32.360 --> 0:16:36.000
<v Speaker 1>of h T t P and HTML evolved the same

0:16:36.040 --> 0:16:40.480
<v Speaker 1>time that Gopher was establishing itself, and some people stuck

0:16:40.520 --> 0:16:44.480
<v Speaker 1>with Gopher, but it just really never took off the

0:16:44.520 --> 0:16:47.760
<v Speaker 1>same way that the web did with HTML and HTTP.

0:16:48.320 --> 0:16:51.440
<v Speaker 1>The protocol and the markup language is what the web

0:16:51.560 --> 0:16:54.720
<v Speaker 1>is built upon, and we call it a web because

0:16:54.760 --> 0:16:57.920
<v Speaker 1>of that interconnectivity of documents. You can build out a

0:16:58.000 --> 0:17:01.160
<v Speaker 1>dock and then link that document to another doc which

0:17:01.200 --> 0:17:04.080
<v Speaker 1>might be linked to a dozen others, and by following

0:17:04.119 --> 0:17:07.000
<v Speaker 1>those links, you can navigate from one document to the next.

0:17:07.320 --> 0:17:09.680
<v Speaker 1>It really is similar to what happens to a lot

0:17:09.720 --> 0:17:12.240
<v Speaker 1>of people when they visit Wikipedia and they just start

0:17:12.320 --> 0:17:15.440
<v Speaker 1>following all sorts of links. But you might already see

0:17:15.440 --> 0:17:19.320
<v Speaker 1>a challenge with that kind of design. It works great

0:17:19.400 --> 0:17:23.399
<v Speaker 1>if you've got a centralized person or institution that's building

0:17:23.400 --> 0:17:27.000
<v Speaker 1>out the web, adding pages in a very logical way

0:17:27.440 --> 0:17:30.119
<v Speaker 1>and linking to them in a very logical way. But

0:17:30.280 --> 0:17:33.399
<v Speaker 1>one of tim berners Lee's major goals was to create

0:17:33.400 --> 0:17:37.879
<v Speaker 1>a democratized system that didn't depend upon a centralized authority.

0:17:38.040 --> 0:17:40.360
<v Speaker 1>People should be able to build their own web pages

0:17:40.520 --> 0:17:43.120
<v Speaker 1>and host them on their own servers. But how would

0:17:43.200 --> 0:17:46.600
<v Speaker 1>anyone else find them if there were no links going

0:17:46.760 --> 0:17:50.119
<v Speaker 1>into those pages. If the web pages are made and

0:17:50.200 --> 0:17:52.960
<v Speaker 1>hosted independently of the first few pages on the web,

0:17:53.440 --> 0:17:57.800
<v Speaker 1>where is the connective tissue? The original solution wasn't a

0:17:57.880 --> 0:18:01.080
<v Speaker 1>search engine. It took a it more of a hands

0:18:01.119 --> 0:18:04.480
<v Speaker 1>on approach. I'll explain more, but first let's take a

0:18:04.560 --> 0:18:16.240
<v Speaker 1>quick break. In the early days, when people first started

0:18:16.280 --> 0:18:19.520
<v Speaker 1>building documents to host on the web, in other words,

0:18:19.800 --> 0:18:23.399
<v Speaker 1>the earliest web pages, Tim Burner's leave would take it

0:18:23.440 --> 0:18:27.879
<v Speaker 1>upon himself to create an index hosted on cerns own server.

0:18:28.359 --> 0:18:31.760
<v Speaker 1>Someone might send him a message saying that they had

0:18:31.920 --> 0:18:34.640
<v Speaker 1>built in are hosting a new web page, and they

0:18:34.680 --> 0:18:38.360
<v Speaker 1>could include the address or U R L. Burners Lee

0:18:38.440 --> 0:18:41.600
<v Speaker 1>would then add a hypertext link to a growing catalog

0:18:41.840 --> 0:18:45.000
<v Speaker 1>of those kind of links on a page hosted by CERN.

0:18:45.240 --> 0:18:47.960
<v Speaker 1>So if you visited cerns site, you could navigate to

0:18:48.000 --> 0:18:50.919
<v Speaker 1>that index and see the links to the other sites.

0:18:51.600 --> 0:18:55.600
<v Speaker 1>Tim came to call this a virtual library. He and

0:18:55.640 --> 0:18:59.399
<v Speaker 1>a group of volunteers oversaw its evolution. They organized it

0:18:59.440 --> 0:19:03.080
<v Speaker 1>into different areas of interest, with subject matter experts overseeing

0:19:03.160 --> 0:19:06.560
<v Speaker 1>specific categories, and a lot of these early pages belonged

0:19:06.560 --> 0:19:11.280
<v Speaker 1>to scientific research organizations or universities or publications, and all

0:19:11.359 --> 0:19:14.320
<v Speaker 1>that makes sense. CERN is the organization that oversees the

0:19:14.400 --> 0:19:17.240
<v Speaker 1>large Hadron Collider after all, So it's no surprise that

0:19:17.359 --> 0:19:21.480
<v Speaker 1>the early web really focused on science and academia. Also,

0:19:21.640 --> 0:19:24.280
<v Speaker 1>it's good to mention that the web in those early

0:19:24.359 --> 0:19:28.639
<v Speaker 1>days again was text based. Browsers were text based too,

0:19:28.760 --> 0:19:32.360
<v Speaker 1>that would not really change until another year like nine.

0:19:34.160 --> 0:19:38.159
<v Speaker 1>Mike Gopher's design, the virtual library approach worked fairly well

0:19:38.359 --> 0:19:41.399
<v Speaker 1>when the Web was still small in scale. According to

0:19:41.440 --> 0:19:46.119
<v Speaker 1>the Virtual Library website, in August nine there were about

0:19:46.160 --> 0:19:50.639
<v Speaker 1>twenty web servers in existence total. A little more than

0:19:50.680 --> 0:19:54.120
<v Speaker 1>a year later, in October of nineteen, it was more

0:19:54.119 --> 0:19:58.800
<v Speaker 1>than two hundred web servers, so growth was still fairly modest,

0:19:58.960 --> 0:20:03.800
<v Speaker 1>but things kind of took off after that. By January

0:20:04.000 --> 0:20:09.360
<v Speaker 1>nine six, there were more than one hundred thousand web servers.

0:20:09.720 --> 0:20:13.480
<v Speaker 1>The following year there were more than six hundred fifty thousand.

0:20:13.560 --> 0:20:17.000
<v Speaker 1>It was growing so fast, and maintaining an index was

0:20:17.040 --> 0:20:23.000
<v Speaker 1>becoming increasingly more difficult, particularly by doing it, you know, manually.

0:20:23.400 --> 0:20:26.480
<v Speaker 1>The virtual library was taking shape around the same time

0:20:26.520 --> 0:20:29.600
<v Speaker 1>that students were building the Veronica search engine for gophers,

0:20:29.640 --> 0:20:31.720
<v Speaker 1>so all this was happening around the same time. I

0:20:31.760 --> 0:20:35.880
<v Speaker 1>know it sounds like I'm going strictly chronologically, but that's

0:20:35.880 --> 0:20:39.840
<v Speaker 1>just too not too helpful. We have to remember this

0:20:39.880 --> 0:20:43.439
<v Speaker 1>is all happening simultaneously. So as the web grew and

0:20:43.520 --> 0:20:48.320
<v Speaker 1>became more complex, indices were growing as well. Just navigating

0:20:48.359 --> 0:20:52.240
<v Speaker 1>an index to find what you wanted would become a challenge,

0:20:52.359 --> 0:20:55.120
<v Speaker 1>particularly if you weren't thinking in the same way as

0:20:55.160 --> 0:20:58.399
<v Speaker 1>the people who had organized the index. This is where

0:20:58.440 --> 0:21:02.880
<v Speaker 1>taxonomy comes in. Taxonomy refers to a system of classification.

0:21:03.160 --> 0:21:05.560
<v Speaker 1>A taxonomy is a set of rules we use to

0:21:05.720 --> 0:21:08.879
<v Speaker 1>organize stuff, and there is no one way to do

0:21:08.960 --> 0:21:12.840
<v Speaker 1>it correctly. So I'll give a simple example. Let's say

0:21:12.960 --> 0:21:16.240
<v Speaker 1>you've got a class of students and you have them

0:21:16.280 --> 0:21:20.200
<v Speaker 1>all divide up into smaller groups. You give each group

0:21:20.520 --> 0:21:23.960
<v Speaker 1>a pile of documents, the same documents per group, but

0:21:24.040 --> 0:21:28.680
<v Speaker 1>you tell the students it's their job to organize those documents. Well,

0:21:28.720 --> 0:21:31.080
<v Speaker 1>one group decides that they're going to organize all the

0:21:31.119 --> 0:21:34.840
<v Speaker 1>documents by alphabetizing them by title. The title of each

0:21:34.920 --> 0:21:38.640
<v Speaker 1>document will determine how they fall in the pile, so

0:21:39.240 --> 0:21:42.320
<v Speaker 1>there's are all in alphabetical order. Another group decides that

0:21:42.320 --> 0:21:45.160
<v Speaker 1>they're gonna bundle their documents that all cover the same

0:21:45.200 --> 0:21:50.320
<v Speaker 1>subject matter together, and they'll alphabetize within subjects. So they

0:21:50.400 --> 0:21:53.840
<v Speaker 1>might have a stack that's just about biology, another stack

0:21:53.920 --> 0:21:57.240
<v Speaker 1>that's about chemistry, another one about material science, and so on.

0:21:57.920 --> 0:22:01.159
<v Speaker 1>A third group bundles their docum it's together by author.

0:22:01.440 --> 0:22:04.480
<v Speaker 1>They put all of the same author's works together, and

0:22:04.480 --> 0:22:08.800
<v Speaker 1>then maybe they alphabetize the authors. Yeah. Another group focuses

0:22:08.880 --> 0:22:12.880
<v Speaker 1>on publication date. They arrange all their documents in order

0:22:12.920 --> 0:22:16.840
<v Speaker 1>of when they were published. So you can imagine combinations

0:22:16.840 --> 0:22:20.600
<v Speaker 1>of these approaches as well, right, such as ordering documents chronologically,

0:22:20.640 --> 0:22:23.880
<v Speaker 1>but then if two documents were published on the same date,

0:22:23.920 --> 0:22:26.479
<v Speaker 1>you then alphabetize them. That kind of thing. You can

0:22:26.520 --> 0:22:28.760
<v Speaker 1>think of those different sets of rules, and you have

0:22:28.800 --> 0:22:32.600
<v Speaker 1>to determine which rules are most important, right, which one

0:22:32.720 --> 0:22:36.400
<v Speaker 1>you do first, and which is secondary. Well, it's important

0:22:36.440 --> 0:22:40.720
<v Speaker 1>that these taxonomy's, however you construct them, are consistent or

0:22:40.760 --> 0:22:43.640
<v Speaker 1>else it becomes a chaotic mess. But even a well

0:22:43.840 --> 0:22:47.560
<v Speaker 1>organized and maintained taxonomy can still be a challenge for

0:22:47.680 --> 0:22:51.040
<v Speaker 1>someone new coming into the system. And that's really what

0:22:51.040 --> 0:22:55.520
<v Speaker 1>I'm getting at here. A comprehensive index will seem incredibly

0:22:55.600 --> 0:22:59.840
<v Speaker 1>overwhelming to someone who's unfamiliar with the system's taxonomy, and

0:23:00.040 --> 0:23:03.480
<v Speaker 1>it will still seem like finding a specific document or

0:23:03.560 --> 0:23:07.560
<v Speaker 1>web page is an impossible task. Once that index grows

0:23:07.600 --> 0:23:11.880
<v Speaker 1>to a large enough size, clearly a search tool would

0:23:11.920 --> 0:23:14.320
<v Speaker 1>be a big solution to that problem. If you can

0:23:14.359 --> 0:23:17.600
<v Speaker 1>type a query into a search engine and you can

0:23:17.640 --> 0:23:20.240
<v Speaker 1>get the results you want, you don't have to comb

0:23:20.320 --> 0:23:23.639
<v Speaker 1>through an enormous index hoping that you're thinking in the

0:23:23.680 --> 0:23:27.240
<v Speaker 1>same way as the custodians of that index. But in

0:23:27.359 --> 0:23:31.199
<v Speaker 1>addition to those challenges was one of scale building. This

0:23:31.320 --> 0:23:34.879
<v Speaker 1>index required a lot of volunteer effort because again it

0:23:34.920 --> 0:23:38.320
<v Speaker 1>was done by hand. The next step toward the development

0:23:38.320 --> 0:23:41.800
<v Speaker 1>of a search engine was a project that was tackled

0:23:41.840 --> 0:23:45.359
<v Speaker 1>by an m I T student, Matthew Gray. That was

0:23:45.400 --> 0:23:49.399
<v Speaker 1>the student. He designed a program he called the Worldwide

0:23:49.600 --> 0:23:53.679
<v Speaker 1>Web Wanderer, and the purpose of this program was to

0:23:53.760 --> 0:23:58.240
<v Speaker 1>automatically navigate across the web, cataloging the web's growth by

0:23:58.280 --> 0:24:03.560
<v Speaker 1>registering new websites, web pages, and web servers. The Worldwide

0:24:03.560 --> 0:24:07.879
<v Speaker 1>Web Wanderer is arguably the earliest automated spider or web

0:24:07.880 --> 0:24:11.640
<v Speaker 1>crawler designed for the Web. When Gray developed the program

0:24:11.640 --> 0:24:15.960
<v Speaker 1>back in the web browser Mosaic, which was the first

0:24:16.040 --> 0:24:19.479
<v Speaker 1>popular browser designed for Windows, was just a couple of

0:24:19.480 --> 0:24:23.520
<v Speaker 1>months old. Mosaic was also a graphical browser, and while

0:24:23.520 --> 0:24:26.720
<v Speaker 1>it wasn't the first graphical browser, it was the first

0:24:26.800 --> 0:24:30.520
<v Speaker 1>popular one available to the average person outside of places

0:24:30.600 --> 0:24:34.680
<v Speaker 1>like CERN. Gray wanted to automatically detect new websites for

0:24:34.800 --> 0:24:38.439
<v Speaker 1>discovery purposes, but before long the number of sites was

0:24:38.520 --> 0:24:42.200
<v Speaker 1>growing so quickly that he shifted his attention to charting

0:24:42.240 --> 0:24:45.639
<v Speaker 1>the growth of the web in general. Gray wrote his

0:24:45.680 --> 0:24:49.920
<v Speaker 1>program using the Perl computer language. That's pe r L,

0:24:50.560 --> 0:24:54.439
<v Speaker 1>And here's a quick refresher on computer languages. We know

0:24:54.600 --> 0:24:59.959
<v Speaker 1>that typically computers process information in machine code, which mostly

0:25:00.040 --> 0:25:03.440
<v Speaker 1>for our cases means binary data, and that means all

0:25:03.480 --> 0:25:07.560
<v Speaker 1>the information going through the machines ultimately breaks down into

0:25:07.680 --> 0:25:10.200
<v Speaker 1>zeros and ones, and you can think of that like

0:25:10.280 --> 0:25:14.080
<v Speaker 1>a light switch flipped either off or on. Now, a

0:25:14.119 --> 0:25:17.000
<v Speaker 1>single offer on is easy, but if we want to

0:25:17.040 --> 0:25:22.040
<v Speaker 1>represent more complex ideas processes that kind of thing, we

0:25:22.119 --> 0:25:26.560
<v Speaker 1>need a lot of bits. A single alpha numerical character

0:25:26.680 --> 0:25:30.840
<v Speaker 1>in the as key code requires seven bits, so you

0:25:30.880 --> 0:25:35.240
<v Speaker 1>need seven strings of zeros or ones just to represent

0:25:35.320 --> 0:25:38.440
<v Speaker 1>a letter, number or symbol and as key. So you

0:25:38.480 --> 0:25:42.880
<v Speaker 1>can imagine that programming in machine code would be really

0:25:42.920 --> 0:25:46.679
<v Speaker 1>tough for most humans because it would be so easy

0:25:46.800 --> 0:25:50.920
<v Speaker 1>to mistype a zero or a one, or to skip one.

0:25:51.280 --> 0:25:53.800
<v Speaker 1>And if you're, you know, typing out a really long sequence,

0:25:53.800 --> 0:25:56.680
<v Speaker 1>it's easy for you to overlook a zero or a one,

0:25:56.720 --> 0:26:01.440
<v Speaker 1>And that's why people developed programming languages. A programming language

0:26:01.480 --> 0:26:05.320
<v Speaker 1>creates a layer of abstraction between the programmer and the

0:26:05.400 --> 0:26:08.880
<v Speaker 1>machine or system of machines. It acts as a sort

0:26:08.920 --> 0:26:13.040
<v Speaker 1>of interpreter. It takes the intentions of the programmer and

0:26:13.200 --> 0:26:17.280
<v Speaker 1>turns them into processes a computer can respond to. Some

0:26:17.320 --> 0:26:20.960
<v Speaker 1>programming languages are closer to machine code. Those are low

0:26:21.040 --> 0:26:24.439
<v Speaker 1>level programming languages. They're typically really challenging to work with,

0:26:24.960 --> 0:26:28.520
<v Speaker 1>and others are more abstract and thus easier for us

0:26:28.560 --> 0:26:32.200
<v Speaker 1>to work with, and these are called high level programming languages.

0:26:32.240 --> 0:26:36.760
<v Speaker 1>Pearl falls into that high level category. Now. Originally, Gray's

0:26:36.840 --> 0:26:40.240
<v Speaker 1>program would seek out links on web pages and then

0:26:40.320 --> 0:26:44.040
<v Speaker 1>note the web servers that were hosting those pages before

0:26:44.080 --> 0:26:47.639
<v Speaker 1>following the link over, and then it would repeat that process,

0:26:47.880 --> 0:26:50.760
<v Speaker 1>and the program was really just automating the process that

0:26:50.840 --> 0:26:53.080
<v Speaker 1>we would do manually if we were to look at

0:26:53.080 --> 0:26:56.080
<v Speaker 1>a web page, see a link, and click on it.

0:26:56.480 --> 0:26:59.640
<v Speaker 1>The program saught the links embedded in pages and then

0:26:59.680 --> 0:27:02.560
<v Speaker 1>act made those links to explore whichever documents were pulled

0:27:02.600 --> 0:27:05.520
<v Speaker 1>up as a result, and then would repeat that process

0:27:05.560 --> 0:27:08.000
<v Speaker 1>while building out this index, kind of leaving a trail

0:27:08.080 --> 0:27:11.440
<v Speaker 1>of where it had been. The program built what Gray

0:27:11.520 --> 0:27:15.879
<v Speaker 1>called the wand DECKX W A and d e X.

0:27:15.880 --> 0:27:19.159
<v Speaker 1>It was an index of web servers that we're joining

0:27:19.160 --> 0:27:22.760
<v Speaker 1>the Internet. Not long after launching the Wanderer than Gray

0:27:22.800 --> 0:27:26.320
<v Speaker 1>built in additional capability of capturing the u r l's

0:27:26.480 --> 0:27:28.720
<v Speaker 1>that it was going through in addition to just the

0:27:28.760 --> 0:27:31.480
<v Speaker 1>web servers, so you can think like originally he was

0:27:31.520 --> 0:27:33.680
<v Speaker 1>just like, I wonder how many web servers are connected

0:27:33.720 --> 0:27:36.879
<v Speaker 1>to the Internet, how many are connected to it today,

0:27:37.040 --> 0:27:39.719
<v Speaker 1>how many will be connected tomorrow? That kind of thing,

0:27:39.760 --> 0:27:41.480
<v Speaker 1>and just sort of keeping track of how the web

0:27:41.560 --> 0:27:44.240
<v Speaker 1>was growing. Then he thought, I want to know actually

0:27:44.280 --> 0:27:46.240
<v Speaker 1>the u r ls that exists too, so he's keeping

0:27:46.280 --> 0:27:50.439
<v Speaker 1>track of both. This didn't go totally smoothly, however. The

0:27:50.480 --> 0:27:54.680
<v Speaker 1>Wanderer was an energetic little spider. It would move through

0:27:54.800 --> 0:27:58.280
<v Speaker 1>links throughout the day. It would index the same pages

0:27:58.520 --> 0:28:02.600
<v Speaker 1>hundreds of time in the process, and this started to

0:28:02.680 --> 0:28:06.159
<v Speaker 1>cause network lag across the Internet, and it meant that

0:28:06.200 --> 0:28:08.480
<v Speaker 1>people who were just trying to navigate to those pages

0:28:08.760 --> 0:28:12.040
<v Speaker 1>were experiencing long delays as a result, and that made people,

0:28:12.960 --> 0:28:16.280
<v Speaker 1>let's say, a little miffed at Mr Gray, and he

0:28:16.320 --> 0:28:19.240
<v Speaker 1>was able to modify the spiders operation so it wouldn't

0:28:19.280 --> 0:28:23.720
<v Speaker 1>cause so much of a disruption, but that early enthusiastic

0:28:23.880 --> 0:28:26.720
<v Speaker 1>mistake kind of created a tough environment for other people

0:28:26.760 --> 0:28:29.639
<v Speaker 1>who wanted to create similar tools that would allow for

0:28:29.800 --> 0:28:33.600
<v Speaker 1>fully fledged searching on the Internet, and the concept of

0:28:33.640 --> 0:28:36.199
<v Speaker 1>spiders had kind of a big old X next to

0:28:36.280 --> 0:28:38.840
<v Speaker 1>it in the minds of many people. It became synonymous

0:28:38.880 --> 0:28:42.320
<v Speaker 1>with this idea of lag and just bad network performance.

0:28:42.920 --> 0:28:46.840
<v Speaker 1>The Wanderer didn't index sites for content either. It was

0:28:47.120 --> 0:28:49.760
<v Speaker 1>more about tracking the growth of the Internet as a whole.

0:28:49.800 --> 0:28:52.400
<v Speaker 1>It wasn't so much concerned with what was on pages,

0:28:53.000 --> 0:28:55.560
<v Speaker 1>so it didn't create a means to search for specific

0:28:55.560 --> 0:28:59.240
<v Speaker 1>web pages or subject matter. Meanwhile, over at the University

0:28:59.240 --> 0:29:03.080
<v Speaker 1>of Geneva, a developer named Oscar near Strats developed a

0:29:03.120 --> 0:29:07.560
<v Speaker 1>tool that could search lists of websites and return results

0:29:07.600 --> 0:29:11.160
<v Speaker 1>based on a query. This tool didn't actually survey the

0:29:11.160 --> 0:29:14.560
<v Speaker 1>web as a whole. Instead, it would access lists other

0:29:14.640 --> 0:29:18.640
<v Speaker 1>organizations had made, such as the Virtual Library, so it

0:29:18.680 --> 0:29:22.680
<v Speaker 1>would reformat those lists as entries into a database and

0:29:22.720 --> 0:29:25.440
<v Speaker 1>that's what could be searched. It was called the W

0:29:25.760 --> 0:29:30.240
<v Speaker 1>three Catalog. It still wasn't quite a search engine as

0:29:30.280 --> 0:29:34.560
<v Speaker 1>we think of them today. Martin Costa developed another tool

0:29:34.880 --> 0:29:40.040
<v Speaker 1>in late which he called the Archie like Indexing for

0:29:40.200 --> 0:29:44.440
<v Speaker 1>the Web or ali Web as the name suggests. He

0:29:44.520 --> 0:29:48.680
<v Speaker 1>was taking inspiration from the Gopher search tool of Archie.

0:29:48.720 --> 0:29:52.400
<v Speaker 1>Ali Web needed web administrators to provide the location for

0:29:52.440 --> 0:29:55.800
<v Speaker 1>their site index files so that they could be included

0:29:55.960 --> 0:29:59.680
<v Speaker 1>in the ali web search database. Users could also create

0:29:59.760 --> 0:30:03.040
<v Speaker 1>dis cryptions for the websites and add in keywords to

0:30:03.080 --> 0:30:06.080
<v Speaker 1>help with search. And this brings us into the world

0:30:06.240 --> 0:30:12.400
<v Speaker 1>of metadata. Metadata is information about information. So you've got

0:30:12.440 --> 0:30:15.360
<v Speaker 1>the core information of a web page, what we might consider,

0:30:15.440 --> 0:30:17.800
<v Speaker 1>you know, the content of the web page, the stuff

0:30:17.840 --> 0:30:19.960
<v Speaker 1>that you and I would actually read if we went there.

0:30:20.280 --> 0:30:23.160
<v Speaker 1>But then you've got the metadata and that describes the

0:30:23.200 --> 0:30:26.520
<v Speaker 1>information in some meaningful way. Now, if you were in

0:30:26.560 --> 0:30:29.920
<v Speaker 1>a physical library, the metadata would be the stuff that

0:30:30.040 --> 0:30:33.680
<v Speaker 1>you used to help locate where in that library a

0:30:33.760 --> 0:30:37.160
<v Speaker 1>specific book should be, and that could include stuff such

0:30:37.200 --> 0:30:40.960
<v Speaker 1>as the author name, the publication date, the subject matter,

0:30:41.120 --> 0:30:43.719
<v Speaker 1>that kind of thing. And it was a pretty good

0:30:43.760 --> 0:30:47.560
<v Speaker 1>idea ALI web but not many people knew about it,

0:30:47.680 --> 0:30:50.200
<v Speaker 1>and those who did know about it, not a lot

0:30:50.240 --> 0:30:52.719
<v Speaker 1>of them went through the trouble of submitting the information

0:30:52.840 --> 0:30:56.760
<v Speaker 1>to ALI webs, so it didn't really see much widespread use.

0:30:57.520 --> 0:31:02.480
<v Speaker 1>Also in and then extending into was the development of

0:31:02.520 --> 0:31:04.840
<v Speaker 1>another search tool, and this was the brain child of

0:31:04.880 --> 0:31:08.280
<v Speaker 1>a guy named Jonathan Fletcher. He was a grad student

0:31:08.400 --> 0:31:12.440
<v Speaker 1>at the University of Sterling in Scotland. Fletcher's approach combined

0:31:12.480 --> 0:31:15.920
<v Speaker 1>the strategies of his predecessors. He built a web crawler

0:31:16.040 --> 0:31:19.680
<v Speaker 1>to find and index web pages. He designed the database

0:31:19.720 --> 0:31:24.120
<v Speaker 1>to be searchable and he called it jump Station. Unfortunately,

0:31:24.480 --> 0:31:28.040
<v Speaker 1>his efforts were limited by the budget that he got

0:31:28.080 --> 0:31:31.360
<v Speaker 1>from his university. He didn't have the resources to really

0:31:31.400 --> 0:31:33.640
<v Speaker 1>build out a tool that could index all of a

0:31:33.720 --> 0:31:38.200
<v Speaker 1>website's contents, so instead he designed jump Station to parse

0:31:38.360 --> 0:31:42.040
<v Speaker 1>web page titles and headers, and that would still help

0:31:42.080 --> 0:31:46.160
<v Speaker 1>people find pages that, at least according to the title

0:31:46.240 --> 0:31:49.680
<v Speaker 1>and header were focused on whatever the area of interest was.

0:31:49.760 --> 0:31:52.000
<v Speaker 1>But it would also mean that other pages that might

0:31:52.040 --> 0:31:55.720
<v Speaker 1>have critically relevant information about the subject could be overlooked

0:31:55.720 --> 0:31:58.840
<v Speaker 1>because those terms just weren't in the title or header.

0:31:59.520 --> 0:32:02.240
<v Speaker 1>We're getting closer to the search engines that would most

0:32:02.280 --> 0:32:05.600
<v Speaker 1>resemble what we think of today, which, let's be honest,

0:32:05.720 --> 0:32:10.160
<v Speaker 1>is primarily Google, and we will learn more about those

0:32:10.240 --> 0:32:19.920
<v Speaker 1>after we take another quick break. I think you could

0:32:20.040 --> 0:32:24.360
<v Speaker 1>argue pretty convincingly the jump station was the first true

0:32:24.520 --> 0:32:27.720
<v Speaker 1>web search engine as we have come to understand them,

0:32:27.720 --> 0:32:30.680
<v Speaker 1>though it was limited since it couldn't crawl through and

0:32:30.720 --> 0:32:34.200
<v Speaker 1>index all the contents of a page. The next name

0:32:34.240 --> 0:32:36.840
<v Speaker 1>on our journey is one a lot of people will recognize,

0:32:37.120 --> 0:32:41.080
<v Speaker 1>and that is Yahoo. But Yahoo didn't start off as

0:32:41.120 --> 0:32:45.960
<v Speaker 1>a search engine. Rather, Yahoo was originally a web directory.

0:32:46.480 --> 0:32:50.360
<v Speaker 1>It started in as just a list of websites that

0:32:50.440 --> 0:32:53.160
<v Speaker 1>the founders of the site that would be Jerry Young

0:32:53.280 --> 0:32:57.240
<v Speaker 1>and David Filo they thought were interesting. There this is

0:32:57.240 --> 0:32:59.240
<v Speaker 1>a cool website. I want more people to know about it,

0:32:59.240 --> 0:33:01.360
<v Speaker 1>so I'm gonna include did on my web page about

0:33:01.440 --> 0:33:05.960
<v Speaker 1>cool websites. So Yahoo started off as another curated list

0:33:06.240 --> 0:33:10.440
<v Speaker 1>of web pages. The search tool aspect of Yahoo would

0:33:10.480 --> 0:33:14.680
<v Speaker 1>follow in The search tool worked on the sites that

0:33:14.720 --> 0:33:19.320
<v Speaker 1>were curated in the human curated Yahoo directory, but if

0:33:19.320 --> 0:33:22.040
<v Speaker 1>a site wasn't in that directory, it wouldn't show up

0:33:22.040 --> 0:33:24.080
<v Speaker 1>in search results. So someone would have had to have

0:33:24.160 --> 0:33:28.440
<v Speaker 1>found the website already and then included it within Yahoo's

0:33:28.520 --> 0:33:33.200
<v Speaker 1>growing directory for it to register as a result. Following

0:33:33.280 --> 0:33:35.880
<v Speaker 1>Yahoo were a couple of other notable names. There was

0:33:36.000 --> 0:33:40.120
<v Speaker 1>info Seek and web Crawler. Webcrawlor was the first search

0:33:40.160 --> 0:33:43.600
<v Speaker 1>engine I remember using. In fact, I stuck with web

0:33:43.600 --> 0:33:47.920
<v Speaker 1>Crawler for a long time, even after the infamous Google

0:33:48.040 --> 0:33:52.320
<v Speaker 1>emerged and started making waves. Web color did something that

0:33:52.360 --> 0:33:55.760
<v Speaker 1>other search engines had not yet done. It's index was

0:33:55.800 --> 0:33:59.520
<v Speaker 1>looking at the full content of a web page, including

0:33:59.520 --> 0:34:02.960
<v Speaker 1>the meta data on that page. So let's talk about

0:34:03.000 --> 0:34:06.440
<v Speaker 1>that for a second. Web spiders, when you get down

0:34:06.440 --> 0:34:09.399
<v Speaker 1>to it, are just bots that follow links, but some

0:34:09.440 --> 0:34:12.359
<v Speaker 1>web spiders can also make a full index of the

0:34:12.440 --> 0:34:17.080
<v Speaker 1>content found at each links destination, essentially scanning all the

0:34:17.160 --> 0:34:20.000
<v Speaker 1>text that's within a web page and indexing it so

0:34:20.040 --> 0:34:25.440
<v Speaker 1>that that content is searchable and that searchable index forms

0:34:25.600 --> 0:34:28.040
<v Speaker 1>as a result of all this, and it can bring

0:34:28.080 --> 0:34:31.600
<v Speaker 1>back any results of any pages that contain a specific word.

0:34:32.360 --> 0:34:34.520
<v Speaker 1>Let's use an example. It makes it easier. So let's

0:34:34.520 --> 0:34:37.000
<v Speaker 1>say you're in a literature class and you're having a

0:34:37.040 --> 0:34:41.720
<v Speaker 1>real hard time understanding Milton's Paradise Lost. So you're looking

0:34:41.719 --> 0:34:43.960
<v Speaker 1>for some resources to help you get a better handle

0:34:44.360 --> 0:34:47.280
<v Speaker 1>on things. You go to a search engine on the Internet.

0:34:47.360 --> 0:34:49.719
<v Speaker 1>It doesn't really matter which one, and you type in

0:34:50.239 --> 0:34:56.120
<v Speaker 1>Paradise lost Milton analysis. You're trying to really cut down

0:34:56.200 --> 0:35:00.279
<v Speaker 1>on anything that might just mention paradise or loss or

0:35:00.320 --> 0:35:03.520
<v Speaker 1>anything like that. You really want to focus on this.

0:35:03.520 --> 0:35:05.759
<v Speaker 1>This part of the search engine is the UI or

0:35:05.800 --> 0:35:08.000
<v Speaker 1>the user interface, right, This is the part that we

0:35:08.160 --> 0:35:11.640
<v Speaker 1>as humans interact with in order to tell the engine

0:35:11.680 --> 0:35:14.520
<v Speaker 1>what it is we're looking for. The search engine then

0:35:14.600 --> 0:35:18.640
<v Speaker 1>goes and consults it's index of the web. So no

0:35:18.680 --> 0:35:22.839
<v Speaker 1>matter which search engine you're using, it's not a representation

0:35:22.880 --> 0:35:26.239
<v Speaker 1>of every single web page that exists. It's every web

0:35:26.280 --> 0:35:31.280
<v Speaker 1>page that exists within that engine's index. So each search

0:35:31.320 --> 0:35:34.280
<v Speaker 1>engine has its own index, or in some cases search

0:35:34.280 --> 0:35:37.800
<v Speaker 1>engines are powered by other engines. It may be sharing

0:35:37.840 --> 0:35:41.080
<v Speaker 1>an index with another search engine, but it looks for

0:35:41.320 --> 0:35:44.960
<v Speaker 1>documents in that index that contain the words that you

0:35:45.040 --> 0:35:48.400
<v Speaker 1>have submitted in the UI, then it has to return

0:35:48.440 --> 0:35:51.880
<v Speaker 1>those results to you, which also means that the search

0:35:51.920 --> 0:35:55.359
<v Speaker 1>engine has to determine which of those search results are

0:35:55.480 --> 0:35:58.440
<v Speaker 1>likely to be the most relevant to your query. This

0:35:58.520 --> 0:36:01.000
<v Speaker 1>is actually harder to do that, and it sounds if

0:36:01.040 --> 0:36:04.240
<v Speaker 1>a surge engine is only looking for documents that happened

0:36:04.239 --> 0:36:07.399
<v Speaker 1>to contain the words that you've submitted, you could get

0:36:07.400 --> 0:36:10.360
<v Speaker 1>back pages that have little to no relevance to what

0:36:10.600 --> 0:36:15.239
<v Speaker 1>you actually wanted. Plus, some web page administrators, especially back

0:36:15.280 --> 0:36:18.920
<v Speaker 1>in the early days, we're really trying to game the system.

0:36:18.960 --> 0:36:21.520
<v Speaker 1>They might use tricks in order to get more people

0:36:21.560 --> 0:36:23.799
<v Speaker 1>to come to that web page. And it might be

0:36:23.800 --> 0:36:27.520
<v Speaker 1>because their web pages had banner ads on them and

0:36:27.600 --> 0:36:31.600
<v Speaker 1>so more people visiting the page meant more money, or

0:36:31.880 --> 0:36:34.480
<v Speaker 1>maybe they just wanted bragging rights. Because some of you

0:36:34.480 --> 0:36:36.880
<v Speaker 1>guys might remember this. It used to be back in

0:36:36.880 --> 0:36:39.040
<v Speaker 1>the day that one of the standard features you would

0:36:39.040 --> 0:36:42.160
<v Speaker 1>see on web pages was the ever present web counter

0:36:42.760 --> 0:36:44.920
<v Speaker 1>that would tell you how many people had visited that

0:36:44.960 --> 0:36:48.960
<v Speaker 1>website since it had been created. And a few folks

0:36:49.320 --> 0:36:52.480
<v Speaker 1>were hoping to just spread malware by tricking people to

0:36:52.800 --> 0:36:57.160
<v Speaker 1>visiting a website and downloading some malicious program. And then

0:36:57.200 --> 0:36:59.520
<v Speaker 1>there were also link farms. These were sites that were

0:36:59.560 --> 0:37:03.919
<v Speaker 1>just one long list of links to other sites. More

0:37:03.960 --> 0:37:07.480
<v Speaker 1>on why that's important in just a second. One trick

0:37:08.120 --> 0:37:11.760
<v Speaker 1>was to include just a ton of different popular search

0:37:11.920 --> 0:37:15.240
<v Speaker 1>terms on a page, even if the page had nothing

0:37:15.280 --> 0:37:17.239
<v Speaker 1>to do with any of those search terms, and you

0:37:17.239 --> 0:37:19.920
<v Speaker 1>could even hide that. You can make the text and

0:37:20.000 --> 0:37:23.680
<v Speaker 1>background the same color, so a human visiting the website

0:37:23.680 --> 0:37:26.560
<v Speaker 1>and looking at it through a standard browser wouldn't see

0:37:26.600 --> 0:37:29.360
<v Speaker 1>anything because the background color in the text is the

0:37:29.440 --> 0:37:32.319
<v Speaker 1>same color. They see whatever the content of the web

0:37:32.360 --> 0:37:35.960
<v Speaker 1>page was, but they wouldn't see all these hidden keywords.

0:37:36.000 --> 0:37:38.440
<v Speaker 1>But a computer would totally see it. It would ignore

0:37:38.560 --> 0:37:41.520
<v Speaker 1>the fact that the font and the background color are

0:37:41.560 --> 0:37:43.960
<v Speaker 1>the same and it would just pick up on the text.

0:37:44.719 --> 0:37:48.719
<v Speaker 1>So you would end up having these false returns on

0:37:48.800 --> 0:37:52.160
<v Speaker 1>search results because those keywords were there in the page,

0:37:52.320 --> 0:37:55.080
<v Speaker 1>they just weren't relevant to whatever the content was. Other

0:37:55.120 --> 0:37:59.600
<v Speaker 1>administrators would put keyword dumps into web page meta data,

0:37:59.760 --> 0:38:01.839
<v Speaker 1>so wouldn't show up on the page itself at all.

0:38:01.880 --> 0:38:04.600
<v Speaker 1>It would all be in the background. Following a search

0:38:04.640 --> 0:38:08.000
<v Speaker 1>result like that would be really frustrating because you wouldn't

0:38:08.000 --> 0:38:10.080
<v Speaker 1>actually get whatever it was you were looking for, you

0:38:10.080 --> 0:38:12.880
<v Speaker 1>would get something else. It was a bait and switch.

0:38:13.280 --> 0:38:17.040
<v Speaker 1>So building search engines meant not only did the developers

0:38:17.080 --> 0:38:19.680
<v Speaker 1>need to figure out how to build in disease that

0:38:20.200 --> 0:38:22.880
<v Speaker 1>could grow as the Web was growing, they also had

0:38:22.920 --> 0:38:25.680
<v Speaker 1>to figure out how to defeat strategies that were intended

0:38:25.719 --> 0:38:28.239
<v Speaker 1>to game the system. How can you make sure the

0:38:28.239 --> 0:38:31.680
<v Speaker 1>people who are using your search engine are actually getting

0:38:31.680 --> 0:38:34.160
<v Speaker 1>the stuff that they want, because if they're not getting

0:38:34.160 --> 0:38:36.960
<v Speaker 1>the stuff they want, they're gonna bounce. They're never going

0:38:37.040 --> 0:38:40.440
<v Speaker 1>to use your search engine again. I'm gonna use Google

0:38:40.880 --> 0:38:44.560
<v Speaker 1>as the example for this, because i mean, let's be honest,

0:38:44.640 --> 0:38:47.759
<v Speaker 1>Google is dominant in that space. It's almost like it's

0:38:47.760 --> 0:38:50.239
<v Speaker 1>the only game in town. But just know that all

0:38:50.280 --> 0:38:54.239
<v Speaker 1>search engines, in general, we're all trying variations on this

0:38:54.320 --> 0:38:58.719
<v Speaker 1>kind of general philosophy. Google's approach used a tool that

0:38:58.800 --> 0:39:01.880
<v Speaker 1>they called page rank, which, as the name suggests, would

0:39:02.160 --> 0:39:05.600
<v Speaker 1>take the documents that came back from any given search,

0:39:06.160 --> 0:39:10.840
<v Speaker 1>then rank those search results before presenting them to the user.

0:39:11.920 --> 0:39:14.319
<v Speaker 1>So if you went to Google and you typed in

0:39:14.440 --> 0:39:19.400
<v Speaker 1>Paradise Lost Milton analysis, Google would consult its own index

0:39:19.480 --> 0:39:21.560
<v Speaker 1>of the web, and it would look for stuff like,

0:39:22.120 --> 0:39:25.120
<v Speaker 1>are the search terms showing up in the page? Is

0:39:25.640 --> 0:39:28.640
<v Speaker 1>examples of words that are close together, because that might

0:39:28.680 --> 0:39:32.160
<v Speaker 1>indicate that this result is more relevant. Right, if these

0:39:32.200 --> 0:39:35.200
<v Speaker 1>words are all kind of next to each other, it's

0:39:35.239 --> 0:39:38.200
<v Speaker 1>more likely to be what the person was looking for,

0:39:38.280 --> 0:39:41.040
<v Speaker 1>as opposed to, Yeah, all four of those words are

0:39:41.040 --> 0:39:43.200
<v Speaker 1>showing up on this page, but they're so far apart,

0:39:43.800 --> 0:39:47.760
<v Speaker 1>then maybe this isn't even related to what the person

0:39:47.880 --> 0:39:51.319
<v Speaker 1>was looking for. That was part of page rank. The

0:39:51.360 --> 0:39:54.279
<v Speaker 1>tool also would look at things like the title of

0:39:54.320 --> 0:39:56.520
<v Speaker 1>the page and maybe even the header, but it mostly

0:39:56.560 --> 0:40:00.719
<v Speaker 1>ignored the metadata because you know, search in gen designers

0:40:00.719 --> 0:40:03.239
<v Speaker 1>were picking up on the tricks people were using in

0:40:03.320 --> 0:40:06.560
<v Speaker 1>order to get more clicks. At the same time, the

0:40:06.560 --> 0:40:09.719
<v Speaker 1>search algorithm would assign ranks to pages based on a

0:40:09.760 --> 0:40:14.160
<v Speaker 1>few other points of criteria. The algorithm attempted to figure

0:40:14.160 --> 0:40:17.799
<v Speaker 1>out how reputable every page was, and it did so

0:40:17.880 --> 0:40:20.399
<v Speaker 1>in a couple of different ways. One was to look

0:40:20.440 --> 0:40:24.279
<v Speaker 1>at which other sites were linking to that page. If

0:40:24.320 --> 0:40:26.960
<v Speaker 1>the other sites that were linking to it were considered

0:40:27.040 --> 0:40:31.840
<v Speaker 1>generally reputable, that would improve the results page rank score.

0:40:32.200 --> 0:40:36.240
<v Speaker 1>So in our case, and this is a totally unrealistic example.

0:40:36.280 --> 0:40:41.120
<v Speaker 1>But let's say we've searched that Paradise Lost Milton analysis

0:40:41.760 --> 0:40:45.520
<v Speaker 1>and all we got back our three results, but Google

0:40:45.560 --> 0:40:48.040
<v Speaker 1>has to rank those results as one, two, and three.

0:40:48.400 --> 0:40:51.000
<v Speaker 1>One of those results is from a website dedicated to

0:40:51.040 --> 0:40:55.759
<v Speaker 1>Paradise Lost, the literary work and has literary analysis on it,

0:40:55.840 --> 0:40:58.160
<v Speaker 1>and it sits on a server that belongs to a

0:40:58.200 --> 0:41:02.320
<v Speaker 1>prestigious university. Let's say that the second result is coming

0:41:02.440 --> 0:41:06.800
<v Speaker 1>from a literary discussion site. It doesn't belong to a university,

0:41:06.840 --> 0:41:11.000
<v Speaker 1>but it does have critical analysis and an entry specifically

0:41:11.040 --> 0:41:14.360
<v Speaker 1>on Paradise Lost. And let's say that the third result

0:41:14.680 --> 0:41:18.240
<v Speaker 1>is Billy Bob's Homespun Guide to Milton and Crab Trap

0:41:18.400 --> 0:41:22.359
<v Speaker 1>Maintenance or something. Now, the algorithm is not smart enough

0:41:22.400 --> 0:41:25.880
<v Speaker 1>to actually read each of these sites as a human

0:41:25.920 --> 0:41:29.520
<v Speaker 1>would and judge them and analyze them and weigh the

0:41:29.640 --> 0:41:33.120
<v Speaker 1>value of each one, but it can see that the

0:41:33.239 --> 0:41:36.640
<v Speaker 1>university server is, you know, it belongs to a university.

0:41:36.640 --> 0:41:40.840
<v Speaker 1>It's generally treated as the property of a recognized authority,

0:41:41.080 --> 0:41:45.000
<v Speaker 1>and so it sees that other reputable sites are also

0:41:45.120 --> 0:41:48.360
<v Speaker 1>linking to that university's web pages, and to that Milton

0:41:48.440 --> 0:41:52.160
<v Speaker 1>page in particular. So it assigns that result a very

0:41:52.239 --> 0:41:56.000
<v Speaker 1>high page rank, saying it's probably pretty darn good. That

0:41:56.080 --> 0:41:58.600
<v Speaker 1>also means it's going to appear higher on the list

0:41:58.640 --> 0:42:02.440
<v Speaker 1>of search results. Meanwhile, Billy Bob's is likely to appear

0:42:02.480 --> 0:42:04.840
<v Speaker 1>at the bottom of that list because very few people

0:42:04.960 --> 0:42:07.480
<v Speaker 1>are linking to it. It might be hosted on just

0:42:07.560 --> 0:42:11.239
<v Speaker 1>some server somewhere that happens to host a whole hodgepodge

0:42:11.239 --> 0:42:15.319
<v Speaker 1>of different web pages, and the page that's on that

0:42:15.440 --> 0:42:19.359
<v Speaker 1>site that has just sort of literary analysis discussions on it,

0:42:19.440 --> 0:42:22.560
<v Speaker 1>that one appears in the middle. Now, could Billy Bob's

0:42:22.560 --> 0:42:27.200
<v Speaker 1>page actually be the best resource? Yes, it could be,

0:42:27.560 --> 0:42:31.320
<v Speaker 1>but without a human or maybe a really incredibly advanced

0:42:31.400 --> 0:42:34.520
<v Speaker 1>AI to review the contents of that page and to

0:42:34.760 --> 0:42:39.560
<v Speaker 1>really understand them, the ranking approach seemed like the best

0:42:39.600 --> 0:42:43.240
<v Speaker 1>way to quickly organize results to give the best chance

0:42:43.320 --> 0:42:47.000
<v Speaker 1>that the returns were going to be relevant to the user. Now,

0:42:47.080 --> 0:42:51.440
<v Speaker 1>in that example I just gave, I mentioned three results. However,

0:42:51.480 --> 0:42:54.040
<v Speaker 1>if you were to really perform that search, because I

0:42:54.080 --> 0:42:57.480
<v Speaker 1>did it before I recorded this episode, you would get

0:42:57.600 --> 0:43:00.520
<v Speaker 1>millions of results. In fact, just for a laugh, I

0:43:00.560 --> 0:43:04.600
<v Speaker 1>went to Google typed in Paradise lost Milton Analysis, and

0:43:04.640 --> 0:43:10.160
<v Speaker 1>I got quote about three point eight million results end quote,

0:43:10.680 --> 0:43:15.000
<v Speaker 1>that happened in less than one second. Page rank becomes

0:43:15.120 --> 0:43:18.880
<v Speaker 1>really important when you get to that level of response,

0:43:18.920 --> 0:43:23.040
<v Speaker 1>when you get to that many results, if you're talking

0:43:23.080 --> 0:43:26.279
<v Speaker 1>about that enormous amount of information, you really want the

0:43:26.360 --> 0:43:29.360
<v Speaker 1>most relevant choices to be near the top to save

0:43:29.400 --> 0:43:33.120
<v Speaker 1>yourself time. And that has created some pretty bad habits

0:43:33.160 --> 0:43:35.799
<v Speaker 1>for us as users. By the way, we've become so

0:43:35.920 --> 0:43:39.759
<v Speaker 1>used to search engines returning the most relevant results right

0:43:39.800 --> 0:43:42.399
<v Speaker 1>at the top that we don't necessarily bother to look

0:43:42.440 --> 0:43:45.279
<v Speaker 1>beyond the first few sites. There are a lot of

0:43:45.320 --> 0:43:49.480
<v Speaker 1>resources out there that have estimates on how many people

0:43:49.560 --> 0:43:53.320
<v Speaker 1>actually bother to ever go past the first page results,

0:43:54.160 --> 0:43:57.720
<v Speaker 1>and some of them even say that as much as

0:43:57.760 --> 0:44:01.040
<v Speaker 1>of all web traffic will just go to results that

0:44:01.080 --> 0:44:04.759
<v Speaker 1>appear on the first page for any given search, and

0:44:04.840 --> 0:44:08.400
<v Speaker 1>that means that all the other results that appear after

0:44:08.520 --> 0:44:12.560
<v Speaker 1>page one are sharing just five percent of the web traffic.

0:44:13.400 --> 0:44:16.760
<v Speaker 1>So when I did that Paradise law search, that first

0:44:16.760 --> 0:44:20.920
<v Speaker 1>page of results had nine websites linked to it, plus

0:44:20.920 --> 0:44:25.759
<v Speaker 1>a few videos. That means somewhere around three million, seven

0:44:26.320 --> 0:44:31.360
<v Speaker 1>nine web pages are sharing just five percent of leftover

0:44:31.440 --> 0:44:36.319
<v Speaker 1>traffic that go to that first page. So they might

0:44:36.360 --> 0:44:40.920
<v Speaker 1>include incredible resources that are even more relevant than the

0:44:40.920 --> 0:44:44.359
<v Speaker 1>stuff that appears on page one, but very few people

0:44:44.360 --> 0:44:46.959
<v Speaker 1>are going to that. That's one bad habit that we've

0:44:47.000 --> 0:44:50.800
<v Speaker 1>all developed through using these search engines. On the flip side,

0:44:51.000 --> 0:44:55.520
<v Speaker 1>that message is that it's really important to get your

0:44:55.520 --> 0:44:58.279
<v Speaker 1>page to show up on that first screen of results.

0:44:58.880 --> 0:45:01.879
<v Speaker 1>If you're bill holding a web page about a specific thing,

0:45:02.600 --> 0:45:04.880
<v Speaker 1>you want to be on that first page because otherwise

0:45:04.880 --> 0:45:07.640
<v Speaker 1>you're gonna have to hope people find your your website

0:45:07.680 --> 0:45:11.319
<v Speaker 1>through some other means, you know, outside of search. That

0:45:11.400 --> 0:45:13.960
<v Speaker 1>gave birth to the industry of s e O or

0:45:14.040 --> 0:45:18.319
<v Speaker 1>search engine optimization, which is a constantly evolving set of

0:45:18.360 --> 0:45:21.680
<v Speaker 1>practices that web designers try to follow in order to

0:45:21.760 --> 0:45:26.000
<v Speaker 1>rank better in search. And whenever a search engine, which

0:45:26.040 --> 0:45:29.400
<v Speaker 1>again these days we mostly just mean Google, whenever Google

0:45:29.480 --> 0:45:32.600
<v Speaker 1>makes a change in its algorithm, it can really upset

0:45:32.640 --> 0:45:35.000
<v Speaker 1>the apple cart, and it can push everyone back to

0:45:35.040 --> 0:45:39.320
<v Speaker 1>the drawing board. It can completely jumble up who appears

0:45:39.320 --> 0:45:42.040
<v Speaker 1>at the top of search results. Now, all of that

0:45:42.200 --> 0:45:44.719
<v Speaker 1>is another kettle of fish, So I'm going to leave

0:45:44.719 --> 0:45:46.839
<v Speaker 1>off of s e O and go back to that

0:45:46.920 --> 0:45:50.160
<v Speaker 1>on some other day, but more Germane to our episode

0:45:50.200 --> 0:45:54.680
<v Speaker 1>here is that the spiders, those web crawling bots, or

0:45:54.719 --> 0:45:57.920
<v Speaker 1>what build out those indices that search engines used to

0:45:57.920 --> 0:46:00.520
<v Speaker 1>give us the results we ask for. There are some

0:46:00.600 --> 0:46:03.359
<v Speaker 1>things I did not cover, such as tags that web

0:46:03.360 --> 0:46:06.440
<v Speaker 1>developers can use to make sure that search engines just

0:46:06.520 --> 0:46:10.120
<v Speaker 1>pass over their websites or sometimes just pages within their

0:46:10.120 --> 0:46:13.759
<v Speaker 1>websites without adding them to an index, so they'll never

0:46:13.800 --> 0:46:16.360
<v Speaker 1>show up in search. But we could go over that

0:46:16.440 --> 0:46:19.040
<v Speaker 1>in a future episode. Two. For now, it's kind of

0:46:19.080 --> 0:46:22.279
<v Speaker 1>time to wrap things up, So guys, I hope you

0:46:22.360 --> 0:46:25.400
<v Speaker 1>enjoyed this episode. If you have suggestions for future topics,

0:46:25.440 --> 0:46:28.600
<v Speaker 1>whether it's a specific technology, a trend in tech, a

0:46:28.640 --> 0:46:30.600
<v Speaker 1>person in tech. Maybe it's a company you want to

0:46:30.640 --> 0:46:33.600
<v Speaker 1>know more about, let me know, draw me a line

0:46:33.640 --> 0:46:36.080
<v Speaker 1>on Facebook or Twitter. The handle of both of those

0:46:36.239 --> 0:46:39.640
<v Speaker 1>is text stuff H s W and I'll talk to

0:46:39.719 --> 0:46:48.280
<v Speaker 1>you again really soon. Text Stuff is an I Heart

0:46:48.360 --> 0:46:52.120
<v Speaker 1>Radio production. For more podcasts from my Heart Radio, visit

0:46:52.120 --> 0:46:55.239
<v Speaker 1>the I Heart Radio app, Apple Podcasts, or wherever you

0:46:55.280 --> 0:47:00.279
<v Speaker 1>listen to your favorite shows. Two