1 00:00:04,400 --> 00:00:12,520 Speaker 1: Welcome to Textuff, a production from my Heart Radio. Hey there, 2 00:00:12,520 --> 00:00:16,240 Speaker 1: and welcome to tech Stuff. I'm your host, Jonathan Strickland. 3 00:00:16,320 --> 00:00:18,880 Speaker 1: I'm an executive producer with I Heart Radio and I 4 00:00:19,000 --> 00:00:23,840 Speaker 1: love all things tech and listener David reached out to 5 00:00:23,840 --> 00:00:26,040 Speaker 1: me on Twitter and said, I would like to hear 6 00:00:26,079 --> 00:00:30,160 Speaker 1: an episode on search engine spiders. He is our hero. 7 00:00:30,760 --> 00:00:33,800 Speaker 1: You gotta David, and if you get the spider, he 8 00:00:33,920 --> 00:00:36,600 Speaker 1: is our hero. Reference let me know. So we're gonna 9 00:00:36,640 --> 00:00:40,920 Speaker 1: talk about the development of search engines and how they 10 00:00:40,960 --> 00:00:44,479 Speaker 1: work from admittedly a pretty high level, because to go 11 00:00:44,520 --> 00:00:48,240 Speaker 1: into great detail would probably take three or four episodes plus. 12 00:00:48,320 --> 00:00:52,640 Speaker 1: Different search engines use slightly different strategies in order to 13 00:00:52,880 --> 00:00:57,640 Speaker 1: index and rank search results. And the reason I'm doing 14 00:00:57,680 --> 00:01:00,760 Speaker 1: all of that is because if we just talked about spiders, 15 00:01:01,200 --> 00:01:03,880 Speaker 1: it would be a fairly short episode. But what the 16 00:01:03,920 --> 00:01:07,720 Speaker 1: heck is a search engine spider. Well, the index the 17 00:01:07,800 --> 00:01:11,559 Speaker 1: contents of the world wide Web. You need to search 18 00:01:11,600 --> 00:01:16,560 Speaker 1: around and find what's there first, right, you can't return 19 00:01:16,680 --> 00:01:19,319 Speaker 1: results without first knowing what is out there in the 20 00:01:19,360 --> 00:01:23,479 Speaker 1: first place. So a search engine spider is a bot 21 00:01:24,120 --> 00:01:28,200 Speaker 1: that does this. It crawls through the Worldwide Web. Thus 22 00:01:28,240 --> 00:01:31,679 Speaker 1: the whole spider name. We'll learn more about what's actually 23 00:01:31,680 --> 00:01:34,119 Speaker 1: going on a little bit later, but to understand search, 24 00:01:34,600 --> 00:01:37,400 Speaker 1: we need a few more basics. So keep in mind 25 00:01:37,640 --> 00:01:40,440 Speaker 1: that all the stuff we see online, whether it's a 26 00:01:40,440 --> 00:01:43,840 Speaker 1: web page or it's a web service or whatever it 27 00:01:43,840 --> 00:01:47,600 Speaker 1: may be, it ultimately sits on a computer that is 28 00:01:47,600 --> 00:01:52,320 Speaker 1: connected to the Internet infrastructure, so it's connected to routers, 29 00:01:52,360 --> 00:01:56,000 Speaker 1: which then connected to various servers and and domain name 30 00:01:56,080 --> 00:01:59,080 Speaker 1: servers and all that kind of stuff. If you visit 31 00:01:59,120 --> 00:02:02,440 Speaker 1: the house Stuff Works homepage, that's the site for the 32 00:02:02,440 --> 00:02:04,440 Speaker 1: company that I used to work for, don't work for 33 00:02:04,480 --> 00:02:08,600 Speaker 1: them anymore, But that website consists of pages that are 34 00:02:08,720 --> 00:02:12,520 Speaker 1: on a computer in a data center. If you happen 35 00:02:12,560 --> 00:02:15,160 Speaker 1: to know the u r L for the site, so 36 00:02:15,280 --> 00:02:17,880 Speaker 1: you happen to know how stuff works dot com, you 37 00:02:17,919 --> 00:02:20,679 Speaker 1: can type that into a browser u r L bar 38 00:02:20,800 --> 00:02:23,760 Speaker 1: address bar, and the browser will then take care of 39 00:02:23,960 --> 00:02:27,800 Speaker 1: sending the appropriate message to that computer. In this case, 40 00:02:27,840 --> 00:02:30,639 Speaker 1: we will call it a server, and the server will 41 00:02:30,720 --> 00:02:34,240 Speaker 1: then return the appropriate information to your browser the web 42 00:02:34,280 --> 00:02:36,560 Speaker 1: page maybe the home page for how Stuff Works in 43 00:02:36,600 --> 00:02:39,720 Speaker 1: this case, and then you'll see the website. But all 44 00:02:39,760 --> 00:02:42,520 Speaker 1: of that requires that First, you have to know that 45 00:02:42,639 --> 00:02:45,240 Speaker 1: there's a site there at all. Plus you have to 46 00:02:45,240 --> 00:02:47,119 Speaker 1: know the U r L for it, and you might 47 00:02:47,280 --> 00:02:51,919 Speaker 1: not have that information. Before there even was a Worldwide Web, 48 00:02:52,160 --> 00:02:54,360 Speaker 1: there was a need to know where you could find 49 00:02:54,400 --> 00:02:58,280 Speaker 1: stuff on the Internet. Now, remember, the Internet is older 50 00:02:58,320 --> 00:03:00,200 Speaker 1: than the Web, and the Internet and the Web or 51 00:03:00,280 --> 00:03:04,320 Speaker 1: not the same thing. The Web exists on top of 52 00:03:04,360 --> 00:03:07,120 Speaker 1: the Internet. It consists of a lot of other stuff 53 00:03:07,160 --> 00:03:12,120 Speaker 1: besides the Web, right, like email and FTP servers. In fact, 54 00:03:12,760 --> 00:03:16,560 Speaker 1: we need to really talk about FTP servers. FTP stands 55 00:03:16,600 --> 00:03:21,200 Speaker 1: for File Transfer Protocol. So these are computers that house 56 00:03:21,520 --> 00:03:26,440 Speaker 1: certain files on them and through FTP. Through this protocol 57 00:03:26,560 --> 00:03:30,160 Speaker 1: that allows for files to transfer from one computer to 58 00:03:30,320 --> 00:03:34,560 Speaker 1: another across a network connection. People can thus access files 59 00:03:34,560 --> 00:03:37,880 Speaker 1: that can transfer them from the server to their own computer, 60 00:03:38,000 --> 00:03:41,240 Speaker 1: which in this case we would call a client. But again, 61 00:03:41,680 --> 00:03:45,400 Speaker 1: FTP is really only useful if you know the address 62 00:03:45,520 --> 00:03:49,520 Speaker 1: of the servers where the stuff is that you want. Right, 63 00:03:49,960 --> 00:03:52,480 Speaker 1: you can't just use FTP to pull a file out 64 00:03:52,520 --> 00:03:55,920 Speaker 1: of nowhere. You have to contact the proper server and 65 00:03:55,960 --> 00:04:00,960 Speaker 1: pull the relevant file from that server. Enter an Emtaj, 66 00:04:01,200 --> 00:04:04,920 Speaker 1: who in nineteen nine was a graduate student at McGill 67 00:04:05,040 --> 00:04:09,000 Speaker 1: University in Montreal, Canada. He also worked as a systems 68 00:04:09,040 --> 00:04:13,280 Speaker 1: administrator for the School of Computer Science at the university, 69 00:04:13,320 --> 00:04:16,040 Speaker 1: and he was running into a challenge. It was his 70 00:04:16,200 --> 00:04:20,719 Speaker 1: job to locate software for professors, for staff, for students 71 00:04:20,760 --> 00:04:23,080 Speaker 1: at the university, but there was no easy way to 72 00:04:23,160 --> 00:04:26,640 Speaker 1: know where all the various files were on the network 73 00:04:26,760 --> 00:04:31,080 Speaker 1: of public FTP servers. Emtage decided there needed to be 74 00:04:31,200 --> 00:04:34,839 Speaker 1: a way to get a snapshot of which public FTP 75 00:04:35,080 --> 00:04:39,559 Speaker 1: servers had which files. There needed to be some sort 76 00:04:39,600 --> 00:04:43,760 Speaker 1: of directory, and since servers were popping up more frequently 77 00:04:43,960 --> 00:04:48,080 Speaker 1: as more people began to develop stuff for the Internet, 78 00:04:48,360 --> 00:04:50,360 Speaker 1: there also needed to be a good way to search 79 00:04:50,520 --> 00:04:54,680 Speaker 1: those lists to find something specific. Otherwise it would be 80 00:04:54,720 --> 00:04:57,479 Speaker 1: like reading through an entire phone book to find out 81 00:04:57,480 --> 00:05:00,520 Speaker 1: which person or business corresponded to a phone number you 82 00:05:00,600 --> 00:05:03,200 Speaker 1: happen to have seen. Let's say the phone number was 83 00:05:03,279 --> 00:05:06,000 Speaker 1: eight six seven five three oh nine, and you don't 84 00:05:06,040 --> 00:05:08,440 Speaker 1: know that that's Jenny's phone number. You just know it's 85 00:05:08,440 --> 00:05:11,560 Speaker 1: the number. So instead of calling the number and asking, Hey, 86 00:05:11,600 --> 00:05:13,960 Speaker 1: whose number is this, you get a phone book and 87 00:05:14,000 --> 00:05:16,720 Speaker 1: you start searching for eight three o nine to find 88 00:05:16,760 --> 00:05:20,960 Speaker 1: the corresponding name that is not efficient. In fact, in 89 00:05:21,000 --> 00:05:24,559 Speaker 1: the early days, information about servers frequently had no other 90 00:05:24,720 --> 00:05:29,000 Speaker 1: real channel to get to users other than word of mouth. 91 00:05:29,160 --> 00:05:32,800 Speaker 1: So there was a really good chance that there was 92 00:05:32,839 --> 00:05:35,680 Speaker 1: stuff that was relevant to you that you just had 93 00:05:35,720 --> 00:05:38,000 Speaker 1: no way of knowing about because you had to hear 94 00:05:38,040 --> 00:05:41,680 Speaker 1: it from somebody else first. Imtash, along with a couple 95 00:05:41,720 --> 00:05:44,360 Speaker 1: of other folks like Bill Healen and J. Peter Deutsch, 96 00:05:44,760 --> 00:05:48,200 Speaker 1: began building a tool to solve this problem. They ended 97 00:05:48,279 --> 00:05:52,680 Speaker 1: up calling this tool Archie, which actually was not a 98 00:05:52,720 --> 00:05:56,800 Speaker 1: nod to the comic book character from Archie Comics. Instead, 99 00:05:57,040 --> 00:06:01,200 Speaker 1: Archie was a somewhat shortened form of the word archives. 100 00:06:01,920 --> 00:06:05,600 Speaker 1: They created programs that could look through the repositories of 101 00:06:05,720 --> 00:06:09,680 Speaker 1: public FTP sites and get an inventory of the files 102 00:06:09,720 --> 00:06:13,599 Speaker 1: stored on those servers or as documented in the book 103 00:06:13,800 --> 00:06:17,320 Speaker 1: A Rough Guide to the Internet by Nicholas West Quote, 104 00:06:17,800 --> 00:06:22,440 Speaker 1: it combined a script based data gatherer which accessed listings 105 00:06:22,440 --> 00:06:26,600 Speaker 1: from anonymous sites with a script which matched regular expressions 106 00:06:26,600 --> 00:06:29,960 Speaker 1: which could retrieve file names matching a user query end 107 00:06:30,040 --> 00:06:34,279 Speaker 1: quote simple right now. In case you're like me and 108 00:06:34,480 --> 00:06:38,160 Speaker 1: what I just quoted sounded a little bit confusing. One 109 00:06:38,240 --> 00:06:40,719 Speaker 1: it really boils down to is to say they made 110 00:06:40,800 --> 00:06:44,800 Speaker 1: a computer program that followed some fairly simple rules. The 111 00:06:44,880 --> 00:06:47,720 Speaker 1: program made note of the file titles that were on 112 00:06:47,880 --> 00:06:51,520 Speaker 1: various FTP servers, kind of like a list of contents, 113 00:06:51,920 --> 00:06:55,680 Speaker 1: and they noted which files were on which servers. Another 114 00:06:55,680 --> 00:06:58,839 Speaker 1: part of the program arranged those findings into a database 115 00:06:59,360 --> 00:07:02,080 Speaker 1: not that much different from the types of spreadsheets you've 116 00:07:02,080 --> 00:07:05,520 Speaker 1: probably worked with in the past. Imtaj and crew also 117 00:07:05,640 --> 00:07:08,960 Speaker 1: created a tool that would allow them to search this database. 118 00:07:09,440 --> 00:07:12,040 Speaker 1: Before long, other people began to hear that he had 119 00:07:12,080 --> 00:07:14,840 Speaker 1: this database and that they would ask him, Hey, can 120 00:07:14,880 --> 00:07:16,480 Speaker 1: you do a search for me, and they would give 121 00:07:16,520 --> 00:07:18,960 Speaker 1: him the search terms, and it started taking up a 122 00:07:18,960 --> 00:07:22,040 Speaker 1: lot of his time. So in an effort to streamline things, 123 00:07:22,320 --> 00:07:26,320 Speaker 1: he programmed a user interface or UI that would allow 124 00:07:26,360 --> 00:07:29,440 Speaker 1: people to conduct their own searches. They could just log 125 00:07:29,480 --> 00:07:32,200 Speaker 1: into this tool and then type in the file that 126 00:07:32,240 --> 00:07:34,960 Speaker 1: they were looking for and it would return the results 127 00:07:35,000 --> 00:07:37,960 Speaker 1: for them. So as long as they were sure about 128 00:07:38,000 --> 00:07:41,480 Speaker 1: the specific file they needed, they would get the results. Now, 129 00:07:41,560 --> 00:07:45,239 Speaker 1: most resources generally agree that Archie was the first real 130 00:07:45,480 --> 00:07:49,120 Speaker 1: search engine on the Internet, but it wasn't a web 131 00:07:49,320 --> 00:07:53,200 Speaker 1: search engine, the Web didn't exist yet. It wasn't long 132 00:07:53,240 --> 00:07:58,160 Speaker 1: before a couple of other tools followed. In some researchers 133 00:07:58,160 --> 00:08:01,200 Speaker 1: with the University of Minnesota develop a new tool to 134 00:08:01,440 --> 00:08:05,840 Speaker 1: organize and discover documents stored on servers, and the tool 135 00:08:05,960 --> 00:08:10,840 Speaker 1: was called the Gopher protocol. Servers were data repositories called 136 00:08:11,000 --> 00:08:14,800 Speaker 1: Gopher holes eventually that's what they were called anyway, and 137 00:08:14,880 --> 00:08:19,440 Speaker 1: Gopher organized everything into a hierarchical text based menu system. 138 00:08:19,520 --> 00:08:23,040 Speaker 1: So this was a specific strategy that was built on 139 00:08:23,080 --> 00:08:25,320 Speaker 1: top of the Internet. It was kind of think of 140 00:08:25,320 --> 00:08:29,400 Speaker 1: it as being in parallel with the Web. It predated 141 00:08:29,440 --> 00:08:31,840 Speaker 1: the Web, but the Web and Gopher would exist at 142 00:08:31,880 --> 00:08:34,240 Speaker 1: the same time, but they were not the same thing. 143 00:08:34,640 --> 00:08:38,679 Speaker 1: This was a different strategy in order to serve information 144 00:08:38,720 --> 00:08:45,360 Speaker 1: across the networks. Before long, like in more researchers developed 145 00:08:45,400 --> 00:08:48,600 Speaker 1: a search function to work on top of Gopher. Because again, 146 00:08:49,240 --> 00:08:52,319 Speaker 1: if you didn't know where something actually quote unquote lived 147 00:08:52,800 --> 00:08:56,160 Speaker 1: in the Gopher network, you would never be able to 148 00:08:56,200 --> 00:08:59,520 Speaker 1: find it unless you were just lucky. So this search 149 00:08:59,600 --> 00:09:03,400 Speaker 1: tool was called Veronica, and that is pretty cute because 150 00:09:04,080 --> 00:09:08,719 Speaker 1: Veronica is a character in the Archie comics books. And 151 00:09:08,800 --> 00:09:12,400 Speaker 1: while the search engine Archie did not pull its name 152 00:09:12,520 --> 00:09:15,840 Speaker 1: from Archie Comics. Veronica was a nod to the older 153 00:09:15,880 --> 00:09:19,160 Speaker 1: search engine as well as a nod to the comic book, 154 00:09:19,200 --> 00:09:23,920 Speaker 1: so it almost kind of retroactively made Archie relate back 155 00:09:23,960 --> 00:09:29,560 Speaker 1: to the comics. Later, computer geeks assigned a backronym to Veronica. 156 00:09:29,920 --> 00:09:33,080 Speaker 1: This is an acronym that you create after you've already 157 00:09:33,160 --> 00:09:35,080 Speaker 1: named a thing. So you've given the thing a name, 158 00:09:35,320 --> 00:09:37,360 Speaker 1: and then you're thinking, okay, well, what can we say. 159 00:09:37,400 --> 00:09:41,120 Speaker 1: Each of those letters stands for that's a backronym, And 160 00:09:41,160 --> 00:09:45,600 Speaker 1: in this case, the revisionist name was very easy, rodent 161 00:09:45,840 --> 00:09:51,400 Speaker 1: oriented net Wide Index to Computer Archives, or Veronica super cute. 162 00:09:51,920 --> 00:09:56,040 Speaker 1: What Veronica did was fairly primitive. It created a database 163 00:09:56,160 --> 00:10:00,199 Speaker 1: of every file and every directory on every go for 164 00:10:00,280 --> 00:10:02,960 Speaker 1: a server that was connected to the Internet, and it 165 00:10:02,960 --> 00:10:07,040 Speaker 1: would update dynamically as more servers joined the network. That 166 00:10:07,080 --> 00:10:10,599 Speaker 1: approach worked fairly well when there was still a relatively 167 00:10:10,679 --> 00:10:14,440 Speaker 1: small number of servers to keep track of, But as 168 00:10:14,520 --> 00:10:18,200 Speaker 1: more servers came online and joined this Gopher network, with 169 00:10:18,280 --> 00:10:22,199 Speaker 1: more documents stored on each server, it started to get 170 00:10:22,200 --> 00:10:26,800 Speaker 1: a really you know, challenging to manage Veronica, a secondary 171 00:10:26,840 --> 00:10:30,040 Speaker 1: Gopher search tool kind of addressed this problem, and this 172 00:10:30,080 --> 00:10:33,520 Speaker 1: one also took its name from a character from Archie comics, 173 00:10:34,040 --> 00:10:37,840 Speaker 1: jug Head. This one didn't create a full database of 174 00:10:37,960 --> 00:10:42,040 Speaker 1: everything that was on the Gopher network. Instead, as a user, 175 00:10:42,360 --> 00:10:45,920 Speaker 1: you would have to designate which Gopher server you wanted 176 00:10:45,960 --> 00:10:48,320 Speaker 1: to search, so you had to at least have some 177 00:10:48,480 --> 00:10:51,640 Speaker 1: general idea of where it was you needed to look. 178 00:10:52,360 --> 00:10:54,199 Speaker 1: But if you did know that, it was a much 179 00:10:54,240 --> 00:10:57,920 Speaker 1: faster approach than trying to search everything on the network 180 00:10:57,920 --> 00:11:01,760 Speaker 1: as a whole. Gopher had a major problem, and that 181 00:11:01,920 --> 00:11:06,079 Speaker 1: was that was becoming increasingly less efficient and easy to navigate. 182 00:11:06,600 --> 00:11:11,360 Speaker 1: The larger it got, it didn't scale well. Meanwhile, at 183 00:11:11,360 --> 00:11:14,240 Speaker 1: the same time that Gopher was growing, a guy named 184 00:11:14,400 --> 00:11:17,959 Speaker 1: Tim berners Lee over at CERN. You know that's the 185 00:11:18,000 --> 00:11:22,120 Speaker 1: research facility that oversees stuff like the large Hadron collider. Well, 186 00:11:22,160 --> 00:11:26,240 Speaker 1: he was developing a different approach to storing and sharing 187 00:11:26,280 --> 00:11:30,199 Speaker 1: information across networks. Tim and his team at CERN developed 188 00:11:30,240 --> 00:11:34,160 Speaker 1: a protocol called Hypertext Transfer Protocol or h t t 189 00:11:34,360 --> 00:11:39,040 Speaker 1: P and Hypertext Markup Language or h t mL. Both 190 00:11:39,040 --> 00:11:41,240 Speaker 1: of these kind of grew out of stuff that CERN 191 00:11:41,320 --> 00:11:44,600 Speaker 1: had been using internally for a while. Now I'm guessing 192 00:11:44,640 --> 00:11:47,240 Speaker 1: those terms sound familiar to you, guys. These are the 193 00:11:47,280 --> 00:11:51,040 Speaker 1: two components that really formed the basis of web pages 194 00:11:51,120 --> 00:11:54,760 Speaker 1: and the Worldwide Web. The markup language acts as the 195 00:11:54,800 --> 00:11:58,440 Speaker 1: set of instructions on how a computer, or more specifically, 196 00:11:58,480 --> 00:12:04,040 Speaker 1: how a browser is to interpret and display documents, eventually 197 00:12:04,040 --> 00:12:07,880 Speaker 1: including stuff like images and sound files. Although initially the 198 00:12:07,920 --> 00:12:11,640 Speaker 1: web was strictly text based and browsers were text based 199 00:12:11,640 --> 00:12:15,240 Speaker 1: as well. H t t P is the set of 200 00:12:15,360 --> 00:12:19,320 Speaker 1: rules and the processes through which a client that being 201 00:12:19,400 --> 00:12:23,680 Speaker 1: your web browser, can request a specific document from a server, 202 00:12:24,280 --> 00:12:27,600 Speaker 1: and how the server can then send that requested document 203 00:12:27,640 --> 00:12:30,400 Speaker 1: to the browser. The server since the h t m 204 00:12:30,520 --> 00:12:33,440 Speaker 1: L files to the client, and the client interprets those 205 00:12:33,559 --> 00:12:37,200 Speaker 1: HTML files in order to display the relevant web page 206 00:12:37,240 --> 00:12:42,040 Speaker 1: to the user. Hypertext refers to text that has a 207 00:12:42,120 --> 00:12:45,480 Speaker 1: link to some other text, and you can think of 208 00:12:45,520 --> 00:12:48,360 Speaker 1: it kind of like a footnote in a book. The 209 00:12:48,480 --> 00:12:52,840 Speaker 1: hypertext has an asterisk that corresponds to another piece of 210 00:12:52,880 --> 00:12:57,440 Speaker 1: information somewhere that is also marked by an asterisk, except 211 00:12:57,480 --> 00:13:02,600 Speaker 1: in this case the asterisks are invisible. It's highlighted text 212 00:13:02,720 --> 00:13:05,360 Speaker 1: or or text in a different color, or it's underlined. 213 00:13:05,840 --> 00:13:08,480 Speaker 1: It's designated in some way to be different from all 214 00:13:08,520 --> 00:13:10,719 Speaker 1: the rest of the text. That's what lets you know 215 00:13:10,960 --> 00:13:14,800 Speaker 1: it's hypertext and it's linked to something else. Hypertext documents 216 00:13:14,840 --> 00:13:18,800 Speaker 1: connect to one another through hyperlinks. Those documents don't even 217 00:13:18,840 --> 00:13:20,640 Speaker 1: have to be on the same server. They can be 218 00:13:20,679 --> 00:13:23,240 Speaker 1: on opposite sides of the world. So this means you 219 00:13:23,240 --> 00:13:27,679 Speaker 1: can build a reference in one hypertext document to content 220 00:13:27,760 --> 00:13:31,640 Speaker 1: that's found on a totally different hypertext document. Clicking on 221 00:13:31,840 --> 00:13:35,840 Speaker 1: that hypertext activates the link. It sends a command to 222 00:13:36,360 --> 00:13:39,440 Speaker 1: the browser, which then relays that command to the server 223 00:13:39,800 --> 00:13:43,880 Speaker 1: that the client wants to see specific linked information, and 224 00:13:44,040 --> 00:13:47,600 Speaker 1: the server returns that. You can also link the locations 225 00:13:47,640 --> 00:13:50,080 Speaker 1: that are within the same page of a document, or 226 00:13:50,200 --> 00:13:53,440 Speaker 1: specific locations on other pages. It doesn't have to just 227 00:13:53,520 --> 00:13:55,559 Speaker 1: be click on this and you go to a new 228 00:13:55,559 --> 00:13:57,880 Speaker 1: web page. It might be click on this and you 229 00:13:58,120 --> 00:14:02,000 Speaker 1: skip down, you know, a significant number of paragraphs to 230 00:14:02,040 --> 00:14:05,360 Speaker 1: get to the relevant information. Really, all the link is 231 00:14:05,400 --> 00:14:09,920 Speaker 1: doing is telling the browser where some specific point in 232 00:14:09,960 --> 00:14:12,680 Speaker 1: a specific document happens to be and how to get there. 233 00:14:13,000 --> 00:14:14,640 Speaker 1: It's kind of like if you were reading a book 234 00:14:14,640 --> 00:14:17,080 Speaker 1: that said I want to know more, skipped a page 235 00:14:17,080 --> 00:14:20,400 Speaker 1: to nineteen and read the third paragraph or or sometimes 236 00:14:20,440 --> 00:14:22,760 Speaker 1: I compare it to those old choose your own adventure 237 00:14:22,800 --> 00:14:25,000 Speaker 1: books where you get to the bottom of a page 238 00:14:25,080 --> 00:14:27,160 Speaker 1: and you have to make a decision, and based on 239 00:14:27,200 --> 00:14:28,960 Speaker 1: which decision you make, you have to turn to a 240 00:14:28,960 --> 00:14:32,280 Speaker 1: specific page to pick up the story again. Well, you 241 00:14:32,280 --> 00:14:35,360 Speaker 1: can quickly see how this would be really useful. Let's 242 00:14:35,360 --> 00:14:37,400 Speaker 1: say I want to make a web page that includes 243 00:14:37,520 --> 00:14:40,560 Speaker 1: directions and how to perform a particular process. We're gonna 244 00:14:40,600 --> 00:14:43,640 Speaker 1: call it baking a soup FLA and the steps I 245 00:14:43,760 --> 00:14:48,160 Speaker 1: list include references to other, maybe slightly less involved processes 246 00:14:48,240 --> 00:14:50,960 Speaker 1: that are part of this, and I don't go into 247 00:14:51,000 --> 00:14:54,240 Speaker 1: explaining how those work. Let's say, like I talk about 248 00:14:54,280 --> 00:14:56,080 Speaker 1: cracking eggs, but I don't tell you the best way 249 00:14:56,080 --> 00:14:59,440 Speaker 1: to crack an egg. However, I could create hypertext links 250 00:14:59,520 --> 00:15:03,120 Speaker 1: to other sets of instructions, maybe a specific page just 251 00:15:03,320 --> 00:15:06,040 Speaker 1: on different ways to crack an egg, And that way 252 00:15:06,040 --> 00:15:08,440 Speaker 1: you could go and look that up if you weren't confident. 253 00:15:08,760 --> 00:15:10,720 Speaker 1: So if you don't know how something works, you can 254 00:15:10,720 --> 00:15:13,280 Speaker 1: click on that other link and go to a page 255 00:15:13,320 --> 00:15:17,240 Speaker 1: let's dedicated to that in order to learn more. And yes, 256 00:15:17,560 --> 00:15:20,600 Speaker 1: I just described how the web works in general, which 257 00:15:20,640 --> 00:15:23,240 Speaker 1: is something I'm sure you all know at least at 258 00:15:23,280 --> 00:15:25,920 Speaker 1: some level, even if it's not you know, a formal one. 259 00:15:26,240 --> 00:15:29,400 Speaker 1: But if you've ever been on say Wikipedia, reading up 260 00:15:29,400 --> 00:15:32,360 Speaker 1: on a topic and saw hypertext link and thought, yeah, 261 00:15:32,520 --> 00:15:34,840 Speaker 1: I should find out what this term means. I don't 262 00:15:34,920 --> 00:15:37,880 Speaker 1: understand it. So you click on that and you go 263 00:15:38,040 --> 00:15:40,400 Speaker 1: follow that so you can get better understanding, then that's 264 00:15:40,400 --> 00:15:43,400 Speaker 1: the use case I'm referring to. And there's a conversation 265 00:15:43,480 --> 00:15:46,440 Speaker 1: we could have about what actually goes on when you 266 00:15:46,520 --> 00:15:49,960 Speaker 1: click a link, but that would require a deeper dive 267 00:15:50,000 --> 00:15:53,160 Speaker 1: into how the web and by extension, how the Internet 268 00:15:53,200 --> 00:15:55,680 Speaker 1: works on a very technical level, and I think that 269 00:15:55,720 --> 00:15:57,640 Speaker 1: goes beyond the scope of what we're trying to do 270 00:15:57,680 --> 00:16:00,840 Speaker 1: in this episode. So I'm going to simple five, perhaps 271 00:16:00,840 --> 00:16:04,760 Speaker 1: to a ludicrous degree, and say that a link contains 272 00:16:04,800 --> 00:16:08,480 Speaker 1: within it the information about where another document, or even 273 00:16:08,560 --> 00:16:13,160 Speaker 1: a specific point within another document exists, and activating that 274 00:16:13,240 --> 00:16:16,160 Speaker 1: link by clicking on it in a browser initiates a 275 00:16:16,200 --> 00:16:20,240 Speaker 1: sequence that results in the browser requesting that specific document 276 00:16:20,400 --> 00:16:24,040 Speaker 1: from the appropriate server, which then sends that document to 277 00:16:24,120 --> 00:16:26,440 Speaker 1: the browser so that you can see it. A lot 278 00:16:26,480 --> 00:16:29,280 Speaker 1: more is going on to make this happen, but let's 279 00:16:29,280 --> 00:16:32,280 Speaker 1: just stick with that high level view. So the pair 280 00:16:32,360 --> 00:16:36,000 Speaker 1: of h T t P and HTML evolved the same 281 00:16:36,040 --> 00:16:40,480 Speaker 1: time that Gopher was establishing itself, and some people stuck 282 00:16:40,520 --> 00:16:44,480 Speaker 1: with Gopher, but it just really never took off the 283 00:16:44,520 --> 00:16:47,760 Speaker 1: same way that the web did with HTML and HTTP. 284 00:16:48,320 --> 00:16:51,440 Speaker 1: The protocol and the markup language is what the web 285 00:16:51,560 --> 00:16:54,720 Speaker 1: is built upon, and we call it a web because 286 00:16:54,760 --> 00:16:57,920 Speaker 1: of that interconnectivity of documents. You can build out a 287 00:16:58,000 --> 00:17:01,160 Speaker 1: dock and then link that document to another doc which 288 00:17:01,200 --> 00:17:04,080 Speaker 1: might be linked to a dozen others, and by following 289 00:17:04,119 --> 00:17:07,000 Speaker 1: those links, you can navigate from one document to the next. 290 00:17:07,320 --> 00:17:09,680 Speaker 1: It really is similar to what happens to a lot 291 00:17:09,720 --> 00:17:12,240 Speaker 1: of people when they visit Wikipedia and they just start 292 00:17:12,320 --> 00:17:15,440 Speaker 1: following all sorts of links. But you might already see 293 00:17:15,440 --> 00:17:19,320 Speaker 1: a challenge with that kind of design. It works great 294 00:17:19,400 --> 00:17:23,399 Speaker 1: if you've got a centralized person or institution that's building 295 00:17:23,400 --> 00:17:27,000 Speaker 1: out the web, adding pages in a very logical way 296 00:17:27,440 --> 00:17:30,119 Speaker 1: and linking to them in a very logical way. But 297 00:17:30,280 --> 00:17:33,399 Speaker 1: one of tim berners Lee's major goals was to create 298 00:17:33,400 --> 00:17:37,879 Speaker 1: a democratized system that didn't depend upon a centralized authority. 299 00:17:38,040 --> 00:17:40,360 Speaker 1: People should be able to build their own web pages 300 00:17:40,520 --> 00:17:43,120 Speaker 1: and host them on their own servers. But how would 301 00:17:43,200 --> 00:17:46,600 Speaker 1: anyone else find them if there were no links going 302 00:17:46,760 --> 00:17:50,119 Speaker 1: into those pages. If the web pages are made and 303 00:17:50,200 --> 00:17:52,960 Speaker 1: hosted independently of the first few pages on the web, 304 00:17:53,440 --> 00:17:57,800 Speaker 1: where is the connective tissue? The original solution wasn't a 305 00:17:57,880 --> 00:18:01,080 Speaker 1: search engine. It took a it more of a hands 306 00:18:01,119 --> 00:18:04,480 Speaker 1: on approach. I'll explain more, but first let's take a 307 00:18:04,560 --> 00:18:16,240 Speaker 1: quick break. In the early days, when people first started 308 00:18:16,280 --> 00:18:19,520 Speaker 1: building documents to host on the web, in other words, 309 00:18:19,800 --> 00:18:23,399 Speaker 1: the earliest web pages, Tim Burner's leave would take it 310 00:18:23,440 --> 00:18:27,879 Speaker 1: upon himself to create an index hosted on cerns own server. 311 00:18:28,359 --> 00:18:31,760 Speaker 1: Someone might send him a message saying that they had 312 00:18:31,920 --> 00:18:34,640 Speaker 1: built in are hosting a new web page, and they 313 00:18:34,680 --> 00:18:38,360 Speaker 1: could include the address or U R L. Burners Lee 314 00:18:38,440 --> 00:18:41,600 Speaker 1: would then add a hypertext link to a growing catalog 315 00:18:41,840 --> 00:18:45,000 Speaker 1: of those kind of links on a page hosted by CERN. 316 00:18:45,240 --> 00:18:47,960 Speaker 1: So if you visited cerns site, you could navigate to 317 00:18:48,000 --> 00:18:50,919 Speaker 1: that index and see the links to the other sites. 318 00:18:51,600 --> 00:18:55,600 Speaker 1: Tim came to call this a virtual library. He and 319 00:18:55,640 --> 00:18:59,399 Speaker 1: a group of volunteers oversaw its evolution. They organized it 320 00:18:59,440 --> 00:19:03,080 Speaker 1: into different areas of interest, with subject matter experts overseeing 321 00:19:03,160 --> 00:19:06,560 Speaker 1: specific categories, and a lot of these early pages belonged 322 00:19:06,560 --> 00:19:11,280 Speaker 1: to scientific research organizations or universities or publications, and all 323 00:19:11,359 --> 00:19:14,320 Speaker 1: that makes sense. CERN is the organization that oversees the 324 00:19:14,400 --> 00:19:17,240 Speaker 1: large Hadron Collider after all, So it's no surprise that 325 00:19:17,359 --> 00:19:21,480 Speaker 1: the early web really focused on science and academia. Also, 326 00:19:21,640 --> 00:19:24,280 Speaker 1: it's good to mention that the web in those early 327 00:19:24,359 --> 00:19:28,639 Speaker 1: days again was text based. Browsers were text based too, 328 00:19:28,760 --> 00:19:32,360 Speaker 1: that would not really change until another year like nine. 329 00:19:34,160 --> 00:19:38,159 Speaker 1: Mike Gopher's design, the virtual library approach worked fairly well 330 00:19:38,359 --> 00:19:41,399 Speaker 1: when the Web was still small in scale. According to 331 00:19:41,440 --> 00:19:46,119 Speaker 1: the Virtual Library website, in August nine there were about 332 00:19:46,160 --> 00:19:50,639 Speaker 1: twenty web servers in existence total. A little more than 333 00:19:50,680 --> 00:19:54,120 Speaker 1: a year later, in October of nineteen, it was more 334 00:19:54,119 --> 00:19:58,800 Speaker 1: than two hundred web servers, so growth was still fairly modest, 335 00:19:58,960 --> 00:20:03,800 Speaker 1: but things kind of took off after that. By January 336 00:20:04,000 --> 00:20:09,360 Speaker 1: nine six, there were more than one hundred thousand web servers. 337 00:20:09,720 --> 00:20:13,480 Speaker 1: The following year there were more than six hundred fifty thousand. 338 00:20:13,560 --> 00:20:17,000 Speaker 1: It was growing so fast, and maintaining an index was 339 00:20:17,040 --> 00:20:23,000 Speaker 1: becoming increasingly more difficult, particularly by doing it, you know, manually. 340 00:20:23,400 --> 00:20:26,480 Speaker 1: The virtual library was taking shape around the same time 341 00:20:26,520 --> 00:20:29,600 Speaker 1: that students were building the Veronica search engine for gophers, 342 00:20:29,640 --> 00:20:31,720 Speaker 1: so all this was happening around the same time. I 343 00:20:31,760 --> 00:20:35,880 Speaker 1: know it sounds like I'm going strictly chronologically, but that's 344 00:20:35,880 --> 00:20:39,840 Speaker 1: just too not too helpful. We have to remember this 345 00:20:39,880 --> 00:20:43,439 Speaker 1: is all happening simultaneously. So as the web grew and 346 00:20:43,520 --> 00:20:48,320 Speaker 1: became more complex, indices were growing as well. Just navigating 347 00:20:48,359 --> 00:20:52,240 Speaker 1: an index to find what you wanted would become a challenge, 348 00:20:52,359 --> 00:20:55,120 Speaker 1: particularly if you weren't thinking in the same way as 349 00:20:55,160 --> 00:20:58,399 Speaker 1: the people who had organized the index. This is where 350 00:20:58,440 --> 00:21:02,880 Speaker 1: taxonomy comes in. Taxonomy refers to a system of classification. 351 00:21:03,160 --> 00:21:05,560 Speaker 1: A taxonomy is a set of rules we use to 352 00:21:05,720 --> 00:21:08,879 Speaker 1: organize stuff, and there is no one way to do 353 00:21:08,960 --> 00:21:12,840 Speaker 1: it correctly. So I'll give a simple example. Let's say 354 00:21:12,960 --> 00:21:16,240 Speaker 1: you've got a class of students and you have them 355 00:21:16,280 --> 00:21:20,200 Speaker 1: all divide up into smaller groups. You give each group 356 00:21:20,520 --> 00:21:23,960 Speaker 1: a pile of documents, the same documents per group, but 357 00:21:24,040 --> 00:21:28,680 Speaker 1: you tell the students it's their job to organize those documents. Well, 358 00:21:28,720 --> 00:21:31,080 Speaker 1: one group decides that they're going to organize all the 359 00:21:31,119 --> 00:21:34,840 Speaker 1: documents by alphabetizing them by title. The title of each 360 00:21:34,920 --> 00:21:38,640 Speaker 1: document will determine how they fall in the pile, so 361 00:21:39,240 --> 00:21:42,320 Speaker 1: there's are all in alphabetical order. Another group decides that 362 00:21:42,320 --> 00:21:45,160 Speaker 1: they're gonna bundle their documents that all cover the same 363 00:21:45,200 --> 00:21:50,320 Speaker 1: subject matter together, and they'll alphabetize within subjects. So they 364 00:21:50,400 --> 00:21:53,840 Speaker 1: might have a stack that's just about biology, another stack 365 00:21:53,920 --> 00:21:57,240 Speaker 1: that's about chemistry, another one about material science, and so on. 366 00:21:57,920 --> 00:22:01,159 Speaker 1: A third group bundles their docum it's together by author. 367 00:22:01,440 --> 00:22:04,480 Speaker 1: They put all of the same author's works together, and 368 00:22:04,480 --> 00:22:08,800 Speaker 1: then maybe they alphabetize the authors. Yeah. Another group focuses 369 00:22:08,880 --> 00:22:12,880 Speaker 1: on publication date. They arrange all their documents in order 370 00:22:12,920 --> 00:22:16,840 Speaker 1: of when they were published. So you can imagine combinations 371 00:22:16,840 --> 00:22:20,600 Speaker 1: of these approaches as well, right, such as ordering documents chronologically, 372 00:22:20,640 --> 00:22:23,880 Speaker 1: but then if two documents were published on the same date, 373 00:22:23,920 --> 00:22:26,479 Speaker 1: you then alphabetize them. That kind of thing. You can 374 00:22:26,520 --> 00:22:28,760 Speaker 1: think of those different sets of rules, and you have 375 00:22:28,800 --> 00:22:32,600 Speaker 1: to determine which rules are most important, right, which one 376 00:22:32,720 --> 00:22:36,400 Speaker 1: you do first, and which is secondary. Well, it's important 377 00:22:36,440 --> 00:22:40,720 Speaker 1: that these taxonomy's, however you construct them, are consistent or 378 00:22:40,760 --> 00:22:43,640 Speaker 1: else it becomes a chaotic mess. But even a well 379 00:22:43,840 --> 00:22:47,560 Speaker 1: organized and maintained taxonomy can still be a challenge for 380 00:22:47,680 --> 00:22:51,040 Speaker 1: someone new coming into the system. And that's really what 381 00:22:51,040 --> 00:22:55,520 Speaker 1: I'm getting at here. A comprehensive index will seem incredibly 382 00:22:55,600 --> 00:22:59,840 Speaker 1: overwhelming to someone who's unfamiliar with the system's taxonomy, and 383 00:23:00,040 --> 00:23:03,480 Speaker 1: it will still seem like finding a specific document or 384 00:23:03,560 --> 00:23:07,560 Speaker 1: web page is an impossible task. Once that index grows 385 00:23:07,600 --> 00:23:11,880 Speaker 1: to a large enough size, clearly a search tool would 386 00:23:11,920 --> 00:23:14,320 Speaker 1: be a big solution to that problem. If you can 387 00:23:14,359 --> 00:23:17,600 Speaker 1: type a query into a search engine and you can 388 00:23:17,640 --> 00:23:20,240 Speaker 1: get the results you want, you don't have to comb 389 00:23:20,320 --> 00:23:23,639 Speaker 1: through an enormous index hoping that you're thinking in the 390 00:23:23,680 --> 00:23:27,240 Speaker 1: same way as the custodians of that index. But in 391 00:23:27,359 --> 00:23:31,199 Speaker 1: addition to those challenges was one of scale building. This 392 00:23:31,320 --> 00:23:34,879 Speaker 1: index required a lot of volunteer effort because again it 393 00:23:34,920 --> 00:23:38,320 Speaker 1: was done by hand. The next step toward the development 394 00:23:38,320 --> 00:23:41,800 Speaker 1: of a search engine was a project that was tackled 395 00:23:41,840 --> 00:23:45,359 Speaker 1: by an m I T student, Matthew Gray. That was 396 00:23:45,400 --> 00:23:49,399 Speaker 1: the student. He designed a program he called the Worldwide 397 00:23:49,600 --> 00:23:53,679 Speaker 1: Web Wanderer, and the purpose of this program was to 398 00:23:53,760 --> 00:23:58,240 Speaker 1: automatically navigate across the web, cataloging the web's growth by 399 00:23:58,280 --> 00:24:03,560 Speaker 1: registering new websites, web pages, and web servers. The Worldwide 400 00:24:03,560 --> 00:24:07,879 Speaker 1: Web Wanderer is arguably the earliest automated spider or web 401 00:24:07,880 --> 00:24:11,640 Speaker 1: crawler designed for the Web. When Gray developed the program 402 00:24:11,640 --> 00:24:15,960 Speaker 1: back in the web browser Mosaic, which was the first 403 00:24:16,040 --> 00:24:19,479 Speaker 1: popular browser designed for Windows, was just a couple of 404 00:24:19,480 --> 00:24:23,520 Speaker 1: months old. Mosaic was also a graphical browser, and while 405 00:24:23,520 --> 00:24:26,720 Speaker 1: it wasn't the first graphical browser, it was the first 406 00:24:26,800 --> 00:24:30,520 Speaker 1: popular one available to the average person outside of places 407 00:24:30,600 --> 00:24:34,680 Speaker 1: like CERN. Gray wanted to automatically detect new websites for 408 00:24:34,800 --> 00:24:38,439 Speaker 1: discovery purposes, but before long the number of sites was 409 00:24:38,520 --> 00:24:42,200 Speaker 1: growing so quickly that he shifted his attention to charting 410 00:24:42,240 --> 00:24:45,639 Speaker 1: the growth of the web in general. Gray wrote his 411 00:24:45,680 --> 00:24:49,920 Speaker 1: program using the Perl computer language. That's pe r L, 412 00:24:50,560 --> 00:24:54,439 Speaker 1: And here's a quick refresher on computer languages. We know 413 00:24:54,600 --> 00:24:59,959 Speaker 1: that typically computers process information in machine code, which mostly 414 00:25:00,040 --> 00:25:03,440 Speaker 1: for our cases means binary data, and that means all 415 00:25:03,480 --> 00:25:07,560 Speaker 1: the information going through the machines ultimately breaks down into 416 00:25:07,680 --> 00:25:10,200 Speaker 1: zeros and ones, and you can think of that like 417 00:25:10,280 --> 00:25:14,080 Speaker 1: a light switch flipped either off or on. Now, a 418 00:25:14,119 --> 00:25:17,000 Speaker 1: single offer on is easy, but if we want to 419 00:25:17,040 --> 00:25:22,040 Speaker 1: represent more complex ideas processes that kind of thing, we 420 00:25:22,119 --> 00:25:26,560 Speaker 1: need a lot of bits. A single alpha numerical character 421 00:25:26,680 --> 00:25:30,840 Speaker 1: in the as key code requires seven bits, so you 422 00:25:30,880 --> 00:25:35,240 Speaker 1: need seven strings of zeros or ones just to represent 423 00:25:35,320 --> 00:25:38,440 Speaker 1: a letter, number or symbol and as key. So you 424 00:25:38,480 --> 00:25:42,880 Speaker 1: can imagine that programming in machine code would be really 425 00:25:42,920 --> 00:25:46,679 Speaker 1: tough for most humans because it would be so easy 426 00:25:46,800 --> 00:25:50,920 Speaker 1: to mistype a zero or a one, or to skip one. 427 00:25:51,280 --> 00:25:53,800 Speaker 1: And if you're, you know, typing out a really long sequence, 428 00:25:53,800 --> 00:25:56,680 Speaker 1: it's easy for you to overlook a zero or a one, 429 00:25:56,720 --> 00:26:01,440 Speaker 1: And that's why people developed programming languages. A programming language 430 00:26:01,480 --> 00:26:05,320 Speaker 1: creates a layer of abstraction between the programmer and the 431 00:26:05,400 --> 00:26:08,880 Speaker 1: machine or system of machines. It acts as a sort 432 00:26:08,920 --> 00:26:13,040 Speaker 1: of interpreter. It takes the intentions of the programmer and 433 00:26:13,200 --> 00:26:17,280 Speaker 1: turns them into processes a computer can respond to. Some 434 00:26:17,320 --> 00:26:20,960 Speaker 1: programming languages are closer to machine code. Those are low 435 00:26:21,040 --> 00:26:24,439 Speaker 1: level programming languages. They're typically really challenging to work with, 436 00:26:24,960 --> 00:26:28,520 Speaker 1: and others are more abstract and thus easier for us 437 00:26:28,560 --> 00:26:32,200 Speaker 1: to work with, and these are called high level programming languages. 438 00:26:32,240 --> 00:26:36,760 Speaker 1: Pearl falls into that high level category. Now. Originally, Gray's 439 00:26:36,840 --> 00:26:40,240 Speaker 1: program would seek out links on web pages and then 440 00:26:40,320 --> 00:26:44,040 Speaker 1: note the web servers that were hosting those pages before 441 00:26:44,080 --> 00:26:47,639 Speaker 1: following the link over, and then it would repeat that process, 442 00:26:47,880 --> 00:26:50,760 Speaker 1: and the program was really just automating the process that 443 00:26:50,840 --> 00:26:53,080 Speaker 1: we would do manually if we were to look at 444 00:26:53,080 --> 00:26:56,080 Speaker 1: a web page, see a link, and click on it. 445 00:26:56,480 --> 00:26:59,640 Speaker 1: The program saught the links embedded in pages and then 446 00:26:59,680 --> 00:27:02,560 Speaker 1: act made those links to explore whichever documents were pulled 447 00:27:02,600 --> 00:27:05,520 Speaker 1: up as a result, and then would repeat that process 448 00:27:05,560 --> 00:27:08,000 Speaker 1: while building out this index, kind of leaving a trail 449 00:27:08,080 --> 00:27:11,440 Speaker 1: of where it had been. The program built what Gray 450 00:27:11,520 --> 00:27:15,879 Speaker 1: called the wand DECKX W A and d e X. 451 00:27:15,880 --> 00:27:19,159 Speaker 1: It was an index of web servers that we're joining 452 00:27:19,160 --> 00:27:22,760 Speaker 1: the Internet. Not long after launching the Wanderer than Gray 453 00:27:22,800 --> 00:27:26,320 Speaker 1: built in additional capability of capturing the u r l's 454 00:27:26,480 --> 00:27:28,720 Speaker 1: that it was going through in addition to just the 455 00:27:28,760 --> 00:27:31,480 Speaker 1: web servers, so you can think like originally he was 456 00:27:31,520 --> 00:27:33,680 Speaker 1: just like, I wonder how many web servers are connected 457 00:27:33,720 --> 00:27:36,879 Speaker 1: to the Internet, how many are connected to it today, 458 00:27:37,040 --> 00:27:39,719 Speaker 1: how many will be connected tomorrow? That kind of thing, 459 00:27:39,760 --> 00:27:41,480 Speaker 1: and just sort of keeping track of how the web 460 00:27:41,560 --> 00:27:44,240 Speaker 1: was growing. Then he thought, I want to know actually 461 00:27:44,280 --> 00:27:46,240 Speaker 1: the u r ls that exists too, so he's keeping 462 00:27:46,280 --> 00:27:50,439 Speaker 1: track of both. This didn't go totally smoothly, however. The 463 00:27:50,480 --> 00:27:54,680 Speaker 1: Wanderer was an energetic little spider. It would move through 464 00:27:54,800 --> 00:27:58,280 Speaker 1: links throughout the day. It would index the same pages 465 00:27:58,520 --> 00:28:02,600 Speaker 1: hundreds of time in the process, and this started to 466 00:28:02,680 --> 00:28:06,159 Speaker 1: cause network lag across the Internet, and it meant that 467 00:28:06,200 --> 00:28:08,480 Speaker 1: people who were just trying to navigate to those pages 468 00:28:08,760 --> 00:28:12,040 Speaker 1: were experiencing long delays as a result, and that made people, 469 00:28:12,960 --> 00:28:16,280 Speaker 1: let's say, a little miffed at Mr Gray, and he 470 00:28:16,320 --> 00:28:19,240 Speaker 1: was able to modify the spiders operation so it wouldn't 471 00:28:19,280 --> 00:28:23,720 Speaker 1: cause so much of a disruption, but that early enthusiastic 472 00:28:23,880 --> 00:28:26,720 Speaker 1: mistake kind of created a tough environment for other people 473 00:28:26,760 --> 00:28:29,639 Speaker 1: who wanted to create similar tools that would allow for 474 00:28:29,800 --> 00:28:33,600 Speaker 1: fully fledged searching on the Internet, and the concept of 475 00:28:33,640 --> 00:28:36,199 Speaker 1: spiders had kind of a big old X next to 476 00:28:36,280 --> 00:28:38,840 Speaker 1: it in the minds of many people. It became synonymous 477 00:28:38,880 --> 00:28:42,320 Speaker 1: with this idea of lag and just bad network performance. 478 00:28:42,920 --> 00:28:46,840 Speaker 1: The Wanderer didn't index sites for content either. It was 479 00:28:47,120 --> 00:28:49,760 Speaker 1: more about tracking the growth of the Internet as a whole. 480 00:28:49,800 --> 00:28:52,400 Speaker 1: It wasn't so much concerned with what was on pages, 481 00:28:53,000 --> 00:28:55,560 Speaker 1: so it didn't create a means to search for specific 482 00:28:55,560 --> 00:28:59,240 Speaker 1: web pages or subject matter. Meanwhile, over at the University 483 00:28:59,240 --> 00:29:03,080 Speaker 1: of Geneva, a developer named Oscar near Strats developed a 484 00:29:03,120 --> 00:29:07,560 Speaker 1: tool that could search lists of websites and return results 485 00:29:07,600 --> 00:29:11,160 Speaker 1: based on a query. This tool didn't actually survey the 486 00:29:11,160 --> 00:29:14,560 Speaker 1: web as a whole. Instead, it would access lists other 487 00:29:14,640 --> 00:29:18,640 Speaker 1: organizations had made, such as the Virtual Library, so it 488 00:29:18,680 --> 00:29:22,680 Speaker 1: would reformat those lists as entries into a database and 489 00:29:22,720 --> 00:29:25,440 Speaker 1: that's what could be searched. It was called the W 490 00:29:25,760 --> 00:29:30,240 Speaker 1: three Catalog. It still wasn't quite a search engine as 491 00:29:30,280 --> 00:29:34,560 Speaker 1: we think of them today. Martin Costa developed another tool 492 00:29:34,880 --> 00:29:40,040 Speaker 1: in late which he called the Archie like Indexing for 493 00:29:40,200 --> 00:29:44,440 Speaker 1: the Web or ali Web as the name suggests. He 494 00:29:44,520 --> 00:29:48,680 Speaker 1: was taking inspiration from the Gopher search tool of Archie. 495 00:29:48,720 --> 00:29:52,400 Speaker 1: Ali Web needed web administrators to provide the location for 496 00:29:52,440 --> 00:29:55,800 Speaker 1: their site index files so that they could be included 497 00:29:55,960 --> 00:29:59,680 Speaker 1: in the ali web search database. Users could also create 498 00:29:59,760 --> 00:30:03,040 Speaker 1: dis cryptions for the websites and add in keywords to 499 00:30:03,080 --> 00:30:06,080 Speaker 1: help with search. And this brings us into the world 500 00:30:06,240 --> 00:30:12,400 Speaker 1: of metadata. Metadata is information about information. So you've got 501 00:30:12,440 --> 00:30:15,360 Speaker 1: the core information of a web page, what we might consider, 502 00:30:15,440 --> 00:30:17,800 Speaker 1: you know, the content of the web page, the stuff 503 00:30:17,840 --> 00:30:19,960 Speaker 1: that you and I would actually read if we went there. 504 00:30:20,280 --> 00:30:23,160 Speaker 1: But then you've got the metadata and that describes the 505 00:30:23,200 --> 00:30:26,520 Speaker 1: information in some meaningful way. Now, if you were in 506 00:30:26,560 --> 00:30:29,920 Speaker 1: a physical library, the metadata would be the stuff that 507 00:30:30,040 --> 00:30:33,680 Speaker 1: you used to help locate where in that library a 508 00:30:33,760 --> 00:30:37,160 Speaker 1: specific book should be, and that could include stuff such 509 00:30:37,200 --> 00:30:40,960 Speaker 1: as the author name, the publication date, the subject matter, 510 00:30:41,120 --> 00:30:43,719 Speaker 1: that kind of thing. And it was a pretty good 511 00:30:43,760 --> 00:30:47,560 Speaker 1: idea ALI web but not many people knew about it, 512 00:30:47,680 --> 00:30:50,200 Speaker 1: and those who did know about it, not a lot 513 00:30:50,240 --> 00:30:52,719 Speaker 1: of them went through the trouble of submitting the information 514 00:30:52,840 --> 00:30:56,760 Speaker 1: to ALI webs, so it didn't really see much widespread use. 515 00:30:57,520 --> 00:31:02,480 Speaker 1: Also in and then extending into was the development of 516 00:31:02,520 --> 00:31:04,840 Speaker 1: another search tool, and this was the brain child of 517 00:31:04,880 --> 00:31:08,280 Speaker 1: a guy named Jonathan Fletcher. He was a grad student 518 00:31:08,400 --> 00:31:12,440 Speaker 1: at the University of Sterling in Scotland. Fletcher's approach combined 519 00:31:12,480 --> 00:31:15,920 Speaker 1: the strategies of his predecessors. He built a web crawler 520 00:31:16,040 --> 00:31:19,680 Speaker 1: to find and index web pages. He designed the database 521 00:31:19,720 --> 00:31:24,120 Speaker 1: to be searchable and he called it jump Station. Unfortunately, 522 00:31:24,480 --> 00:31:28,040 Speaker 1: his efforts were limited by the budget that he got 523 00:31:28,080 --> 00:31:31,360 Speaker 1: from his university. He didn't have the resources to really 524 00:31:31,400 --> 00:31:33,640 Speaker 1: build out a tool that could index all of a 525 00:31:33,720 --> 00:31:38,200 Speaker 1: website's contents, so instead he designed jump Station to parse 526 00:31:38,360 --> 00:31:42,040 Speaker 1: web page titles and headers, and that would still help 527 00:31:42,080 --> 00:31:46,160 Speaker 1: people find pages that, at least according to the title 528 00:31:46,240 --> 00:31:49,680 Speaker 1: and header were focused on whatever the area of interest was. 529 00:31:49,760 --> 00:31:52,000 Speaker 1: But it would also mean that other pages that might 530 00:31:52,040 --> 00:31:55,720 Speaker 1: have critically relevant information about the subject could be overlooked 531 00:31:55,720 --> 00:31:58,840 Speaker 1: because those terms just weren't in the title or header. 532 00:31:59,520 --> 00:32:02,240 Speaker 1: We're getting closer to the search engines that would most 533 00:32:02,280 --> 00:32:05,600 Speaker 1: resemble what we think of today, which, let's be honest, 534 00:32:05,720 --> 00:32:10,160 Speaker 1: is primarily Google, and we will learn more about those 535 00:32:10,240 --> 00:32:19,920 Speaker 1: after we take another quick break. I think you could 536 00:32:20,040 --> 00:32:24,360 Speaker 1: argue pretty convincingly the jump station was the first true 537 00:32:24,520 --> 00:32:27,720 Speaker 1: web search engine as we have come to understand them, 538 00:32:27,720 --> 00:32:30,680 Speaker 1: though it was limited since it couldn't crawl through and 539 00:32:30,720 --> 00:32:34,200 Speaker 1: index all the contents of a page. The next name 540 00:32:34,240 --> 00:32:36,840 Speaker 1: on our journey is one a lot of people will recognize, 541 00:32:37,120 --> 00:32:41,080 Speaker 1: and that is Yahoo. But Yahoo didn't start off as 542 00:32:41,120 --> 00:32:45,960 Speaker 1: a search engine. Rather, Yahoo was originally a web directory. 543 00:32:46,480 --> 00:32:50,360 Speaker 1: It started in as just a list of websites that 544 00:32:50,440 --> 00:32:53,160 Speaker 1: the founders of the site that would be Jerry Young 545 00:32:53,280 --> 00:32:57,240 Speaker 1: and David Filo they thought were interesting. There this is 546 00:32:57,240 --> 00:32:59,240 Speaker 1: a cool website. I want more people to know about it, 547 00:32:59,240 --> 00:33:01,360 Speaker 1: so I'm gonna include did on my web page about 548 00:33:01,440 --> 00:33:05,960 Speaker 1: cool websites. So Yahoo started off as another curated list 549 00:33:06,240 --> 00:33:10,440 Speaker 1: of web pages. The search tool aspect of Yahoo would 550 00:33:10,480 --> 00:33:14,680 Speaker 1: follow in The search tool worked on the sites that 551 00:33:14,720 --> 00:33:19,320 Speaker 1: were curated in the human curated Yahoo directory, but if 552 00:33:19,320 --> 00:33:22,040 Speaker 1: a site wasn't in that directory, it wouldn't show up 553 00:33:22,040 --> 00:33:24,080 Speaker 1: in search results. So someone would have had to have 554 00:33:24,160 --> 00:33:28,440 Speaker 1: found the website already and then included it within Yahoo's 555 00:33:28,520 --> 00:33:33,200 Speaker 1: growing directory for it to register as a result. Following 556 00:33:33,280 --> 00:33:35,880 Speaker 1: Yahoo were a couple of other notable names. There was 557 00:33:36,000 --> 00:33:40,120 Speaker 1: info Seek and web Crawler. Webcrawlor was the first search 558 00:33:40,160 --> 00:33:43,600 Speaker 1: engine I remember using. In fact, I stuck with web 559 00:33:43,600 --> 00:33:47,920 Speaker 1: Crawler for a long time, even after the infamous Google 560 00:33:48,040 --> 00:33:52,320 Speaker 1: emerged and started making waves. Web color did something that 561 00:33:52,360 --> 00:33:55,760 Speaker 1: other search engines had not yet done. It's index was 562 00:33:55,800 --> 00:33:59,520 Speaker 1: looking at the full content of a web page, including 563 00:33:59,520 --> 00:34:02,960 Speaker 1: the meta data on that page. So let's talk about 564 00:34:03,000 --> 00:34:06,440 Speaker 1: that for a second. Web spiders, when you get down 565 00:34:06,440 --> 00:34:09,399 Speaker 1: to it, are just bots that follow links, but some 566 00:34:09,440 --> 00:34:12,359 Speaker 1: web spiders can also make a full index of the 567 00:34:12,440 --> 00:34:17,080 Speaker 1: content found at each links destination, essentially scanning all the 568 00:34:17,160 --> 00:34:20,000 Speaker 1: text that's within a web page and indexing it so 569 00:34:20,040 --> 00:34:25,440 Speaker 1: that that content is searchable and that searchable index forms 570 00:34:25,600 --> 00:34:28,040 Speaker 1: as a result of all this, and it can bring 571 00:34:28,080 --> 00:34:31,600 Speaker 1: back any results of any pages that contain a specific word. 572 00:34:32,360 --> 00:34:34,520 Speaker 1: Let's use an example. It makes it easier. So let's 573 00:34:34,520 --> 00:34:37,000 Speaker 1: say you're in a literature class and you're having a 574 00:34:37,040 --> 00:34:41,720 Speaker 1: real hard time understanding Milton's Paradise Lost. So you're looking 575 00:34:41,719 --> 00:34:43,960 Speaker 1: for some resources to help you get a better handle 576 00:34:44,360 --> 00:34:47,280 Speaker 1: on things. You go to a search engine on the Internet. 577 00:34:47,360 --> 00:34:49,719 Speaker 1: It doesn't really matter which one, and you type in 578 00:34:50,239 --> 00:34:56,120 Speaker 1: Paradise lost Milton analysis. You're trying to really cut down 579 00:34:56,200 --> 00:35:00,279 Speaker 1: on anything that might just mention paradise or loss or 580 00:35:00,320 --> 00:35:03,520 Speaker 1: anything like that. You really want to focus on this. 581 00:35:03,520 --> 00:35:05,759 Speaker 1: This part of the search engine is the UI or 582 00:35:05,800 --> 00:35:08,000 Speaker 1: the user interface, right, This is the part that we 583 00:35:08,160 --> 00:35:11,640 Speaker 1: as humans interact with in order to tell the engine 584 00:35:11,680 --> 00:35:14,520 Speaker 1: what it is we're looking for. The search engine then 585 00:35:14,600 --> 00:35:18,640 Speaker 1: goes and consults it's index of the web. So no 586 00:35:18,680 --> 00:35:22,839 Speaker 1: matter which search engine you're using, it's not a representation 587 00:35:22,880 --> 00:35:26,239 Speaker 1: of every single web page that exists. It's every web 588 00:35:26,280 --> 00:35:31,280 Speaker 1: page that exists within that engine's index. So each search 589 00:35:31,320 --> 00:35:34,280 Speaker 1: engine has its own index, or in some cases search 590 00:35:34,280 --> 00:35:37,800 Speaker 1: engines are powered by other engines. It may be sharing 591 00:35:37,840 --> 00:35:41,080 Speaker 1: an index with another search engine, but it looks for 592 00:35:41,320 --> 00:35:44,960 Speaker 1: documents in that index that contain the words that you 593 00:35:45,040 --> 00:35:48,400 Speaker 1: have submitted in the UI, then it has to return 594 00:35:48,440 --> 00:35:51,880 Speaker 1: those results to you, which also means that the search 595 00:35:51,920 --> 00:35:55,359 Speaker 1: engine has to determine which of those search results are 596 00:35:55,480 --> 00:35:58,440 Speaker 1: likely to be the most relevant to your query. This 597 00:35:58,520 --> 00:36:01,000 Speaker 1: is actually harder to do that, and it sounds if 598 00:36:01,040 --> 00:36:04,240 Speaker 1: a surge engine is only looking for documents that happened 599 00:36:04,239 --> 00:36:07,399 Speaker 1: to contain the words that you've submitted, you could get 600 00:36:07,400 --> 00:36:10,360 Speaker 1: back pages that have little to no relevance to what 601 00:36:10,600 --> 00:36:15,239 Speaker 1: you actually wanted. Plus, some web page administrators, especially back 602 00:36:15,280 --> 00:36:18,920 Speaker 1: in the early days, we're really trying to game the system. 603 00:36:18,960 --> 00:36:21,520 Speaker 1: They might use tricks in order to get more people 604 00:36:21,560 --> 00:36:23,799 Speaker 1: to come to that web page. And it might be 605 00:36:23,800 --> 00:36:27,520 Speaker 1: because their web pages had banner ads on them and 606 00:36:27,600 --> 00:36:31,600 Speaker 1: so more people visiting the page meant more money, or 607 00:36:31,880 --> 00:36:34,480 Speaker 1: maybe they just wanted bragging rights. Because some of you 608 00:36:34,480 --> 00:36:36,880 Speaker 1: guys might remember this. It used to be back in 609 00:36:36,880 --> 00:36:39,040 Speaker 1: the day that one of the standard features you would 610 00:36:39,040 --> 00:36:42,160 Speaker 1: see on web pages was the ever present web counter 611 00:36:42,760 --> 00:36:44,920 Speaker 1: that would tell you how many people had visited that 612 00:36:44,960 --> 00:36:48,960 Speaker 1: website since it had been created. And a few folks 613 00:36:49,320 --> 00:36:52,480 Speaker 1: were hoping to just spread malware by tricking people to 614 00:36:52,800 --> 00:36:57,160 Speaker 1: visiting a website and downloading some malicious program. And then 615 00:36:57,200 --> 00:36:59,520 Speaker 1: there were also link farms. These were sites that were 616 00:36:59,560 --> 00:37:03,919 Speaker 1: just one long list of links to other sites. More 617 00:37:03,960 --> 00:37:07,480 Speaker 1: on why that's important in just a second. One trick 618 00:37:08,120 --> 00:37:11,760 Speaker 1: was to include just a ton of different popular search 619 00:37:11,920 --> 00:37:15,240 Speaker 1: terms on a page, even if the page had nothing 620 00:37:15,280 --> 00:37:17,239 Speaker 1: to do with any of those search terms, and you 621 00:37:17,239 --> 00:37:19,920 Speaker 1: could even hide that. You can make the text and 622 00:37:20,000 --> 00:37:23,680 Speaker 1: background the same color, so a human visiting the website 623 00:37:23,680 --> 00:37:26,560 Speaker 1: and looking at it through a standard browser wouldn't see 624 00:37:26,600 --> 00:37:29,360 Speaker 1: anything because the background color in the text is the 625 00:37:29,440 --> 00:37:32,319 Speaker 1: same color. They see whatever the content of the web 626 00:37:32,360 --> 00:37:35,960 Speaker 1: page was, but they wouldn't see all these hidden keywords. 627 00:37:36,000 --> 00:37:38,440 Speaker 1: But a computer would totally see it. It would ignore 628 00:37:38,560 --> 00:37:41,520 Speaker 1: the fact that the font and the background color are 629 00:37:41,560 --> 00:37:43,960 Speaker 1: the same and it would just pick up on the text. 630 00:37:44,719 --> 00:37:48,719 Speaker 1: So you would end up having these false returns on 631 00:37:48,800 --> 00:37:52,160 Speaker 1: search results because those keywords were there in the page, 632 00:37:52,320 --> 00:37:55,080 Speaker 1: they just weren't relevant to whatever the content was. Other 633 00:37:55,120 --> 00:37:59,600 Speaker 1: administrators would put keyword dumps into web page meta data, 634 00:37:59,760 --> 00:38:01,839 Speaker 1: so wouldn't show up on the page itself at all. 635 00:38:01,880 --> 00:38:04,600 Speaker 1: It would all be in the background. Following a search 636 00:38:04,640 --> 00:38:08,000 Speaker 1: result like that would be really frustrating because you wouldn't 637 00:38:08,000 --> 00:38:10,080 Speaker 1: actually get whatever it was you were looking for, you 638 00:38:10,080 --> 00:38:12,880 Speaker 1: would get something else. It was a bait and switch. 639 00:38:13,280 --> 00:38:17,040 Speaker 1: So building search engines meant not only did the developers 640 00:38:17,080 --> 00:38:19,680 Speaker 1: need to figure out how to build in disease that 641 00:38:20,200 --> 00:38:22,880 Speaker 1: could grow as the Web was growing, they also had 642 00:38:22,920 --> 00:38:25,680 Speaker 1: to figure out how to defeat strategies that were intended 643 00:38:25,719 --> 00:38:28,239 Speaker 1: to game the system. How can you make sure the 644 00:38:28,239 --> 00:38:31,680 Speaker 1: people who are using your search engine are actually getting 645 00:38:31,680 --> 00:38:34,160 Speaker 1: the stuff that they want, because if they're not getting 646 00:38:34,160 --> 00:38:36,960 Speaker 1: the stuff they want, they're gonna bounce. They're never going 647 00:38:37,040 --> 00:38:40,440 Speaker 1: to use your search engine again. I'm gonna use Google 648 00:38:40,880 --> 00:38:44,560 Speaker 1: as the example for this, because i mean, let's be honest, 649 00:38:44,640 --> 00:38:47,759 Speaker 1: Google is dominant in that space. It's almost like it's 650 00:38:47,760 --> 00:38:50,239 Speaker 1: the only game in town. But just know that all 651 00:38:50,280 --> 00:38:54,239 Speaker 1: search engines, in general, we're all trying variations on this 652 00:38:54,320 --> 00:38:58,719 Speaker 1: kind of general philosophy. Google's approach used a tool that 653 00:38:58,800 --> 00:39:01,880 Speaker 1: they called page rank, which, as the name suggests, would 654 00:39:02,160 --> 00:39:05,600 Speaker 1: take the documents that came back from any given search, 655 00:39:06,160 --> 00:39:10,840 Speaker 1: then rank those search results before presenting them to the user. 656 00:39:11,920 --> 00:39:14,319 Speaker 1: So if you went to Google and you typed in 657 00:39:14,440 --> 00:39:19,400 Speaker 1: Paradise Lost Milton analysis, Google would consult its own index 658 00:39:19,480 --> 00:39:21,560 Speaker 1: of the web, and it would look for stuff like, 659 00:39:22,120 --> 00:39:25,120 Speaker 1: are the search terms showing up in the page? Is 660 00:39:25,640 --> 00:39:28,640 Speaker 1: examples of words that are close together, because that might 661 00:39:28,680 --> 00:39:32,160 Speaker 1: indicate that this result is more relevant. Right, if these 662 00:39:32,200 --> 00:39:35,200 Speaker 1: words are all kind of next to each other, it's 663 00:39:35,239 --> 00:39:38,200 Speaker 1: more likely to be what the person was looking for, 664 00:39:38,280 --> 00:39:41,040 Speaker 1: as opposed to, Yeah, all four of those words are 665 00:39:41,040 --> 00:39:43,200 Speaker 1: showing up on this page, but they're so far apart, 666 00:39:43,800 --> 00:39:47,760 Speaker 1: then maybe this isn't even related to what the person 667 00:39:47,880 --> 00:39:51,319 Speaker 1: was looking for. That was part of page rank. The 668 00:39:51,360 --> 00:39:54,279 Speaker 1: tool also would look at things like the title of 669 00:39:54,320 --> 00:39:56,520 Speaker 1: the page and maybe even the header, but it mostly 670 00:39:56,560 --> 00:40:00,719 Speaker 1: ignored the metadata because you know, search in gen designers 671 00:40:00,719 --> 00:40:03,239 Speaker 1: were picking up on the tricks people were using in 672 00:40:03,320 --> 00:40:06,560 Speaker 1: order to get more clicks. At the same time, the 673 00:40:06,560 --> 00:40:09,719 Speaker 1: search algorithm would assign ranks to pages based on a 674 00:40:09,760 --> 00:40:14,160 Speaker 1: few other points of criteria. The algorithm attempted to figure 675 00:40:14,160 --> 00:40:17,799 Speaker 1: out how reputable every page was, and it did so 676 00:40:17,880 --> 00:40:20,399 Speaker 1: in a couple of different ways. One was to look 677 00:40:20,440 --> 00:40:24,279 Speaker 1: at which other sites were linking to that page. If 678 00:40:24,320 --> 00:40:26,960 Speaker 1: the other sites that were linking to it were considered 679 00:40:27,040 --> 00:40:31,840 Speaker 1: generally reputable, that would improve the results page rank score. 680 00:40:32,200 --> 00:40:36,240 Speaker 1: So in our case, and this is a totally unrealistic example. 681 00:40:36,280 --> 00:40:41,120 Speaker 1: But let's say we've searched that Paradise Lost Milton analysis 682 00:40:41,760 --> 00:40:45,520 Speaker 1: and all we got back our three results, but Google 683 00:40:45,560 --> 00:40:48,040 Speaker 1: has to rank those results as one, two, and three. 684 00:40:48,400 --> 00:40:51,000 Speaker 1: One of those results is from a website dedicated to 685 00:40:51,040 --> 00:40:55,759 Speaker 1: Paradise Lost, the literary work and has literary analysis on it, 686 00:40:55,840 --> 00:40:58,160 Speaker 1: and it sits on a server that belongs to a 687 00:40:58,200 --> 00:41:02,320 Speaker 1: prestigious university. Let's say that the second result is coming 688 00:41:02,440 --> 00:41:06,800 Speaker 1: from a literary discussion site. It doesn't belong to a university, 689 00:41:06,840 --> 00:41:11,000 Speaker 1: but it does have critical analysis and an entry specifically 690 00:41:11,040 --> 00:41:14,360 Speaker 1: on Paradise Lost. And let's say that the third result 691 00:41:14,680 --> 00:41:18,240 Speaker 1: is Billy Bob's Homespun Guide to Milton and Crab Trap 692 00:41:18,400 --> 00:41:22,359 Speaker 1: Maintenance or something. Now, the algorithm is not smart enough 693 00:41:22,400 --> 00:41:25,880 Speaker 1: to actually read each of these sites as a human 694 00:41:25,920 --> 00:41:29,520 Speaker 1: would and judge them and analyze them and weigh the 695 00:41:29,640 --> 00:41:33,120 Speaker 1: value of each one, but it can see that the 696 00:41:33,239 --> 00:41:36,640 Speaker 1: university server is, you know, it belongs to a university. 697 00:41:36,640 --> 00:41:40,840 Speaker 1: It's generally treated as the property of a recognized authority, 698 00:41:41,080 --> 00:41:45,000 Speaker 1: and so it sees that other reputable sites are also 699 00:41:45,120 --> 00:41:48,360 Speaker 1: linking to that university's web pages, and to that Milton 700 00:41:48,440 --> 00:41:52,160 Speaker 1: page in particular. So it assigns that result a very 701 00:41:52,239 --> 00:41:56,000 Speaker 1: high page rank, saying it's probably pretty darn good. That 702 00:41:56,080 --> 00:41:58,600 Speaker 1: also means it's going to appear higher on the list 703 00:41:58,640 --> 00:42:02,440 Speaker 1: of search results. Meanwhile, Billy Bob's is likely to appear 704 00:42:02,480 --> 00:42:04,840 Speaker 1: at the bottom of that list because very few people 705 00:42:04,960 --> 00:42:07,480 Speaker 1: are linking to it. It might be hosted on just 706 00:42:07,560 --> 00:42:11,239 Speaker 1: some server somewhere that happens to host a whole hodgepodge 707 00:42:11,239 --> 00:42:15,319 Speaker 1: of different web pages, and the page that's on that 708 00:42:15,440 --> 00:42:19,359 Speaker 1: site that has just sort of literary analysis discussions on it, 709 00:42:19,440 --> 00:42:22,560 Speaker 1: that one appears in the middle. Now, could Billy Bob's 710 00:42:22,560 --> 00:42:27,200 Speaker 1: page actually be the best resource? Yes, it could be, 711 00:42:27,560 --> 00:42:31,320 Speaker 1: but without a human or maybe a really incredibly advanced 712 00:42:31,400 --> 00:42:34,520 Speaker 1: AI to review the contents of that page and to 713 00:42:34,760 --> 00:42:39,560 Speaker 1: really understand them, the ranking approach seemed like the best 714 00:42:39,600 --> 00:42:43,240 Speaker 1: way to quickly organize results to give the best chance 715 00:42:43,320 --> 00:42:47,000 Speaker 1: that the returns were going to be relevant to the user. Now, 716 00:42:47,080 --> 00:42:51,440 Speaker 1: in that example I just gave, I mentioned three results. However, 717 00:42:51,480 --> 00:42:54,040 Speaker 1: if you were to really perform that search, because I 718 00:42:54,080 --> 00:42:57,480 Speaker 1: did it before I recorded this episode, you would get 719 00:42:57,600 --> 00:43:00,520 Speaker 1: millions of results. In fact, just for a laugh, I 720 00:43:00,560 --> 00:43:04,600 Speaker 1: went to Google typed in Paradise lost Milton Analysis, and 721 00:43:04,640 --> 00:43:10,160 Speaker 1: I got quote about three point eight million results end quote, 722 00:43:10,680 --> 00:43:15,000 Speaker 1: that happened in less than one second. Page rank becomes 723 00:43:15,120 --> 00:43:18,880 Speaker 1: really important when you get to that level of response, 724 00:43:18,920 --> 00:43:23,040 Speaker 1: when you get to that many results, if you're talking 725 00:43:23,080 --> 00:43:26,279 Speaker 1: about that enormous amount of information, you really want the 726 00:43:26,360 --> 00:43:29,360 Speaker 1: most relevant choices to be near the top to save 727 00:43:29,400 --> 00:43:33,120 Speaker 1: yourself time. And that has created some pretty bad habits 728 00:43:33,160 --> 00:43:35,799 Speaker 1: for us as users. By the way, we've become so 729 00:43:35,920 --> 00:43:39,759 Speaker 1: used to search engines returning the most relevant results right 730 00:43:39,800 --> 00:43:42,399 Speaker 1: at the top that we don't necessarily bother to look 731 00:43:42,440 --> 00:43:45,279 Speaker 1: beyond the first few sites. There are a lot of 732 00:43:45,320 --> 00:43:49,480 Speaker 1: resources out there that have estimates on how many people 733 00:43:49,560 --> 00:43:53,320 Speaker 1: actually bother to ever go past the first page results, 734 00:43:54,160 --> 00:43:57,720 Speaker 1: and some of them even say that as much as 735 00:43:57,760 --> 00:44:01,040 Speaker 1: of all web traffic will just go to results that 736 00:44:01,080 --> 00:44:04,759 Speaker 1: appear on the first page for any given search, and 737 00:44:04,840 --> 00:44:08,400 Speaker 1: that means that all the other results that appear after 738 00:44:08,520 --> 00:44:12,560 Speaker 1: page one are sharing just five percent of the web traffic. 739 00:44:13,400 --> 00:44:16,760 Speaker 1: So when I did that Paradise law search, that first 740 00:44:16,760 --> 00:44:20,920 Speaker 1: page of results had nine websites linked to it, plus 741 00:44:20,920 --> 00:44:25,759 Speaker 1: a few videos. That means somewhere around three million, seven 742 00:44:26,320 --> 00:44:31,360 Speaker 1: nine web pages are sharing just five percent of leftover 743 00:44:31,440 --> 00:44:36,319 Speaker 1: traffic that go to that first page. So they might 744 00:44:36,360 --> 00:44:40,920 Speaker 1: include incredible resources that are even more relevant than the 745 00:44:40,920 --> 00:44:44,359 Speaker 1: stuff that appears on page one, but very few people 746 00:44:44,360 --> 00:44:46,959 Speaker 1: are going to that. That's one bad habit that we've 747 00:44:47,000 --> 00:44:50,800 Speaker 1: all developed through using these search engines. On the flip side, 748 00:44:51,000 --> 00:44:55,520 Speaker 1: that message is that it's really important to get your 749 00:44:55,520 --> 00:44:58,279 Speaker 1: page to show up on that first screen of results. 750 00:44:58,880 --> 00:45:01,879 Speaker 1: If you're bill holding a web page about a specific thing, 751 00:45:02,600 --> 00:45:04,880 Speaker 1: you want to be on that first page because otherwise 752 00:45:04,880 --> 00:45:07,640 Speaker 1: you're gonna have to hope people find your your website 753 00:45:07,680 --> 00:45:11,319 Speaker 1: through some other means, you know, outside of search. That 754 00:45:11,400 --> 00:45:13,960 Speaker 1: gave birth to the industry of s e O or 755 00:45:14,040 --> 00:45:18,319 Speaker 1: search engine optimization, which is a constantly evolving set of 756 00:45:18,360 --> 00:45:21,680 Speaker 1: practices that web designers try to follow in order to 757 00:45:21,760 --> 00:45:26,000 Speaker 1: rank better in search. And whenever a search engine, which 758 00:45:26,040 --> 00:45:29,400 Speaker 1: again these days we mostly just mean Google, whenever Google 759 00:45:29,480 --> 00:45:32,600 Speaker 1: makes a change in its algorithm, it can really upset 760 00:45:32,640 --> 00:45:35,000 Speaker 1: the apple cart, and it can push everyone back to 761 00:45:35,040 --> 00:45:39,320 Speaker 1: the drawing board. It can completely jumble up who appears 762 00:45:39,320 --> 00:45:42,040 Speaker 1: at the top of search results. Now, all of that 763 00:45:42,200 --> 00:45:44,719 Speaker 1: is another kettle of fish, So I'm going to leave 764 00:45:44,719 --> 00:45:46,839 Speaker 1: off of s e O and go back to that 765 00:45:46,920 --> 00:45:50,160 Speaker 1: on some other day, but more Germane to our episode 766 00:45:50,200 --> 00:45:54,680 Speaker 1: here is that the spiders, those web crawling bots, or 767 00:45:54,719 --> 00:45:57,920 Speaker 1: what build out those indices that search engines used to 768 00:45:57,920 --> 00:46:00,520 Speaker 1: give us the results we ask for. There are some 769 00:46:00,600 --> 00:46:03,359 Speaker 1: things I did not cover, such as tags that web 770 00:46:03,360 --> 00:46:06,440 Speaker 1: developers can use to make sure that search engines just 771 00:46:06,520 --> 00:46:10,120 Speaker 1: pass over their websites or sometimes just pages within their 772 00:46:10,120 --> 00:46:13,759 Speaker 1: websites without adding them to an index, so they'll never 773 00:46:13,800 --> 00:46:16,360 Speaker 1: show up in search. But we could go over that 774 00:46:16,440 --> 00:46:19,040 Speaker 1: in a future episode. Two. For now, it's kind of 775 00:46:19,080 --> 00:46:22,279 Speaker 1: time to wrap things up, So guys, I hope you 776 00:46:22,360 --> 00:46:25,400 Speaker 1: enjoyed this episode. If you have suggestions for future topics, 777 00:46:25,440 --> 00:46:28,600 Speaker 1: whether it's a specific technology, a trend in tech, a 778 00:46:28,640 --> 00:46:30,600 Speaker 1: person in tech. Maybe it's a company you want to 779 00:46:30,640 --> 00:46:33,600 Speaker 1: know more about, let me know, draw me a line 780 00:46:33,640 --> 00:46:36,080 Speaker 1: on Facebook or Twitter. The handle of both of those 781 00:46:36,239 --> 00:46:39,640 Speaker 1: is text stuff H s W and I'll talk to 782 00:46:39,719 --> 00:46:48,280 Speaker 1: you again really soon. Text Stuff is an I Heart 783 00:46:48,360 --> 00:46:52,120 Speaker 1: Radio production. For more podcasts from my Heart Radio, visit 784 00:46:52,120 --> 00:46:55,239 Speaker 1: the I Heart Radio app, Apple Podcasts, or wherever you 785 00:46:55,280 --> 00:47:00,279 Speaker 1: listen to your favorite shows. Two