1 00:00:04,240 --> 00:00:07,240 Speaker 1: Welcome to tech Stuff, a production of I Heart Radios 2 00:00:07,320 --> 00:00:14,200 Speaker 1: How Stuff Works. Hey there, and welcome to tech Stuff. 3 00:00:14,240 --> 00:00:17,520 Speaker 1: I'm your host, Jonathan Strickland. I'm an executive producer with 4 00:00:17,560 --> 00:00:19,920 Speaker 1: How Stuff Works and I heart Radio and I love 5 00:00:20,040 --> 00:00:23,360 Speaker 1: all things tech, and today I thought i'd talk a 6 00:00:23,360 --> 00:00:26,680 Speaker 1: bit about Internet search engines and how Google was able 7 00:00:26,760 --> 00:00:30,400 Speaker 1: to sort of take the lead amongst a pack of competitors, 8 00:00:30,880 --> 00:00:34,480 Speaker 1: most of which came out well before Google did. Now 9 00:00:34,479 --> 00:00:37,240 Speaker 1: these days, lots of people use Google as a word 10 00:00:37,360 --> 00:00:40,239 Speaker 1: for web searching in general, even though the company does 11 00:00:40,560 --> 00:00:43,159 Speaker 1: way more than web search, and there's still plenty of 12 00:00:43,200 --> 00:00:46,199 Speaker 1: competitors that are still active that are out there. I'm 13 00:00:46,240 --> 00:00:49,400 Speaker 1: sure Microsoft would rather we all talk about binging the 14 00:00:49,440 --> 00:00:52,519 Speaker 1: heck of the things, but that doesn't happen. I think 15 00:00:52,520 --> 00:00:54,520 Speaker 1: we're now at the point where people will talk about Googling, 16 00:00:54,640 --> 00:00:57,040 Speaker 1: even if they're using a different search engine. So how 17 00:00:57,040 --> 00:01:00,120 Speaker 1: did that happen? How did we get to that point? Well, 18 00:01:00,120 --> 00:01:02,280 Speaker 1: to explain how we got there, it's a good idea 19 00:01:02,280 --> 00:01:04,520 Speaker 1: to walk down memory lane. I mean, you know, I 20 00:01:04,560 --> 00:01:07,000 Speaker 1: love to do this. Every episode begins with a history 21 00:01:07,080 --> 00:01:09,760 Speaker 1: lesson and to really look at how the idea of 22 00:01:09,760 --> 00:01:12,280 Speaker 1: search engines developed and what things were like in the 23 00:01:12,319 --> 00:01:16,279 Speaker 1: early days of the public Internet and the Web now. First, 24 00:01:16,560 --> 00:01:20,120 Speaker 1: the idea of search engines predates both of those concepts 25 00:01:20,120 --> 00:01:22,839 Speaker 1: by quite some time, and it rose out of necessity. 26 00:01:22,880 --> 00:01:26,520 Speaker 1: It kind of evolved out of older methods of indexing. 27 00:01:26,640 --> 00:01:31,039 Speaker 1: So a predecessor to search engines are the various library 28 00:01:31,120 --> 00:01:35,760 Speaker 1: classification systems UH. Three big ones are the Dewey Decimal system, 29 00:01:36,200 --> 00:01:40,240 Speaker 1: the Library of Congress system, and the Superintendent of Documents 30 00:01:40,280 --> 00:01:44,120 Speaker 1: systems UH. The first two of those designate books with 31 00:01:44,200 --> 00:01:47,760 Speaker 1: call numbers according to subject matter, so you divide the 32 00:01:47,800 --> 00:01:51,640 Speaker 1: books up based upon whatever subject they cover. This can 33 00:01:51,640 --> 00:01:56,480 Speaker 1: get a little complicated, it is and no pun intended subjective. 34 00:01:57,000 --> 00:01:59,840 Speaker 1: You have to determine where does the book best fit 35 00:02:00,280 --> 00:02:04,760 Speaker 1: in the grand taxonomy of subjects UH. Meanwhile, the Superintendent 36 00:02:04,800 --> 00:02:07,240 Speaker 1: of Documents system is totally different. It doesn't divide it 37 00:02:07,320 --> 00:02:11,079 Speaker 1: up by subject. It divides up books by the issuing 38 00:02:11,120 --> 00:02:15,399 Speaker 1: agency responsible for the publication of the work. So they 39 00:02:15,520 --> 00:02:19,160 Speaker 1: just divided up by where the book came from, not 40 00:02:19,240 --> 00:02:22,480 Speaker 1: what the book covers. Whatever the system, the purpose is 41 00:02:22,520 --> 00:02:24,440 Speaker 1: the same. It's to make it possible for someone to 42 00:02:24,480 --> 00:02:29,160 Speaker 1: track down a specific work in an enormous collection of works, 43 00:02:29,520 --> 00:02:32,080 Speaker 1: or to figure out where to place a new work 44 00:02:32,280 --> 00:02:36,359 Speaker 1: within an existing collection. By classifying each work and then 45 00:02:36,400 --> 00:02:41,360 Speaker 1: designating the physical location for that piece, people can find stuff. Otherwise, 46 00:02:41,400 --> 00:02:43,280 Speaker 1: you just have an enormous pile of books with no 47 00:02:43,520 --> 00:02:47,119 Speaker 1: organizational system at all, and finding anything would take ages. Now, 48 00:02:47,200 --> 00:02:50,160 Speaker 1: someday I'll have to do an episode about these systems 49 00:02:50,160 --> 00:02:52,679 Speaker 1: in more detail, to talk about how they were developed 50 00:02:52,720 --> 00:02:55,560 Speaker 1: and how they've evolved over time, because it's actually a 51 00:02:55,600 --> 00:02:58,480 Speaker 1: pretty interesting story. But we're gonna jump forward a bit, 52 00:02:58,800 --> 00:03:01,799 Speaker 1: not quite up to the com uter age, however. Rather 53 00:03:01,880 --> 00:03:04,799 Speaker 1: we're gonna jump forward to the nineteen forties. That's when 54 00:03:04,800 --> 00:03:08,600 Speaker 1: a forward thinking fellow named Vanavar Bush wrote an article 55 00:03:08,639 --> 00:03:11,600 Speaker 1: for The Atlantic Monthly. The piece had the title as 56 00:03:11,680 --> 00:03:16,320 Speaker 1: we May Think, and it contains some fairly prescient ideas 57 00:03:16,400 --> 00:03:20,119 Speaker 1: in it. Bush recognized that as we increase our knowledge, 58 00:03:20,440 --> 00:03:24,560 Speaker 1: we were beginning to specialize in certain fields out of necessity. 59 00:03:24,560 --> 00:03:29,000 Speaker 1: That you couldn't just be a general knowledge master. Eventually 60 00:03:29,400 --> 00:03:32,480 Speaker 1: you were starting to develop our our knowledge in different areas, 61 00:03:32,960 --> 00:03:36,120 Speaker 1: uh so far that you had to specialize. You couldn't 62 00:03:36,120 --> 00:03:38,520 Speaker 1: be an expert in everything to get get a really 63 00:03:38,520 --> 00:03:42,200 Speaker 1: deep understanding about a particular field, such as physics or chemistry, 64 00:03:42,600 --> 00:03:45,480 Speaker 1: we might dedicate all our resources to that pursuit as 65 00:03:45,480 --> 00:03:48,880 Speaker 1: an individual. Meanwhile, there are other people who are exploring 66 00:03:49,000 --> 00:03:54,280 Speaker 1: different subjects, like pure mathematics or cosmology or something like that. Now, this, 67 00:03:54,680 --> 00:03:57,560 Speaker 1: Bush argued, presented a new challenge. How do we create 68 00:03:57,640 --> 00:04:02,240 Speaker 1: a usable record of our discovery, one that's easily navigable 69 00:04:02,440 --> 00:04:06,480 Speaker 1: and remains relevant over time. While an older library classification 70 00:04:06,520 --> 00:04:11,120 Speaker 1: system might encompass several categories, it couldn't get as granular 71 00:04:11,200 --> 00:04:14,120 Speaker 1: as our knowledge was growing to be. For example, the 72 00:04:14,160 --> 00:04:18,440 Speaker 1: Library of Congress classification system has twenty one categories that 73 00:04:18,520 --> 00:04:21,200 Speaker 1: you can use to group books together. But as our 74 00:04:21,240 --> 00:04:25,400 Speaker 1: research and discoveries honed in on ever more precise slices 75 00:04:25,480 --> 00:04:29,839 Speaker 1: of those categories, the system becomes less relevant because you've 76 00:04:29,960 --> 00:04:35,160 Speaker 1: you've got, you know, minor categories within those major categories, 77 00:04:35,560 --> 00:04:38,919 Speaker 1: so it gets harder to start classifying things. Bush said 78 00:04:39,120 --> 00:04:41,920 Speaker 1: we needed to have a record that could be continuously 79 00:04:42,080 --> 00:04:47,679 Speaker 1: extended and easy to consult. But he went even further 80 00:04:47,760 --> 00:04:50,640 Speaker 1: out than that. He said, to make it a really 81 00:04:50,839 --> 00:04:53,719 Speaker 1: useful record, we need to structure it to respond to 82 00:04:53,760 --> 00:04:56,200 Speaker 1: our queries in a way similar to how the human 83 00:04:56,240 --> 00:05:00,680 Speaker 1: mind works. Bush argued that we think through associate. We 84 00:05:00,760 --> 00:05:05,800 Speaker 1: associate ideas with each other, sometimes in pretty unusual ways, 85 00:05:05,839 --> 00:05:09,040 Speaker 1: in ways that might seem intuitive to us. But on 86 00:05:09,080 --> 00:05:11,920 Speaker 1: the very surface of it, there there doesn't seem to 87 00:05:11,920 --> 00:05:14,960 Speaker 1: be any relation between those ideas. And you may have 88 00:05:15,080 --> 00:05:17,479 Speaker 1: experienced this where you're thinking about one thing and you 89 00:05:17,560 --> 00:05:20,040 Speaker 1: just start to think about a different thing that doesn't 90 00:05:20,080 --> 00:05:22,520 Speaker 1: seem to be related, and then you're able to relate 91 00:05:22,520 --> 00:05:26,560 Speaker 1: the two. This is really human ingenuity. It's where innovation 92 00:05:26,760 --> 00:05:30,520 Speaker 1: really takes off. Well, Bush, that would probably be impossible 93 00:05:30,720 --> 00:05:33,159 Speaker 1: for us to create an artificial system that could replicate 94 00:05:33,200 --> 00:05:35,800 Speaker 1: that tendency, but we could at the very least design 95 00:05:35,920 --> 00:05:39,520 Speaker 1: something that acknowledges that human trait so it works better 96 00:05:39,640 --> 00:05:42,400 Speaker 1: for us. So if we did that, if we designed 97 00:05:42,440 --> 00:05:45,240 Speaker 1: to search for a record for a particular type of information, 98 00:05:45,760 --> 00:05:48,760 Speaker 1: we might also see the opportunity to search for tangential 99 00:05:48,839 --> 00:05:52,320 Speaker 1: data that is relevant to our needs. A good system 100 00:05:52,360 --> 00:05:54,640 Speaker 1: would be able to anticipate that and serve up the 101 00:05:54,680 --> 00:05:58,200 Speaker 1: information for us. So Bush proposed a hypothetical system called 102 00:05:58,360 --> 00:06:02,240 Speaker 1: mimics m E M E X and that would use 103 00:06:02,279 --> 00:06:08,000 Speaker 1: associative factors to organize information in a virtually limitless storage space. Again, 104 00:06:08,000 --> 00:06:11,039 Speaker 1: this is hypothetical. It would be a system that one 105 00:06:11,080 --> 00:06:13,680 Speaker 1: could reference and send a retrieval command to get the 106 00:06:13,720 --> 00:06:16,560 Speaker 1: most relevant information related to whatever it was you were 107 00:06:16,560 --> 00:06:20,200 Speaker 1: asking for your query. Essentially, he was talking about a 108 00:06:20,200 --> 00:06:24,560 Speaker 1: conceptual model that the Internet attempts to realize. Now skip 109 00:06:24,560 --> 00:06:27,280 Speaker 1: ahead to the nineteen sixties. Then you've got a computer 110 00:06:27,320 --> 00:06:32,240 Speaker 1: scientist named Jerry Saltan. Jerry Salton taught at Cornell University, 111 00:06:32,279 --> 00:06:36,480 Speaker 1: and he developed an indexing strategy using a vector space model. 112 00:06:37,040 --> 00:06:39,680 Speaker 1: Now this gets a bit mind bendy for people who 113 00:06:39,720 --> 00:06:43,080 Speaker 1: haven't worked with vector space models, but follow me here. Now, 114 00:06:43,120 --> 00:06:47,400 Speaker 1: start with an imaginary virtual space kind of analogous to 115 00:06:47,520 --> 00:06:50,680 Speaker 1: the physical space we live in in our day to 116 00:06:50,760 --> 00:06:54,839 Speaker 1: day lives. Now, in our reality, we can perceive three dimensions, 117 00:06:54,960 --> 00:06:57,440 Speaker 1: and we experience a fourth one, that of time, but 118 00:06:57,520 --> 00:07:01,400 Speaker 1: we cannot directly perceive any more than that ourselves, So 119 00:07:01,440 --> 00:07:03,680 Speaker 1: most of the time we associate the physical world with 120 00:07:03,800 --> 00:07:08,000 Speaker 1: three physical dimensions. Now, the information retrieval method that Salton 121 00:07:08,080 --> 00:07:11,280 Speaker 1: set up, he defined the number of dimensions within his 122 00:07:11,520 --> 00:07:15,960 Speaker 1: virtual space by the number of terms in a retrieval request. 123 00:07:16,240 --> 00:07:20,800 Speaker 1: So if your request included five terms, the vector space 124 00:07:20,800 --> 00:07:25,119 Speaker 1: model would have five dimensions. Documents within the model would 125 00:07:25,160 --> 00:07:29,920 Speaker 1: virtually appear as vectors within the space according to which 126 00:07:29,920 --> 00:07:33,200 Speaker 1: of the search terms were present within those documents and 127 00:07:33,240 --> 00:07:36,520 Speaker 1: how frequently they were present within the documents. Uh, the 128 00:07:36,640 --> 00:07:40,240 Speaker 1: queries and the documents are both vectors of the term counts. 129 00:07:40,240 --> 00:07:42,400 Speaker 1: And just in case you're as rusty on your physics 130 00:07:42,480 --> 00:07:45,000 Speaker 1: terms as I am, a vector is a quantity that 131 00:07:45,040 --> 00:07:50,120 Speaker 1: has a magnitude and a direction. So your terms have vectors, 132 00:07:50,160 --> 00:07:52,720 Speaker 1: your documents have vectors, and the goal is to identify 133 00:07:52,760 --> 00:07:55,640 Speaker 1: the documents that are most similar to the initial query 134 00:07:55,720 --> 00:07:58,560 Speaker 1: in an effort to retrieve the most relevant results, well 135 00:07:58,640 --> 00:08:01,240 Speaker 1: leaving out anything that doesn't meet the criterion or doesn't 136 00:08:01,240 --> 00:08:04,880 Speaker 1: meant a predetermined threshold of relevance. So you might say, 137 00:08:05,160 --> 00:08:09,640 Speaker 1: I need to have x percentage match for the retrieval 138 00:08:09,800 --> 00:08:12,120 Speaker 1: to actually come through, and anything that doesn't meet that 139 00:08:12,200 --> 00:08:15,520 Speaker 1: threshold gets discarded. It's not it's not served to me, 140 00:08:16,040 --> 00:08:18,400 Speaker 1: and that saves you time when you start sorting through 141 00:08:18,480 --> 00:08:21,920 Speaker 1: the results to see if any of those actually represent 142 00:08:21,960 --> 00:08:24,960 Speaker 1: the information you were actually looking for. Now, suffice it 143 00:08:25,000 --> 00:08:27,760 Speaker 1: to say, this model really looks for the presence of 144 00:08:27,800 --> 00:08:32,080 Speaker 1: specific terms, but not necessarily their use within the document 145 00:08:32,120 --> 00:08:35,280 Speaker 1: their context, So you could end up retrieving a document 146 00:08:35,320 --> 00:08:38,400 Speaker 1: that technically contains all the terms you used in the search, 147 00:08:38,880 --> 00:08:42,600 Speaker 1: but it has no real relevance to your actual needs. 148 00:08:42,640 --> 00:08:47,280 Speaker 1: So that is a limitation of this model, but still 149 00:08:47,360 --> 00:08:50,000 Speaker 1: it was a pretty good starting point, so Saltan's work 150 00:08:50,040 --> 00:08:53,840 Speaker 1: was incredibly important. Another big thinker who helped shape the 151 00:08:53,880 --> 00:08:57,040 Speaker 1: course of what would become the Internet and the Web 152 00:08:57,400 --> 00:08:59,959 Speaker 1: is a guy named Ted Nelson who in the nineteenes 153 00:09:00,000 --> 00:09:03,160 Speaker 1: sixties proposed an idea he called Zanna Do. And I'm 154 00:09:03,200 --> 00:09:05,920 Speaker 1: not talking about the cheesy movie starring Olivia Newton John 155 00:09:06,000 --> 00:09:09,360 Speaker 1: about roller skating Greek muses, but as a side note, 156 00:09:09,360 --> 00:09:12,880 Speaker 1: I really love that movie now. Nelson's Zanna Do was 157 00:09:12,920 --> 00:09:16,280 Speaker 1: a hypothetical computer based writing system that would have a 158 00:09:16,360 --> 00:09:20,640 Speaker 1: means to link different documents within a global depository. So 159 00:09:20,760 --> 00:09:23,720 Speaker 1: essentially he was talking about hypertext links, which would allow 160 00:09:23,800 --> 00:09:27,480 Speaker 1: users to navigate from document to document to relate documents together, 161 00:09:28,160 --> 00:09:32,800 Speaker 1: so that you could have a collection of documents about 162 00:09:32,840 --> 00:09:35,720 Speaker 1: the same sort of of subject matter and make it 163 00:09:35,800 --> 00:09:39,040 Speaker 1: very easy to reference different research. It would also allow 164 00:09:39,160 --> 00:09:41,679 Speaker 1: document creators to add their work to a growing collection 165 00:09:41,720 --> 00:09:44,800 Speaker 1: of documents about similar subjects. Now, while the Web would 166 00:09:44,840 --> 00:09:48,960 Speaker 1: incorporate many of Nelson's ideas, he has stated that the 167 00:09:49,040 --> 00:09:51,880 Speaker 1: web falls far short of what Zanna do was meant 168 00:09:52,000 --> 00:09:54,840 Speaker 1: to do. Still, those links would become very important for 169 00:09:54,880 --> 00:09:56,840 Speaker 1: the web. Heck, I mean you could argue the links 170 00:09:56,880 --> 00:09:58,880 Speaker 1: or what make it a web in the first place. 171 00:09:59,240 --> 00:10:03,600 Speaker 1: The World Wide Web is a series of documents published 172 00:10:03,600 --> 00:10:08,280 Speaker 1: on servers that have connective tissue between them. That's the 173 00:10:08,320 --> 00:10:11,600 Speaker 1: web that you navigate. So it would be crucial in 174 00:10:11,640 --> 00:10:15,200 Speaker 1: Google's eventual successes. We'll see now. In the nineteen seventies, 175 00:10:15,559 --> 00:10:18,080 Speaker 1: the agency that would become DARPA, which at the time 176 00:10:18,120 --> 00:10:22,040 Speaker 1: was just ARPA, funded the development of the ARPA Net, 177 00:10:22,360 --> 00:10:26,400 Speaker 1: which would be the predecessor to the Internet. Computer scientists 178 00:10:26,480 --> 00:10:28,840 Speaker 1: worked on the rules that machines would have to follow 179 00:10:28,920 --> 00:10:31,319 Speaker 1: in order to communicate with one another over a network. 180 00:10:31,800 --> 00:10:34,080 Speaker 1: This was a non trivial problem at the time because 181 00:10:34,679 --> 00:10:38,360 Speaker 1: computers were dependent upon proprietary systems that were not compatible 182 00:10:38,440 --> 00:10:42,520 Speaker 1: with computers from other manufacturers. So, in other words, they 183 00:10:42,559 --> 00:10:44,680 Speaker 1: were talking in different languages. So you have to find 184 00:10:44,720 --> 00:10:48,600 Speaker 1: a common means of communication between these different machines. Solving 185 00:10:48,600 --> 00:10:51,080 Speaker 1: those problems laid the foundation for the Internet that was 186 00:10:51,120 --> 00:10:54,600 Speaker 1: to follow. Now skipping ahead to the late nineteen eighties, 187 00:10:54,920 --> 00:10:57,640 Speaker 1: this is still before the Web was a thing, but 188 00:10:58,040 --> 00:11:03,360 Speaker 1: college students Alan Mta and Bill Healen recognized the need 189 00:11:03,559 --> 00:11:06,520 Speaker 1: for a tool to search file databases. Effectively, they were 190 00:11:06,520 --> 00:11:09,880 Speaker 1: part of a project at the McGill University School of 191 00:11:09,880 --> 00:11:13,000 Speaker 1: Computer Science to develop that kind of a tool. It 192 00:11:13,040 --> 00:11:15,840 Speaker 1: would become known as Archie, and it was meant to 193 00:11:15,880 --> 00:11:20,280 Speaker 1: search archives of files. The original version was pretty primitive. 194 00:11:20,679 --> 00:11:24,200 Speaker 1: It would essentially just send an automated request to a 195 00:11:24,240 --> 00:11:27,640 Speaker 1: file Transfer Protocol server and it would just say, hey, 196 00:11:27,679 --> 00:11:30,280 Speaker 1: give me a list of all the files that are 197 00:11:30,320 --> 00:11:33,839 Speaker 1: stored on your server. That's it, just give me a 198 00:11:33,920 --> 00:11:35,720 Speaker 1: laundry list of all the files that are on there. 199 00:11:36,160 --> 00:11:39,240 Speaker 1: And it was once a month it would send this request, 200 00:11:39,760 --> 00:11:42,080 Speaker 1: and so really it was just a list of the 201 00:11:42,160 --> 00:11:46,160 Speaker 1: documents that were available on that FTP server, not anything more, 202 00:11:46,559 --> 00:11:49,320 Speaker 1: you know, sophisticated than that. But it would grow to 203 00:11:49,320 --> 00:11:52,120 Speaker 1: become a query search tool, allowing users to look for 204 00:11:52,160 --> 00:11:55,600 Speaker 1: files containing specific terms in them or with specific titles. 205 00:11:56,240 --> 00:11:59,839 Speaker 1: Other schools would develop similar search tools in the following years, 206 00:12:00,200 --> 00:12:04,360 Speaker 1: naming them after characters from Archie comics like Veronica and 207 00:12:04,480 --> 00:12:07,280 Speaker 1: jug Head. Now this is despite the fact that Mtaj 208 00:12:07,360 --> 00:12:10,959 Speaker 1: said he intended no association with Archie comics at all. 209 00:12:11,000 --> 00:12:14,680 Speaker 1: He chose the name Archie because it's archive but without 210 00:12:14,679 --> 00:12:17,520 Speaker 1: the V. But sometimes memes just take hold, even if 211 00:12:17,520 --> 00:12:21,760 Speaker 1: they're based off a misunderstanding. Also, both Veronica and Jugead 212 00:12:22,000 --> 00:12:26,360 Speaker 1: search for files in the Gopher index system, a predecessor 213 00:12:26,440 --> 00:12:29,080 Speaker 1: and alternative to the Worldwide Web. I did an episode 214 00:12:29,080 --> 00:12:32,040 Speaker 1: about Gopher a couple of years ago. I think, so 215 00:12:32,080 --> 00:12:33,640 Speaker 1: he can search the archives if you want to hear 216 00:12:33,679 --> 00:12:38,240 Speaker 1: about that. Now. This leads us to when Tim Burners 217 00:12:38,320 --> 00:12:41,920 Speaker 1: Lee built and published the world's first web page. Burners 218 00:12:41,960 --> 00:12:45,000 Speaker 1: Lee had done some work with hypertext documents at CERN 219 00:12:45,160 --> 00:12:48,400 Speaker 1: as a contractor in the early eighties. The goal then 220 00:12:48,600 --> 00:12:51,559 Speaker 1: was to help researchers share information between each other as 221 00:12:51,559 --> 00:12:54,480 Speaker 1: they were smashing particles against each other really really hard. 222 00:12:54,920 --> 00:12:58,840 Speaker 1: By burners Lee was thinking about pairing the hypertext capabilities 223 00:12:58,840 --> 00:13:01,800 Speaker 1: with the Internet to allow for an interconnected series of 224 00:13:01,840 --> 00:13:06,000 Speaker 1: documents hosted on networked Internet servers, and thus the World 225 00:13:06,040 --> 00:13:08,760 Speaker 1: Wide Web was born. It wouldn't take long for others 226 00:13:08,800 --> 00:13:11,320 Speaker 1: to jump on the idea, and that meant it wouldn't 227 00:13:11,360 --> 00:13:13,840 Speaker 1: be long before people needed a tool to search the 228 00:13:13,880 --> 00:13:17,520 Speaker 1: growing collection of documents on the Internet. And that kind 229 00:13:17,520 --> 00:13:19,880 Speaker 1: of sets me up for the next section, which I 230 00:13:19,920 --> 00:13:22,360 Speaker 1: will tackle in just a moment after we take this 231 00:13:22,440 --> 00:13:32,880 Speaker 1: quake break. So in the earliest days of the web, 232 00:13:32,960 --> 00:13:36,120 Speaker 1: when calling it a web might have been a little generous, 233 00:13:36,440 --> 00:13:40,120 Speaker 1: cern maintained a list of web servers that hosted web pages. 234 00:13:40,480 --> 00:13:44,280 Speaker 1: This was all part of the Worldwide Web Virtual Library 235 00:13:44,480 --> 00:13:49,240 Speaker 1: or vlib or sometimes www v lib. This was the 236 00:13:49,280 --> 00:13:52,560 Speaker 1: first index of web content and it relied upon real 237 00:13:52,600 --> 00:13:54,959 Speaker 1: life human beings to build out the index. As more 238 00:13:54,960 --> 00:13:58,240 Speaker 1: web pages were publishing, they volunteered their time to build 239 00:13:58,240 --> 00:14:01,760 Speaker 1: out the index. So this is automated. People were actually 240 00:14:02,240 --> 00:14:06,199 Speaker 1: doing this by hand adding the the names and the 241 00:14:06,280 --> 00:14:10,680 Speaker 1: addresses to these different sites on this index. Next we 242 00:14:10,720 --> 00:14:15,120 Speaker 1: have Matthew Gray's Worldwide Web Wanderer. Now, this was a 243 00:14:15,200 --> 00:14:18,000 Speaker 1: bot or an autonomous program on a network that can 244 00:14:18,080 --> 00:14:21,800 Speaker 1: interact in some significant way with the information on the network. 245 00:14:22,160 --> 00:14:24,520 Speaker 1: And we deal with bots all the time. Sometimes it's 246 00:14:24,520 --> 00:14:27,680 Speaker 1: in the background and humans don't really notice, and sometimes 247 00:14:28,080 --> 00:14:30,720 Speaker 1: like chat pots, it's very much in front of us. 248 00:14:31,200 --> 00:14:34,000 Speaker 1: The butt that Matthew Gray created would navigate the World 249 00:14:34,080 --> 00:14:37,040 Speaker 1: Wide Web to keep counting of how many active servers 250 00:14:37,080 --> 00:14:39,960 Speaker 1: there were in any network. It was essentially measuring the 251 00:14:40,000 --> 00:14:43,600 Speaker 1: growth of the web over time by counting up these servers. 252 00:14:43,920 --> 00:14:46,560 Speaker 1: As more servers came online, we learned that the World 253 00:14:46,560 --> 00:14:49,920 Speaker 1: Wide Web was growing. Gray upgraded the bot to actually 254 00:14:49,920 --> 00:14:52,360 Speaker 1: capture the u r l's of web pages, because earlier 255 00:14:52,360 --> 00:14:55,720 Speaker 1: it was just counting stuff. It wasn't actually making note 256 00:14:55,760 --> 00:14:58,520 Speaker 1: of anything in particular, and so I got a little 257 00:14:58,560 --> 00:15:02,800 Speaker 1: more sophisticated gray bill out of database of these captured 258 00:15:02,880 --> 00:15:05,680 Speaker 1: u r l s, called wand decks. The bought would 259 00:15:05,720 --> 00:15:09,440 Speaker 1: ping servers multiple times each day, and it actually became 260 00:15:09,440 --> 00:15:12,240 Speaker 1: a problem that was pinging so frequently. And a ping 261 00:15:12,320 --> 00:15:14,640 Speaker 1: is just a quick message that essentially says, hey, are 262 00:15:14,680 --> 00:15:17,240 Speaker 1: you there, and then it's waiting for a response of yeah, 263 00:15:17,360 --> 00:15:19,880 Speaker 1: I'm here. It's all good. But it was doing this 264 00:15:19,960 --> 00:15:23,480 Speaker 1: so many times each day that it was actually starting 265 00:15:23,520 --> 00:15:26,040 Speaker 1: to create lag on the Internet. Of course, this is 266 00:15:26,040 --> 00:15:29,240 Speaker 1: in the very very early days, so whoopsie daisy there. 267 00:15:29,680 --> 00:15:33,760 Speaker 1: Now toward the end of n some early web search 268 00:15:33,840 --> 00:15:36,520 Speaker 1: tools were starting to make their way to the general public. Though, 269 00:15:36,840 --> 00:15:39,320 Speaker 1: keep in mind that in the very early days of 270 00:15:39,360 --> 00:15:43,040 Speaker 1: the Worldwide Web, the general public accessing web pages was 271 00:15:43,080 --> 00:15:46,280 Speaker 1: really just a tiny fraction of the overall population. It's 272 00:15:46,320 --> 00:15:50,720 Speaker 1: like college students, some early adopters, some folks with various 273 00:15:50,800 --> 00:15:55,520 Speaker 1: government agencies, and a few companies, but not a whole lot. Uh. 274 00:15:55,640 --> 00:15:58,120 Speaker 1: There was largely a mysterious thing. You know. This is 275 00:15:58,160 --> 00:16:01,120 Speaker 1: when people were just starting to hear the terms of 276 00:16:01,360 --> 00:16:05,280 Speaker 1: Worldwide Web and information super Highway, because the Internet had 277 00:16:05,320 --> 00:16:07,400 Speaker 1: been around for a while, but most people didn't have 278 00:16:07,400 --> 00:16:11,320 Speaker 1: any regular way to access it. So these tools could 279 00:16:11,360 --> 00:16:15,880 Speaker 1: help you find stuff, but they weren't super sophisticated. There 280 00:16:15,960 --> 00:16:19,960 Speaker 1: was the Worldwide web Worm, which would pull together lists 281 00:16:20,040 --> 00:16:22,200 Speaker 1: of titles and u r l s for web pages. 282 00:16:23,040 --> 00:16:26,080 Speaker 1: There was jump Station, which would pull down information about 283 00:16:26,080 --> 00:16:29,800 Speaker 1: web pages titles and header sections, so sort of like 284 00:16:30,000 --> 00:16:32,160 Speaker 1: the title of the web page and a brief description 285 00:16:32,200 --> 00:16:34,400 Speaker 1: of what the web page was supposed to be. But 286 00:16:34,520 --> 00:16:37,000 Speaker 1: both of those tools were very simple, and they would 287 00:16:37,040 --> 00:16:40,160 Speaker 1: present results in the order that they were found by 288 00:16:40,280 --> 00:16:44,920 Speaker 1: the tools, so there was no ranking of the search results. 289 00:16:45,160 --> 00:16:48,320 Speaker 1: It was all by by uh, first come, first serve 290 00:16:48,400 --> 00:16:52,960 Speaker 1: kind of approach. So it might be that your results 291 00:16:53,000 --> 00:16:56,160 Speaker 1: all had whatever your query was in it, but the 292 00:16:56,160 --> 00:16:58,760 Speaker 1: most relevant ones could be buried much further down the 293 00:16:58,840 --> 00:17:02,080 Speaker 1: list because they didn't rank in any way. Then there 294 00:17:02,160 --> 00:17:05,520 Speaker 1: was the rb SC spider, which actually attempted to rank 295 00:17:05,600 --> 00:17:08,760 Speaker 1: results by relevance. But all three of these were limited 296 00:17:08,760 --> 00:17:10,560 Speaker 1: in what they could do, and often you needed to 297 00:17:10,600 --> 00:17:13,800 Speaker 1: know what you were looking for exactly in order to 298 00:17:13,840 --> 00:17:16,560 Speaker 1: get a hit. In other words, you couldn't just do 299 00:17:16,760 --> 00:17:21,800 Speaker 1: a string of words. You certainly couldn't write in natural 300 00:17:21,920 --> 00:17:25,080 Speaker 1: language what your query was, so you might have to 301 00:17:25,080 --> 00:17:27,439 Speaker 1: put in the actual title of a page in order 302 00:17:27,520 --> 00:17:31,159 Speaker 1: to get the response back. So you would have to 303 00:17:31,200 --> 00:17:33,240 Speaker 1: know what the page's title is, but you're not. You 304 00:17:33,320 --> 00:17:35,200 Speaker 1: obviously don't know what the U r L is, or 305 00:17:35,200 --> 00:17:38,439 Speaker 1: else you would just navigate to the page directly. You 306 00:17:38,520 --> 00:17:41,640 Speaker 1: just type in the address and your browser's address bar 307 00:17:41,680 --> 00:17:45,600 Speaker 1: and go there. So it was kind of limited in 308 00:17:45,680 --> 00:17:49,080 Speaker 1: its utility. If you were to do anything outside of 309 00:17:49,160 --> 00:17:51,280 Speaker 1: the actual title of a page, you might not find 310 00:17:51,280 --> 00:17:55,680 Speaker 1: any hits, even if such pages actually existed out there. Also, 311 00:17:55,720 --> 00:17:59,600 Speaker 1: in some Stanford undergraduates decided to take the work they 312 00:17:59,640 --> 00:18:02,800 Speaker 1: had been doing on a project called Architect and develop 313 00:18:02,840 --> 00:18:05,720 Speaker 1: a web crawling search tool based off of that work. 314 00:18:06,320 --> 00:18:10,880 Speaker 1: Architect was all about using statistical analysis of word relationships 315 00:18:10,920 --> 00:18:14,000 Speaker 1: in an effort to kind of build a basic understanding 316 00:18:14,040 --> 00:18:18,040 Speaker 1: of what the subject matter was and that would then 317 00:18:18,160 --> 00:18:21,760 Speaker 1: be able to help you create more relevant search results 318 00:18:21,760 --> 00:18:26,480 Speaker 1: on queries. So you run a search request and this 319 00:18:26,560 --> 00:18:32,720 Speaker 1: tool would statistically analyze various indexed pages in its database 320 00:18:33,320 --> 00:18:37,760 Speaker 1: and return the results that appeared to be the most relevant. Um. 321 00:18:37,800 --> 00:18:40,760 Speaker 1: It was an interesting approach. It was definitely one that 322 00:18:40,840 --> 00:18:44,879 Speaker 1: was needed because it wasn't just listing the the sites 323 00:18:44,920 --> 00:18:49,000 Speaker 1: chronologically based on how they were attained. But it would 324 00:18:49,000 --> 00:18:53,880 Speaker 1: take about two years for this project to actually turn 325 00:18:53,920 --> 00:18:58,119 Speaker 1: into something that the group could unveil uh and when 326 00:18:58,160 --> 00:19:02,880 Speaker 1: they did, they called the will Excite and they held 327 00:19:02,920 --> 00:19:07,159 Speaker 1: a commercial release for the product in n But in 328 00:19:07,200 --> 00:19:11,640 Speaker 1: between the founding and the release of Excite, we hit 329 00:19:11,760 --> 00:19:18,159 Speaker 1: a banner year for early search engines. Nineteen four was 330 00:19:18,200 --> 00:19:22,960 Speaker 1: the year that web crawler, lycos Info, Seek, and Yahoo 331 00:19:23,160 --> 00:19:26,320 Speaker 1: all got their start. Now, with the case of Yahoo, 332 00:19:26,480 --> 00:19:29,320 Speaker 1: the company was not relying on bots to crawl through 333 00:19:29,359 --> 00:19:33,040 Speaker 1: web servers to index all the pages that the bots 334 00:19:33,119 --> 00:19:38,000 Speaker 1: came across. Instead, Yahoo initially was relying on actual human 335 00:19:38,040 --> 00:19:41,680 Speaker 1: beings to curate an index, so they were actually going 336 00:19:41,720 --> 00:19:44,600 Speaker 1: to web pages deciding whether or not those web pages 337 00:19:44,680 --> 00:19:48,280 Speaker 1: were good enough to be listed on Yahoo on the 338 00:19:48,359 --> 00:19:51,359 Speaker 1: various subjects that Yahoo was covering, and then they would 339 00:19:51,400 --> 00:19:54,600 Speaker 1: be grouped together if they passed muster. Now, there are 340 00:19:54,680 --> 00:19:57,479 Speaker 1: pros and cons to that approach. One of the pros 341 00:19:57,560 --> 00:20:00,320 Speaker 1: is that because it is human curated, there a much 342 00:20:00,359 --> 00:20:03,800 Speaker 1: better possibility that the web pages on Yeah whose lists 343 00:20:03,840 --> 00:20:07,119 Speaker 1: were good ones with good content. But the conside was 344 00:20:07,160 --> 00:20:09,439 Speaker 1: that as the web grew and began growing at an 345 00:20:09,480 --> 00:20:13,320 Speaker 1: even faster rate, it really limited Yahoo's usefulness. It would 346 00:20:13,359 --> 00:20:16,160 Speaker 1: only be later that Yahoo would branch out into the 347 00:20:16,200 --> 00:20:19,160 Speaker 1: web search in general, and even then it relied very 348 00:20:19,160 --> 00:20:21,959 Speaker 1: heavily on third parties for the actual search tools. They 349 00:20:21,960 --> 00:20:26,400 Speaker 1: didn't really dive into developing their own. They were more 350 00:20:26,440 --> 00:20:31,560 Speaker 1: about making deals with other search engines to power their search. 351 00:20:31,560 --> 00:20:34,680 Speaker 1: In fact, that happened on and off throughout Yahoo's entire existence. 352 00:20:35,359 --> 00:20:38,960 Speaker 1: But let's get back to web Crawler, Lycos and Infoseek. Now, 353 00:20:39,000 --> 00:20:41,960 Speaker 1: of those three, WebCrawler was the first to provide full 354 00:20:42,080 --> 00:20:45,600 Speaker 1: text search of web pages, so not just headers and titles. 355 00:20:45,960 --> 00:20:49,359 Speaker 1: You could search terms, and if they appeared in the 356 00:20:49,440 --> 00:20:52,840 Speaker 1: web page at all, then, in theory, WebCrawler would be 357 00:20:52,880 --> 00:20:55,080 Speaker 1: able to bring that back as long as it was 358 00:20:55,160 --> 00:20:59,080 Speaker 1: indexed in Webcrawler's index. Um it was the work of 359 00:20:59,600 --> 00:21:03,600 Speaker 1: universe the a Washington student named Brian Pinkerton, and Pinkerton's 360 00:21:03,600 --> 00:21:07,679 Speaker 1: web Crawler built out this big index of pages, and 361 00:21:07,720 --> 00:21:11,879 Speaker 1: Pinkerton started rather modestly. He first released a list of 362 00:21:11,920 --> 00:21:15,560 Speaker 1: the top twenty five websites on the web on March fifteenth, 363 00:21:16,960 --> 00:21:19,040 Speaker 1: and the following month he announced that the web Crawler's 364 00:21:19,080 --> 00:21:24,320 Speaker 1: index included four thousand websites, and by June of ninety four, 365 00:21:24,720 --> 00:21:28,199 Speaker 1: he made the index searchable for everyone. So again, this 366 00:21:28,280 --> 00:21:31,119 Speaker 1: is just a slice of all the websites that were 367 00:21:31,119 --> 00:21:34,560 Speaker 1: out there, but it was a decent enough slice to 368 00:21:34,640 --> 00:21:37,679 Speaker 1: start off with, and the endeavor proved to be successful. 369 00:21:37,720 --> 00:21:40,879 Speaker 1: Pinkerton received financial investments from a couple of big companies, 370 00:21:41,119 --> 00:21:43,240 Speaker 1: and within a year he had managed to support the 371 00:21:43,280 --> 00:21:47,280 Speaker 1: service through advertising revenue, a model that other search engines 372 00:21:47,320 --> 00:21:49,280 Speaker 1: would follow, so he was able to actually make money 373 00:21:49,280 --> 00:21:54,480 Speaker 1: by serving up advertising on his search engine pages. By June, 374 00:21:54,760 --> 00:21:57,480 Speaker 1: A O. L had become interested in WebCrawler and would 375 00:21:57,520 --> 00:22:00,440 Speaker 1: purchase the company a O L would lay Eater sell 376 00:22:00,480 --> 00:22:02,840 Speaker 1: the company a little less than two years later to 377 00:22:02,920 --> 00:22:05,760 Speaker 1: excite that company that I had mentioned earlier in this episode. 378 00:22:05,800 --> 00:22:07,439 Speaker 1: I'll get back to them and to web Crawler a 379 00:22:07,440 --> 00:22:09,760 Speaker 1: bit later, but I will say that web Crawler was 380 00:22:09,840 --> 00:22:12,360 Speaker 1: my search engine of choice when I first started using 381 00:22:12,400 --> 00:22:15,399 Speaker 1: the web in the mid nineteen nineties. I was actually 382 00:22:15,440 --> 00:22:18,120 Speaker 1: pretty slow to move over to that crazy Google thing 383 00:22:18,160 --> 00:22:21,280 Speaker 1: that we're gonna get to later in this episode. Lycos meanwhile, 384 00:22:21,640 --> 00:22:25,800 Speaker 1: started off as a project at Carnegie Mellon University. Michael 385 00:22:25,880 --> 00:22:28,880 Speaker 1: Malden headed up the project and the name came from 386 00:22:29,000 --> 00:22:34,520 Speaker 1: wolf spiders that have the scientific name Lycos sedilla. When 387 00:22:34,640 --> 00:22:37,560 Speaker 1: Lycos became a company, Bob Davis took the helm to 388 00:22:37,680 --> 00:22:40,920 Speaker 1: turn it into a revenue generating business that it gets 389 00:22:40,920 --> 00:22:43,679 Speaker 1: cash from advertising like web Crawler, and it also was 390 00:22:43,760 --> 00:22:46,920 Speaker 1: a success, and by the end of nine the Lycos 391 00:22:47,040 --> 00:22:50,720 Speaker 1: index was the largest web search index available on the web. 392 00:22:51,040 --> 00:22:55,679 Speaker 1: It held more than sixty million documents in it. The 393 00:22:55,720 --> 00:22:59,280 Speaker 1: service grew tremendously, as did the company, and the full 394 00:22:59,320 --> 00:23:01,400 Speaker 1: story of Like Coast is one I'll have to cover 395 00:23:01,440 --> 00:23:04,480 Speaker 1: in another episode because it gets pretty bonkers. But for 396 00:23:04,520 --> 00:23:06,320 Speaker 1: this episode, it's just important to note that it was 397 00:23:06,359 --> 00:23:11,119 Speaker 1: another early search service that grew and became diversified and 398 00:23:11,200 --> 00:23:15,479 Speaker 1: tried to do lots of other stuff. Um Steve Kirsch 399 00:23:15,880 --> 00:23:18,880 Speaker 1: would be the guy behind info Seque. That one originally 400 00:23:18,960 --> 00:23:22,520 Speaker 1: launched as a pay for use service, so it's an 401 00:23:22,560 --> 00:23:27,160 Speaker 1: original revenue model. Wasn't advertising, it was you would pay 402 00:23:27,240 --> 00:23:30,960 Speaker 1: to use it. Now that only lasted about half a year, 403 00:23:30,960 --> 00:23:32,640 Speaker 1: a little more than half a year before a kurse 404 00:23:32,760 --> 00:23:36,240 Speaker 1: dropped the fee and it became free to use, and 405 00:23:36,240 --> 00:23:40,200 Speaker 1: by February the service became known as info see Search 406 00:23:41,119 --> 00:23:46,000 Speaker 1: and also Netscape and Infoseque negotiated a deal in which 407 00:23:46,040 --> 00:23:49,200 Speaker 1: info sque would become the default search engine and Netscape's 408 00:23:49,400 --> 00:23:53,960 Speaker 1: web browser, so that really helped info squ's penetration quite 409 00:23:53,960 --> 00:23:57,360 Speaker 1: a bit in those days. Now, one thing Infoseque incorporated 410 00:23:57,440 --> 00:24:00,159 Speaker 1: in its service after a couple of years is the 411 00:24:00,200 --> 00:24:03,440 Speaker 1: option to use boolean operators. Now, these are a collection 412 00:24:03,480 --> 00:24:06,440 Speaker 1: of simple words that can help you narrow down searches. 413 00:24:07,040 --> 00:24:11,800 Speaker 1: The words include and or and not, so with an 414 00:24:11,840 --> 00:24:16,560 Speaker 1: and operator you are narrowing your focus. So if you 415 00:24:16,600 --> 00:24:22,320 Speaker 1: search for the terms Superman and movies, the results you 416 00:24:22,359 --> 00:24:25,040 Speaker 1: get should be relevant to both of those terms. You 417 00:24:25,080 --> 00:24:29,880 Speaker 1: should only get results that include information about Superman and movies. 418 00:24:30,760 --> 00:24:33,960 Speaker 1: If you're looking for specific Superman movie, hopefully those would 419 00:24:34,040 --> 00:24:37,480 Speaker 1: be right in that list. Some of them should have 420 00:24:37,560 --> 00:24:39,800 Speaker 1: the information you're looking for, and maybe that you still 421 00:24:39,800 --> 00:24:41,760 Speaker 1: have to do some digging to find them, because you're 422 00:24:41,760 --> 00:24:44,600 Speaker 1: going to get all the web pages that have both 423 00:24:44,800 --> 00:24:48,639 Speaker 1: Superman and movies inside of them. Now you could make 424 00:24:48,640 --> 00:24:53,200 Speaker 1: it more specific. You could say Superman and movies and 425 00:24:53,520 --> 00:24:57,400 Speaker 1: Christopher Reeve. That would end up narrowing the results for 426 00:24:57,800 --> 00:25:00,960 Speaker 1: those to look for. Any pages have all three of 427 00:25:01,000 --> 00:25:05,040 Speaker 1: those terms inside of them. The boolean operator or does 428 00:25:05,080 --> 00:25:07,919 Speaker 1: the opposite. It broadens your search. Maybe you want to 429 00:25:07,920 --> 00:25:12,600 Speaker 1: search for Batman or Superman, then you should get results 430 00:25:12,640 --> 00:25:17,280 Speaker 1: that have either or both Superman or Batman in them. Um, 431 00:25:17,320 --> 00:25:19,639 Speaker 1: so you would get all the Superman results, all the 432 00:25:19,680 --> 00:25:22,359 Speaker 1: Batman results. You probably also get all the Superman and 433 00:25:22,400 --> 00:25:27,240 Speaker 1: Batman results, so you're you're increasing the number that you receive. 434 00:25:27,800 --> 00:25:31,679 Speaker 1: The not boolean operator helps you eliminate options from search. 435 00:25:31,800 --> 00:25:35,840 Speaker 1: So if you searched comic books not Superman, you should 436 00:25:35,840 --> 00:25:39,240 Speaker 1: get results about comic books that don't mention or include 437 00:25:39,440 --> 00:25:43,000 Speaker 1: Superman in the web pages, so it should be discussions 438 00:25:43,000 --> 00:25:45,960 Speaker 1: about comic books, but they're not Superman comic books, or 439 00:25:45,960 --> 00:25:50,240 Speaker 1: at least Superman's name isn't appearing in the web page. Now, 440 00:25:50,240 --> 00:25:52,679 Speaker 1: Boolean search is still a great tool to help you 441 00:25:52,760 --> 00:25:55,199 Speaker 1: get the results you want, but as time has gone on, 442 00:25:55,280 --> 00:25:59,120 Speaker 1: search has become much more sophisticated, so it's not really 443 00:25:59,160 --> 00:26:02,560 Speaker 1: as necessary to become familiar with booleyan search. It's good 444 00:26:02,600 --> 00:26:05,560 Speaker 1: to know how to use it, but it's not key 445 00:26:05,600 --> 00:26:10,240 Speaker 1: because searches not only just grown more sophisticated, but growing 446 00:26:10,320 --> 00:26:14,320 Speaker 1: more intrusive. A lot of searches today rely on information 447 00:26:14,600 --> 00:26:18,520 Speaker 1: that various browsers and web pages are gathering about you, 448 00:26:19,040 --> 00:26:23,000 Speaker 1: so they're using your past behavior as a predictive tool 449 00:26:23,320 --> 00:26:26,879 Speaker 1: to help serve up relevant results. But that's a topic 450 00:26:26,920 --> 00:26:30,879 Speaker 1: for a different podcast episode. I'll do another podcast episode 451 00:26:30,880 --> 00:26:35,120 Speaker 1: about this at some point. Now. Infacyque had a search 452 00:26:35,160 --> 00:26:38,200 Speaker 1: tool that allowed users to include different modifiers on search 453 00:26:38,240 --> 00:26:41,800 Speaker 1: results to narrow down the return sites, which was becoming 454 00:26:41,800 --> 00:26:45,000 Speaker 1: important because the web was growing enormously in the mid 455 00:26:45,040 --> 00:26:47,560 Speaker 1: to late nineties and would only continue to do so. 456 00:26:48,160 --> 00:26:51,240 Speaker 1: The Walt Disney Company took notice of infoseque and would 457 00:26:51,240 --> 00:26:55,680 Speaker 1: purchase more than of the company, effectively incorporating the business 458 00:26:55,680 --> 00:26:59,359 Speaker 1: into the media empire ruled by the hand of the mouse. 459 00:27:00,280 --> 00:27:03,560 Speaker 1: Infoseque at that point had made several acquisitions of its own, 460 00:27:03,560 --> 00:27:07,359 Speaker 1: including sites like ESPN dot com and ABC news dot com, 461 00:27:07,400 --> 00:27:11,440 Speaker 1: which then became part of Disney's media Empire, and infoseque 462 00:27:11,440 --> 00:27:14,879 Speaker 1: we get rolled into Disney's Go dot com network of 463 00:27:14,920 --> 00:27:19,679 Speaker 1: services and sites, and effectively, eventually, after several years, it 464 00:27:19,680 --> 00:27:24,440 Speaker 1: would disappear into that network of sites, Infoseque would begin 465 00:27:24,480 --> 00:27:28,960 Speaker 1: to offer up manually curated search results along with automated ones. Again, 466 00:27:29,040 --> 00:27:31,320 Speaker 1: this was an effort to return the most relevant results. 467 00:27:31,720 --> 00:27:33,840 Speaker 1: You'll see if you look at the history of search 468 00:27:33,880 --> 00:27:36,479 Speaker 1: engines that a lot of them kind of experimented with 469 00:27:36,640 --> 00:27:40,879 Speaker 1: this human curated approach because that was a real issue, 470 00:27:41,000 --> 00:27:43,960 Speaker 1: was that you would use these search engines and you 471 00:27:43,960 --> 00:27:46,200 Speaker 1: would get a ton of results and only a few 472 00:27:46,240 --> 00:27:48,639 Speaker 1: of them ever seemed to be even remotely connected to 473 00:27:48,680 --> 00:27:52,680 Speaker 1: what you wanted. So putting humans in charge of that 474 00:27:52,920 --> 00:27:57,600 Speaker 1: made it a little easier to do relevant results. Because 475 00:27:57,760 --> 00:28:01,679 Speaker 1: humans understand context. They understand when a site is actually 476 00:28:02,119 --> 00:28:06,400 Speaker 1: about something versus when a site just mentions something off hand, 477 00:28:06,520 --> 00:28:10,480 Speaker 1: but it's not really about that thing. You even saw 478 00:28:10,520 --> 00:28:15,800 Speaker 1: this relatively recently. I remember, uh, shortly after I started 479 00:28:15,840 --> 00:28:20,480 Speaker 1: How Stuff works, how the service mahallow was kind of struggling, 480 00:28:21,080 --> 00:28:23,840 Speaker 1: but it was also a human curated search engine. And 481 00:28:24,000 --> 00:28:26,800 Speaker 1: we're talking like two thousand seven when I was looking 482 00:28:26,840 --> 00:28:29,960 Speaker 1: into that. Um my friend Veronica Belmont used to work 483 00:28:30,000 --> 00:28:33,320 Speaker 1: for that company, so it was still something that people 484 00:28:33,320 --> 00:28:36,720 Speaker 1: were trying even as late as the late two thousand's era, 485 00:28:37,000 --> 00:28:39,719 Speaker 1: or by late two thousand's, I mean the first decade 486 00:28:39,720 --> 00:28:43,760 Speaker 1: of two thousands. Anyway, info seq uh what tried that out. 487 00:28:43,920 --> 00:28:47,280 Speaker 1: And also one of the engineers from Infoseek, lie Yan Hong, 488 00:28:47,960 --> 00:28:50,920 Speaker 1: relocated to China and became a co founder of a 489 00:28:50,960 --> 00:28:54,160 Speaker 1: different search engine company called Bai Do be a i 490 00:28:54,360 --> 00:28:59,480 Speaker 1: du as a company that has become truly enormous, with 491 00:28:59,560 --> 00:29:04,840 Speaker 1: asset approaching a value of three hundred billion dollars. That's 492 00:29:04,840 --> 00:29:10,000 Speaker 1: actually more than what Google's parent company, Alphabet has at 493 00:29:10,080 --> 00:29:13,200 Speaker 1: its disposal. So you could argue by do one the 494 00:29:13,240 --> 00:29:16,800 Speaker 1: search wars, but then by Do is not widely known 495 00:29:16,840 --> 00:29:21,320 Speaker 1: in the West. It's a very huge company over in Asia, 496 00:29:21,440 --> 00:29:25,200 Speaker 1: but not not as well known here. Back to our 497 00:29:25,200 --> 00:29:30,440 Speaker 1: search engine history. Excite, the company I talked about earlier, 498 00:29:30,440 --> 00:29:33,200 Speaker 1: finally debuts and it did well. In fact, it did 499 00:29:33,280 --> 00:29:36,200 Speaker 1: so well that it would end up purchasing web Crawler 500 00:29:36,400 --> 00:29:41,280 Speaker 1: in But by nine it's numbers were starting to decline 501 00:29:41,280 --> 00:29:45,120 Speaker 1: thanks to you know who That rhymes was Shmoogle, and 502 00:29:45,160 --> 00:29:48,120 Speaker 1: it merged with a company called at home dot com, 503 00:29:48,440 --> 00:29:51,320 Speaker 1: the at symbol home dot com. It was a deal 504 00:29:51,360 --> 00:29:56,280 Speaker 1: that was worth nearly seven billion dollars, but that deal 505 00:29:56,360 --> 00:29:59,960 Speaker 1: did not ultimately work out. The merged company would file 506 00:30:00,160 --> 00:30:02,960 Speaker 1: for bankruptcy in two thousand one, one of the many 507 00:30:03,240 --> 00:30:07,400 Speaker 1: victims of the dot com bubble bursting um that was 508 00:30:07,440 --> 00:30:10,680 Speaker 1: at least one of the big contributing factors to that. 509 00:30:10,880 --> 00:30:13,120 Speaker 1: The company also just had a lot of debt even 510 00:30:13,120 --> 00:30:16,760 Speaker 1: heading into two thousand two thousand one, so that was 511 00:30:16,840 --> 00:30:20,760 Speaker 1: kind of the nail in the coffin. Now, Infospace, which 512 00:30:20,880 --> 00:30:24,400 Speaker 1: once upon a time owned what would become stuffed Media, 513 00:30:24,960 --> 00:30:29,560 Speaker 1: So technically I was an Infospace employee for a short while, 514 00:30:29,960 --> 00:30:35,600 Speaker 1: purchased Excites assets and domain names, and so web Crawler 515 00:30:35,920 --> 00:30:42,920 Speaker 1: and and uh Excite all became wrapped up with infospaces offerings, 516 00:30:43,120 --> 00:30:47,360 Speaker 1: and uh yeah, there you just technically, it's still part 517 00:30:47,800 --> 00:30:50,120 Speaker 1: of that. You can still use some of that, although 518 00:30:50,640 --> 00:30:52,960 Speaker 1: um it's a much different tool than what it used 519 00:30:53,000 --> 00:30:57,840 Speaker 1: to be. Also in Alta Vista emerged from the Western 520 00:30:57,920 --> 00:31:01,040 Speaker 1: Research Laboratory at the Digital equip Mint Corporation or d 521 00:31:01,080 --> 00:31:05,360 Speaker 1: e C. Alta Vista allowed for natural language queries, meaning 522 00:31:05,360 --> 00:31:07,400 Speaker 1: you could type in a query similar to how you 523 00:31:07,480 --> 00:31:10,280 Speaker 1: would ask a person to look for something for you. 524 00:31:10,520 --> 00:31:12,720 Speaker 1: You didn't have to focus on asking in a way 525 00:31:12,760 --> 00:31:15,640 Speaker 1: that would only make sense to a machine. This is 526 00:31:15,760 --> 00:31:18,680 Speaker 1: that barrier of entry we often see with technology, where 527 00:31:19,120 --> 00:31:23,400 Speaker 1: we have to adjust our behavior so that whatever technology 528 00:31:23,440 --> 00:31:27,240 Speaker 1: we're working with understands, quote unquote what we want from it. 529 00:31:27,760 --> 00:31:30,920 Speaker 1: Um Alta Vista was trying to reverse that, to make 530 00:31:31,280 --> 00:31:35,280 Speaker 1: the technology attempt to understand what we want, rather than 531 00:31:35,400 --> 00:31:38,840 Speaker 1: making us work so that the technology can understand us. 532 00:31:39,320 --> 00:31:41,480 Speaker 1: The researchers who designed it had to do a full 533 00:31:41,520 --> 00:31:47,200 Speaker 1: scale web crawl in August, indexed ten million pages in 534 00:31:47,240 --> 00:31:50,640 Speaker 1: that web crawl, and this was compelling enough to launch 535 00:31:50,720 --> 00:31:54,920 Speaker 1: as a spinoff company by Alta Vista was powering search 536 00:31:54,960 --> 00:31:57,959 Speaker 1: results for Yahoo. So, like I mentioned earlier, where Yahoo 537 00:31:58,000 --> 00:32:01,520 Speaker 1: would use other company to to run their web search 538 00:32:01,560 --> 00:32:04,720 Speaker 1: Altivista was one of those, but also at that time, 539 00:32:04,800 --> 00:32:09,200 Speaker 1: Compact would acquire d e C, which in turn owned Altivista, 540 00:32:09,720 --> 00:32:12,680 Speaker 1: and Compact turned Ultivista into more of a portal service 541 00:32:13,160 --> 00:32:16,400 Speaker 1: than than a search engine, a true search engine, which 542 00:32:16,400 --> 00:32:20,000 Speaker 1: put it more in direct competition with Yahoo, and Ultivista's 543 00:32:20,040 --> 00:32:23,360 Speaker 1: numbers went into decline, possibly because of that shift to 544 00:32:23,440 --> 00:32:27,520 Speaker 1: a portal service rather than as a more straightforward search tool. 545 00:32:28,040 --> 00:32:31,640 Speaker 1: Now we're not quite done covering all the major players 546 00:32:31,640 --> 00:32:34,320 Speaker 1: in the space before Google came on board. I'm going 547 00:32:34,360 --> 00:32:37,320 Speaker 1: to cover a couple more right after we take this 548 00:32:37,400 --> 00:32:48,560 Speaker 1: quick break. Okay, So in addition to the services I've 549 00:32:48,600 --> 00:32:51,200 Speaker 1: already mentioned, there were a couple more. There was ink 550 00:32:51,320 --> 00:32:54,680 Speaker 1: Tomy that it's a project that was headed by Eric 551 00:32:54,680 --> 00:32:59,520 Speaker 1: Brewer and Paul Gautier. They founded inc Tomy in the 552 00:32:59,520 --> 00:33:02,600 Speaker 1: two of them been working on a parallel processing computing 553 00:33:02,640 --> 00:33:06,240 Speaker 1: project for DARPA when they came up with this approach 554 00:33:06,320 --> 00:33:10,120 Speaker 1: to search, and rather than launching a dedicated search tool 555 00:33:10,200 --> 00:33:13,440 Speaker 1: of their own, they said, oh, well, we offer to 556 00:33:13,600 --> 00:33:18,440 Speaker 1: use our technology to power other people's search engines. So essentially, 557 00:33:18,760 --> 00:33:21,840 Speaker 1: you you put up the front and will power the 558 00:33:21,880 --> 00:33:26,280 Speaker 1: back end. And one of those was run by a 559 00:33:26,320 --> 00:33:29,280 Speaker 1: company called hot Wired, and they introduced a search tool 560 00:33:29,360 --> 00:33:33,240 Speaker 1: called hot bot. Ink Tomy worked largely as sort of 561 00:33:33,280 --> 00:33:36,560 Speaker 1: a business to business entity, growing far beyond a search 562 00:33:36,600 --> 00:33:39,560 Speaker 1: engine company. But the dot com crash of two thousand 563 00:33:39,640 --> 00:33:42,480 Speaker 1: one also hit ink Tony really hard, and a couple 564 00:33:42,520 --> 00:33:46,240 Speaker 1: of years later it was swept up by Yahoo. So 565 00:33:46,320 --> 00:33:48,000 Speaker 1: you see, you see a lot of these companies end 566 00:33:48,080 --> 00:33:50,280 Speaker 1: up kind of getting gulped up by each other. Now, 567 00:33:50,280 --> 00:33:53,160 Speaker 1: the last of our pre Google search engines that I'm 568 00:33:53,200 --> 00:33:56,840 Speaker 1: going to talk about is ask Jeeves. Later on, it 569 00:33:56,960 --> 00:34:00,000 Speaker 1: was just known as Ask. It launched in nineteen nine 570 00:34:00,120 --> 00:34:03,840 Speaker 1: d seven, having been developed by David Warthen and Garrett Gruner, 571 00:34:04,280 --> 00:34:06,400 Speaker 1: and like some of the other services I mentioned in 572 00:34:06,440 --> 00:34:10,200 Speaker 1: this episode, it would present curated lists that were created 573 00:34:10,200 --> 00:34:14,360 Speaker 1: by sort of an editorial board, along with some paid listing. 574 00:34:14,440 --> 00:34:17,840 Speaker 1: So if you're a company that wanted your website to 575 00:34:17,880 --> 00:34:24,920 Speaker 1: be listed alongside quote unquote legitimate research returns rather, you 576 00:34:25,000 --> 00:34:29,000 Speaker 1: could pony up the cash have your website put on 577 00:34:29,040 --> 00:34:33,120 Speaker 1: that list. That still happens today on search engines. Happens 578 00:34:33,160 --> 00:34:36,080 Speaker 1: today on Google, where you'll see the first couple of 579 00:34:36,120 --> 00:34:38,960 Speaker 1: results tend to be ones that say, you know, add 580 00:34:39,480 --> 00:34:41,520 Speaker 1: At the end of it, Google has to label them 581 00:34:41,560 --> 00:34:45,560 Speaker 1: as ads, not as just natural search results based on 582 00:34:45,600 --> 00:34:49,279 Speaker 1: your query. Though sometimes those ads actually are the things 583 00:34:49,320 --> 00:34:51,640 Speaker 1: you're looking for, so it's not always a bad thing, 584 00:34:51,880 --> 00:34:55,080 Speaker 1: but it is good to just pay attention. So eventually 585 00:34:55,440 --> 00:34:58,399 Speaker 1: Ask would develop its own search engine technology that would 586 00:34:58,400 --> 00:35:03,440 Speaker 1: automate things. They stopped lying exclusively on people curating lists, 587 00:35:03,840 --> 00:35:07,399 Speaker 1: and Ask would go on to acquire Excite, so you saw, 588 00:35:07,400 --> 00:35:10,080 Speaker 1: you know, Excite what WebCrawler will ask with later on 589 00:35:10,200 --> 00:35:12,640 Speaker 1: by Excite, So you see, there's a lot of shuffling 590 00:35:12,920 --> 00:35:18,040 Speaker 1: with these companies. And then came Google, which had started 591 00:35:18,080 --> 00:35:21,920 Speaker 1: as a research project at Stanford. Larry Page and Sarage 592 00:35:21,920 --> 00:35:24,600 Speaker 1: Brenn had developed the tool and they were running it 593 00:35:24,680 --> 00:35:26,960 Speaker 1: out of a garage for a little while. They had 594 00:35:26,960 --> 00:35:30,680 Speaker 1: built a search tool they originally called BackRub, and their 595 00:35:30,680 --> 00:35:32,680 Speaker 1: goal is to create a search engine that could index 596 00:35:32,719 --> 00:35:35,000 Speaker 1: the web and then present the most relevant results to 597 00:35:35,200 --> 00:35:38,680 Speaker 1: any query. But how would you do that? Well, the 598 00:35:38,760 --> 00:35:42,480 Speaker 1: actual answer, if we're being totally transparent, is kind of 599 00:35:42,480 --> 00:35:46,640 Speaker 1: like Coke's secret formula, and that we know in general 600 00:35:46,920 --> 00:35:49,080 Speaker 1: what has to go into it, but we don't know 601 00:35:49,160 --> 00:35:52,440 Speaker 1: the specifics that would allow us to replicate the results precisely. 602 00:35:52,840 --> 00:35:58,000 Speaker 1: The algorithm that Google uses is peculiar to Google, and 603 00:35:58,080 --> 00:36:01,520 Speaker 1: they also change it a lot. They tweak it, so 604 00:36:01,800 --> 00:36:03,880 Speaker 1: even if we did learn how it used to work, 605 00:36:04,040 --> 00:36:07,560 Speaker 1: it doesn't work that way anymore. So Brendan Page would 606 00:36:07,600 --> 00:36:11,000 Speaker 1: refer to this process as page rank. And here's how 607 00:36:11,040 --> 00:36:15,240 Speaker 1: it worked from a theoretical standpoint. So first, you index 608 00:36:15,280 --> 00:36:18,680 Speaker 1: the web. So you need to get a kind of 609 00:36:18,719 --> 00:36:24,279 Speaker 1: a ah, a complete look at all the websites that 610 00:36:24,360 --> 00:36:27,080 Speaker 1: are available out there on the web and inventory if 611 00:36:27,120 --> 00:36:29,719 Speaker 1: you will, of all the web. To do this, you 612 00:36:29,760 --> 00:36:32,480 Speaker 1: send out bots to index all the pages that are 613 00:36:32,480 --> 00:36:35,120 Speaker 1: listed on the web that you can find. Um, you 614 00:36:35,160 --> 00:36:39,520 Speaker 1: can actually in the HTML of a web page, you 615 00:36:39,560 --> 00:36:42,600 Speaker 1: can designate it so that it will instruct bots to 616 00:36:42,680 --> 00:36:46,359 Speaker 1: ignore the page and not index it. So you can 617 00:36:46,400 --> 00:36:48,759 Speaker 1: do that and it won't show up on any search 618 00:36:48,840 --> 00:36:51,440 Speaker 1: result page because the bot will see that message and 619 00:36:51,480 --> 00:36:54,480 Speaker 1: we'll just move on. This is useful if you want 620 00:36:54,480 --> 00:36:56,839 Speaker 1: a page that only people who know about it can 621 00:36:56,920 --> 00:36:59,840 Speaker 1: navigate to it, and you don't want folks just stumbling 622 00:37:00,000 --> 00:37:02,799 Speaker 1: on it through search, So that is an option. So 623 00:37:02,920 --> 00:37:05,960 Speaker 1: for all of the pages that are discoverable. The bots 624 00:37:05,960 --> 00:37:08,879 Speaker 1: will crawl through, they follow all the links, they try 625 00:37:08,920 --> 00:37:11,279 Speaker 1: and index out the web and get it as good 626 00:37:11,280 --> 00:37:13,839 Speaker 1: as snapshot of what the World Wide Web is as 627 00:37:13,960 --> 00:37:17,319 Speaker 1: is possible. Now, these spots aren't just looking for the 628 00:37:17,360 --> 00:37:20,200 Speaker 1: location of the web pages, like what server those web 629 00:37:20,200 --> 00:37:23,800 Speaker 1: pages are stored on, or even get just a full 630 00:37:23,960 --> 00:37:27,600 Speaker 1: understanding of what the text is inside those pages, so 631 00:37:27,640 --> 00:37:30,000 Speaker 1: that when you do a search query and you put 632 00:37:30,000 --> 00:37:33,120 Speaker 1: your search terms in, they can return the pages that 633 00:37:33,200 --> 00:37:36,520 Speaker 1: have those search terms. They're also looking for links, both 634 00:37:36,560 --> 00:37:39,120 Speaker 1: going into the page and coming from the page to 635 00:37:39,160 --> 00:37:42,080 Speaker 1: go elsewhere, and the links will become a really important 636 00:37:42,120 --> 00:37:45,200 Speaker 1: part of page rink. So here's the basic idea. Brand 637 00:37:45,280 --> 00:37:48,160 Speaker 1: and Page figured out that if a web page about 638 00:37:48,200 --> 00:37:52,319 Speaker 1: a given subject is really good, other pages tend to 639 00:37:52,400 --> 00:37:57,000 Speaker 1: link to it. They do so because they recognize the quality, 640 00:37:57,440 --> 00:38:02,040 Speaker 1: and that helps boost the pages position in search results. 641 00:38:02,040 --> 00:38:04,440 Speaker 1: So let's use an example to kind of understand this. 642 00:38:05,000 --> 00:38:08,560 Speaker 1: Let's say you are one of these early web developers 643 00:38:08,640 --> 00:38:11,720 Speaker 1: in the late nineteen nineties, and you're also a big 644 00:38:11,800 --> 00:38:14,799 Speaker 1: music fans, so you decided to create a blog that's 645 00:38:14,840 --> 00:38:18,000 Speaker 1: completely focused on the music industry, and you cover the 646 00:38:18,040 --> 00:38:20,840 Speaker 1: news in the industry. You post reviews of albums that 647 00:38:20,880 --> 00:38:23,520 Speaker 1: you've listened to. Maybe you even do some interviews with 648 00:38:23,560 --> 00:38:26,000 Speaker 1: people who are in the industry. And as you write 649 00:38:26,000 --> 00:38:29,640 Speaker 1: this blog, other people take notice. Some of them also 650 00:38:29,680 --> 00:38:32,239 Speaker 1: have a web presence and cover the industry, and they 651 00:38:32,280 --> 00:38:35,080 Speaker 1: really dig your stuff, so they linked to your page. 652 00:38:35,120 --> 00:38:38,080 Speaker 1: They say, there's a really cool music industry blog. It's 653 00:38:38,080 --> 00:38:40,799 Speaker 1: being run by this person over here. Follow this link 654 00:38:40,800 --> 00:38:44,239 Speaker 1: to go check it out. Google's bots would register that 655 00:38:44,320 --> 00:38:47,120 Speaker 1: they would see that those links were out there pointing 656 00:38:47,160 --> 00:38:49,879 Speaker 1: to your page, and the more sites that link back 657 00:38:49,920 --> 00:38:52,400 Speaker 1: to your blog, the higher your blog would rank and 658 00:38:52,440 --> 00:38:56,840 Speaker 1: search results. So if someone searched music industry news or 659 00:38:56,880 --> 00:38:59,640 Speaker 1: something along those lines, there's a chance that your blog 660 00:38:59,680 --> 00:39:02,560 Speaker 1: would pop up fairly high and results. Now how high 661 00:39:02,640 --> 00:39:07,000 Speaker 1: would be dependent on something other than just how many 662 00:39:07,160 --> 00:39:10,640 Speaker 1: pages are linking to you. That's one factor that matters 663 00:39:10,680 --> 00:39:12,880 Speaker 1: a lot, the number of sites linking to your page. 664 00:39:13,120 --> 00:39:17,160 Speaker 1: But the other one is how trustworthy those linking sites were. 665 00:39:17,719 --> 00:39:22,000 Speaker 1: So let's consider two scenarios. In our first scenario, you've 666 00:39:22,040 --> 00:39:24,680 Speaker 1: got your music blog and you've got a lot of 667 00:39:24,719 --> 00:39:27,279 Speaker 1: sites that are linking to your page, but they're all 668 00:39:27,680 --> 00:39:30,960 Speaker 1: small time sites like some our personal sites, run by 669 00:39:31,000 --> 00:39:33,239 Speaker 1: people who are interested in music, but they don't really 670 00:39:33,239 --> 00:39:36,520 Speaker 1: have any presence in the industry and no one's really 671 00:39:36,600 --> 00:39:40,960 Speaker 1: linking to their page, so they're not ranked super high 672 00:39:41,040 --> 00:39:44,279 Speaker 1: in Google's estimation. Some of them might be even worse 673 00:39:44,280 --> 00:39:47,120 Speaker 1: than that. Some of them might be link farms. Link farms. 674 00:39:47,400 --> 00:39:49,840 Speaker 1: You don't really see them that much these days, but 675 00:39:50,200 --> 00:39:53,239 Speaker 1: in the nineties they were everywhere. They only existed to 676 00:39:53,360 --> 00:39:57,040 Speaker 1: link to other pages, and it was in an effort 677 00:39:57,040 --> 00:40:01,160 Speaker 1: to boost those other pages rankings in search. So if 678 00:40:01,239 --> 00:40:04,000 Speaker 1: you navigated to one, let's say you do a search 679 00:40:04,040 --> 00:40:06,920 Speaker 1: for a term and you click on the link, you 680 00:40:07,000 --> 00:40:12,400 Speaker 1: end up looking at a bunch of completely disconnected titles 681 00:40:12,400 --> 00:40:15,120 Speaker 1: and U r l s and that's it. There's no 682 00:40:15,239 --> 00:40:18,399 Speaker 1: other content on the page. It's just a listing of 683 00:40:18,480 --> 00:40:22,440 Speaker 1: links to different sites with no rhyme or reason to them. 684 00:40:22,480 --> 00:40:26,840 Speaker 1: Those would also be very low in Google's trustworthiness according 685 00:40:26,840 --> 00:40:30,040 Speaker 1: to its algorithm, because obviously the only reason they're existing 686 00:40:30,200 --> 00:40:33,440 Speaker 1: is to try and game the system, to try and say, well, 687 00:40:33,520 --> 00:40:37,160 Speaker 1: let's just add a lot more links to this page 688 00:40:37,600 --> 00:40:41,960 Speaker 1: and that will boost its its relevance. So if that 689 00:40:42,040 --> 00:40:44,120 Speaker 1: were the case, if most of the links going to 690 00:40:44,200 --> 00:40:48,440 Speaker 1: your page were either from small potatoes websites or they 691 00:40:48,440 --> 00:40:51,680 Speaker 1: were from link farms. Your page rank wouldn't be boosted 692 00:40:51,800 --> 00:40:54,640 Speaker 1: very high. It might be higher than it would be 693 00:40:54,680 --> 00:40:56,640 Speaker 1: if there were no links going to your page at all, 694 00:40:56,719 --> 00:41:00,359 Speaker 1: but it's not a huge help. Now let's consider scenario two. 695 00:41:01,080 --> 00:41:03,720 Speaker 1: Let's say your blog only has a few sites linking 696 00:41:03,719 --> 00:41:06,240 Speaker 1: to it, a couple of dozen maybe, But those sites 697 00:41:06,239 --> 00:41:10,840 Speaker 1: are doozies. Maybe they include record labels that are in 698 00:41:10,880 --> 00:41:14,520 Speaker 1: the music industry. Maybe it's other outlets that cover music news. 699 00:41:14,920 --> 00:41:18,080 Speaker 1: Maybe it includes some news websites that use your blog 700 00:41:18,120 --> 00:41:21,040 Speaker 1: as a source for stories. Now those sites have a 701 00:41:21,120 --> 00:41:25,120 Speaker 1: much higher level of trustworthiness for Google, and so or 702 00:41:25,280 --> 00:41:27,839 Speaker 1: you know, in Google's estimation, I should say, and those 703 00:41:27,880 --> 00:41:31,160 Speaker 1: links matter more. So maybe in scenario one you have 704 00:41:31,200 --> 00:41:34,400 Speaker 1: a thousand tiny sites linking to you, and scenario to 705 00:41:34,560 --> 00:41:36,080 Speaker 1: you just have a couple of dozen of the really 706 00:41:36,120 --> 00:41:39,600 Speaker 1: big sites linking to you. Page rank would favor scenario 707 00:41:39,760 --> 00:41:43,040 Speaker 1: two over scenario one, reasoning that if your blog is 708 00:41:43,080 --> 00:41:45,640 Speaker 1: good enough to get the attention and support of those 709 00:41:45,680 --> 00:41:49,480 Speaker 1: trusted entities, it must be a really good resource, and 710 00:41:49,520 --> 00:41:52,919 Speaker 1: so your site would get boosted in search results. Now 711 00:41:53,000 --> 00:41:56,480 Speaker 1: that helped address a troublesome trend with search. I mentioned 712 00:41:56,520 --> 00:41:59,840 Speaker 1: link farms. That was one problem. So any search engine 713 00:41:59,840 --> 00:42:04,399 Speaker 1: that looked at back linking UM could be fooled through 714 00:42:04,480 --> 00:42:07,720 Speaker 1: link farms that were just there to to boost that number. 715 00:42:08,680 --> 00:42:12,040 Speaker 1: In the nineties, it wasn't unusual to encounter that. I 716 00:42:12,120 --> 00:42:13,759 Speaker 1: can't tell you how many times it happened to me 717 00:42:13,800 --> 00:42:15,520 Speaker 1: when I was doing a search for, you know, a 718 00:42:15,520 --> 00:42:19,200 Speaker 1: fairly obscure type of topic, and I just would come 719 00:42:19,200 --> 00:42:22,160 Speaker 1: across a link farm to all sorts of stuff that 720 00:42:22,239 --> 00:42:24,600 Speaker 1: was most of which was totally not relevant to what 721 00:42:24,680 --> 00:42:29,760 Speaker 1: I wanted. UM. Those were really frustrating, and so that 722 00:42:29,760 --> 00:42:32,239 Speaker 1: that was one thing that people would do to try 723 00:42:32,239 --> 00:42:38,680 Speaker 1: and game the system. But another was an equally annoying tactic. UH. 724 00:42:39,280 --> 00:42:42,160 Speaker 1: People wanted folks to come to their web pages really badly. 725 00:42:42,520 --> 00:42:45,000 Speaker 1: They were in the old old days. There were even 726 00:42:45,160 --> 00:42:47,799 Speaker 1: web page counters, a little it looked like a little 727 00:42:47,800 --> 00:42:50,279 Speaker 1: digit counter that would tell you how many people had 728 00:42:50,320 --> 00:42:52,840 Speaker 1: been to that website, and it became kind of a 729 00:42:52,880 --> 00:42:56,520 Speaker 1: badge of honor among early web developers if that number 730 00:42:56,520 --> 00:42:59,400 Speaker 1: were particularly high, because it showed that a lot of 731 00:42:59,400 --> 00:43:02,239 Speaker 1: people were visiting your site, and it was kind of 732 00:43:02,239 --> 00:43:05,960 Speaker 1: a prestige thing um and also could mean money because 733 00:43:05,960 --> 00:43:08,640 Speaker 1: if you were using web advertising to support your your 734 00:43:08,680 --> 00:43:12,280 Speaker 1: web site and that number was getting really really high, 735 00:43:12,520 --> 00:43:14,600 Speaker 1: and then you had more page views, and more page 736 00:43:14,640 --> 00:43:17,799 Speaker 1: views would mean more cash from the advertisers, So there 737 00:43:17,920 --> 00:43:21,279 Speaker 1: was an actual, you know, financial reason to try and 738 00:43:21,320 --> 00:43:23,480 Speaker 1: get more people to come to your web page, and 739 00:43:23,520 --> 00:43:30,600 Speaker 1: not everybody played fair and square. Sometimes web developers would 740 00:43:30,600 --> 00:43:35,920 Speaker 1: include an incredibly long list of popular search terms on 741 00:43:36,000 --> 00:43:38,720 Speaker 1: the web page. Usually would be at the very bottom 742 00:43:38,760 --> 00:43:42,600 Speaker 1: of the web page in tiny font and so that's 743 00:43:42,600 --> 00:43:45,040 Speaker 1: the only place where your search terms would show up. 744 00:43:45,480 --> 00:43:47,080 Speaker 1: The rest of the web page would be about something 745 00:43:47,239 --> 00:43:50,040 Speaker 1: entirely different, and then you do a search on the 746 00:43:50,040 --> 00:43:52,279 Speaker 1: web page for the terms you were looking for. It 747 00:43:52,320 --> 00:43:55,480 Speaker 1: turns out there just in this list of random or 748 00:43:55,520 --> 00:43:59,200 Speaker 1: seemingly random search terms, it's really the most popular search 749 00:43:59,320 --> 00:44:01,719 Speaker 1: terms that people could come across, and they were just 750 00:44:02,880 --> 00:44:05,280 Speaker 1: dumping them all at the bottom of their web pages, 751 00:44:05,360 --> 00:44:07,560 Speaker 1: and that way their web page would pop up in 752 00:44:07,600 --> 00:44:10,640 Speaker 1: all these sorts of searches, and people would end up 753 00:44:10,680 --> 00:44:13,759 Speaker 1: going to their web page without knowing that it wasn't 754 00:44:13,840 --> 00:44:16,719 Speaker 1: really about what they were hoping for That was really 755 00:44:16,760 --> 00:44:20,080 Speaker 1: frustrating for a lot of people, including myself, because you know, 756 00:44:20,200 --> 00:44:22,600 Speaker 1: you're obviously you're searching for something because you want to 757 00:44:22,640 --> 00:44:25,040 Speaker 1: get that content, but then you end up going to 758 00:44:25,080 --> 00:44:27,200 Speaker 1: a web page that's not about that at all. It's 759 00:44:27,239 --> 00:44:29,760 Speaker 1: not a good experience, So it was a terrible way 760 00:44:29,960 --> 00:44:32,719 Speaker 1: to have people come to your web page. However, if 761 00:44:32,719 --> 00:44:35,400 Speaker 1: your goal was just to get those views so that 762 00:44:35,480 --> 00:44:38,719 Speaker 1: you could get that ad money, people were willing to 763 00:44:38,719 --> 00:44:43,319 Speaker 1: do it. Um maybe it was a successful strategy for 764 00:44:43,360 --> 00:44:46,000 Speaker 1: people who were maybe running an online store, but I 765 00:44:46,000 --> 00:44:47,879 Speaker 1: can't imagine it would be worked too well. I mean, 766 00:44:48,120 --> 00:44:51,840 Speaker 1: if I'm looking for information about quantum mechanics and I 767 00:44:51,960 --> 00:44:55,440 Speaker 1: end up being dumped in some store that's selling baseball 768 00:44:55,480 --> 00:44:58,440 Speaker 1: caps that have nothing to do with anything, I'm probably 769 00:44:58,480 --> 00:45:01,200 Speaker 1: just gonna be mad. But anyway, that was one of 770 00:45:01,239 --> 00:45:05,400 Speaker 1: the other approaches people were taking, was trying to include 771 00:45:05,400 --> 00:45:07,520 Speaker 1: this text. Sometimes they would even hide it. They would 772 00:45:07,520 --> 00:45:11,080 Speaker 1: have a big section of the web page where the 773 00:45:11,239 --> 00:45:15,239 Speaker 1: font had the same color as the background text, so 774 00:45:15,280 --> 00:45:18,200 Speaker 1: you couldn't see it just when you're reading through the 775 00:45:18,239 --> 00:45:21,319 Speaker 1: web page, but it could be read by bots as 776 00:45:21,360 --> 00:45:26,640 Speaker 1: they're crawling through all this material. Uh So, search engine 777 00:45:26,960 --> 00:45:31,640 Speaker 1: developers got into kind of a seesaw battle with web 778 00:45:31,640 --> 00:45:34,960 Speaker 1: developers to try and get around these tricks. One of 779 00:45:35,000 --> 00:45:37,840 Speaker 1: the things they started to do Google was one of 780 00:45:37,880 --> 00:45:41,400 Speaker 1: them was focused on the text in the actual body 781 00:45:41,520 --> 00:45:45,000 Speaker 1: of the document itself and then ignore information that might 782 00:45:45,040 --> 00:45:47,399 Speaker 1: be in the headers or footers, which was typically where 783 00:45:47,400 --> 00:45:52,160 Speaker 1: people were putting these laundry lists of popular search terms. 784 00:45:52,480 --> 00:45:54,640 Speaker 1: So Google got around that by saying, Okay, well, we're 785 00:45:54,680 --> 00:45:57,000 Speaker 1: no longer worried about the text that's in the head 786 00:45:57,120 --> 00:45:59,840 Speaker 1: or the footer. We're just concentrating on what's in the 787 00:45:59,840 --> 00:46:05,440 Speaker 1: body of the page. And Google's approach really improved upon relevance, 788 00:46:05,480 --> 00:46:08,759 Speaker 1: the search results were just better than most of the competitors. 789 00:46:08,920 --> 00:46:11,160 Speaker 1: You know, you you were more likely to come across 790 00:46:11,239 --> 00:46:13,960 Speaker 1: something the stuff that you know represented what you wanted, 791 00:46:14,480 --> 00:46:17,640 Speaker 1: and so Google was able to tap into advertising revenue 792 00:46:17,719 --> 00:46:21,520 Speaker 1: because they were able to really give people what they wanted. 793 00:46:22,400 --> 00:46:25,799 Speaker 1: Advertisers wanted to be included with that, and Google began 794 00:46:25,880 --> 00:46:29,280 Speaker 1: listing ads supported results with the top returns for queries. 795 00:46:29,680 --> 00:46:32,800 Speaker 1: So it meant that you know, the stuff that people 796 00:46:32,840 --> 00:46:36,080 Speaker 1: most wanted to see, you would get ads served right 797 00:46:36,160 --> 00:46:40,279 Speaker 1: with that Uh, there's a very attractive proposition, and it 798 00:46:40,320 --> 00:46:42,719 Speaker 1: positioned the company well enough to survive the dot com 799 00:46:42,719 --> 00:46:45,440 Speaker 1: bubble burst of two thousand and two thousand one, and 800 00:46:45,560 --> 00:46:48,200 Speaker 1: many of its competitors either merged with other companies as 801 00:46:48,239 --> 00:46:52,080 Speaker 1: I mentioned, or they completely went under. The Google remained 802 00:46:53,160 --> 00:46:56,440 Speaker 1: around and then was able to actually seriously grow in 803 00:46:56,480 --> 00:46:59,319 Speaker 1: the two thousand's. Uh. There were a couple of discussions 804 00:46:59,320 --> 00:47:02,600 Speaker 1: with other company needs early on, including Excite, that could 805 00:47:02,600 --> 00:47:05,040 Speaker 1: have led to Google getting acquired, but none of that 806 00:47:05,120 --> 00:47:08,160 Speaker 1: came to fruition, and Google remained its own company and 807 00:47:08,200 --> 00:47:13,280 Speaker 1: continue to build on its success. And Google would evolve 808 00:47:13,320 --> 00:47:16,160 Speaker 1: its algorithm trying to crack the nut of deciphering the 809 00:47:16,239 --> 00:47:20,200 Speaker 1: meaning of text inside web pages. So not just here 810 00:47:20,200 --> 00:47:23,160 Speaker 1: are the web pages that include the terms that you 811 00:47:23,239 --> 00:47:26,080 Speaker 1: search for, but here are the ones that included in 812 00:47:26,120 --> 00:47:30,040 Speaker 1: the way that you meant including a improving it so 813 00:47:30,080 --> 00:47:34,400 Speaker 1: that it can recognize natural language and not just you know, 814 00:47:34,640 --> 00:47:38,879 Speaker 1: lists of search terms. Pairing that with the page rank 815 00:47:38,960 --> 00:47:41,520 Speaker 1: kind of approach would give Google the information and needed 816 00:47:41,520 --> 00:47:45,200 Speaker 1: to really rank its results and necessitated the search engine 817 00:47:45,239 --> 00:47:49,880 Speaker 1: optimization strategy that that became a whole new industry. Ranking 818 00:47:49,880 --> 00:47:52,759 Speaker 1: well in search was a really good way to get 819 00:47:52,800 --> 00:47:56,800 Speaker 1: serious Internet traffic to a site. People made entire careers 820 00:47:56,800 --> 00:47:58,799 Speaker 1: out of figuring out the best way to rank well 821 00:47:58,840 --> 00:48:03,880 Speaker 1: in search, which honestly mostly involves creating a compelling and 822 00:48:03,920 --> 00:48:06,920 Speaker 1: relevant web page or website that makes people want to 823 00:48:07,000 --> 00:48:11,440 Speaker 1: link to it. Um it's easier said than done. It's 824 00:48:11,480 --> 00:48:15,560 Speaker 1: that was the best way to rank well within Google's search. Occasionally, 825 00:48:15,600 --> 00:48:18,680 Speaker 1: Google would tweak things so that your site, if it 826 00:48:18,719 --> 00:48:21,439 Speaker 1: was particularly good, which just rise to the top because 827 00:48:21,480 --> 00:48:24,840 Speaker 1: Google recognized that they might wait your site more heavily 828 00:48:24,920 --> 00:48:28,120 Speaker 1: than other sites. Um. But it also led to companies 829 00:48:28,560 --> 00:48:32,759 Speaker 1: learning the hard lesson that depending upon search traffic is 830 00:48:32,960 --> 00:48:36,919 Speaker 1: a risky thing to do. Every time Google changes its 831 00:48:37,000 --> 00:48:41,360 Speaker 1: search algorithm, it affects search rankings. So you might be 832 00:48:41,440 --> 00:48:43,960 Speaker 1: doing really well for years, and then suddenly you see 833 00:48:43,960 --> 00:48:47,520 Speaker 1: a massive drop off and visitor numbers because Google changed 834 00:48:47,560 --> 00:48:50,080 Speaker 1: its algorithm and your page no longer ranks as well 835 00:48:50,120 --> 00:48:52,480 Speaker 1: in search results as it used to. So in a 836 00:48:52,560 --> 00:48:54,840 Speaker 1: future episode, I plan on getting some s e O 837 00:48:55,080 --> 00:48:58,759 Speaker 1: experts on the show and have them talk about the 838 00:48:58,840 --> 00:49:01,640 Speaker 1: challenges of developing a good strategy to rank well in 839 00:49:01,719 --> 00:49:05,719 Speaker 1: search and what other strategies people might consider if they 840 00:49:05,719 --> 00:49:10,200 Speaker 1: want to promote their traffic to sites and services. You know, 841 00:49:10,320 --> 00:49:14,120 Speaker 1: it's it's tricky stuff because again, it might work great 842 00:49:14,600 --> 00:49:17,200 Speaker 1: today and then tomorrow it might not work at all. 843 00:49:17,880 --> 00:49:22,160 Speaker 1: So there there's a real strong push among web developers 844 00:49:22,160 --> 00:49:27,400 Speaker 1: to try and find alternatives to search engine traffic being 845 00:49:27,480 --> 00:49:31,840 Speaker 1: your main way of getting people into your website. Um. Also, 846 00:49:32,280 --> 00:49:36,360 Speaker 1: if people are just searching for content and then popping 847 00:49:36,360 --> 00:49:40,040 Speaker 1: over to your site, uh, and they read one page 848 00:49:40,080 --> 00:49:43,279 Speaker 1: that is relevant to whatever their search engine query was, 849 00:49:43,719 --> 00:49:46,520 Speaker 1: they're not likely to stick around unless they go down 850 00:49:46,680 --> 00:49:50,480 Speaker 1: sort of the Wikipedia rabbit hole. They're more likely to bounce. 851 00:49:50,920 --> 00:49:52,719 Speaker 1: And this was a problem we saw at the House 852 00:49:52,719 --> 00:49:54,839 Speaker 1: Stuff Works website all the time, is that we could 853 00:49:54,840 --> 00:49:59,000 Speaker 1: get great search engine traffic. People were looking for specific 854 00:49:59,040 --> 00:50:02,520 Speaker 1: answers to question and we had articles that answered those questions, 855 00:50:02,560 --> 00:50:05,160 Speaker 1: so people would come and read those articles. Now, what 856 00:50:05,200 --> 00:50:07,400 Speaker 1: would be ideal for us is that people say, this 857 00:50:07,480 --> 00:50:09,640 Speaker 1: is a great site, I want to read more articles. 858 00:50:09,760 --> 00:50:12,879 Speaker 1: Let's just see what's here. But the reality was most 859 00:50:12,880 --> 00:50:16,600 Speaker 1: people would come in, read whatever they wanted and then leave. Um, 860 00:50:16,640 --> 00:50:19,120 Speaker 1: they wouldn't stick around to read other stuff. And and 861 00:50:19,480 --> 00:50:21,520 Speaker 1: it was a real challenge One of the things that 862 00:50:21,560 --> 00:50:24,000 Speaker 1: we always tried to do was figure out how to 863 00:50:24,040 --> 00:50:27,319 Speaker 1: create a site that was a destination all of its own. 864 00:50:27,680 --> 00:50:30,600 Speaker 1: That you're not going there because a search engine told 865 00:50:30,640 --> 00:50:33,040 Speaker 1: you to. You're going there because you love the site 866 00:50:33,080 --> 00:50:36,760 Speaker 1: and you want to read more of the stuff on there. Um. 867 00:50:36,800 --> 00:50:39,160 Speaker 1: That was always our goal. It was always very, very 868 00:50:39,280 --> 00:50:41,799 Speaker 1: challenging because there's a ton of websites out there, and 869 00:50:41,840 --> 00:50:45,520 Speaker 1: there's a ton of really great content, So making sure 870 00:50:45,560 --> 00:50:48,719 Speaker 1: that yours can stand up to everybody else's is a 871 00:50:48,719 --> 00:50:50,680 Speaker 1: heck of a challenge. It's a hard thing to do. 872 00:50:50,880 --> 00:50:53,359 Speaker 1: I think the site does a great job of it, um, 873 00:50:53,480 --> 00:50:55,000 Speaker 1: but it was one of those things that we were 874 00:50:55,040 --> 00:50:59,240 Speaker 1: always striving toward. In the end, Google one out because 875 00:50:59,440 --> 00:51:02,680 Speaker 1: it had grown too large before the bubble burst, so 876 00:51:02,800 --> 00:51:05,640 Speaker 1: it hadn't spread its assets out too thin, it wasn't 877 00:51:05,680 --> 00:51:08,680 Speaker 1: in incredible amounts of debt, so it was able to 878 00:51:08,840 --> 00:51:10,960 Speaker 1: weather that storm, and then it was able to build 879 00:51:11,000 --> 00:51:14,400 Speaker 1: on its success, and it had developed a search engine 880 00:51:14,400 --> 00:51:16,680 Speaker 1: tool that people felt returned the best results and they 881 00:51:16,719 --> 00:51:19,959 Speaker 1: put a ton of trust in it. Ultimately, Google would 882 00:51:19,960 --> 00:51:23,120 Speaker 1: become this enormous company that would be able to gather 883 00:51:23,520 --> 00:51:26,960 Speaker 1: huge amounts of data from its users and put that 884 00:51:27,000 --> 00:51:28,919 Speaker 1: to use as well, and that made it a very 885 00:51:29,040 --> 00:51:34,160 Speaker 1: valuable resource for advertisers, and that's kind of how Google 886 00:51:34,239 --> 00:51:38,480 Speaker 1: won the search engine war. Now, we'll talk about other 887 00:51:38,520 --> 00:51:40,920 Speaker 1: stuff related search engines in the future, but our next 888 00:51:40,960 --> 00:51:44,160 Speaker 1: episode is going to be about something totally different. Um 889 00:51:44,200 --> 00:51:46,520 Speaker 1: And I'm just doing a few one off episodes because 890 00:51:46,520 --> 00:51:51,120 Speaker 1: after doing that arc of seven episodes about the media 891 00:51:51,239 --> 00:51:55,640 Speaker 1: and its relationship to us and and technology, I felt 892 00:51:55,640 --> 00:51:58,440 Speaker 1: like we kind of needed to do some one offs. 893 00:51:58,560 --> 00:52:02,600 Speaker 1: So the next one's gonna be an their entertainment related podcast, 894 00:52:02,680 --> 00:52:04,520 Speaker 1: but it will be another one off. If you guys 895 00:52:04,560 --> 00:52:07,480 Speaker 1: have suggestions for future topics I should tackle, why not 896 00:52:07,719 --> 00:52:10,239 Speaker 1: send me an email address is tech stuff at how 897 00:52:10,280 --> 00:52:12,400 Speaker 1: stuff works dot com or hop on over to our 898 00:52:12,440 --> 00:52:16,040 Speaker 1: website that's tech stuff podcast dot com. You will find 899 00:52:16,040 --> 00:52:18,600 Speaker 1: the archive of all of our shows. There, you'll find 900 00:52:18,640 --> 00:52:21,399 Speaker 1: links to our social media sites. You'll find a link 901 00:52:21,520 --> 00:52:24,120 Speaker 1: to our online store, where every purchase you make goes 902 00:52:24,160 --> 00:52:26,920 Speaker 1: to help the show and we greatly appreciate it and 903 00:52:26,960 --> 00:52:34,680 Speaker 1: I will talk to you again really soon. Text Stuff 904 00:52:34,680 --> 00:52:37,040 Speaker 1: is a production of I Heart Radio's How Stuff Works. 905 00:52:37,200 --> 00:52:40,000 Speaker 1: For more podcasts from my Heart radio, visit the I 906 00:52:40,120 --> 00:52:43,360 Speaker 1: heart Radio app, Apple podcasts, or wherever you listen to 907 00:52:43,400 --> 00:52:44,320 Speaker 1: your favorite shows.