WEBVTT - The Common Crawl

0:00:00.160 --> 0:00:07.080
<v Speaker 1>Brought to you by Toyota. Let's go places. Welcome to

0:00:07.280 --> 0:00:13.360
<v Speaker 1>Forward Thinking. Hey, they're in Loving and Forward Thinking, the

0:00:13.520 --> 0:00:16.639
<v Speaker 1>podcast that looks at the future and says, Kitty McGee's

0:00:16.680 --> 0:00:21.160
<v Speaker 1>in Dublin Town upon the Crawl. I'm Jonathan Strickland and

0:00:21.280 --> 0:00:25.360
<v Speaker 1>I'm Joe McCormick, and today we're gonna be talking about

0:00:25.400 --> 0:00:28.159
<v Speaker 1>the crawl. We are talking about the crawl, not a

0:00:28.200 --> 0:00:30.760
<v Speaker 1>pub crawl. No, sadly, not a pub crawl, which is

0:00:31.240 --> 0:00:33.280
<v Speaker 1>what I was referring to in the lyric. But that's

0:00:33.320 --> 0:00:35.760
<v Speaker 1>not what we're talking about today. The krawl. What is that?

0:00:35.960 --> 0:00:39.360
<v Speaker 1>Is that the name of a movie that was like

0:00:39.080 --> 0:00:41.320
<v Speaker 1>a like a fantasy movie from the eighties, or it

0:00:41.320 --> 0:00:45.600
<v Speaker 1>sounds like a I'm thinking krawl. Oh yeah, that's a

0:00:45.640 --> 0:00:50.160
<v Speaker 1>science fiction fantasy film with a phenomenal one I might

0:00:50.200 --> 0:00:53.960
<v Speaker 1>add phenomenal science fiction fantasy film. Okay, So why would

0:00:53.960 --> 0:00:56.200
<v Speaker 1>we be talking about a crawl that's not a pub

0:00:56.320 --> 0:00:58.920
<v Speaker 1>crawl and not a sci fi fantasy movie. And it's

0:00:58.960 --> 0:01:02.000
<v Speaker 1>not the future of baby these crawling right. No, it

0:01:02.120 --> 0:01:05.000
<v Speaker 1>has something to do with the Internet. Yes, it has

0:01:05.040 --> 0:01:08.880
<v Speaker 1>everything to do with the Internet Web in particular. Actually, uh,

0:01:08.959 --> 0:01:12.480
<v Speaker 1>and here's a funny little little tidbit of information that

0:01:12.520 --> 0:01:14.960
<v Speaker 1>you probably already knew, but you might be a little

0:01:14.959 --> 0:01:17.479
<v Speaker 1>bit fuzzy on. Wait, what's the difference between the Web

0:01:17.520 --> 0:01:20.320
<v Speaker 1>and the Internet. Because when I say the Internet, most

0:01:20.319 --> 0:01:22.600
<v Speaker 1>of the time, what I'm talking about is the place

0:01:22.640 --> 0:01:26.280
<v Speaker 1>where people leave comments and argue about things, which would

0:01:26.280 --> 0:01:29.960
<v Speaker 1>be the Web mostly. Right. So Internet is the network

0:01:29.959 --> 0:01:32.280
<v Speaker 1>of networks of computers. Right. So You've got all these

0:01:32.280 --> 0:01:36.440
<v Speaker 1>different computer networks that then connect to a larger backbone

0:01:36.880 --> 0:01:40.520
<v Speaker 1>that allow all these various networks to interact and communicate

0:01:40.560 --> 0:01:44.040
<v Speaker 1>with one another. That is the Internet. The Worldwide Web

0:01:44.240 --> 0:01:47.960
<v Speaker 1>is one thing that sits on top of this network

0:01:48.000 --> 0:01:52.240
<v Speaker 1>of networks, other things being email and FTP servers and

0:01:52.960 --> 0:01:56.440
<v Speaker 1>other stuff that uses the Internet as its method of

0:01:56.520 --> 0:02:00.160
<v Speaker 1>transmitting data to and from different computers. But the Old

0:02:00.200 --> 0:02:02.960
<v Speaker 1>Wide Web is often what we think of with the

0:02:03.000 --> 0:02:06.200
<v Speaker 1>Internet because it is a very forward facing part of

0:02:06.240 --> 0:02:08.600
<v Speaker 1>the Web, or the Internet rather, Right. One way to

0:02:08.639 --> 0:02:12.200
<v Speaker 1>think about the Web is that it's a gigantic collection

0:02:12.320 --> 0:02:16.679
<v Speaker 1>of interactive documents. Yeah, exactly. Yeah. Some of those documents

0:02:16.720 --> 0:02:21.200
<v Speaker 1>are very static and they don't change frequently or at all.

0:02:21.680 --> 0:02:25.000
<v Speaker 1>Some of them are more like programs. Yeah, yeah, some

0:02:25.080 --> 0:02:27.360
<v Speaker 1>of them are more like like white boards, where you

0:02:27.400 --> 0:02:29.280
<v Speaker 1>know stuff is being put up and taking down and

0:02:29.280 --> 0:02:32.440
<v Speaker 1>put up and taking down constantly. So some of them

0:02:32.480 --> 0:02:35.200
<v Speaker 1>linked to lots of other documents, some do not, yep,

0:02:35.320 --> 0:02:39.040
<v Speaker 1>so some are applications. Right, So you've you've got this

0:02:39.400 --> 0:02:43.360
<v Speaker 1>massive number of documents. And when we say massive, uh,

0:02:43.440 --> 0:02:47.680
<v Speaker 1>it's hard to put it all into context. First of all,

0:02:47.720 --> 0:02:52.839
<v Speaker 1>if you talk about all the information that we have created,

0:02:53.320 --> 0:02:56.160
<v Speaker 1>not us, but humanity, humanity itself, you have the three

0:02:56.200 --> 0:03:00.000
<v Speaker 1>of us have done our share, but no humanity overall.

0:02:59.800 --> 0:03:02.080
<v Speaker 1>All the information that has been created, well back in

0:03:02.120 --> 0:03:04.400
<v Speaker 1>two thousand and twelve, that was estimated to be at

0:03:04.480 --> 0:03:09.480
<v Speaker 1>two point eight zetta bytes two point eight trillion gigabytes,

0:03:10.480 --> 0:03:16.760
<v Speaker 1>trillion gigabytes. It's bigger than my hard drive significantly. So yeah,

0:03:17.040 --> 0:03:20.560
<v Speaker 1>if you're hard drive can hold two point eight zetta bytes,

0:03:21.200 --> 0:03:24.320
<v Speaker 1>I need to see your gaming rig sir. I think

0:03:24.400 --> 0:03:27.519
<v Speaker 1>I have downloaded two point eight zeta bytes of Pirated

0:03:27.560 --> 0:03:30.680
<v Speaker 1>anime before I was gonna say I have two point

0:03:30.680 --> 0:03:34.280
<v Speaker 1>eight zeta bytes of Skyrim mods. But so no, not

0:03:34.400 --> 0:03:38.000
<v Speaker 1>all of this data is necessarily available for access on

0:03:38.040 --> 0:03:40.320
<v Speaker 1>the web, right, This is just data that we have created.

0:03:40.720 --> 0:03:43.640
<v Speaker 1>So let's let's narrow it down and look at the

0:03:43.680 --> 0:03:46.920
<v Speaker 1>information that's actually on the web. So the web has

0:03:46.960 --> 0:03:51.520
<v Speaker 1>between ten billion and one trillion documents on it. Now

0:03:51.600 --> 0:03:55.160
<v Speaker 1>that's a huge range, but it tells you that it's

0:03:55.200 --> 0:03:57.760
<v Speaker 1>hard to make an estimate about something that one is

0:03:57.800 --> 0:04:01.520
<v Speaker 1>so big and two is rapidly evolving. Right, there are

0:04:01.520 --> 0:04:04.920
<v Speaker 1>always things being added to it and deleted from it. Yeah,

0:04:05.000 --> 0:04:07.960
<v Speaker 1>you have servers that go offline from the Internet. If

0:04:07.960 --> 0:04:11.440
<v Speaker 1>those servers had web pages on them, those, unless they've

0:04:11.480 --> 0:04:14.720
<v Speaker 1>been mirrored onto other servers, are no longer accessible. They

0:04:14.720 --> 0:04:18.400
<v Speaker 1>have they have left the web. Other people deleting their

0:04:18.400 --> 0:04:23.359
<v Speaker 1>MySpace accounts. Why would you do that? Look, I have

0:04:23.520 --> 0:04:29.000
<v Speaker 1>so few friends there, but so many awesome bands. Uh yeah,

0:04:29.040 --> 0:04:32.560
<v Speaker 1>so seriously true moment. Does my Space still exist? Yes, yes,

0:04:32.640 --> 0:04:37.640
<v Speaker 1>it's largely How recently did you check? Probably? Probably then

0:04:37.880 --> 0:04:41.479
<v Speaker 1>eight months ago. Let's look it up. Yeah, because it's

0:04:41.480 --> 0:04:44.280
<v Speaker 1>a it's a music discovery site more than anything else. Now,

0:04:44.320 --> 0:04:47.440
<v Speaker 1>Oh yeah, here we go MySpace dot com. Oh oh,

0:04:47.480 --> 0:04:49.760
<v Speaker 1>it's it's breaking my browser home. I was about to say,

0:04:49.760 --> 0:04:51.720
<v Speaker 1>why did you go to that? You realize that my

0:04:51.800 --> 0:04:55.480
<v Speaker 1>Space is like the home of the auto play music file. Right, No,

0:04:55.880 --> 0:04:58.680
<v Speaker 1>we just talked about how that's like my least favorite thing. Well,

0:04:58.760 --> 0:05:04.800
<v Speaker 1>let's not. Let's not invoke the auto playing music gods. Yeah. So,

0:05:04.800 --> 0:05:07.360
<v Speaker 1>so the reason why we're even talking about how much

0:05:07.400 --> 0:05:10.120
<v Speaker 1>information is on the web and how many documents there

0:05:10.160 --> 0:05:12.800
<v Speaker 1>are out there is that the web. You can think

0:05:12.800 --> 0:05:17.320
<v Speaker 1>of the web as representing the world's largest database of information,

0:05:17.400 --> 0:05:21.839
<v Speaker 1>and that information spans every topic imaginable. Yeah, and there's

0:05:21.960 --> 0:05:24.719
<v Speaker 1>lots of great stuff out there that might be really

0:05:24.760 --> 0:05:28.039
<v Speaker 1>relevant to you, might have answers to questions that you have,

0:05:28.279 --> 0:05:30.680
<v Speaker 1>or it might just be very interesting to you. But

0:05:31.600 --> 0:05:34.400
<v Speaker 1>a strange question that you may never have considered is

0:05:35.760 --> 0:05:38.640
<v Speaker 1>how do I get the stuff that I want to

0:05:38.720 --> 0:05:41.160
<v Speaker 1>get from the web? I mean, you know, how you

0:05:41.200 --> 0:05:43.400
<v Speaker 1>get it in practical terms while you go you sit

0:05:43.440 --> 0:05:47.120
<v Speaker 1>down at Google and you type in terms or Google yeah,

0:05:47.240 --> 0:05:50.080
<v Speaker 1>or you or you maybe have some kind of aggregator,

0:05:50.200 --> 0:05:52.680
<v Speaker 1>like a friend on social media or some kind of

0:05:52.760 --> 0:05:59.400
<v Speaker 1>things content writer, or perhaps you have received the direct

0:05:59.400 --> 0:06:03.159
<v Speaker 1>to your l of a website that you wish to visit. Ye,

0:06:03.600 --> 0:06:05.880
<v Speaker 1>might you might have one in particular in mind that

0:06:05.960 --> 0:06:09.440
<v Speaker 1>you go too frequently, and so like dinosaur Comics Dot

0:06:09.440 --> 0:06:13.560
<v Speaker 1>com awesome should example. Yeah, so they're all all these

0:06:13.600 --> 0:06:16.200
<v Speaker 1>different ways. But let's say that you want to use

0:06:16.240 --> 0:06:19.440
<v Speaker 1>the web to do something dion just visiting a particular

0:06:19.520 --> 0:06:22.080
<v Speaker 1>web page if you know the U r L, that's

0:06:22.080 --> 0:06:25.920
<v Speaker 1>pretty simple. But what if you're you're just trying to

0:06:25.960 --> 0:06:29.320
<v Speaker 1>find something. Yeah, maybe that you don't even know what

0:06:29.440 --> 0:06:31.520
<v Speaker 1>that thing is, or you know what that thing is,

0:06:31.560 --> 0:06:35.240
<v Speaker 1>but nobody has gathered that and placed it into an

0:06:35.240 --> 0:06:38.760
<v Speaker 1>easily digestible piece of information. So, in other words, let's

0:06:38.800 --> 0:06:42.520
<v Speaker 1>say that you're looking at some sort of statistical uh

0:06:43.160 --> 0:06:45.160
<v Speaker 1>result that you want to know. You want to know

0:06:45.279 --> 0:06:50.520
<v Speaker 1>the percentage of people who drove red cars in two

0:06:50.560 --> 0:06:55.920
<v Speaker 1>thousand and twelve who ended up getting speeding tickets, and

0:06:55.960 --> 0:06:57.680
<v Speaker 1>you know this this sort of thing like, there may

0:06:57.720 --> 0:07:00.599
<v Speaker 1>be a web page out there that has that specific

0:07:00.680 --> 0:07:03.720
<v Speaker 1>answer on it, but there may not be. However, there

0:07:03.839 --> 0:07:07.520
<v Speaker 1>may be the data out there that exists across multiple

0:07:07.560 --> 0:07:11.239
<v Speaker 1>web pages and multiple places that could answer that question

0:07:11.320 --> 0:07:15.440
<v Speaker 1>for you, but there's no easy way for the average

0:07:15.480 --> 0:07:19.200
<v Speaker 1>person to be able to collect and coalate all that data,

0:07:19.280 --> 0:07:23.600
<v Speaker 1>analyze it and get to a meaningful answer, especially not quickly,

0:07:23.680 --> 0:07:27.040
<v Speaker 1>because if you wanted to go through the entire Internet

0:07:27.080 --> 0:07:29.720
<v Speaker 1>to try to find that information, it would take you

0:07:29.720 --> 0:07:32.400
<v Speaker 1>a minute. Yeah, it would take quite some time. So

0:07:32.640 --> 0:07:36.680
<v Speaker 1>what we wanted to do is tough. Sometimes well, again,

0:07:36.720 --> 0:07:39.080
<v Speaker 1>depending upon what it is you're looking for, right, because

0:07:39.160 --> 0:07:43.120
<v Speaker 1>in some cases you may have very little information and

0:07:43.280 --> 0:07:45.280
<v Speaker 1>it may take you some time just to make sure

0:07:45.320 --> 0:07:49.240
<v Speaker 1>that the information that you do have is worthy of consideration.

0:07:49.880 --> 0:07:52.640
<v Speaker 1>Or you may have the opposite problem. Let's say you

0:07:52.640 --> 0:07:54.160
<v Speaker 1>want to look at anything that has to do with

0:07:54.200 --> 0:07:59.280
<v Speaker 1>about cats, good grief, You're gonna have so much information

0:07:59.360 --> 0:08:02.280
<v Speaker 1>on the inner and on the web that relates to

0:08:02.360 --> 0:08:05.560
<v Speaker 1>cats that finding the you know, separating that the signal

0:08:05.680 --> 0:08:08.600
<v Speaker 1>from the noise would take you a really long time.

0:08:08.680 --> 0:08:13.680
<v Speaker 1>So and uh, this problem is already has a solution,

0:08:13.880 --> 0:08:18.360
<v Speaker 1>and that is why we are today talking about web crawlers. Yeah.

0:08:18.400 --> 0:08:20.920
<v Speaker 1>And web crawlers are something that have been around for

0:08:21.000 --> 0:08:23.440
<v Speaker 1>about as long as the web has been around, because

0:08:23.480 --> 0:08:26.720
<v Speaker 1>people realized early on that in order to make the

0:08:26.720 --> 0:08:31.720
<v Speaker 1>web really user friendly, especially once it grew beyond a

0:08:31.760 --> 0:08:36.280
<v Speaker 1>collection of you know, three computers, right, Yeah, three computers

0:08:36.320 --> 0:08:40.360
<v Speaker 1>with twelve web pages altogether, Once you get past all

0:08:40.360 --> 0:08:42.000
<v Speaker 1>of that and you get to a point where it

0:08:42.080 --> 0:08:45.760
<v Speaker 1>really is growing rapidly. You need a way to navigate

0:08:45.840 --> 0:08:48.280
<v Speaker 1>through the web and find the stuff you're interested in.

0:08:48.600 --> 0:08:51.400
<v Speaker 1>You need an index. Yeah, you have to have that

0:08:51.480 --> 0:08:53.840
<v Speaker 1>index because otherwise the only other option you have is

0:08:53.880 --> 0:08:57.440
<v Speaker 1>to know the address of a particular web page and

0:08:57.480 --> 0:08:59.839
<v Speaker 1>then to just follow whatever links that web page have

0:09:00.120 --> 0:09:02.319
<v Speaker 1>is to have, and then once you hit a dead end,

0:09:02.640 --> 0:09:04.520
<v Speaker 1>you've got to backtrack. And you know, it's kind of

0:09:04.520 --> 0:09:06.480
<v Speaker 1>like a choose your own adventure book, And it's a

0:09:06.559 --> 0:09:09.520
<v Speaker 1>choose your own adventure book that's that isn't even connected

0:09:09.559 --> 0:09:13.280
<v Speaker 1>to all the pages that you need. Right, So indexing

0:09:13.440 --> 0:09:17.320
<v Speaker 1>is a way of creating a means to find web

0:09:17.360 --> 0:09:23.360
<v Speaker 1>pages about any given keyword. Right, And again, this is

0:09:23.400 --> 0:09:27.880
<v Speaker 1>a big, big job. You can't expect this to be

0:09:27.960 --> 0:09:31.920
<v Speaker 1>something that only humans are doing under human power. It

0:09:31.960 --> 0:09:35.880
<v Speaker 1>would take way too long and it would be exhausting.

0:09:36.000 --> 0:09:39.600
<v Speaker 1>So there have to be automated ways to index web pages.

0:09:39.640 --> 0:09:43.200
<v Speaker 1>Well yeah, I mean, just consider the ridiculousness of the alternative.

0:09:43.840 --> 0:09:47.439
<v Speaker 1>So let's say you are searching for a term and

0:09:47.520 --> 0:09:53.360
<v Speaker 1>that term is, I don't know, lobster baseball. Somewhere out there,

0:09:53.400 --> 0:09:56.280
<v Speaker 1>there might be a page about lobster baseball, but it

0:09:56.320 --> 0:09:59.680
<v Speaker 1>would not be a good way to find it. To say, well,

0:09:59.720 --> 0:10:03.120
<v Speaker 1>I'm going to ping every web server in the world

0:10:03.800 --> 0:10:07.360
<v Speaker 1>and see if it's offering any public pages that say

0:10:07.440 --> 0:10:10.640
<v Speaker 1>lobster baseball on them. Yeah, that would not especially you know,

0:10:10.720 --> 0:10:13.880
<v Speaker 1>as the Web grows and gets larger and larger and larger,

0:10:14.040 --> 0:10:17.240
<v Speaker 1>that task becomes impossible. It would just it would take

0:10:17.280 --> 0:10:21.360
<v Speaker 1>your computer longer than your lifespan to complete the job,

0:10:21.480 --> 0:10:24.320
<v Speaker 1>especially considering that, as we mentioned before, the web is

0:10:24.360 --> 0:10:28.800
<v Speaker 1>constantly changing, so we would have new web servers joining

0:10:28.840 --> 0:10:32.200
<v Speaker 1>while you're still doing this pinging operation, which means you

0:10:32.280 --> 0:10:34.440
<v Speaker 1>just have you know, you've added more that you have

0:10:34.520 --> 0:10:38.240
<v Speaker 1>to ping before you're done. You never finish. So what's

0:10:38.280 --> 0:10:41.960
<v Speaker 1>the solution, Well, web crawlers would be would be the solution, Joe.

0:10:42.240 --> 0:10:45.720
<v Speaker 1>Web crawlers and search engines are our favorite things here

0:10:45.800 --> 0:10:47.960
<v Speaker 1>at Health Tough Works. I mean, if if it weren't

0:10:47.960 --> 0:10:52.760
<v Speaker 1>for them, our jobs would be significantly more difficult. So, uh,

0:10:52.840 --> 0:10:56.200
<v Speaker 1>let's say that you've got all right, So to break

0:10:56.200 --> 0:10:58.520
<v Speaker 1>it down, we've got web servers that have web pages

0:10:58.559 --> 0:11:02.160
<v Speaker 1>on them, right, we have it's a computer somewhere out there.

0:11:02.600 --> 0:11:05.520
<v Speaker 1>It's got a public facing document that it will show

0:11:05.559 --> 0:11:07.960
<v Speaker 1>you if you ask for right, and your browser is

0:11:07.960 --> 0:11:10.240
<v Speaker 1>the way that you ask for it right, So your

0:11:10.280 --> 0:11:13.960
<v Speaker 1>browser is your conduit to getting the information that's stored

0:11:14.120 --> 0:11:18.080
<v Speaker 1>on other computers that maybe on completely different networks, on

0:11:18.120 --> 0:11:22.719
<v Speaker 1>another on another part of the world even and the

0:11:22.760 --> 0:11:24.800
<v Speaker 1>fact that you have a browser that is what allows

0:11:24.840 --> 0:11:27.960
<v Speaker 1>you to have the access to that document that exists

0:11:27.960 --> 0:11:31.120
<v Speaker 1>on that other page. But those servers can have really

0:11:31.160 --> 0:11:34.360
<v Speaker 1>funky names. Um, the web pages may not have a

0:11:34.400 --> 0:11:37.800
<v Speaker 1>title that is is identical to what it is you're

0:11:37.800 --> 0:11:40.760
<v Speaker 1>looking for, but the information may be in that page.

0:11:41.160 --> 0:11:44.200
<v Speaker 1>Sure for to to use my prior example, dinosaur comics

0:11:44.240 --> 0:11:47.520
<v Speaker 1>dot com used to be known only as quantz dot com.

0:11:47.520 --> 0:11:50.560
<v Speaker 1>Perfect with the QW the way that you sometimes spell

0:11:50.640 --> 0:11:55.120
<v Speaker 1>words much yes, exactly the way the way words are

0:11:55.120 --> 0:11:59.040
<v Speaker 1>never spelled in English. Uh yeah, I I My example

0:11:59.040 --> 0:12:00.800
<v Speaker 1>of my notes I wrote is that let's say that

0:12:00.800 --> 0:12:03.680
<v Speaker 1>you're looking for funny cat memes. The funniest memes happen

0:12:03.679 --> 0:12:05.680
<v Speaker 1>to be on a page that has the title things.

0:12:05.840 --> 0:12:09.480
<v Speaker 1>FDR definitely didn't say, Well, the title of the page

0:12:09.520 --> 0:12:11.719
<v Speaker 1>wouldn't tell you that there are cat memes on there.

0:12:11.720 --> 0:12:15.160
<v Speaker 1>You would need something to have searched that page to

0:12:15.320 --> 0:12:19.400
<v Speaker 1>understand what actually appears on that page, the context within

0:12:19.880 --> 0:12:22.520
<v Speaker 1>which it appears, and to be able to serve that

0:12:22.640 --> 0:12:25.400
<v Speaker 1>up to you. And that's really where the crawlers come in.

0:12:25.480 --> 0:12:28.280
<v Speaker 1>They they build out these indexes of words and where

0:12:28.320 --> 0:12:32.120
<v Speaker 1>to find those words on the web like uh, they

0:12:32.200 --> 0:12:35.800
<v Speaker 1>use lots of They use well, actually pretty simple software.

0:12:36.720 --> 0:12:39.800
<v Speaker 1>They are often referred to as either robots or spiders,

0:12:40.040 --> 0:12:42.480
<v Speaker 1>and they're called spiders because they crawl the web. That

0:12:42.559 --> 0:12:47.160
<v Speaker 1>is good. Yeah, alright, So here's where we mentioned that

0:12:47.200 --> 0:12:53.920
<v Speaker 1>most of these terms. So wait, are we the flies? Good?

0:12:54.160 --> 0:12:59.640
<v Speaker 1>Good question? I mean, I think the uh cams, I'm

0:12:59.679 --> 0:13:03.960
<v Speaker 1>not so. So here's where we mentioned that a lot

0:13:04.000 --> 0:13:07.600
<v Speaker 1>of these terms were all invented around the same time.

0:13:07.720 --> 0:13:10.760
<v Speaker 1>And boy, when we when we go with a metaphor,

0:13:10.880 --> 0:13:15.040
<v Speaker 1>we just go whole spider. So um so all right,

0:13:15.080 --> 0:13:18.599
<v Speaker 1>So spiders typically start by traveling to web servers that

0:13:18.720 --> 0:13:22.040
<v Speaker 1>have lots of traffic, the ones that are the most popular,

0:13:22.600 --> 0:13:25.400
<v Speaker 1>and they explore the most popular web pages and start

0:13:25.440 --> 0:13:28.559
<v Speaker 1>to build up the index of words of those web pages.

0:13:28.600 --> 0:13:31.920
<v Speaker 1>Then all the links that are on those popular web pages,

0:13:32.000 --> 0:13:35.800
<v Speaker 1>the spiders start to follow those links and index those

0:13:35.840 --> 0:13:39.319
<v Speaker 1>pages in turn, and then do the same thing over

0:13:39.400 --> 0:13:41.760
<v Speaker 1>and over and over again, so they just you know,

0:13:41.800 --> 0:13:44.600
<v Speaker 1>it's it is like a spider web or a crack

0:13:44.640 --> 0:13:46.880
<v Speaker 1>in the glass where you see its splintering over and

0:13:46.920 --> 0:13:49.679
<v Speaker 1>over while the glass shatters. The same sort of thing.

0:13:49.880 --> 0:13:54.240
<v Speaker 1>It's following all those potential pathways, and they can hold

0:13:54.520 --> 0:13:57.720
<v Speaker 1>hundreds of pages open at a time. We're talking like

0:13:57.920 --> 0:14:04.439
<v Speaker 1>three hundred pages a second. So yes, more than Google

0:14:04.520 --> 0:14:07.000
<v Speaker 1>Chrome will allow me to have opened before my computer

0:14:07.080 --> 0:14:10.400
<v Speaker 1>says listen, I give up. Uh. So, depending on the

0:14:10.440 --> 0:14:13.040
<v Speaker 1>crawler of the spiders will index these pages based upon

0:14:13.240 --> 0:14:16.080
<v Speaker 1>which words appear in the page and where those words

0:14:16.120 --> 0:14:20.480
<v Speaker 1>actually appear in that page, like in what context. So

0:14:20.720 --> 0:14:23.080
<v Speaker 1>you may remember in the early days of the web,

0:14:23.160 --> 0:14:27.600
<v Speaker 1>before web search engines got really sophisticated, that some people

0:14:27.720 --> 0:14:30.480
<v Speaker 1>would make a web page and then just litter the

0:14:30.520 --> 0:14:33.240
<v Speaker 1>bottom of the page with tons of random words that

0:14:33.320 --> 0:14:36.760
<v Speaker 1>we're doing really well in search, mostly because they had

0:14:36.800 --> 0:14:39.160
<v Speaker 1>ads served on the page of the type that they

0:14:39.240 --> 0:14:42.640
<v Speaker 1>got money from per page view. So I might have

0:14:42.680 --> 0:14:46.920
<v Speaker 1>a talk Brittney Spear, right, Yeah, it would often be

0:14:47.080 --> 0:14:50.280
<v Speaker 1>celebrity rumors and gossip that kind of stuff, and just

0:14:50.480 --> 0:14:54.800
<v Speaker 1>random recipe. Yeah, yeah, it'd be weird stuff, like totally

0:14:54.960 --> 0:14:57.200
<v Speaker 1>some of it would be disturbing to read. You're like, wow,

0:14:57.240 --> 0:14:59.760
<v Speaker 1>I can't believe that that. To know that this particular

0:14:59.800 --> 0:15:02.040
<v Speaker 1>turn m is a very popular search term is disturbing

0:15:02.320 --> 0:15:06.640
<v Speaker 1>others would Yeah. Yeah, I'm more of an A C,

0:15:06.800 --> 0:15:09.320
<v Speaker 1>D C kind of guy myself, so I'm I'm with

0:15:09.360 --> 0:15:12.120
<v Speaker 1>you there. So anyway, Uh, you know, this was a

0:15:12.160 --> 0:15:16.720
<v Speaker 1>way of fooling search engines into into indexing that page

0:15:16.840 --> 0:15:20.720
<v Speaker 1>on multiple indexes so that it would appear no matter

0:15:20.760 --> 0:15:23.000
<v Speaker 1>what search you put in, your page would pop up.

0:15:23.400 --> 0:15:26.320
<v Speaker 1>You as a as saying, if you're assuming you are

0:15:26.320 --> 0:15:28.920
<v Speaker 1>the one who are administering this web page, you have

0:15:28.960 --> 0:15:31.960
<v Speaker 1>no ethics, Like, you don't care if people come to

0:15:32.000 --> 0:15:34.680
<v Speaker 1>your page and are completely disappointed because it has nothing

0:15:34.720 --> 0:15:36.440
<v Speaker 1>to do with the search term they put in there.

0:15:36.480 --> 0:15:39.080
<v Speaker 1>You just want to get those sweet sweet clicks. You

0:15:39.160 --> 0:15:41.160
<v Speaker 1>just need the page views because you need to pay

0:15:41.200 --> 0:15:46.280
<v Speaker 1>the bills, right, So, uh, search engines and spiders got

0:15:46.320 --> 0:15:49.320
<v Speaker 1>more sophisticated, so they were able to look for the

0:15:49.400 --> 0:15:52.480
<v Speaker 1>placement of words where it fell in the page, whether

0:15:52.600 --> 0:15:54.840
<v Speaker 1>or not it appeared more than once within a page,

0:15:54.840 --> 0:15:58.960
<v Speaker 1>to understand if a page really was about that particular

0:15:59.040 --> 0:16:01.720
<v Speaker 1>search term, or if it was just one of those

0:16:01.720 --> 0:16:05.320
<v Speaker 1>things where the word happened to appear once, it may

0:16:05.400 --> 0:16:08.440
<v Speaker 1>be a saying or a quote that has very little

0:16:08.440 --> 0:16:10.480
<v Speaker 1>to do with the actual substance of the rest of

0:16:10.520 --> 0:16:13.360
<v Speaker 1>the page. You know, this would help the search engine

0:16:13.440 --> 0:16:16.880
<v Speaker 1>rank the page in search right. Right, So, the final

0:16:16.960 --> 0:16:21.600
<v Speaker 1>product of of these spiders doing this indexing is called

0:16:21.600 --> 0:16:26.520
<v Speaker 1>a crawl and and it's essentially a lightweight copy of

0:16:26.720 --> 0:16:30.360
<v Speaker 1>the Worldwide Web that's built to be much more easily

0:16:30.360 --> 0:16:33.160
<v Speaker 1>searched than the whole web itself. Uh and and a

0:16:33.240 --> 0:16:36.960
<v Speaker 1>crawl usually consists, therefore, of this huge cash of data

0:16:37.000 --> 0:16:39.880
<v Speaker 1>about the web, including like the text of each page

0:16:39.960 --> 0:16:44.360
<v Speaker 1>it's spiders encountered, the code that constructed those pages h T, M,

0:16:44.480 --> 0:16:47.800
<v Speaker 1>L or et cetera, uh, and a certain amount of

0:16:48.000 --> 0:16:50.560
<v Speaker 1>metadata um you know, certainly the pages r L and

0:16:50.640 --> 0:16:53.080
<v Speaker 1>maybe the tags. As we discussed, that's not always as

0:16:53.160 --> 0:16:56.360
<v Speaker 1>useful as it used to be due to uh scammy stuff.

0:16:56.360 --> 0:17:00.360
<v Speaker 1>But yeah, so so creating a crawl is a huge

0:17:00.440 --> 0:17:03.920
<v Speaker 1>project in terms of time and computer equipment and drive

0:17:03.960 --> 0:17:08.800
<v Speaker 1>space and spider programming and just sheer Internet bandwidth. Right. So,

0:17:09.080 --> 0:17:11.760
<v Speaker 1>for the longest time, this is something that was really

0:17:11.880 --> 0:17:18.640
<v Speaker 1>only accessible by big corporations like Google or Microsoft, Yahoo that. Yeah,

0:17:18.680 --> 0:17:22.480
<v Speaker 1>we're talking huge companies that have the computer power and

0:17:22.520 --> 0:17:25.000
<v Speaker 1>the bandwidth to pull this sort of stuff off on

0:17:25.040 --> 0:17:28.359
<v Speaker 1>a on a regular basis. And while those are incredibly

0:17:28.440 --> 0:17:31.280
<v Speaker 1>useful for us as consumers, if we are looking for

0:17:31.320 --> 0:17:33.919
<v Speaker 1>a specific piece of information that happens to live somewhere

0:17:33.920 --> 0:17:35.920
<v Speaker 1>on the web page, if we want to do more

0:17:35.960 --> 0:17:39.520
<v Speaker 1>of a big data analysis something where we need to

0:17:40.200 --> 0:17:44.800
<v Speaker 1>collate the information across multiple, perhaps hundreds or thousands of

0:17:44.840 --> 0:17:49.040
<v Speaker 1>web pages, it's not easy, right. We don't have those

0:17:49.080 --> 0:17:51.560
<v Speaker 1>tools for the most part, right right, Because when you

0:17:51.600 --> 0:17:55.000
<v Speaker 1>go to Google, you can't access that level of information.

0:17:55.080 --> 0:17:57.640
<v Speaker 1>Yeah you can, you can ask, uh, you know what

0:17:57.720 --> 0:18:00.239
<v Speaker 1>Hugh Grant was doing last week? Right? Yeah, you can

0:18:00.280 --> 0:18:04.480
<v Speaker 1>get the most popular or the highest ranking search results,

0:18:04.520 --> 0:18:07.760
<v Speaker 1>which could give you at least some useful information. But again,

0:18:07.800 --> 0:18:11.879
<v Speaker 1>if you want to do a wide spread study on

0:18:12.119 --> 0:18:16.560
<v Speaker 1>a specific thing, unless someone's already done it, in which

0:18:16.600 --> 0:18:19.399
<v Speaker 1>case you may just need to replicate their their study

0:18:19.440 --> 0:18:22.600
<v Speaker 1>to make sure that it was correct. Um, you're you're

0:18:22.680 --> 0:18:26.080
<v Speaker 1>kind of out of luck. So where can someone turn.

0:18:26.280 --> 0:18:30.760
<v Speaker 1>Let's say that it's a researcher who's working on something

0:18:30.840 --> 0:18:33.400
<v Speaker 1>and they don't work for one of these big companies.

0:18:33.480 --> 0:18:37.560
<v Speaker 1>Where can they turn to leverage the incredible asset that

0:18:37.760 --> 0:18:41.080
<v Speaker 1>is the World Wide Web? One? Gil Lbez started up

0:18:41.119 --> 0:18:46.320
<v Speaker 1>a nonprofit corporation called the Common Crawl Foundation and it

0:18:46.480 --> 0:18:51.440
<v Speaker 1>has been since then working on providing public, publicly accessible,

0:18:51.960 --> 0:18:56.160
<v Speaker 1>free crawls to anyone who wants to use them. And

0:18:56.720 --> 0:18:59.320
<v Speaker 1>uh ls is a really interesting dude. A little bit

0:18:59.320 --> 0:19:01.959
<v Speaker 1>of background on him. Um. He co founded a company

0:19:02.000 --> 0:19:05.879
<v Speaker 1>back in the nineties called Applied Semantics, which created software

0:19:05.880 --> 0:19:10.480
<v Speaker 1>that matched ads to web pages like contextually and automatically.

0:19:10.760 --> 0:19:13.960
<v Speaker 1>Oh we know a little bit about that. Yeah, yeah,

0:19:14.000 --> 0:19:16.879
<v Speaker 1>And this prompted Google to acquire them in two thousand

0:19:16.960 --> 0:19:20.800
<v Speaker 1>three for like a hundred two million bucks, So not

0:19:20.920 --> 0:19:24.040
<v Speaker 1>doing too bad for himself. Also, that's that's essentially the

0:19:24.040 --> 0:19:27.600
<v Speaker 1>reason why Google AdSense exists. That the programming that led

0:19:27.640 --> 0:19:32.760
<v Speaker 1>to Google AdSense so very very practical application of that

0:19:32.840 --> 0:19:37.679
<v Speaker 1>contextual understanding, right right. Um. In two thousand eight, interestingly,

0:19:37.720 --> 0:19:39.439
<v Speaker 1>and kind of a side note, he founded a company

0:19:39.440 --> 0:19:44.600
<v Speaker 1>called Factual, which seeks to gather and analyze global location

0:19:44.840 --> 0:19:49.320
<v Speaker 1>data in order to create a repository of really high quality,

0:19:49.440 --> 0:19:56.239
<v Speaker 1>easily accessible location data that's uh factual um and and

0:19:56.520 --> 0:19:59.760
<v Speaker 1>companies like Bang and Samsung and Yelp all use factuals

0:20:00.119 --> 0:20:05.000
<v Speaker 1>to construct local maps and personalized advertising for mobile consumers.

0:20:05.600 --> 0:20:08.800
<v Speaker 1>So uh so pretty nifty stuff. And what I am

0:20:08.800 --> 0:20:12.439
<v Speaker 1>saying is that elbas is passionate about and experienced with

0:20:12.560 --> 0:20:15.919
<v Speaker 1>big data, right, and we've talked about it before on

0:20:15.960 --> 0:20:20.399
<v Speaker 1>this podcast. That big data is. You know, it sounds

0:20:20.400 --> 0:20:22.600
<v Speaker 1>like one of those just buzz industry terms, but it

0:20:22.720 --> 0:20:25.919
<v Speaker 1>really is one of those things that holds a huge

0:20:25.920 --> 0:20:30.119
<v Speaker 1>amount of potential to affect our lives in different ways,

0:20:30.160 --> 0:20:34.639
<v Speaker 1>assuming that we've developed the right means to analyze that

0:20:34.880 --> 0:20:37.520
<v Speaker 1>massive amount of information that's had there to collect it

0:20:37.520 --> 0:20:40.040
<v Speaker 1>in the first place. Sure, and once you have a

0:20:40.119 --> 0:20:43.320
<v Speaker 1>way of processing, of collecting and being able to access

0:20:43.359 --> 0:20:46.399
<v Speaker 1>and process vast amounts of data, you can do a

0:20:46.400 --> 0:20:50.920
<v Speaker 1>lot of amazing things like big data. And the ability

0:20:51.000 --> 0:20:53.960
<v Speaker 1>to process it might be the key to say, for example,

0:20:54.440 --> 0:21:00.000
<v Speaker 1>computational modeling that predicts complex social phenomenon by analyzing big

0:21:00.119 --> 0:21:03.480
<v Speaker 1>data coming from social media and from news and from

0:21:03.520 --> 0:21:06.359
<v Speaker 1>weather and from all kinds of sources, it can really

0:21:06.359 --> 0:21:09.840
<v Speaker 1>mean that we are able to actually see elements of

0:21:10.000 --> 0:21:13.760
<v Speaker 1>order and what previously appeared to be a truly chaotic system,

0:21:13.880 --> 0:21:17.320
<v Speaker 1>which is kind of exciting. Sure. And then on the

0:21:17.359 --> 0:21:19.679
<v Speaker 1>other hand, a lot of people think that big data

0:21:19.760 --> 0:21:23.040
<v Speaker 1>could be one of the ways that we finally achieve

0:21:23.240 --> 0:21:27.560
<v Speaker 1>that next level of artificial intelligence by having machines sort

0:21:27.600 --> 0:21:31.720
<v Speaker 1>of plumb the depths of this data with self teaching

0:21:31.800 --> 0:21:35.399
<v Speaker 1>and self learning mechanisms. Right, well, let's get back to

0:21:35.560 --> 0:21:39.640
<v Speaker 1>the common crawl okay, Right, So the Foundation began compiling

0:21:39.800 --> 0:21:42.840
<v Speaker 1>crawls back in two thousand eight. The most recent one

0:21:42.880 --> 0:21:45.159
<v Speaker 1>that they released as of this podcast at the end

0:21:45.200 --> 0:21:50.840
<v Speaker 1>of May, was from April. It was some a hundred

0:21:50.840 --> 0:21:56.600
<v Speaker 1>and sixty eight terabits in size bytes in size huge.

0:21:56.920 --> 0:22:01.920
<v Speaker 1>That's big, uh, and contains some two point one billion

0:22:02.160 --> 0:22:08.399
<v Speaker 1>web pages. That's not that many, really, I wrote I

0:22:08.440 --> 0:22:13.480
<v Speaker 1>wrote forty seven million web pages before breakfast. No I

0:22:13.560 --> 0:22:19.360
<v Speaker 1>did not. I'm just kidding. No, that's a lot. Yeah, yeah, yeah,

0:22:19.359 --> 0:22:22.480
<v Speaker 1>it's it's a it's a bunch um uh. But so

0:22:22.720 --> 0:22:26.840
<v Speaker 1>they're they're continuously indexing and releasing new crawls, right, It's

0:22:26.880 --> 0:22:31.240
<v Speaker 1>not like it's here's the Internet and now we're done. Yeah. Yeah,

0:22:31.280 --> 0:22:33.960
<v Speaker 1>they've been releasing a new crawl every month since July.

0:22:35.200 --> 0:22:37.600
<v Speaker 1>That's I mean, that's incredible you think about the amount

0:22:37.600 --> 0:22:39.399
<v Speaker 1>of work that does. It also means that you have

0:22:39.560 --> 0:22:44.000
<v Speaker 1>like a a time a timestamp, like photograph of what

0:22:44.119 --> 0:22:48.520
<v Speaker 1>the web was at that moment from these crawls. Yeah. Yeah,

0:22:48.560 --> 0:22:50.520
<v Speaker 1>I hadn't thought about it quite that way before, but yeah,

0:22:50.560 --> 0:22:53.879
<v Speaker 1>that's it's kind of you know, things that may not

0:22:54.040 --> 0:22:56.640
<v Speaker 1>exist from one month to the next, you could actually

0:22:57.320 --> 0:23:01.040
<v Speaker 1>see and watch those trends. Yeah, it's fascinating. Yeah. I'd

0:23:01.040 --> 0:23:03.560
<v Speaker 1>say one of the main ways that I often encounter

0:23:04.600 --> 0:23:08.520
<v Speaker 1>web archives is when I'm trying to find evidence of

0:23:08.640 --> 0:23:14.240
<v Speaker 1>something somebody did in the past that they wanted expunge. Right,

0:23:14.640 --> 0:23:16.840
<v Speaker 1>This makes sound like I'm some kind of detective, not

0:23:16.960 --> 0:23:19.000
<v Speaker 1>like I'm trying to find but you know what I'm

0:23:19.000 --> 0:23:21.320
<v Speaker 1>talking about. No, I know, I will post something and

0:23:21.359 --> 0:23:23.879
<v Speaker 1>then they'll be like, oh wait a minute, No, that

0:23:23.960 --> 0:23:25.639
<v Speaker 1>was a bad idea. I know if you try to

0:23:25.680 --> 0:23:27.600
<v Speaker 1>delete it. I know, if you use archive dot org,

0:23:27.680 --> 0:23:30.159
<v Speaker 1>you can find one of the web pages I built

0:23:30.240 --> 0:23:35.080
<v Speaker 1>way back when, and I never want anyone to ever

0:23:35.080 --> 0:23:37.960
<v Speaker 1>see it because it was that bad. But they will

0:23:38.080 --> 0:23:40.639
<v Speaker 1>forever be able to Yeah, you shouldn't talk about it

0:23:40.680 --> 0:23:43.240
<v Speaker 1>on podcasts. Pretty sure they're not going to be able

0:23:43.280 --> 0:23:45.480
<v Speaker 1>to find it. Tens of thousands of people. Let me

0:23:45.520 --> 0:23:48.080
<v Speaker 1>guess you had some You had a bunch of rage

0:23:48.119 --> 0:23:53.920
<v Speaker 1>against the machine lyrics, and it auto played midias. They're

0:23:54.080 --> 0:23:58.320
<v Speaker 1>so close closest, No, you're really far away. I wrote,

0:23:58.119 --> 0:24:01.159
<v Speaker 1>I made some web pages for I particular company I

0:24:01.200 --> 0:24:04.479
<v Speaker 1>worked for, which is unless you know the company I

0:24:04.520 --> 0:24:07.000
<v Speaker 1>worked for that I'm specifically referring to. That's why you're

0:24:07.040 --> 0:24:11.240
<v Speaker 1>never going to find that particular web page, and you shouldn't.

0:24:11.320 --> 0:24:13.919
<v Speaker 1>It was terrible. It was. It was about, we're going

0:24:13.960 --> 0:24:16.600
<v Speaker 1>to find these Go look on his LinkedIn profile. We

0:24:16.640 --> 0:24:18.879
<v Speaker 1>can figure out what company was. I already found some

0:24:18.920 --> 0:24:22.560
<v Speaker 1>lobster baseball stuff, so yeah we can. That comes from

0:24:22.920 --> 0:24:26.280
<v Speaker 1>you almost Got a spit take on you almost got

0:24:27.440 --> 0:24:31.639
<v Speaker 1>and uh yeah, that comes from a pre episode conversation. No,

0:24:31.800 --> 0:24:34.480
<v Speaker 1>that was actually in the episode, wasn't it. I can't

0:24:34.560 --> 0:24:37.680
<v Speaker 1>keep track anymore, all this dungeons and dragons saving throw

0:24:37.720 --> 0:24:40.119
<v Speaker 1>talk we had. OK, hold on, I've got a question,

0:24:41.320 --> 0:24:43.919
<v Speaker 1>So hold on. If the Common Crawl is trying to

0:24:44.600 --> 0:24:52.200
<v Speaker 1>preserve and make accessible continuously updated snapshots of the web

0:24:52.720 --> 0:24:55.680
<v Speaker 1>weird on Earth? Are they're going to like store that

0:24:55.800 --> 0:24:57.800
<v Speaker 1>and make it available, right, because it's not going to

0:24:57.960 --> 0:25:01.480
<v Speaker 1>fit on like a thumb dry So where I think

0:25:01.480 --> 0:25:03.840
<v Speaker 1>it's also funny that that sort of becomes part of

0:25:03.840 --> 0:25:07.600
<v Speaker 1>the web, Like the web now incorporates a snapshot of

0:25:07.640 --> 0:25:10.320
<v Speaker 1>the previous web, and so it just gets that much larger.

0:25:10.400 --> 0:25:13.400
<v Speaker 1>So yeah, where's the stuff living? It is all living

0:25:13.640 --> 0:25:17.960
<v Speaker 1>on Amazon's web services. Uh, specifically, it's it's stored in

0:25:18.040 --> 0:25:21.119
<v Speaker 1>Amazon Simple Storage Service or S three as it is

0:25:21.200 --> 0:25:25.040
<v Speaker 1>sometimes known, and you can analyze it via Amazon's a

0:25:25.119 --> 0:25:29.879
<v Speaker 1>Last Compute Cloud or e C two. And this is

0:25:29.920 --> 0:25:34.240
<v Speaker 1>so cool guys, because because okay, given Amazon's web service scope,

0:25:34.280 --> 0:25:38.200
<v Speaker 1>it means that practically anyone in the whole world can

0:25:38.280 --> 0:25:42.400
<v Speaker 1>download entire crawls for free, or can if they don't

0:25:42.440 --> 0:25:46.919
<v Speaker 1>want to, you know, use hundred sixty terabytes of space.

0:25:47.000 --> 0:25:48.879
<v Speaker 1>They can they can just use e C two to

0:25:49.119 --> 0:25:53.600
<v Speaker 1>really easily run simple data crunches for like an hourly charge,

0:25:53.640 --> 0:25:57.280
<v Speaker 1>like anywhere from a few bucks to maybe fifty dollars

0:25:58.000 --> 0:26:02.240
<v Speaker 1>for pretty simple computations. So and of course, more savvy

0:26:02.320 --> 0:26:05.160
<v Speaker 1>users can write their own code to investigate stuff with.

0:26:05.280 --> 0:26:08.280
<v Speaker 1>But but but yeah, for for the common user. This

0:26:08.359 --> 0:26:11.160
<v Speaker 1>is revolutionary. Yeah. To me, this is uh. I mean,

0:26:11.200 --> 0:26:14.800
<v Speaker 1>obviously I would I would recommend doing the approach where

0:26:14.840 --> 0:26:18.879
<v Speaker 1>you're you're searching on Amazon stuff. I can't imagine the

0:26:18.960 --> 0:26:21.680
<v Speaker 1>phone call you would get from your internet service provider.

0:26:22.200 --> 0:26:24.520
<v Speaker 1>I see you're trying to download a hundred and sixty

0:26:24.560 --> 0:26:28.080
<v Speaker 1>eight terrified it's worth of data over our lines. You

0:26:28.160 --> 0:26:34.120
<v Speaker 1>have gone significantly over your bandwidth cav. Yeah. And they

0:26:34.280 --> 0:26:37.520
<v Speaker 1>can do all of this because Amazon has specifically chosen

0:26:37.560 --> 0:26:40.320
<v Speaker 1>to wave their storage fees for for this and a

0:26:40.320 --> 0:26:43.000
<v Speaker 1>handful of other things that they consider to be of

0:26:43.119 --> 0:26:46.160
<v Speaker 1>wide public interest, like like like weather in census data.

0:26:46.320 --> 0:26:49.240
<v Speaker 1>And it's that's incredible, right, I mean, because you are

0:26:49.320 --> 0:26:54.359
<v Speaker 1>talking about a significantly huge amount of data, So to say,

0:26:54.440 --> 0:26:57.800
<v Speaker 1>you know this, this is so important and so potentially

0:26:57.880 --> 0:27:01.880
<v Speaker 1>beneficial to mankind that we're not going to end up,

0:27:02.000 --> 0:27:05.760
<v Speaker 1>you know, charging these the storage fees for it. That's

0:27:05.760 --> 0:27:10.800
<v Speaker 1>that's great, Yeah, very very encouraging. So when we get

0:27:10.840 --> 0:27:12.479
<v Speaker 1>down to all, right, well, what can you actually use

0:27:12.520 --> 0:27:14.520
<v Speaker 1>all that data for? Well, just think about the stuff

0:27:14.520 --> 0:27:16.440
<v Speaker 1>that's on the web and pretty much anything you could

0:27:16.440 --> 0:27:19.040
<v Speaker 1>think of that you know, you have a question you

0:27:19.119 --> 0:27:21.440
<v Speaker 1>might have that could not be answered through a simple

0:27:21.440 --> 0:27:26.680
<v Speaker 1>search query, you could potentially answer by leveraging this information. So, uh,

0:27:27.000 --> 0:27:29.800
<v Speaker 1>you know, we're talking about everything from lots of stuff

0:27:29.800 --> 0:27:32.840
<v Speaker 1>that deals with AI. Actually, like you were mentioning earlier, Joe,

0:27:33.440 --> 0:27:39.760
<v Speaker 1>stuff like developing better natural language algorithms so that the

0:27:39.760 --> 0:27:43.240
<v Speaker 1>the machines of the future can understand a wider variety

0:27:43.480 --> 0:27:47.959
<v Speaker 1>of inputs and make meaningful connections between that input and

0:27:48.080 --> 0:27:51.600
<v Speaker 1>the desired output. So in other words, I could talk

0:27:51.680 --> 0:27:54.160
<v Speaker 1>to my computer or my phone as if it were

0:27:54.200 --> 0:27:56.960
<v Speaker 1>a person, and no matter how I might word things

0:27:56.960 --> 0:28:00.680
<v Speaker 1>in my own quirky way, the machine understands what I mean.

0:28:01.440 --> 0:28:04.680
<v Speaker 1>So it's not it's not responding to what I say

0:28:05.119 --> 0:28:07.520
<v Speaker 1>versus it's more respond to what I mean, which would

0:28:07.520 --> 0:28:13.879
<v Speaker 1>be awesome. Uh. Also stuff like speech recognition, UM emerging

0:28:13.880 --> 0:28:16.520
<v Speaker 1>global trends. We mentioned that. You know, let's say that

0:28:16.560 --> 0:28:20.520
<v Speaker 1>you wanted to track the outbreak of a disease and

0:28:20.640 --> 0:28:23.359
<v Speaker 1>to try and get at you know, where did this start?

0:28:23.400 --> 0:28:25.680
<v Speaker 1>How can we prevent this from happening again? That kind

0:28:25.720 --> 0:28:28.640
<v Speaker 1>of thing that can be really useful. And sometimes you're

0:28:28.760 --> 0:28:34.400
<v Speaker 1>tracking this through uh, not like official documents, but through

0:28:34.720 --> 0:28:36.919
<v Speaker 1>you know, people on Twitter saying, oh, I have the

0:28:36.920 --> 0:28:39.840
<v Speaker 1>flu right, Yeah, it might be it might be social media,

0:28:39.920 --> 0:28:42.440
<v Speaker 1>it could be news reports. I mean, it could be

0:28:42.480 --> 0:28:47.440
<v Speaker 1>all these different, completely separate pieces that would be way

0:28:47.480 --> 0:28:50.200
<v Speaker 1>too hard for you to put together on your own. Sure,

0:28:50.440 --> 0:28:54.960
<v Speaker 1>we did a whole episode once about financial purposes for

0:28:54.960 --> 0:28:59.840
<v Speaker 1>big data, so play in the stock markets and all that. Yeah, yeah, exactly.

0:29:00.000 --> 0:29:02.239
<v Speaker 1>So this is this is really important stuff, and we're

0:29:02.240 --> 0:29:06.120
<v Speaker 1>gonna see a lot more of examples of people leveraging

0:29:06.200 --> 0:29:09.920
<v Speaker 1>big data, especially now that it's outside the realm of

0:29:09.960 --> 0:29:14.040
<v Speaker 1>just mega corporations right where we can see people are

0:29:14.440 --> 0:29:18.360
<v Speaker 1>researchers who have interest in all sorts of different fields

0:29:19.360 --> 0:29:22.560
<v Speaker 1>taking advantage of this massive amount of information that we

0:29:22.680 --> 0:29:27.640
<v Speaker 1>continue to accumulate day after day. And uh, that's really exciting.

0:29:27.840 --> 0:29:31.520
<v Speaker 1>It makes me think of having access to the best

0:29:31.640 --> 0:29:37.280
<v Speaker 1>research librarians in the world, all boiled into the largest

0:29:37.480 --> 0:29:41.040
<v Speaker 1>library you can imagine. That's essentially what we're talking about here.

0:29:41.640 --> 0:29:46.120
<v Speaker 1>So very exciting, and the common crawl is pretty inspirational.

0:29:46.160 --> 0:29:48.960
<v Speaker 1>It's one of those things where you realize it took

0:29:49.000 --> 0:29:52.440
<v Speaker 1>a lot of determination to make that become a reality

0:29:52.800 --> 0:29:56.040
<v Speaker 1>and the potential benefit, and it also is incredibly forward

0:29:56.080 --> 0:29:59.120
<v Speaker 1>thinking to be all the way back in two thousand

0:29:59.200 --> 0:30:01.960
<v Speaker 1>eight and putting the US together, And that was before

0:30:02.000 --> 0:30:06.240
<v Speaker 1>we had really developed sophisticated tools that could leverage it properly.

0:30:06.280 --> 0:30:08.440
<v Speaker 1>Now we're getting to see that as big data has

0:30:08.480 --> 0:30:13.200
<v Speaker 1>become an industry unto itself. I mean, it's really exciting now.

0:30:13.480 --> 0:30:19.000
<v Speaker 1>So thank goodness that the idea was was implemented long enough,

0:30:19.120 --> 0:30:22.520
<v Speaker 1>long ago enough for it to actually have uh you know,

0:30:22.680 --> 0:30:28.480
<v Speaker 1>uh an established presence and now we can really see

0:30:28.520 --> 0:30:32.000
<v Speaker 1>how we can take advantage of it. So Las is

0:30:32.040 --> 0:30:35.400
<v Speaker 1>such a such a fascinating dude. I have found so

0:30:35.440 --> 0:30:37.640
<v Speaker 1>many interesting interviews and stuff with him. I think that

0:30:37.640 --> 0:30:39.640
<v Speaker 1>we should maybe maybe not on the show, but maybe

0:30:39.640 --> 0:30:41.479
<v Speaker 1>if you'd want to do a text text episode kind

0:30:41.480 --> 0:30:43.720
<v Speaker 1>of focus on them, that would be kind of cool. Yeah.

0:30:43.800 --> 0:30:46.120
<v Speaker 1>I love to do episodes where we are able to

0:30:46.160 --> 0:30:50.160
<v Speaker 1>look at influential figures in technology on that show, So

0:30:50.200 --> 0:30:53.160
<v Speaker 1>that would be fantastic. I will add that to the list.

0:30:53.960 --> 0:30:57.880
<v Speaker 1>Uh So, the Common Crawl is really an interesting project.

0:30:57.960 --> 0:31:01.280
<v Speaker 1>If you have not looked into it, go check it out, um,

0:31:01.320 --> 0:31:03.760
<v Speaker 1>you know, because it may be one of those things

0:31:03.760 --> 0:31:06.520
<v Speaker 1>that could come in handy if you're working on a

0:31:06.640 --> 0:31:10.680
<v Speaker 1>research project. If you are just curious about how big

0:31:10.800 --> 0:31:14.200
<v Speaker 1>data is gonna continue to have a huge impact in

0:31:14.240 --> 0:31:17.440
<v Speaker 1>our lives, go go seek that out. Yeah, it's a

0:31:17.640 --> 0:31:19.800
<v Speaker 1>you can find them at a common crawl dot org.

0:31:20.640 --> 0:31:25.440
<v Speaker 1>And if you if you're into making donations to nonprofit

0:31:25.560 --> 0:31:28.360
<v Speaker 1>organizations that are tax deductible, you can do that thing too.

0:31:28.480 --> 0:31:32.240
<v Speaker 1>That's pretty cool, all right, guys. That wraps up this discussion.

0:31:32.400 --> 0:31:35.959
<v Speaker 1>If you have any suggestions for future topics for forward Thinking,

0:31:36.480 --> 0:31:39.000
<v Speaker 1>you should let us know. Send us an email that

0:31:39.040 --> 0:31:42.200
<v Speaker 1>addresses FW thinking at how stuff Works dot com, or

0:31:42.280 --> 0:31:46.160
<v Speaker 1>drop us a line on Facebook, Twitter or Google Plus.

0:31:46.200 --> 0:31:48.280
<v Speaker 1>At Twitter and Google Plus, we are f w thinking.

0:31:48.760 --> 0:31:51.880
<v Speaker 1>Just search fw thinking in Facebook and we will pop

0:31:52.040 --> 0:31:54.160
<v Speaker 1>right up. Leave us a message. We read all of them.

0:31:54.160 --> 0:31:56.480
<v Speaker 1>We look forward to hearing from you, and you'll hear

0:31:56.520 --> 0:32:05.040
<v Speaker 1>from us again really soon. For more on this topic

0:32:05.080 --> 0:32:18.040
<v Speaker 1>and the future of technology, visit forward thinking dot com,

0:32:18.080 --> 0:32:20.880
<v Speaker 1>brought to you by Toyota. Let's go places,