WEBVTT - A Small Episode About Big Data

0:00:04.440 --> 0:00:12.360
<v Speaker 1>Welcome to tech Stuff, a production from iHeartRadio. Hey there,

0:00:12.400 --> 0:00:16.239
<v Speaker 1>and welcome to tech Stuff. I'm your host, Jonathan Strickland.

0:00:16.320 --> 0:00:19.959
<v Speaker 1>I'm an executive producer with iHeart Podcasts and How the

0:00:20.160 --> 0:00:24.360
<v Speaker 1>tech are You. So early on in the days of

0:00:24.360 --> 0:00:26.439
<v Speaker 1>tech stuff, back when I was still a staff writer

0:00:26.560 --> 0:00:30.400
<v Speaker 1>for a little website called HowStuffWorks dot com, my boss

0:00:31.000 --> 0:00:34.840
<v Speaker 1>Connell Burn, who is now a big shot over here

0:00:34.840 --> 0:00:37.560
<v Speaker 1>at iHeart, he came over to me with an assignment.

0:00:37.640 --> 0:00:40.360
<v Speaker 1>He wanted me to do some articles in some episodes

0:00:40.520 --> 0:00:46.680
<v Speaker 1>about this buzzword concept called big data. And I had

0:00:46.760 --> 0:00:50.040
<v Speaker 1>heard the term big data, and obviously there's a pretty

0:00:50.080 --> 0:00:52.839
<v Speaker 1>darn good hint I want. Big data is all about

0:00:52.920 --> 0:00:55.720
<v Speaker 1>just in the nature of the name itself, but beyond that,

0:00:55.800 --> 0:00:58.680
<v Speaker 1>I didn't really know much, so I jumped to it.

0:00:59.080 --> 0:01:01.960
<v Speaker 1>And the interesting thing is that since that time, the

0:01:02.040 --> 0:01:05.959
<v Speaker 1>discipline of big data has evolved significantly. When I was

0:01:06.000 --> 0:01:08.760
<v Speaker 1>first working on my articles and episodes, we were mostly

0:01:08.800 --> 0:01:13.119
<v Speaker 1>talking about how technological tools made it easier to collect

0:01:13.760 --> 0:01:18.280
<v Speaker 1>vast amounts of information very quickly and to store it.

0:01:18.720 --> 0:01:22.160
<v Speaker 1>But we didn't necessarily have equally sufficient tools to do

0:01:22.200 --> 0:01:25.360
<v Speaker 1>anything useful with all that information. We have or at

0:01:25.440 --> 0:01:29.680
<v Speaker 1>least those tools weren't widely known and understood beyond a

0:01:29.720 --> 0:01:34.759
<v Speaker 1>certain circle of computer scientists. Flash forward a few years,

0:01:35.080 --> 0:01:39.320
<v Speaker 1>and we'd see companies developing new methods to analyze large

0:01:39.440 --> 0:01:42.160
<v Speaker 1>chunks of data. Oh, by the way, I do the

0:01:42.200 --> 0:01:45.360
<v Speaker 1>weird data data thing, and there's no rhyme or reason

0:01:45.360 --> 0:01:46.680
<v Speaker 1>to it. I don't even know which one I'm going

0:01:46.720 --> 0:01:49.360
<v Speaker 1>to say before I say it, so I apologize because

0:01:49.400 --> 0:01:52.960
<v Speaker 1>I know it's irritating. It irritates me too, Anyway. Other

0:01:53.000 --> 0:01:57.480
<v Speaker 1>companies sprung up with products that were meant to help

0:01:57.720 --> 0:02:01.040
<v Speaker 1>with data analysis, and it seemed like we were going

0:02:01.080 --> 0:02:03.800
<v Speaker 1>from an era of well, now I have all this information,

0:02:03.880 --> 0:02:06.200
<v Speaker 1>what do I do now, to an era of I

0:02:06.280 --> 0:02:09.440
<v Speaker 1>have discovered cryptic secrets that were hiding in plain sight

0:02:09.680 --> 0:02:13.679
<v Speaker 1>thanks to data analysis, and that somehow it all happened overnight.

0:02:13.960 --> 0:02:16.639
<v Speaker 1>So today I thought we would actually look back over

0:02:16.680 --> 0:02:21.000
<v Speaker 1>the history of the big data concept, how various systems

0:02:21.000 --> 0:02:25.520
<v Speaker 1>have made it possible to sift through seemingly meaningless information

0:02:25.880 --> 0:02:28.480
<v Speaker 1>in order to find nuggets of wisdom, and why we

0:02:28.560 --> 0:02:31.960
<v Speaker 1>might not always be able to trust the answers that

0:02:32.000 --> 0:02:35.960
<v Speaker 1>we discover. So the history of big data starts in

0:02:36.000 --> 0:02:38.680
<v Speaker 1>the twenty tens, or maybe it starts in two thousand

0:02:38.720 --> 0:02:42.240
<v Speaker 1>and five, or maybe in nineteen ninety, or maybe the

0:02:42.280 --> 0:02:46.239
<v Speaker 1>sixteen hundreds, or maybe nearly twenty thousand years ago. You

0:02:46.520 --> 0:02:48.280
<v Speaker 1>might have already picked up on the fact that folks

0:02:48.280 --> 0:02:51.120
<v Speaker 1>don't quite agree on where we should start when talking

0:02:51.160 --> 0:02:54.640
<v Speaker 1>about big data. But that makes sense. Ever since humans

0:02:54.720 --> 0:02:58.800
<v Speaker 1>have started to write stuff down, we've been pretty darn

0:02:58.880 --> 0:03:03.360
<v Speaker 1>invested in the collection and then the classification of information.

0:03:03.840 --> 0:03:07.079
<v Speaker 1>Whether it's to figure out the best time to sew

0:03:07.320 --> 0:03:10.480
<v Speaker 1>or harvest crops, or keep track of how much we've

0:03:10.520 --> 0:03:13.120
<v Speaker 1>traded with that other band of neair dwells who live

0:03:13.160 --> 0:03:15.440
<v Speaker 1>on the other side of the holler, or we just

0:03:15.520 --> 0:03:18.160
<v Speaker 1>want to make a record of how great it was

0:03:18.200 --> 0:03:20.640
<v Speaker 1>that we kicked the butt of that mastodon, real good.

0:03:20.840 --> 0:03:26.440
<v Speaker 1>We've been really obsessed with data and collection and retrieval. Now,

0:03:27.120 --> 0:03:30.240
<v Speaker 1>this obsession also means that we had to come up

0:03:30.280 --> 0:03:34.000
<v Speaker 1>with various ways to store and analyze this information. Raw

0:03:34.000 --> 0:03:37.720
<v Speaker 1>information doesn't do anyone much good, and so throughout antiquity

0:03:37.960 --> 0:03:41.520
<v Speaker 1>we came up with means of recording and storing and

0:03:41.560 --> 0:03:45.960
<v Speaker 1>making use of information. Not only did hardworking humans create

0:03:46.120 --> 0:03:49.760
<v Speaker 1>libraries where we could gather all this knowledge and then

0:03:50.000 --> 0:03:52.840
<v Speaker 1>lose some of those libraries along the way due to

0:03:52.880 --> 0:03:55.640
<v Speaker 1>the fact that we humans also are pretty stupid and

0:03:55.680 --> 0:03:59.480
<v Speaker 1>we end up having disputes that involve burning each other's

0:03:59.520 --> 0:04:03.200
<v Speaker 1>stuff to the ground. Yeah, I'm still bitter about certain

0:04:03.240 --> 0:04:07.080
<v Speaker 1>libraries being destroyed over in antiquity, but it means that

0:04:07.120 --> 0:04:10.120
<v Speaker 1>we also had to come up with methodologies to categorize

0:04:10.280 --> 0:04:13.600
<v Speaker 1>and classify information. Otherwise you may as well just have

0:04:13.680 --> 0:04:17.240
<v Speaker 1>a big old pile of scrolls or books or whatever,

0:04:17.600 --> 0:04:20.200
<v Speaker 1>and then people just you know, have to sort through

0:04:20.240 --> 0:04:23.159
<v Speaker 1>them and see if they can find anything, which actually

0:04:23.240 --> 0:04:25.640
<v Speaker 1>sparks two different memories in my head. One is that

0:04:25.839 --> 0:04:28.640
<v Speaker 1>there used to be a used bookstore I would go

0:04:28.680 --> 0:04:33.120
<v Speaker 1>to here in Atlanta, and often the used bookstore was

0:04:33.560 --> 0:04:37.919
<v Speaker 1>completely unorganized, right, Like you literally could go through a

0:04:37.920 --> 0:04:40.240
<v Speaker 1>bookshelf and it's just going to be books that are

0:04:40.720 --> 0:04:43.440
<v Speaker 1>more or less the same size, but otherwise there's no

0:04:43.680 --> 0:04:45.679
<v Speaker 1>rhyme or reason as to why they were put there,

0:04:46.040 --> 0:04:47.800
<v Speaker 1>and it was like you were on a treasure hunt.

0:04:48.000 --> 0:04:55.000
<v Speaker 1>And then I'm also reminded of a naval museum in Appalachicola, Florida,

0:04:55.000 --> 0:04:57.480
<v Speaker 1>which is on the Panhandle. I went to this little,

0:04:57.880 --> 0:05:02.159
<v Speaker 1>you know, naval museum, like a ship museum, and I

0:05:02.360 --> 0:05:05.480
<v Speaker 1>reminded that all the exhibits were kind of in a

0:05:05.520 --> 0:05:08.839
<v Speaker 1>pile on the floor, and you would literally pick things

0:05:08.920 --> 0:05:12.560
<v Speaker 1>up and look at them. And that's kind of what

0:05:12.880 --> 0:05:15.159
<v Speaker 1>it would be like if we didn't have these means

0:05:15.200 --> 0:05:17.880
<v Speaker 1>of classification. Once you get to a certain size, like

0:05:17.920 --> 0:05:21.719
<v Speaker 1>that little museum in Appalachic Cola wasn't so big as

0:05:21.760 --> 0:05:24.000
<v Speaker 1>to be a problem. But if you're talking about a

0:05:24.040 --> 0:05:27.240
<v Speaker 1>big library, obviously, if you want anything useful, you got

0:05:27.279 --> 0:05:29.120
<v Speaker 1>to come up with a way of classifying all this.

0:05:29.680 --> 0:05:32.760
<v Speaker 1>To that end, ancient folks began to develop a science

0:05:33.160 --> 0:05:37.640
<v Speaker 1>called taxonomy. And this isn't when you stuff dead animals

0:05:37.720 --> 0:05:39.800
<v Speaker 1>so that they look like they might still sort of

0:05:39.839 --> 0:05:44.320
<v Speaker 1>be alive. That's taxon dermy. No. Taxonomy is the science

0:05:44.320 --> 0:05:47.600
<v Speaker 1>of classification, and it's perhaps best known in the field

0:05:47.640 --> 0:05:51.440
<v Speaker 1>of biology, thanks in large part to a Swedish scientist

0:05:51.440 --> 0:05:55.280
<v Speaker 1>from the eighteenth century named Carl Linnaeus. But there are

0:05:55.480 --> 0:05:59.159
<v Speaker 1>many applications of taxonomy that extend beyond biology. It's just

0:05:59.160 --> 0:06:02.080
<v Speaker 1>the biological taxonomy is the one that I think most

0:06:02.080 --> 0:06:04.400
<v Speaker 1>of us are familiar with because most of us were

0:06:04.440 --> 0:06:08.320
<v Speaker 1>taught it when we were going through basic biology science.

0:06:08.760 --> 0:06:11.840
<v Speaker 1>But the ancient Greeks made some early progress on developing

0:06:11.880 --> 0:06:17.000
<v Speaker 1>systems of classification, and obviously, within modern library science, taxonomy

0:06:17.120 --> 0:06:20.640
<v Speaker 1>is an important discipline, though oddly enough, you could say

0:06:20.680 --> 0:06:24.960
<v Speaker 1>taxonomy in library science is distinct from classification. When I

0:06:25.000 --> 0:06:29.039
<v Speaker 1>was looking this up, I found resources for library science

0:06:29.279 --> 0:06:32.920
<v Speaker 1>that made these two distinct disciplines. Classification was one in

0:06:32.920 --> 0:06:36.080
<v Speaker 1>taxonomy was another. Now. This is because there are various

0:06:36.400 --> 0:06:39.800
<v Speaker 1>methods of classification in library science. The one that I

0:06:39.960 --> 0:06:41.880
<v Speaker 1>was most familiar with when I was growing up was

0:06:41.920 --> 0:06:45.320
<v Speaker 1>the Dewey decimal system, which I don't even think is

0:06:45.680 --> 0:06:48.760
<v Speaker 1>the dominant form now, but it was when I was

0:06:48.800 --> 0:06:51.680
<v Speaker 1>growing up. And it's meant to connect a specific work

0:06:51.760 --> 0:06:54.919
<v Speaker 1>to a specific physical location in a library for the

0:06:54.920 --> 0:06:57.880
<v Speaker 1>purposes of, you know, checking down the book, right. But

0:06:58.040 --> 0:07:01.760
<v Speaker 1>taxonomy in library science tends to be more towards metadata

0:07:01.920 --> 0:07:06.839
<v Speaker 1>or data about data. In fact, metadata plays a huge

0:07:06.880 --> 0:07:11.160
<v Speaker 1>part in big data. Oh man, I did it both

0:07:11.200 --> 0:07:14.160
<v Speaker 1>ways in one sentence. I feel awful. Anyway, the information

0:07:14.240 --> 0:07:17.840
<v Speaker 1>about information can be as useful as the information itself.

0:07:17.840 --> 0:07:20.280
<v Speaker 1>In some cases. I have often talked about this with

0:07:20.720 --> 0:07:24.720
<v Speaker 1>personal information about how info about info can give you

0:07:24.760 --> 0:07:28.160
<v Speaker 1>a lot of insight into a person. Maybe you don't

0:07:28.200 --> 0:07:30.560
<v Speaker 1>have a person's name, but you have a couple of

0:07:30.560 --> 0:07:34.200
<v Speaker 1>different data points about that person. In some cases, you

0:07:34.240 --> 0:07:37.920
<v Speaker 1>can actually narrow down the identity of the person you're

0:07:37.960 --> 0:07:41.040
<v Speaker 1>thinking of just by looking at this metadata. You don't

0:07:41.040 --> 0:07:43.240
<v Speaker 1>even have to see the information about them, which shows

0:07:43.240 --> 0:07:46.120
<v Speaker 1>you how powerful metadata can be. So you start to

0:07:46.120 --> 0:07:48.680
<v Speaker 1>see a cascading effect here where you slowly realize that

0:07:48.720 --> 0:07:51.400
<v Speaker 1>you actually have access to even more information than you

0:07:51.440 --> 0:07:54.800
<v Speaker 1>first anticipated because you also have information about that information.

0:07:54.880 --> 0:07:58.320
<v Speaker 1>It gets pretty wild. Another important development in the history

0:07:58.360 --> 0:08:01.840
<v Speaker 1>of big data is the creation of statistics. So let's

0:08:01.840 --> 0:08:05.360
<v Speaker 1>give the Merriam Webster definition of statistics. Shall we just

0:08:05.800 --> 0:08:10.520
<v Speaker 1>have a baseline. It is quote a branch of mathematics

0:08:10.920 --> 0:08:16.440
<v Speaker 1>dealing with the collection, analysis, interpretation, and presentation of masses

0:08:16.480 --> 0:08:21.200
<v Speaker 1>of numerical data. End quote. Now. One famous early example

0:08:21.240 --> 0:08:24.160
<v Speaker 1>of statistics comes to us courtesy of a fellow named

0:08:24.360 --> 0:08:31.080
<v Speaker 1>John Grant Graunt. He was looking at mortality rates in London,

0:08:31.440 --> 0:08:34.160
<v Speaker 1>and that gave him a lot more information and helped

0:08:34.200 --> 0:08:38.000
<v Speaker 1>him analyze the course of the plague. For example, he

0:08:38.040 --> 0:08:41.240
<v Speaker 1>could see when the plague was spiking or receding. Pretty

0:08:41.280 --> 0:08:45.080
<v Speaker 1>cheerful stuff, right, But he also used this information, the

0:08:45.120 --> 0:08:49.520
<v Speaker 1>mortality information, to start drawing some conclusions about the population

0:08:49.600 --> 0:08:52.840
<v Speaker 1>of London as a whole, So counting up everybody, like

0:08:52.920 --> 0:08:56.320
<v Speaker 1>figuring out who lives in London. That would have been

0:08:56.400 --> 0:08:59.600
<v Speaker 1>challenging at the time, to say the least. But Grant

0:08:59.720 --> 0:09:02.680
<v Speaker 1>took information like the number of funerals and then he

0:09:02.760 --> 0:09:06.240
<v Speaker 1>compared it to things like the average family size in

0:09:06.320 --> 0:09:09.559
<v Speaker 1>London to try and make an estimate of London's population.

0:09:09.679 --> 0:09:11.440
<v Speaker 1>So it gave him kind of a working figure that

0:09:11.960 --> 0:09:16.880
<v Speaker 1>was useful for certain applications, specifically government ones. Statistics as

0:09:17.080 --> 0:09:21.439
<v Speaker 1>a branch of mathematics would mature over the following centuries.

0:09:21.920 --> 0:09:25.520
<v Speaker 1>Often it would be the tool that allowed social scientists

0:09:25.600 --> 0:09:31.080
<v Speaker 1>to draw broad conclusions about large populations, but others found

0:09:31.120 --> 0:09:35.800
<v Speaker 1>plenty of alternative applications of statistics. Anyway, the age of

0:09:35.920 --> 0:09:39.320
<v Speaker 1>data analysis was well and truly in swing at this

0:09:39.440 --> 0:09:42.600
<v Speaker 1>point in the late nineteenth century. The United States was

0:09:42.679 --> 0:09:45.040
<v Speaker 1>getting in a bit of a pickle. And I know

0:09:45.280 --> 0:09:48.080
<v Speaker 1>we're making jumps of centuries here, but we need to

0:09:48.240 --> 0:09:52.920
<v Speaker 1>We can't go through every single evolution of data collection

0:09:53.160 --> 0:09:56.600
<v Speaker 1>and data analysis that would be a podcast series all

0:09:56.640 --> 0:09:59.599
<v Speaker 1>in itself. So we're in the late eighteen hundreds and

0:09:59.640 --> 0:10:02.120
<v Speaker 1>the USA US isn't a bit of a problem. The

0:10:02.160 --> 0:10:06.000
<v Speaker 1>country holds a census every ten years, where they're essentially

0:10:06.080 --> 0:10:08.920
<v Speaker 1>gathering information about all the citizens in the United States.

0:10:09.040 --> 0:10:12.199
<v Speaker 1>This is required by the US Constitution, and there are

0:10:12.240 --> 0:10:15.960
<v Speaker 1>several reasons why the Census Bureau holds a census every

0:10:15.960 --> 0:10:19.280
<v Speaker 1>ten years. But one of those reasons is that the

0:10:19.400 --> 0:10:24.800
<v Speaker 1>US House of Representatives its membership depends upon population. So

0:10:25.679 --> 0:10:29.280
<v Speaker 1>the more populous a state is, the more representatives that

0:10:29.320 --> 0:10:31.920
<v Speaker 1>state has in the House of Representatives. So if your

0:10:31.960 --> 0:10:35.280
<v Speaker 1>state has a big population, there are more representatives that

0:10:35.400 --> 0:10:39.200
<v Speaker 1>go to the House. If you have a relatively small population,

0:10:39.280 --> 0:10:43.000
<v Speaker 1>then you have fewer House representatives, right, That's how that works.

0:10:43.240 --> 0:10:48.480
<v Speaker 1>So by eighteen eighty things were getting to a really

0:10:48.840 --> 0:10:53.400
<v Speaker 1>difficult situation. The process of collecting and then analyzing all

0:10:53.440 --> 0:10:58.240
<v Speaker 1>the information was so cumbersome that it would take nearly

0:10:58.760 --> 0:11:02.040
<v Speaker 1>the whole decade just to get to a result, and

0:11:02.120 --> 0:11:05.880
<v Speaker 1>that means by the time you're drawing conclusions, it's actually

0:11:05.920 --> 0:11:08.120
<v Speaker 1>time for you to administer the next census. In fact,

0:11:08.360 --> 0:11:11.680
<v Speaker 1>they projected that in eighteen ninety working on the same

0:11:11.720 --> 0:11:15.160
<v Speaker 1>process that they were dependent upon previously. It would take

0:11:15.200 --> 0:11:18.480
<v Speaker 1>a whole decade, so literally you'd be holding your next

0:11:18.480 --> 0:11:21.000
<v Speaker 1>census while you were just getting your information from the

0:11:21.080 --> 0:11:23.440
<v Speaker 1>last one. So the Census Bureau needed a way to

0:11:23.440 --> 0:11:26.840
<v Speaker 1>collect and analyze this information in a much more efficient process.

0:11:27.280 --> 0:11:32.360
<v Speaker 1>They tapped a man named Herman Holleeth to accomplish this.

0:11:32.920 --> 0:11:36.720
<v Speaker 1>So Holloweth took a punch card system that had been

0:11:36.800 --> 0:11:40.840
<v Speaker 1>used in weaving, weaving with mechanical looms. I've talked about

0:11:40.840 --> 0:11:43.160
<v Speaker 1>this in the past with the history of punch cards.

0:11:43.400 --> 0:11:47.840
<v Speaker 1>In fact, this also gets into perhaps a somewhat apocryphal

0:11:47.880 --> 0:11:51.080
<v Speaker 1>story of where the word sabotage comes from, but that's

0:11:51.280 --> 0:11:54.360
<v Speaker 1>for another time. So he took this punch card system

0:11:54.400 --> 0:11:58.160
<v Speaker 1>that had been used to set weaving patterns with mechanical looms,

0:11:58.360 --> 0:12:00.760
<v Speaker 1>and then he adapted that to serve as a way

0:12:01.120 --> 0:12:05.040
<v Speaker 1>to record information so that you could feed the card

0:12:05.080 --> 0:12:09.840
<v Speaker 1>to a tabulation machine which then could actually tabulate the results.

0:12:10.080 --> 0:12:13.600
<v Speaker 1>And his invention meant that ten years of labor done

0:12:13.600 --> 0:12:16.320
<v Speaker 1>by clerks who are working at desks would actually boil

0:12:16.400 --> 0:12:21.080
<v Speaker 1>down to about three months of labor using the tabulation machine. Obviously,

0:12:21.640 --> 0:12:24.920
<v Speaker 1>that was a huge improvement. Hollerith formed a company that

0:12:25.080 --> 0:12:28.040
<v Speaker 1>over time would evolve into one of the most famous

0:12:28.040 --> 0:12:32.120
<v Speaker 1>companies in all the world, Kentucky Fried Chicken. I'm just kidding.

0:12:32.200 --> 0:12:36.760
<v Speaker 1>It wasn't KFC. Instead, it was IBM. That's the company

0:12:37.040 --> 0:12:40.200
<v Speaker 1>that would grow out of Hollowarith's company that he founded

0:12:40.600 --> 0:12:44.360
<v Speaker 1>in the nineteenth century. Anyway, we're not going to spend

0:12:44.800 --> 0:12:47.520
<v Speaker 1>too much time in all these centuries gone by. We're

0:12:47.520 --> 0:12:50.560
<v Speaker 1>actually going to speed things up and get up to

0:12:50.600 --> 0:12:54.000
<v Speaker 1>the twentieth century. But before we do that, let's take

0:12:54.240 --> 0:13:07.480
<v Speaker 1>a quick break to thank our sponsor. We're back, okay,

0:13:08.200 --> 0:13:12.439
<v Speaker 1>So the actual term big data is still waiting for us.

0:13:12.480 --> 0:13:14.280
<v Speaker 1>We're not going to really get to that until we

0:13:14.360 --> 0:13:17.200
<v Speaker 1>hit the late nineteen nineties or so. But there are

0:13:17.200 --> 0:13:20.120
<v Speaker 1>a few things to point out before we get up

0:13:20.160 --> 0:13:24.760
<v Speaker 1>to there. Folks were starting to notice that we were generating, collecting,

0:13:24.880 --> 0:13:28.920
<v Speaker 1>and storing an awful lot of information in the twentieth century,

0:13:29.200 --> 0:13:33.280
<v Speaker 1>and that the rate of data generation was on the rise.

0:13:33.440 --> 0:13:35.960
<v Speaker 1>Not only were we generating a whole bunch of information,

0:13:36.360 --> 0:13:39.680
<v Speaker 1>we were doing it in larger amounts year over year.

0:13:39.880 --> 0:13:42.520
<v Speaker 1>In fact, it was rising much faster than our rate

0:13:42.559 --> 0:13:47.040
<v Speaker 1>of consumption of information, meaning that we were making way

0:13:47.120 --> 0:13:50.320
<v Speaker 1>more data than we were actually able to use. And

0:13:50.360 --> 0:13:53.560
<v Speaker 1>a big thanks goes out to Forbes for an article

0:13:53.600 --> 0:13:57.080
<v Speaker 1>that's titled A very Short History of Big Data by

0:13:57.200 --> 0:14:00.560
<v Speaker 1>Gil Press. A lot of the information that I'm drawing

0:14:00.640 --> 0:14:03.280
<v Speaker 1>upon came from that article. It is fantastic if you

0:14:03.320 --> 0:14:05.320
<v Speaker 1>want to learn more about this. I'm not going to

0:14:05.360 --> 0:14:07.960
<v Speaker 1>cover every element that they do. I mean, that would

0:14:07.960 --> 0:14:10.560
<v Speaker 1>just be me regurgitating their article. You should check it

0:14:10.559 --> 0:14:13.080
<v Speaker 1>out if you're interested in the history of big data.

0:14:13.240 --> 0:14:15.440
<v Speaker 1>We're going to touch on a few of the important points,

0:14:15.760 --> 0:14:18.000
<v Speaker 1>or what I think of as the important points. So

0:14:18.679 --> 0:14:20.400
<v Speaker 1>one of the earliest ones we're going to talk about

0:14:20.440 --> 0:14:24.400
<v Speaker 1>is in nineteen forty four, a librarian named Fremont Writer,

0:14:24.600 --> 0:14:28.640
<v Speaker 1>which is a fantastic name, wrote a work titled The

0:14:28.680 --> 0:14:32.040
<v Speaker 1>Scholar and the Future of the Research Library. So Writer

0:14:32.240 --> 0:14:34.840
<v Speaker 1>made an observation that reminds me a lot of Gordon

0:14:34.920 --> 0:14:39.800
<v Speaker 1>Moore's famous Moore's law, except this involves not silicon chips

0:14:40.280 --> 0:14:44.880
<v Speaker 1>but physical libraries. So Writer said that your typical library

0:14:45.120 --> 0:14:49.240
<v Speaker 1>in your typical American university was doubling in size every

0:14:49.320 --> 0:14:52.720
<v Speaker 1>sixteen years. He projected that this would mean that by

0:14:52.760 --> 0:14:56.440
<v Speaker 1>the year twenty forty, the library at Yale University would

0:14:56.480 --> 0:14:59.280
<v Speaker 1>be so large as to require a staff of more

0:14:59.320 --> 0:15:04.480
<v Speaker 1>than six thousand people to manage it. Of course, this

0:15:04.640 --> 0:15:08.280
<v Speaker 1>was before we had digital storage and digital filing systems

0:15:08.560 --> 0:15:12.600
<v Speaker 1>that has largely mitigated this particular requirement. We don't need

0:15:13.040 --> 0:15:16.960
<v Speaker 1>the physical space necessarily that we would if everything were

0:15:17.000 --> 0:15:20.640
<v Speaker 1>still in hard copy. But the observation showed that data

0:15:20.680 --> 0:15:24.280
<v Speaker 1>accumulation really had a steep trajectory even back in the

0:15:24.360 --> 0:15:28.800
<v Speaker 1>nineteen forties. Similarly, in the early nineteen sixties, a guy

0:15:28.880 --> 0:15:32.080
<v Speaker 1>named Derek Price published a piece explaining that the number

0:15:32.080 --> 0:15:35.400
<v Speaker 1>of scientific journals and papers was on a path of

0:15:35.600 --> 0:15:39.520
<v Speaker 1>exponential growth. It was doubling every fifteen years, so similar

0:15:39.760 --> 0:15:42.880
<v Speaker 1>to the rate at which university libraries were doubling in

0:15:42.960 --> 0:15:45.480
<v Speaker 1>size now. Part of the reason for this, he said,

0:15:45.840 --> 0:15:50.400
<v Speaker 1>was that scientific discoveries inevitably fuel further discoveries. So you

0:15:50.480 --> 0:15:53.320
<v Speaker 1>find out something new, this inspires other scientists to look

0:15:53.360 --> 0:15:56.160
<v Speaker 1>further into it, they find other new things, and so on.

0:15:56.520 --> 0:15:59.480
<v Speaker 1>In nineteen sixty five, the United States government needed to

0:15:59.480 --> 0:16:02.200
<v Speaker 1>build a place that would store records, including things like

0:16:02.320 --> 0:16:06.600
<v Speaker 1>tax returns and fingerprint sets, and so the plan was

0:16:06.680 --> 0:16:09.800
<v Speaker 1>to take the paper records and then transfer them to

0:16:10.000 --> 0:16:13.360
<v Speaker 1>magnetic tape, and then to store that magnetic tape in

0:16:13.440 --> 0:16:17.800
<v Speaker 1>this so called data center. This project fell through, however,

0:16:18.040 --> 0:16:21.920
<v Speaker 1>because the public got nervous. They felt squiky about this

0:16:22.000 --> 0:16:25.080
<v Speaker 1>idea of the government hoarding vast amounts of information about

0:16:25.080 --> 0:16:28.200
<v Speaker 1>its citizens. They did not fully trust the government. So

0:16:28.240 --> 0:16:30.680
<v Speaker 1>you understand like they're thinking, I don't really feel comfortable

0:16:30.720 --> 0:16:33.520
<v Speaker 1>with you just gathering all this information about us. It

0:16:33.560 --> 0:16:37.240
<v Speaker 1>feels kind of oppressive. Now, what's funny to me is

0:16:37.280 --> 0:16:41.000
<v Speaker 1>that today the average person is more than willing to

0:16:41.080 --> 0:16:44.320
<v Speaker 1>let companies do this to them without even protesting it.

0:16:44.560 --> 0:16:48.840
<v Speaker 1>Because that's how all the online social network companies work, right,

0:16:48.880 --> 0:16:51.440
<v Speaker 1>They work on the basis of gathering information about us

0:16:51.760 --> 0:16:55.480
<v Speaker 1>and then peddling that or or hoarding it, however you

0:16:55.560 --> 0:16:58.320
<v Speaker 1>might think of it. And it's very similar to what

0:16:58.360 --> 0:17:00.360
<v Speaker 1>was happening in the nineteen sixties. And back then we

0:17:00.360 --> 0:17:02.480
<v Speaker 1>were like, no, that's not cool, and now we're like,

0:17:02.520 --> 0:17:05.719
<v Speaker 1>that's just how it works. It's wild to me. Anyway,

0:17:05.720 --> 0:17:07.359
<v Speaker 1>I'm going to skip ahead a little bit to the

0:17:07.440 --> 0:17:14.160
<v Speaker 1>nineteen eighties. There was a lecturer, I a Tjomslend, and

0:17:14.200 --> 0:17:18.600
<v Speaker 1>I know I butchered his name. I apologize anyway. He

0:17:18.680 --> 0:17:24.159
<v Speaker 1>gave a lecture at the IE or IE Symposium in

0:17:24.200 --> 0:17:27.600
<v Speaker 1>which he posits that one reason all this information is

0:17:27.640 --> 0:17:30.560
<v Speaker 1>piling up is that we don't really have a good

0:17:30.560 --> 0:17:34.520
<v Speaker 1>way to determine which information is relevant and which information

0:17:35.119 --> 0:17:39.000
<v Speaker 1>is not. And we can make that determination, but it

0:17:39.040 --> 0:17:42.919
<v Speaker 1>requires work, and meanwhile, we're still accumulating more information. So

0:17:43.280 --> 0:17:45.480
<v Speaker 1>it's the kind of work where you're never done, and

0:17:45.560 --> 0:17:47.840
<v Speaker 1>it feels like you're never making any progress. So most

0:17:47.840 --> 0:17:50.280
<v Speaker 1>of us never bother to do it at all. And

0:17:50.720 --> 0:17:54.200
<v Speaker 1>if our ability to store data is sufficient, in other words,

0:17:54.200 --> 0:17:57.280
<v Speaker 1>if we have ways of storing the information, then we

0:17:57.359 --> 0:18:01.119
<v Speaker 1>have even less incentive to make any determination about the data. Right, Like,

0:18:01.160 --> 0:18:03.679
<v Speaker 1>if we've got plenty of storage, well, let's just go

0:18:03.720 --> 0:18:06.639
<v Speaker 1>ahead and keep the information. There's no reason to have

0:18:06.720 --> 0:18:09.080
<v Speaker 1>to worry about it whether it's useful or not. We

0:18:09.119 --> 0:18:12.200
<v Speaker 1>should keep it because it's better for us to keep

0:18:12.840 --> 0:18:17.960
<v Speaker 1>useless information without needing it, rather than accidentally deleting something

0:18:18.160 --> 0:18:21.520
<v Speaker 1>that turned out to be important. Right, And this kind

0:18:21.520 --> 0:18:23.840
<v Speaker 1>of makes sense. I mean, I'm sure a lot of

0:18:23.880 --> 0:18:26.119
<v Speaker 1>you out there can apply that to your lives. I

0:18:26.160 --> 0:18:28.359
<v Speaker 1>certainly can apply it to my life, right Like, I

0:18:28.480 --> 0:18:31.600
<v Speaker 1>have file folders that are full of stuff that I'm

0:18:31.640 --> 0:18:35.159
<v Speaker 1>never going to touch again, but I still feel reluctant

0:18:35.200 --> 0:18:37.600
<v Speaker 1>to delete it just in case I do need to

0:18:37.600 --> 0:18:39.679
<v Speaker 1>touch it again sometime in the future, even though the

0:18:39.760 --> 0:18:43.399
<v Speaker 1>likelihood of that is very low. So that's anecdotal. I

0:18:43.440 --> 0:18:46.639
<v Speaker 1>can't really call that evidence to prove the point, but

0:18:46.720 --> 0:18:50.240
<v Speaker 1>it feels like the point is relevant. So this is

0:18:50.280 --> 0:18:52.639
<v Speaker 1>also how I play a lot of those big open

0:18:52.640 --> 0:18:56.080
<v Speaker 1>world computer RPGs, by the way, things like Skyrim or whatever,

0:18:56.160 --> 0:18:59.800
<v Speaker 1>because I'll just hoarde potions and scrolls and I never

0:19:00.160 --> 0:19:02.679
<v Speaker 1>use them because what if I need it more in

0:19:02.760 --> 0:19:05.600
<v Speaker 1>the future. Balder's Gate three has really done a number

0:19:05.640 --> 0:19:07.600
<v Speaker 1>on me with this. I got a real problem with

0:19:07.640 --> 0:19:12.120
<v Speaker 1>that anyway. The Forbes article details several more entries indicating

0:19:12.160 --> 0:19:15.520
<v Speaker 1>how very smart people were taking note regarding the accumulation

0:19:15.600 --> 0:19:19.320
<v Speaker 1>of information, as well as methods to store the information,

0:19:19.480 --> 0:19:22.520
<v Speaker 1>and increasingly, as time went on, how we can do

0:19:22.640 --> 0:19:25.480
<v Speaker 1>useful things with all this information. So I recommend you

0:19:25.560 --> 0:19:27.480
<v Speaker 1>check out that Forbes article if you want to learn more.

0:19:27.920 --> 0:19:31.240
<v Speaker 1>I think it goes up to about twenty twelve at

0:19:31.240 --> 0:19:34.080
<v Speaker 1>this point, it has been updated numerous times, but obviously

0:19:34.400 --> 0:19:38.760
<v Speaker 1>twenty twelve was quite a long time ago, so it's

0:19:38.840 --> 0:19:41.680
<v Speaker 1>it's not exactly up to present day. But it's still

0:19:41.720 --> 0:19:45.440
<v Speaker 1>a really interesting article that gives lots more details about this.

0:19:46.040 --> 0:19:47.960
<v Speaker 1>But I don't want to just regurgitate the article, so

0:19:47.960 --> 0:19:50.840
<v Speaker 1>we're going to hop on ahead. Now, Folks, in general,

0:19:51.000 --> 0:19:54.840
<v Speaker 1>we're becoming more aware of this information challenge that was growing.

0:19:55.119 --> 0:19:58.840
<v Speaker 1>But where did the term big data actually come from? Well,

0:19:58.920 --> 0:20:02.760
<v Speaker 1>chances are it's sort of rose organically in conversations within

0:20:02.840 --> 0:20:07.160
<v Speaker 1>the computer sector. As you know, hackers and computer scientists

0:20:07.240 --> 0:20:11.080
<v Speaker 1>and programmers and researchers were all wrestling with ways to

0:20:11.200 --> 0:20:14.840
<v Speaker 1>deal with data. Now, by this time, folks had adapted

0:20:14.880 --> 0:20:19.119
<v Speaker 1>an observation made by Cyril Northcote Parkinson to apply to

0:20:19.240 --> 0:20:23.679
<v Speaker 1>computer systems and to information. So Parkinson's original observation was

0:20:23.680 --> 0:20:27.560
<v Speaker 1>that generally speaking, in public administration offices, you know, like

0:20:27.680 --> 0:20:31.960
<v Speaker 1>government offices, work expands to fill the time that was

0:20:32.000 --> 0:20:35.000
<v Speaker 1>allowed for that work. So if you have a project

0:20:35.040 --> 0:20:38.120
<v Speaker 1>that's going to be due in three weeks, but really,

0:20:38.160 --> 0:20:40.760
<v Speaker 1>if you were to be brutally honest, there's only a

0:20:40.800 --> 0:20:44.479
<v Speaker 1>week's worth of work to do for that project. Well,

0:20:44.800 --> 0:20:48.360
<v Speaker 1>that work will almost magically expand so that it actually

0:20:48.400 --> 0:20:51.720
<v Speaker 1>takes three weeks to complete. This gets more nuanced and

0:20:51.760 --> 0:20:54.880
<v Speaker 1>it brings into account elements like bureaucracy. But you get

0:20:54.920 --> 0:20:59.760
<v Speaker 1>the point right that somehow it doesn't matter, you know what,

0:21:00.119 --> 0:21:02.480
<v Speaker 1>who is working the job. It doesn't matter the nature

0:21:02.520 --> 0:21:05.159
<v Speaker 1>of the work. The work will expand to fill the

0:21:05.200 --> 0:21:07.920
<v Speaker 1>amount of time it requires to do that work, which

0:21:07.960 --> 0:21:10.439
<v Speaker 1>meant that if you had said it would take two weeks,

0:21:10.760 --> 0:21:12.879
<v Speaker 1>it would have just expanded to two weeks, not three.

0:21:13.000 --> 0:21:15.960
<v Speaker 1>It's very weird, right Anyway, Folks in the computer biz

0:21:16.000 --> 0:21:19.679
<v Speaker 1>adapted this to say that data will expand to fill

0:21:19.840 --> 0:21:23.199
<v Speaker 1>whatever space you have available for that data. So, in

0:21:23.240 --> 0:21:27.560
<v Speaker 1>other words, you make a bigger storage unit, you're going

0:21:27.640 --> 0:21:30.240
<v Speaker 1>to fill it like that data will just expand to

0:21:30.280 --> 0:21:34.119
<v Speaker 1>fill that even though you thought, oh, I'm future proofing this,

0:21:34.640 --> 0:21:38.640
<v Speaker 1>and again anecdotally, I have observed this in my personal life.

0:21:38.640 --> 0:21:41.320
<v Speaker 1>I remember when hard disk drives first became a thing

0:21:41.400 --> 0:21:44.879
<v Speaker 1>in personal computers, like they were already existed, but personal

0:21:44.920 --> 0:21:47.920
<v Speaker 1>computers didn't have them when they first it came out, right,

0:21:47.960 --> 0:21:51.080
<v Speaker 1>you were using external drives like floppy disks and stuff,

0:21:51.320 --> 0:21:54.440
<v Speaker 1>and I remember whenever there would be a dramatic expansion

0:21:54.440 --> 0:21:57.399
<v Speaker 1>of storage space, and it always seemed to be dramatic, right,

0:21:57.480 --> 0:22:00.600
<v Speaker 1>it always seemed like it had doubled since last time.

0:22:00.640 --> 0:22:03.240
<v Speaker 1>And typically that's how it worked. Anyway, I would walk

0:22:03.280 --> 0:22:05.880
<v Speaker 1>away thinking, Wow, I'm never gonna fill all this space.

0:22:06.040 --> 0:22:08.840
<v Speaker 1>I mean, who even needs that much space? Two hundred

0:22:08.840 --> 0:22:12.359
<v Speaker 1>and fifty six megabytes? Who the heck needs that much space?

0:22:12.400 --> 0:22:14.880
<v Speaker 1>That's way too much. I mean, I'll never fill it up.

0:22:15.520 --> 0:22:17.800
<v Speaker 1>But of course I would prove myself wrong, typically in

0:22:17.880 --> 0:22:22.280
<v Speaker 1>record time. But beyond anecdotes, which again don't really count

0:22:22.320 --> 0:22:25.480
<v Speaker 1>as evidence, the observation really pointed out that we will

0:22:25.560 --> 0:22:28.919
<v Speaker 1>eagerly fill up whatever space we're given. You could argue

0:22:29.160 --> 0:22:31.920
<v Speaker 1>this goes back to our tendency to avoid deleting material

0:22:32.000 --> 0:22:35.960
<v Speaker 1>out of concern that it might one day become useful. Anyway,

0:22:36.280 --> 0:22:39.160
<v Speaker 1>By the mid nineteen nineties, there was a computer scientist

0:22:39.240 --> 0:22:43.720
<v Speaker 1>named John Mashie, and he was giving presentations that related

0:22:43.840 --> 0:22:48.120
<v Speaker 1>to this concept of big data. Now, Mashie has dismissed

0:22:48.160 --> 0:22:51.800
<v Speaker 1>the idea that he personally coined the phrase. At most,

0:22:52.160 --> 0:22:55.919
<v Speaker 1>he says that he popularized the term big data in

0:22:55.920 --> 0:22:58.520
<v Speaker 1>his talks but his point was that he used the

0:22:58.520 --> 0:23:00.720
<v Speaker 1>phrase big data because it was a shit shorthand way

0:23:00.760 --> 0:23:04.360
<v Speaker 1>to give a nod to several related challenges, ranging from

0:23:04.480 --> 0:23:07.919
<v Speaker 1>storage to analysis. So one could argue that Mashie's use

0:23:07.960 --> 0:23:11.159
<v Speaker 1>of the term approached what we mean by big data today,

0:23:11.320 --> 0:23:13.680
<v Speaker 1>but it wasn't one hundred percent the same thing. And

0:23:14.119 --> 0:23:18.640
<v Speaker 1>the earliest use I've seen cited happened sometime around nineteen

0:23:18.760 --> 0:23:22.679
<v Speaker 1>ninety eight. So we know Mashie didn't invent the phrase,

0:23:23.359 --> 0:23:26.600
<v Speaker 1>and we know that partly because researchers found an instance

0:23:26.680 --> 0:23:29.920
<v Speaker 1>that predates his talks by nearly a decade. Steve Lohr

0:23:30.000 --> 0:23:32.480
<v Speaker 1>wrote a piece for The New York Times titled the

0:23:32.520 --> 0:23:37.119
<v Speaker 1>Origins of Big Data, An etymological detective Story. A great,

0:23:37.240 --> 0:23:40.639
<v Speaker 1>great article. By the way, Lore spoke with an associate

0:23:40.840 --> 0:23:44.640
<v Speaker 1>librarian in Yale Law School named Fred Shapiro, and Fred

0:23:44.680 --> 0:23:47.600
<v Speaker 1>Shapiro did some research and uncovered an instance of the

0:23:47.640 --> 0:23:51.240
<v Speaker 1>phrase big data in a nineteen eighty nine article in

0:23:51.359 --> 0:23:54.640
<v Speaker 1>Harper's magazine. The author of that piece was Eric Larson,

0:23:54.680 --> 0:23:58.239
<v Speaker 1>who said, quote, the keepers of big data say they

0:23:58.320 --> 0:24:01.119
<v Speaker 1>do it for the consumer's benefit, but data have a

0:24:01.160 --> 0:24:05.280
<v Speaker 1>way of being used for purposes other than originally intended quote,

0:24:05.520 --> 0:24:08.600
<v Speaker 1>and boy howdy, we have seen that observation play out

0:24:08.600 --> 0:24:11.960
<v Speaker 1>again and again, haven't we. It's remarkable because nineteen eighty

0:24:12.080 --> 0:24:15.520
<v Speaker 1>nine predates the World Wide Web, certainly predates all the

0:24:15.520 --> 0:24:18.920
<v Speaker 1>social networks that we talk about. But Eric Larson's observation

0:24:19.280 --> 0:24:22.639
<v Speaker 1>is just as relevant, if not more relevant, today than

0:24:22.680 --> 0:24:26.320
<v Speaker 1>it was in nineteen eighty nine. Also, incidentally, Eric Larson

0:24:26.480 --> 0:24:29.720
<v Speaker 1>wrote one of my favorite books of all time. It's

0:24:29.760 --> 0:24:33.280
<v Speaker 1>titled The Devil in the White City. Famous book. I'm

0:24:33.280 --> 0:24:35.119
<v Speaker 1>sure a lot of you have already read it, but

0:24:35.200 --> 0:24:37.840
<v Speaker 1>for those who haven't, it's a book that tells two

0:24:38.000 --> 0:24:43.000
<v Speaker 1>somewhat intertwined stories, the eighteen ninety three World's Columbian Exposition

0:24:43.119 --> 0:24:47.320
<v Speaker 1>in Chicago and the tale behind HH Holmes, credited as

0:24:47.359 --> 0:24:50.920
<v Speaker 1>one of America's first serial killers. Now, I originally bought

0:24:50.920 --> 0:24:54.280
<v Speaker 1>the book because I was interested in Holmes's story, but

0:24:54.320 --> 0:24:56.240
<v Speaker 1>I got to be honest, I actually found the chapters

0:24:56.240 --> 0:24:59.560
<v Speaker 1>about the exposition to be far more captivating, and it

0:24:59.640 --> 0:25:01.120
<v Speaker 1>ties up into a lot of the stuff we talk

0:25:01.160 --> 0:25:03.520
<v Speaker 1>about on tech stuff. So it's a great book if

0:25:03.560 --> 0:25:05.920
<v Speaker 1>you're looking for something to read. But now let's get

0:25:05.920 --> 0:25:09.520
<v Speaker 1>back to Big data. So things continue on their inevitable

0:25:09.560 --> 0:25:14.199
<v Speaker 1>path through time. As it goes, time marches on, we

0:25:14.280 --> 0:25:16.920
<v Speaker 1>get up to the two thousands. By now the Internet

0:25:17.000 --> 0:25:21.679
<v Speaker 1>has greatly exacerbated our data creation and accumulation problem. In

0:25:21.760 --> 0:25:25.879
<v Speaker 1>two thousand, Francis Diebold wrote, quote big data refers to

0:25:25.920 --> 0:25:31.080
<v Speaker 1>the explosion in the quantity and sometimes quality of available

0:25:31.119 --> 0:25:35.239
<v Speaker 1>and potentially relevant data, largely the result of recent and

0:25:35.359 --> 0:25:40.400
<v Speaker 1>unprecedented advancements in data recording and storage technology end quote.

0:25:40.640 --> 0:25:42.679
<v Speaker 1>So we're really starting to close in at this point

0:25:42.720 --> 0:25:45.679
<v Speaker 1>on the concept of big data as we understand it today.

0:25:46.160 --> 0:25:47.960
<v Speaker 1>Then we get up to two thousand and five, and

0:25:48.040 --> 0:25:51.000
<v Speaker 1>a couple actually several important things happened that year in

0:25:51.080 --> 0:25:54.280
<v Speaker 1>the realm of big data. We get Tim O'Reilly and

0:25:54.359 --> 0:25:59.520
<v Speaker 1>his media company, fittingly enough called O'Reilly Media, and this

0:25:59.600 --> 0:26:01.800
<v Speaker 1>is the year that he would publish an article titled

0:26:02.040 --> 0:26:05.679
<v Speaker 1>what is web two point zho, a famous or perhaps

0:26:05.760 --> 0:26:09.679
<v Speaker 1>infamous article in tech circles. So the dot com bubble

0:26:09.720 --> 0:26:12.120
<v Speaker 1>had burst several years earlier, around two thousand and two

0:26:12.119 --> 0:26:15.199
<v Speaker 1>thousand and one, and O'Reilly was making observations about the

0:26:15.280 --> 0:26:19.720
<v Speaker 1>qualities that helped the companies that survived that crash versus

0:26:19.720 --> 0:26:22.280
<v Speaker 1>the companies that went under, like what set them apart?

0:26:22.320 --> 0:26:24.480
<v Speaker 1>What are some of the qualities that we can say

0:26:24.840 --> 0:26:27.800
<v Speaker 1>are really valuable on the web. And part of that

0:26:27.880 --> 0:26:31.760
<v Speaker 1>involved how successful web ventures were handling data. Now. That

0:26:31.840 --> 0:26:35.960
<v Speaker 1>same year, he had a guy named Roger Mugalus or

0:26:36.080 --> 0:26:38.480
<v Speaker 1>Mugalas actually I don't know how to say his last name,

0:26:38.520 --> 0:26:40.560
<v Speaker 1>but he was also with O'Reilly, and he argued that

0:26:40.600 --> 0:26:44.000
<v Speaker 1>big data refers to how we now had the capacity

0:26:44.160 --> 0:26:47.879
<v Speaker 1>and the capability to gather and store data sets that

0:26:47.960 --> 0:26:51.640
<v Speaker 1>are so large that our traditional business tools are incapable

0:26:51.680 --> 0:26:56.040
<v Speaker 1>of doing anything useful with that information. It's makes me

0:26:56.119 --> 0:26:59.440
<v Speaker 1>think of the joker in the Dark Knight film where

0:26:59.560 --> 0:27:01.760
<v Speaker 1>he says, as a dog chasing a car, he wouldn't

0:27:01.760 --> 0:27:03.520
<v Speaker 1>know what to do if he caught it. That kind

0:27:03.520 --> 0:27:05.960
<v Speaker 1>of thing. Yeah, we've got all this information, but the

0:27:06.000 --> 0:27:11.119
<v Speaker 1>tools we have aren't sufficient to do anything meaningful with it.

0:27:11.359 --> 0:27:15.280
<v Speaker 1>We were overwhelmed with information. But that same year, because

0:27:15.320 --> 0:27:17.199
<v Speaker 1>an awful lot happened in two thousand and five in

0:27:17.240 --> 0:27:21.640
<v Speaker 1>the big data space, Doug Cutting and Mike Cafferella released

0:27:21.640 --> 0:27:26.720
<v Speaker 1>a tool that would really change things. I'll explain more,

0:27:27.000 --> 0:27:30.040
<v Speaker 1>but first we're going to take another quick break to

0:27:30.119 --> 0:27:42.760
<v Speaker 1>thank our sponsors. Okay, before the break, I teased that

0:27:42.880 --> 0:27:44.760
<v Speaker 1>we were going to talk about a tool made by

0:27:44.800 --> 0:27:48.920
<v Speaker 1>Doug Cutting and Mike Cafferella that would actually change our

0:27:48.960 --> 0:27:52.600
<v Speaker 1>approach to big data and make it possible to do

0:27:52.760 --> 0:27:56.600
<v Speaker 1>meaningful things with it. So these two had read papers

0:27:56.720 --> 0:27:59.800
<v Speaker 1>about Google's file system as well as a tool that

0:27:59.800 --> 0:28:02.960
<v Speaker 1>Go was using called map reduce. Now, the purpose of

0:28:03.000 --> 0:28:05.960
<v Speaker 1>map reduce is to take large clusters of data and

0:28:06.080 --> 0:28:09.840
<v Speaker 1>essentially break them down into more manageable chunks, and then

0:28:09.920 --> 0:28:15.000
<v Speaker 1>analyze these chunks in parallel, and this makes the process

0:28:15.040 --> 0:28:17.960
<v Speaker 1>of data analysis faster. It's really just another form of

0:28:18.000 --> 0:28:21.320
<v Speaker 1>parallel processing when you really think about it. Anyway, Cutting

0:28:21.359 --> 0:28:24.800
<v Speaker 1>and Cafarella were inspired to make their own tool that

0:28:24.840 --> 0:28:27.119
<v Speaker 1>could do similar work, but you know, they can make

0:28:27.160 --> 0:28:30.240
<v Speaker 1>it for everybody, and so they created a project called

0:28:30.560 --> 0:28:36.199
<v Speaker 1>hadoop hadop, and the first version of hadoop would come

0:28:36.200 --> 0:28:38.080
<v Speaker 1>out in two thousand and six, and it's an open

0:28:38.120 --> 0:28:41.960
<v Speaker 1>source project. It's still around today with thousands of contributors.

0:28:42.160 --> 0:28:44.640
<v Speaker 1>But the important bit is that we were now starting

0:28:44.680 --> 0:28:48.560
<v Speaker 1>to develop new business tools that actually could handle the

0:28:48.600 --> 0:28:52.440
<v Speaker 1>massive amounts of information that we were accumulating. But let's

0:28:52.440 --> 0:28:55.240
<v Speaker 1>take a quick step back. Let's also consider what's going

0:28:55.280 --> 0:28:59.520
<v Speaker 1>on around this same time, the mid to late two thousands,

0:28:59.520 --> 0:29:01.320
<v Speaker 1>and by that I mean the first decade of the

0:29:01.360 --> 0:29:04.800
<v Speaker 1>two thousands. So for the first several years in the

0:29:04.800 --> 0:29:07.920
<v Speaker 1>computer age, it was really computer systems themselves that were

0:29:07.920 --> 0:29:11.160
<v Speaker 1>seen as the genesis of data creation, right like it's

0:29:11.560 --> 0:29:14.280
<v Speaker 1>the computers are the things making all this info. But

0:29:14.400 --> 0:29:16.800
<v Speaker 1>other elements were starting to come into play by this point.

0:29:16.960 --> 0:29:18.600
<v Speaker 1>So when we get up to two thousand and seven,

0:29:18.680 --> 0:29:22.000
<v Speaker 1>we're into the consumer smartphone era, because that was the

0:29:22.040 --> 0:29:25.840
<v Speaker 1>introduction of the Apple iPhone. These consumer smartphones can generate

0:29:26.120 --> 0:29:29.080
<v Speaker 1>enormous amounts of information. You can perform all sorts of

0:29:29.080 --> 0:29:32.640
<v Speaker 1>computational tasks on them. They can track your location, you

0:29:32.680 --> 0:29:36.320
<v Speaker 1>can connect the internet, et cetera. We also were getting

0:29:36.320 --> 0:29:38.800
<v Speaker 1>into the age of the Internet of Things, so we

0:29:38.800 --> 0:29:42.920
<v Speaker 1>were starting to create millions of these tiny devices, usually

0:29:42.960 --> 0:29:46.320
<v Speaker 1>designed to collect specific bits of information and then zip

0:29:46.400 --> 0:29:49.440
<v Speaker 1>that info off to somewhere else. So it might be

0:29:49.560 --> 0:29:52.880
<v Speaker 1>a speed sensor along a road. It might be a

0:29:52.880 --> 0:29:55.800
<v Speaker 1>thermometer at a weather data collection site. It might be

0:29:55.840 --> 0:29:58.240
<v Speaker 1>a thermostat in your own home. It could be anything.

0:29:58.320 --> 0:30:02.479
<v Speaker 1>Could be a smart speaker. All of these individual little

0:30:02.680 --> 0:30:06.160
<v Speaker 1>components would add to the amount of information we were

0:30:06.600 --> 0:30:10.800
<v Speaker 1>gathering and storing and creating, all in the hopes of

0:30:10.800 --> 0:30:14.160
<v Speaker 1>being able to do something useful with that info. And

0:30:14.520 --> 0:30:17.040
<v Speaker 1>we also had another buzz term that was starting to

0:30:17.040 --> 0:30:19.880
<v Speaker 1>gain traction, just as big data was really beginning to

0:30:19.880 --> 0:30:22.400
<v Speaker 1>transition from a topic that was talked about in a

0:30:22.440 --> 0:30:27.080
<v Speaker 1>relatively small subculture of computer scientists and such into a

0:30:27.160 --> 0:30:31.080
<v Speaker 1>topic that the general public had actually heard about. You know,

0:30:31.280 --> 0:30:36.320
<v Speaker 1>usually we're a few years behind whatever group is really

0:30:36.360 --> 0:30:40.080
<v Speaker 1>focused on the subject matter. So this other buzz term

0:30:40.360 --> 0:30:43.240
<v Speaker 1>was cloud computing, which I also got an assignment to

0:30:43.640 --> 0:30:46.120
<v Speaker 1>work on right around the same time as big data. Now,

0:30:46.160 --> 0:30:48.840
<v Speaker 1>the simplest way to describe cloud computing is that it's

0:30:48.880 --> 0:30:51.840
<v Speaker 1>when you use someone else's computer to do your computational

0:30:51.880 --> 0:30:56.320
<v Speaker 1>tasks because you log in through your computer, but it's

0:30:56.400 --> 0:30:59.080
<v Speaker 1>this other computer that's actually doing the work, or it

0:30:59.160 --> 0:31:02.040
<v Speaker 1>might be a net work of other computers doing that work.

0:31:02.280 --> 0:31:05.680
<v Speaker 1>That work could be that you're storing photos or ekitty

0:31:05.720 --> 0:31:11.080
<v Speaker 1>cats on a drive on some cloud storage, or it

0:31:11.160 --> 0:31:13.680
<v Speaker 1>might be that you're using cloud computing to help you

0:31:13.760 --> 0:31:17.280
<v Speaker 1>crunch really big numbers that your computer could not handle

0:31:17.640 --> 0:31:20.920
<v Speaker 1>and you're peeling back the mysteries of quantum mechanics or something.

0:31:21.360 --> 0:31:25.200
<v Speaker 1>So cloud computing would rise at the same time as

0:31:25.200 --> 0:31:27.680
<v Speaker 1>big data and cloud computing and big data are very

0:31:27.680 --> 0:31:31.440
<v Speaker 1>closely related. They're enablers of one another in a way.

0:31:31.880 --> 0:31:34.720
<v Speaker 1>Organizations and companies feel the need to engage with cloud

0:31:34.760 --> 0:31:38.960
<v Speaker 1>computing services because their data tasks are growing increasingly complex

0:31:39.040 --> 0:31:42.680
<v Speaker 1>and voluminous, and it gets harder and harder to handle

0:31:42.720 --> 0:31:46.080
<v Speaker 1>all of that on your own. Right, Like most businesses

0:31:46.120 --> 0:31:51.840
<v Speaker 1>these days are not using exclusively on premises computing systems

0:31:52.200 --> 0:31:55.400
<v Speaker 1>to do all their computation and all their storage. It

0:31:55.640 --> 0:31:59.600
<v Speaker 1>just is not practical, right. You would have to continuously

0:32:00.200 --> 0:32:03.600
<v Speaker 1>buy or lease more space just to hold all the

0:32:03.640 --> 0:32:07.360
<v Speaker 1>systems you would need. So instead they engage with cloud

0:32:07.440 --> 0:32:10.800
<v Speaker 1>computing companies that will provide those services for them, and

0:32:10.800 --> 0:32:13.000
<v Speaker 1>then the cloud computing companies will go out and build

0:32:13.000 --> 0:32:15.800
<v Speaker 1>a warehouse and fill it full of computers. Big data

0:32:15.920 --> 0:32:19.320
<v Speaker 1>leans on cloud computing to make it practical to even

0:32:19.360 --> 0:32:21.640
<v Speaker 1>accumulate all that data in the first place, let alone

0:32:21.680 --> 0:32:25.680
<v Speaker 1>analyze it. Now, the lure a big data the reason

0:32:25.840 --> 0:32:28.320
<v Speaker 1>why we're concerned with it. I mentioned this in the

0:32:28.480 --> 0:32:31.680
<v Speaker 1>very beginning of this episode. The lure is that there

0:32:31.680 --> 0:32:36.360
<v Speaker 1>are nuggets of truth hiding inside vast amounts of possibly

0:32:36.480 --> 0:32:41.160
<v Speaker 1>useless information. There is signal, but there's also an enormous

0:32:41.200 --> 0:32:46.760
<v Speaker 1>amount of noise. If we can identify those little nuggets

0:32:46.760 --> 0:32:50.680
<v Speaker 1>of truth, then we can potentially benefit from them. But

0:32:50.800 --> 0:32:53.760
<v Speaker 1>these huge piles of information are just so vast that

0:32:53.800 --> 0:32:56.920
<v Speaker 1>our ability to zero in on the important stuff is

0:32:57.120 --> 0:32:59.640
<v Speaker 1>just not up to snuff. It is the proverbial needle

0:32:59.640 --> 0:33:03.000
<v Speaker 1>in a hay haystack problem. So the promise of big

0:33:03.080 --> 0:33:06.200
<v Speaker 1>data in our current age is that when we use

0:33:06.280 --> 0:33:10.320
<v Speaker 1>the right tools, we can sift through the haystack and

0:33:10.360 --> 0:33:12.640
<v Speaker 1>we can find all the needles, which is a really

0:33:12.680 --> 0:33:16.120
<v Speaker 1>tempting concept, because who knows what you might find when

0:33:16.160 --> 0:33:21.240
<v Speaker 1>you analyze large amounts of information. Maybe you identify patterns

0:33:21.640 --> 0:33:24.920
<v Speaker 1>that you can then use to lead you to change

0:33:24.960 --> 0:33:27.560
<v Speaker 1>things so that you can save huge sums of money

0:33:27.840 --> 0:33:30.960
<v Speaker 1>in the way you do business. Or maybe you identify

0:33:31.000 --> 0:33:35.520
<v Speaker 1>a previously unknown opportunity. Or maybe you can spot connections

0:33:35.560 --> 0:33:38.640
<v Speaker 1>between data points that you didn't see before and you

0:33:38.680 --> 0:33:42.960
<v Speaker 1>start to see correlation. Maybe you even determine causation. Maybe

0:33:43.480 --> 0:33:46.680
<v Speaker 1>this leads you to make some incredible scientific progress, and

0:33:46.720 --> 0:33:50.000
<v Speaker 1>it might be on anything from medicine to astronomy. It

0:33:50.040 --> 0:33:54.840
<v Speaker 1>all depends on the type of data. Obviously, However, there's

0:33:54.880 --> 0:33:58.040
<v Speaker 1>a big caveat that goes along with this sort of

0:33:58.360 --> 0:34:03.400
<v Speaker 1>beautiful concept, and it's possible that the tools we use

0:34:03.440 --> 0:34:07.640
<v Speaker 1>will make mistakes that they're going to spot patterns or

0:34:07.760 --> 0:34:12.360
<v Speaker 1>meaning when in reality there isn't anything there. They mistake

0:34:12.760 --> 0:34:17.080
<v Speaker 1>something to be meaningful when in fact it's not. This

0:34:17.200 --> 0:34:19.120
<v Speaker 1>is kind of like when you look up at the

0:34:19.160 --> 0:34:21.600
<v Speaker 1>clouds and you see a pattern that makes you think

0:34:21.600 --> 0:34:25.320
<v Speaker 1>of a specific shape, very like a whale, As Hamlet

0:34:25.400 --> 0:34:28.600
<v Speaker 1>and Polonius would say, so, the shape of the cloud

0:34:28.680 --> 0:34:30.800
<v Speaker 1>might remind you of a whale or a dog, or

0:34:30.840 --> 0:34:34.759
<v Speaker 1>a hand or whatever, but you probably are aware that

0:34:34.800 --> 0:34:38.440
<v Speaker 1>the cloud isn't actually a whale or whatever. In fact,

0:34:38.600 --> 0:34:41.719
<v Speaker 1>you might even realize that your point of view is

0:34:41.840 --> 0:34:44.560
<v Speaker 1>part of what is shaping your perception. It's part of

0:34:44.600 --> 0:34:48.239
<v Speaker 1>the reason why it looks like a whale. Maybe if

0:34:48.280 --> 0:34:51.720
<v Speaker 1>you were a mile away to the east or something

0:34:51.760 --> 0:34:54.359
<v Speaker 1>and you were to look at that same cloud, the

0:34:54.400 --> 0:34:56.360
<v Speaker 1>angle you would be at would mean that the cloud

0:34:56.360 --> 0:34:58.480
<v Speaker 1>wouldn't look anything like a whale. Maybe it would look

0:34:58.560 --> 0:35:00.840
<v Speaker 1>like something entirely different, or maybe it wouldn't remind you

0:35:00.880 --> 0:35:03.680
<v Speaker 1>of anything at all. So, from one perspective, the cloud

0:35:03.680 --> 0:35:08.480
<v Speaker 1>shape appears to have some meaning. From other perspectives, it doesn't.

0:35:08.680 --> 0:35:11.399
<v Speaker 1>So it would be a mistake to draw any conclusions

0:35:11.400 --> 0:35:15.160
<v Speaker 1>based on that one perception, because it would just be

0:35:15.360 --> 0:35:19.279
<v Speaker 1>the illusion of meaning, not actual meaning. And that can

0:35:19.360 --> 0:35:22.080
<v Speaker 1>happen when you're looking at huge data sets too. You

0:35:22.160 --> 0:35:25.560
<v Speaker 1>might see something that looks like it's meaningful, that it

0:35:25.600 --> 0:35:28.440
<v Speaker 1>represents a pattern or a connection, when in fact it doesn't.

0:35:28.840 --> 0:35:30.920
<v Speaker 1>That can lead you on a wild goose chase, and

0:35:30.960 --> 0:35:33.719
<v Speaker 1>in a worst case scenario, you might dedicate a lot

0:35:33.800 --> 0:35:37.520
<v Speaker 1>of time and effort and money toward pursuing this perceived

0:35:37.640 --> 0:35:40.600
<v Speaker 1>meaning and you only find out much later that there

0:35:40.680 --> 0:35:43.040
<v Speaker 1>was nothing there at all. Now that's not to say

0:35:43.040 --> 0:35:46.839
<v Speaker 1>that we can't trust the outcomes of big data analysis,

0:35:47.200 --> 0:35:49.320
<v Speaker 1>but it does mean that we have to make sure

0:35:49.440 --> 0:35:53.279
<v Speaker 1>that we have tests to ensure the validity of those analyzes.

0:35:53.640 --> 0:35:57.160
<v Speaker 1>We need to take a scientific approach toward big data,

0:35:57.280 --> 0:35:59.280
<v Speaker 1>or else we run the risk of chasing a dream

0:35:59.680 --> 0:36:02.919
<v Speaker 1>rather than and learning more about reality. And anytime there's

0:36:03.080 --> 0:36:06.600
<v Speaker 1>any uncertainty, there will be people who move in to

0:36:06.800 --> 0:36:12.200
<v Speaker 1>exploit that uncertainty, hucksters, scam artists, snake oil salesman. So

0:36:12.239 --> 0:36:13.880
<v Speaker 1>as an example of this, I would point to the

0:36:13.920 --> 0:36:17.680
<v Speaker 1>explosion we are seeing in artificial intelligence right now. AHI

0:36:18.320 --> 0:36:22.120
<v Speaker 1>has tons of applications, including in the analysis of big data,

0:36:23.400 --> 0:36:26.480
<v Speaker 1>and that means that there is also opportunity there to

0:36:26.560 --> 0:36:29.239
<v Speaker 1>take advantage of people. So it doesn't take much imagination

0:36:29.280 --> 0:36:31.680
<v Speaker 1>to think of a company that actually uses a cheap

0:36:31.760 --> 0:36:34.680
<v Speaker 1>human labor and to pass it off as a truly

0:36:34.800 --> 0:36:38.279
<v Speaker 1>AI company and to market that company's services to big

0:36:38.320 --> 0:36:41.719
<v Speaker 1>businesses that may not know any better. And really you're

0:36:41.800 --> 0:36:46.399
<v Speaker 1>just exploiting people in poorer countries and passing it off

0:36:46.440 --> 0:36:50.360
<v Speaker 1>as being this really high tech business. As it stands,

0:36:50.760 --> 0:36:53.320
<v Speaker 1>even if you're not doing that, human labor is already

0:36:53.360 --> 0:36:56.120
<v Speaker 1>the backbone of the AI industry, like it or not.

0:36:56.480 --> 0:36:59.400
<v Speaker 1>People in countries that have low wages and have very

0:36:59.440 --> 0:37:03.440
<v Speaker 1>little protect in place for working citizens, they're spending countless

0:37:03.440 --> 0:37:07.200
<v Speaker 1>hours tagging data so that AI can actually make use

0:37:07.239 --> 0:37:10.440
<v Speaker 1>of it. So as we marvel at how clever AI

0:37:10.520 --> 0:37:13.640
<v Speaker 1>tools appear to be, there are folks out there on

0:37:13.680 --> 0:37:17.560
<v Speaker 1>the margins who are the ones labeling images and applying

0:37:17.600 --> 0:37:20.319
<v Speaker 1>metadata to text so that the AI can grab the

0:37:20.400 --> 0:37:23.680
<v Speaker 1>right stuff based upon a query. Anyway, I think it's

0:37:23.680 --> 0:37:27.800
<v Speaker 1>important to remember that big data can, with the right tools,

0:37:28.120 --> 0:37:31.360
<v Speaker 1>provide us insights that we might not otherwise make because

0:37:31.360 --> 0:37:33.759
<v Speaker 1>the amount of information is just too large for us

0:37:33.760 --> 0:37:37.000
<v Speaker 1>to handle. Those insights might mean we can do things

0:37:37.040 --> 0:37:41.040
<v Speaker 1>like streamline supply chains, or identify a market for specific product,

0:37:41.400 --> 0:37:43.680
<v Speaker 1>or find a new way to treat an illness. Big

0:37:43.760 --> 0:37:46.880
<v Speaker 1>data can also lead us to some darker outcomes. Companies

0:37:46.960 --> 0:37:49.799
<v Speaker 1>will scrape as much of your personal information as they

0:37:49.960 --> 0:37:53.439
<v Speaker 1>possibly can. They will sell it to other companies. These

0:37:53.480 --> 0:37:57.279
<v Speaker 1>other companies will market you to yet more companies on

0:37:57.360 --> 0:37:59.640
<v Speaker 1>an effort to serve you ads or to lure you

0:37:59.640 --> 0:38:04.840
<v Speaker 1>into doing something foolish like downloading malware or consuming misinformation.

0:38:05.280 --> 0:38:08.400
<v Speaker 1>Because behind every silver lining is a big, scary cloud.

0:38:09.040 --> 0:38:11.960
<v Speaker 1>Maybe it's in the shape of a whale. That is

0:38:12.160 --> 0:38:16.120
<v Speaker 1>a brief history of big data. It is a history

0:38:16.120 --> 0:38:19.640
<v Speaker 1>that is ongoing. I'm sure we're going to see some incredible,

0:38:19.800 --> 0:38:24.239
<v Speaker 1>incredible discoveries thanks to analysis of big data. I'm sure

0:38:24.239 --> 0:38:27.440
<v Speaker 1>we're also going to see some pretty scary stuff as

0:38:27.440 --> 0:38:30.360
<v Speaker 1>a result of it as well, such is life, but

0:38:30.480 --> 0:38:35.520
<v Speaker 1>it is fascinating to see how we have arrived at

0:38:35.520 --> 0:38:38.160
<v Speaker 1>this point, like first from the point of how do

0:38:38.239 --> 0:38:40.799
<v Speaker 1>we collect all this information? And then what do we

0:38:40.840 --> 0:38:44.040
<v Speaker 1>do with it? I hope you enjoyed this episode. I

0:38:44.080 --> 0:38:46.880
<v Speaker 1>hope you are all well, and I will talk to

0:38:46.880 --> 0:38:57.240
<v Speaker 1>you again really soon. Tech Stuff is an iHeartRadio production.

0:38:57.560 --> 0:39:02.560
<v Speaker 1>For more podcasts from iheartradioit the iHeartRadio app, Apple Podcasts,

0:39:02.680 --> 0:39:04.680
<v Speaker 1>or wherever you listen to your favorite shows.