1 00:00:04,440 --> 00:00:12,360 Speaker 1: Welcome to tech Stuff, a production from iHeartRadio. Hey there, 2 00:00:12,400 --> 00:00:16,239 Speaker 1: and welcome to tech Stuff. I'm your host, Jonathan Strickland. 3 00:00:16,320 --> 00:00:19,959 Speaker 1: I'm an executive producer with iHeart Podcasts and How the 4 00:00:20,160 --> 00:00:24,360 Speaker 1: tech are You. So early on in the days of 5 00:00:24,360 --> 00:00:26,439 Speaker 1: tech stuff, back when I was still a staff writer 6 00:00:26,560 --> 00:00:30,400 Speaker 1: for a little website called HowStuffWorks dot com, my boss 7 00:00:31,000 --> 00:00:34,840 Speaker 1: Connell Burn, who is now a big shot over here 8 00:00:34,840 --> 00:00:37,560 Speaker 1: at iHeart, he came over to me with an assignment. 9 00:00:37,640 --> 00:00:40,360 Speaker 1: He wanted me to do some articles in some episodes 10 00:00:40,520 --> 00:00:46,680 Speaker 1: about this buzzword concept called big data. And I had 11 00:00:46,760 --> 00:00:50,040 Speaker 1: heard the term big data, and obviously there's a pretty 12 00:00:50,080 --> 00:00:52,839 Speaker 1: darn good hint I want. Big data is all about 13 00:00:52,920 --> 00:00:55,720 Speaker 1: just in the nature of the name itself, but beyond that, 14 00:00:55,800 --> 00:00:58,680 Speaker 1: I didn't really know much, so I jumped to it. 15 00:00:59,080 --> 00:01:01,960 Speaker 1: And the interesting thing is that since that time, the 16 00:01:02,040 --> 00:01:05,959 Speaker 1: discipline of big data has evolved significantly. When I was 17 00:01:06,000 --> 00:01:08,760 Speaker 1: first working on my articles and episodes, we were mostly 18 00:01:08,800 --> 00:01:13,119 Speaker 1: talking about how technological tools made it easier to collect 19 00:01:13,760 --> 00:01:18,280 Speaker 1: vast amounts of information very quickly and to store it. 20 00:01:18,720 --> 00:01:22,160 Speaker 1: But we didn't necessarily have equally sufficient tools to do 21 00:01:22,200 --> 00:01:25,360 Speaker 1: anything useful with all that information. We have or at 22 00:01:25,440 --> 00:01:29,680 Speaker 1: least those tools weren't widely known and understood beyond a 23 00:01:29,720 --> 00:01:34,759 Speaker 1: certain circle of computer scientists. Flash forward a few years, 24 00:01:35,080 --> 00:01:39,320 Speaker 1: and we'd see companies developing new methods to analyze large 25 00:01:39,440 --> 00:01:42,160 Speaker 1: chunks of data. Oh, by the way, I do the 26 00:01:42,200 --> 00:01:45,360 Speaker 1: weird data data thing, and there's no rhyme or reason 27 00:01:45,360 --> 00:01:46,680 Speaker 1: to it. I don't even know which one I'm going 28 00:01:46,720 --> 00:01:49,360 Speaker 1: to say before I say it, so I apologize because 29 00:01:49,400 --> 00:01:52,960 Speaker 1: I know it's irritating. It irritates me too, Anyway. Other 30 00:01:53,000 --> 00:01:57,480 Speaker 1: companies sprung up with products that were meant to help 31 00:01:57,720 --> 00:02:01,040 Speaker 1: with data analysis, and it seemed like we were going 32 00:02:01,080 --> 00:02:03,800 Speaker 1: from an era of well, now I have all this information, 33 00:02:03,880 --> 00:02:06,200 Speaker 1: what do I do now, to an era of I 34 00:02:06,280 --> 00:02:09,440 Speaker 1: have discovered cryptic secrets that were hiding in plain sight 35 00:02:09,680 --> 00:02:13,679 Speaker 1: thanks to data analysis, and that somehow it all happened overnight. 36 00:02:13,960 --> 00:02:16,639 Speaker 1: So today I thought we would actually look back over 37 00:02:16,680 --> 00:02:21,000 Speaker 1: the history of the big data concept, how various systems 38 00:02:21,000 --> 00:02:25,520 Speaker 1: have made it possible to sift through seemingly meaningless information 39 00:02:25,880 --> 00:02:28,480 Speaker 1: in order to find nuggets of wisdom, and why we 40 00:02:28,560 --> 00:02:31,960 Speaker 1: might not always be able to trust the answers that 41 00:02:32,000 --> 00:02:35,960 Speaker 1: we discover. So the history of big data starts in 42 00:02:36,000 --> 00:02:38,680 Speaker 1: the twenty tens, or maybe it starts in two thousand 43 00:02:38,720 --> 00:02:42,240 Speaker 1: and five, or maybe in nineteen ninety, or maybe the 44 00:02:42,280 --> 00:02:46,239 Speaker 1: sixteen hundreds, or maybe nearly twenty thousand years ago. You 45 00:02:46,520 --> 00:02:48,280 Speaker 1: might have already picked up on the fact that folks 46 00:02:48,280 --> 00:02:51,120 Speaker 1: don't quite agree on where we should start when talking 47 00:02:51,160 --> 00:02:54,640 Speaker 1: about big data. But that makes sense. Ever since humans 48 00:02:54,720 --> 00:02:58,800 Speaker 1: have started to write stuff down, we've been pretty darn 49 00:02:58,880 --> 00:03:03,360 Speaker 1: invested in the collection and then the classification of information. 50 00:03:03,840 --> 00:03:07,079 Speaker 1: Whether it's to figure out the best time to sew 51 00:03:07,320 --> 00:03:10,480 Speaker 1: or harvest crops, or keep track of how much we've 52 00:03:10,520 --> 00:03:13,120 Speaker 1: traded with that other band of neair dwells who live 53 00:03:13,160 --> 00:03:15,440 Speaker 1: on the other side of the holler, or we just 54 00:03:15,520 --> 00:03:18,160 Speaker 1: want to make a record of how great it was 55 00:03:18,200 --> 00:03:20,640 Speaker 1: that we kicked the butt of that mastodon, real good. 56 00:03:20,840 --> 00:03:26,440 Speaker 1: We've been really obsessed with data and collection and retrieval. Now, 57 00:03:27,120 --> 00:03:30,240 Speaker 1: this obsession also means that we had to come up 58 00:03:30,280 --> 00:03:34,000 Speaker 1: with various ways to store and analyze this information. Raw 59 00:03:34,000 --> 00:03:37,720 Speaker 1: information doesn't do anyone much good, and so throughout antiquity 60 00:03:37,960 --> 00:03:41,520 Speaker 1: we came up with means of recording and storing and 61 00:03:41,560 --> 00:03:45,960 Speaker 1: making use of information. Not only did hardworking humans create 62 00:03:46,120 --> 00:03:49,760 Speaker 1: libraries where we could gather all this knowledge and then 63 00:03:50,000 --> 00:03:52,840 Speaker 1: lose some of those libraries along the way due to 64 00:03:52,880 --> 00:03:55,640 Speaker 1: the fact that we humans also are pretty stupid and 65 00:03:55,680 --> 00:03:59,480 Speaker 1: we end up having disputes that involve burning each other's 66 00:03:59,520 --> 00:04:03,200 Speaker 1: stuff to the ground. Yeah, I'm still bitter about certain 67 00:04:03,240 --> 00:04:07,080 Speaker 1: libraries being destroyed over in antiquity, but it means that 68 00:04:07,120 --> 00:04:10,120 Speaker 1: we also had to come up with methodologies to categorize 69 00:04:10,280 --> 00:04:13,600 Speaker 1: and classify information. Otherwise you may as well just have 70 00:04:13,680 --> 00:04:17,240 Speaker 1: a big old pile of scrolls or books or whatever, 71 00:04:17,600 --> 00:04:20,200 Speaker 1: and then people just you know, have to sort through 72 00:04:20,240 --> 00:04:23,159 Speaker 1: them and see if they can find anything, which actually 73 00:04:23,240 --> 00:04:25,640 Speaker 1: sparks two different memories in my head. One is that 74 00:04:25,839 --> 00:04:28,640 Speaker 1: there used to be a used bookstore I would go 75 00:04:28,680 --> 00:04:33,120 Speaker 1: to here in Atlanta, and often the used bookstore was 76 00:04:33,560 --> 00:04:37,919 Speaker 1: completely unorganized, right, Like you literally could go through a 77 00:04:37,920 --> 00:04:40,240 Speaker 1: bookshelf and it's just going to be books that are 78 00:04:40,720 --> 00:04:43,440 Speaker 1: more or less the same size, but otherwise there's no 79 00:04:43,680 --> 00:04:45,679 Speaker 1: rhyme or reason as to why they were put there, 80 00:04:46,040 --> 00:04:47,800 Speaker 1: and it was like you were on a treasure hunt. 81 00:04:48,000 --> 00:04:55,000 Speaker 1: And then I'm also reminded of a naval museum in Appalachicola, Florida, 82 00:04:55,000 --> 00:04:57,480 Speaker 1: which is on the Panhandle. I went to this little, 83 00:04:57,880 --> 00:05:02,159 Speaker 1: you know, naval museum, like a ship museum, and I 84 00:05:02,360 --> 00:05:05,480 Speaker 1: reminded that all the exhibits were kind of in a 85 00:05:05,520 --> 00:05:08,839 Speaker 1: pile on the floor, and you would literally pick things 86 00:05:08,920 --> 00:05:12,560 Speaker 1: up and look at them. And that's kind of what 87 00:05:12,880 --> 00:05:15,159 Speaker 1: it would be like if we didn't have these means 88 00:05:15,200 --> 00:05:17,880 Speaker 1: of classification. Once you get to a certain size, like 89 00:05:17,920 --> 00:05:21,719 Speaker 1: that little museum in Appalachic Cola wasn't so big as 90 00:05:21,760 --> 00:05:24,000 Speaker 1: to be a problem. But if you're talking about a 91 00:05:24,040 --> 00:05:27,240 Speaker 1: big library, obviously, if you want anything useful, you got 92 00:05:27,279 --> 00:05:29,120 Speaker 1: to come up with a way of classifying all this. 93 00:05:29,680 --> 00:05:32,760 Speaker 1: To that end, ancient folks began to develop a science 94 00:05:33,160 --> 00:05:37,640 Speaker 1: called taxonomy. And this isn't when you stuff dead animals 95 00:05:37,720 --> 00:05:39,800 Speaker 1: so that they look like they might still sort of 96 00:05:39,839 --> 00:05:44,320 Speaker 1: be alive. That's taxon dermy. No. Taxonomy is the science 97 00:05:44,320 --> 00:05:47,600 Speaker 1: of classification, and it's perhaps best known in the field 98 00:05:47,640 --> 00:05:51,440 Speaker 1: of biology, thanks in large part to a Swedish scientist 99 00:05:51,440 --> 00:05:55,280 Speaker 1: from the eighteenth century named Carl Linnaeus. But there are 100 00:05:55,480 --> 00:05:59,159 Speaker 1: many applications of taxonomy that extend beyond biology. It's just 101 00:05:59,160 --> 00:06:02,080 Speaker 1: the biological taxonomy is the one that I think most 102 00:06:02,080 --> 00:06:04,400 Speaker 1: of us are familiar with because most of us were 103 00:06:04,440 --> 00:06:08,320 Speaker 1: taught it when we were going through basic biology science. 104 00:06:08,760 --> 00:06:11,840 Speaker 1: But the ancient Greeks made some early progress on developing 105 00:06:11,880 --> 00:06:17,000 Speaker 1: systems of classification, and obviously, within modern library science, taxonomy 106 00:06:17,120 --> 00:06:20,640 Speaker 1: is an important discipline, though oddly enough, you could say 107 00:06:20,680 --> 00:06:24,960 Speaker 1: taxonomy in library science is distinct from classification. When I 108 00:06:25,000 --> 00:06:29,039 Speaker 1: was looking this up, I found resources for library science 109 00:06:29,279 --> 00:06:32,920 Speaker 1: that made these two distinct disciplines. Classification was one in 110 00:06:32,920 --> 00:06:36,080 Speaker 1: taxonomy was another. Now. This is because there are various 111 00:06:36,400 --> 00:06:39,800 Speaker 1: methods of classification in library science. The one that I 112 00:06:39,960 --> 00:06:41,880 Speaker 1: was most familiar with when I was growing up was 113 00:06:41,920 --> 00:06:45,320 Speaker 1: the Dewey decimal system, which I don't even think is 114 00:06:45,680 --> 00:06:48,760 Speaker 1: the dominant form now, but it was when I was 115 00:06:48,800 --> 00:06:51,680 Speaker 1: growing up. And it's meant to connect a specific work 116 00:06:51,760 --> 00:06:54,919 Speaker 1: to a specific physical location in a library for the 117 00:06:54,920 --> 00:06:57,880 Speaker 1: purposes of, you know, checking down the book, right. But 118 00:06:58,040 --> 00:07:01,760 Speaker 1: taxonomy in library science tends to be more towards metadata 119 00:07:01,920 --> 00:07:06,839 Speaker 1: or data about data. In fact, metadata plays a huge 120 00:07:06,880 --> 00:07:11,160 Speaker 1: part in big data. Oh man, I did it both 121 00:07:11,200 --> 00:07:14,160 Speaker 1: ways in one sentence. I feel awful. Anyway, the information 122 00:07:14,240 --> 00:07:17,840 Speaker 1: about information can be as useful as the information itself. 123 00:07:17,840 --> 00:07:20,280 Speaker 1: In some cases. I have often talked about this with 124 00:07:20,720 --> 00:07:24,720 Speaker 1: personal information about how info about info can give you 125 00:07:24,760 --> 00:07:28,160 Speaker 1: a lot of insight into a person. Maybe you don't 126 00:07:28,200 --> 00:07:30,560 Speaker 1: have a person's name, but you have a couple of 127 00:07:30,560 --> 00:07:34,200 Speaker 1: different data points about that person. In some cases, you 128 00:07:34,240 --> 00:07:37,920 Speaker 1: can actually narrow down the identity of the person you're 129 00:07:37,960 --> 00:07:41,040 Speaker 1: thinking of just by looking at this metadata. You don't 130 00:07:41,040 --> 00:07:43,240 Speaker 1: even have to see the information about them, which shows 131 00:07:43,240 --> 00:07:46,120 Speaker 1: you how powerful metadata can be. So you start to 132 00:07:46,120 --> 00:07:48,680 Speaker 1: see a cascading effect here where you slowly realize that 133 00:07:48,720 --> 00:07:51,400 Speaker 1: you actually have access to even more information than you 134 00:07:51,440 --> 00:07:54,800 Speaker 1: first anticipated because you also have information about that information. 135 00:07:54,880 --> 00:07:58,320 Speaker 1: It gets pretty wild. Another important development in the history 136 00:07:58,360 --> 00:08:01,840 Speaker 1: of big data is the creation of statistics. So let's 137 00:08:01,840 --> 00:08:05,360 Speaker 1: give the Merriam Webster definition of statistics. Shall we just 138 00:08:05,800 --> 00:08:10,520 Speaker 1: have a baseline. It is quote a branch of mathematics 139 00:08:10,920 --> 00:08:16,440 Speaker 1: dealing with the collection, analysis, interpretation, and presentation of masses 140 00:08:16,480 --> 00:08:21,200 Speaker 1: of numerical data. End quote. Now. One famous early example 141 00:08:21,240 --> 00:08:24,160 Speaker 1: of statistics comes to us courtesy of a fellow named 142 00:08:24,360 --> 00:08:31,080 Speaker 1: John Grant Graunt. He was looking at mortality rates in London, 143 00:08:31,440 --> 00:08:34,160 Speaker 1: and that gave him a lot more information and helped 144 00:08:34,200 --> 00:08:38,000 Speaker 1: him analyze the course of the plague. For example, he 145 00:08:38,040 --> 00:08:41,240 Speaker 1: could see when the plague was spiking or receding. Pretty 146 00:08:41,280 --> 00:08:45,080 Speaker 1: cheerful stuff, right, But he also used this information, the 147 00:08:45,120 --> 00:08:49,520 Speaker 1: mortality information, to start drawing some conclusions about the population 148 00:08:49,600 --> 00:08:52,840 Speaker 1: of London as a whole, So counting up everybody, like 149 00:08:52,920 --> 00:08:56,320 Speaker 1: figuring out who lives in London. That would have been 150 00:08:56,400 --> 00:08:59,600 Speaker 1: challenging at the time, to say the least. But Grant 151 00:08:59,720 --> 00:09:02,680 Speaker 1: took information like the number of funerals and then he 152 00:09:02,760 --> 00:09:06,240 Speaker 1: compared it to things like the average family size in 153 00:09:06,320 --> 00:09:09,559 Speaker 1: London to try and make an estimate of London's population. 154 00:09:09,679 --> 00:09:11,440 Speaker 1: So it gave him kind of a working figure that 155 00:09:11,960 --> 00:09:16,880 Speaker 1: was useful for certain applications, specifically government ones. Statistics as 156 00:09:17,080 --> 00:09:21,439 Speaker 1: a branch of mathematics would mature over the following centuries. 157 00:09:21,920 --> 00:09:25,520 Speaker 1: Often it would be the tool that allowed social scientists 158 00:09:25,600 --> 00:09:31,080 Speaker 1: to draw broad conclusions about large populations, but others found 159 00:09:31,120 --> 00:09:35,800 Speaker 1: plenty of alternative applications of statistics. Anyway, the age of 160 00:09:35,920 --> 00:09:39,320 Speaker 1: data analysis was well and truly in swing at this 161 00:09:39,440 --> 00:09:42,600 Speaker 1: point in the late nineteenth century. The United States was 162 00:09:42,679 --> 00:09:45,040 Speaker 1: getting in a bit of a pickle. And I know 163 00:09:45,280 --> 00:09:48,080 Speaker 1: we're making jumps of centuries here, but we need to 164 00:09:48,240 --> 00:09:52,920 Speaker 1: We can't go through every single evolution of data collection 165 00:09:53,160 --> 00:09:56,600 Speaker 1: and data analysis that would be a podcast series all 166 00:09:56,640 --> 00:09:59,599 Speaker 1: in itself. So we're in the late eighteen hundreds and 167 00:09:59,640 --> 00:10:02,120 Speaker 1: the USA US isn't a bit of a problem. The 168 00:10:02,160 --> 00:10:06,000 Speaker 1: country holds a census every ten years, where they're essentially 169 00:10:06,080 --> 00:10:08,920 Speaker 1: gathering information about all the citizens in the United States. 170 00:10:09,040 --> 00:10:12,199 Speaker 1: This is required by the US Constitution, and there are 171 00:10:12,240 --> 00:10:15,960 Speaker 1: several reasons why the Census Bureau holds a census every 172 00:10:15,960 --> 00:10:19,280 Speaker 1: ten years. But one of those reasons is that the 173 00:10:19,400 --> 00:10:24,800 Speaker 1: US House of Representatives its membership depends upon population. So 174 00:10:25,679 --> 00:10:29,280 Speaker 1: the more populous a state is, the more representatives that 175 00:10:29,320 --> 00:10:31,920 Speaker 1: state has in the House of Representatives. So if your 176 00:10:31,960 --> 00:10:35,280 Speaker 1: state has a big population, there are more representatives that 177 00:10:35,400 --> 00:10:39,200 Speaker 1: go to the House. If you have a relatively small population, 178 00:10:39,280 --> 00:10:43,000 Speaker 1: then you have fewer House representatives, right, That's how that works. 179 00:10:43,240 --> 00:10:48,480 Speaker 1: So by eighteen eighty things were getting to a really 180 00:10:48,840 --> 00:10:53,400 Speaker 1: difficult situation. The process of collecting and then analyzing all 181 00:10:53,440 --> 00:10:58,240 Speaker 1: the information was so cumbersome that it would take nearly 182 00:10:58,760 --> 00:11:02,040 Speaker 1: the whole decade just to get to a result, and 183 00:11:02,120 --> 00:11:05,880 Speaker 1: that means by the time you're drawing conclusions, it's actually 184 00:11:05,920 --> 00:11:08,120 Speaker 1: time for you to administer the next census. In fact, 185 00:11:08,360 --> 00:11:11,680 Speaker 1: they projected that in eighteen ninety working on the same 186 00:11:11,720 --> 00:11:15,160 Speaker 1: process that they were dependent upon previously. It would take 187 00:11:15,200 --> 00:11:18,480 Speaker 1: a whole decade, so literally you'd be holding your next 188 00:11:18,480 --> 00:11:21,000 Speaker 1: census while you were just getting your information from the 189 00:11:21,080 --> 00:11:23,440 Speaker 1: last one. So the Census Bureau needed a way to 190 00:11:23,440 --> 00:11:26,840 Speaker 1: collect and analyze this information in a much more efficient process. 191 00:11:27,280 --> 00:11:32,360 Speaker 1: They tapped a man named Herman Holleeth to accomplish this. 192 00:11:32,920 --> 00:11:36,720 Speaker 1: So Holloweth took a punch card system that had been 193 00:11:36,800 --> 00:11:40,840 Speaker 1: used in weaving, weaving with mechanical looms. I've talked about 194 00:11:40,840 --> 00:11:43,160 Speaker 1: this in the past with the history of punch cards. 195 00:11:43,400 --> 00:11:47,840 Speaker 1: In fact, this also gets into perhaps a somewhat apocryphal 196 00:11:47,880 --> 00:11:51,080 Speaker 1: story of where the word sabotage comes from, but that's 197 00:11:51,280 --> 00:11:54,360 Speaker 1: for another time. So he took this punch card system 198 00:11:54,400 --> 00:11:58,160 Speaker 1: that had been used to set weaving patterns with mechanical looms, 199 00:11:58,360 --> 00:12:00,760 Speaker 1: and then he adapted that to serve as a way 200 00:12:01,120 --> 00:12:05,040 Speaker 1: to record information so that you could feed the card 201 00:12:05,080 --> 00:12:09,840 Speaker 1: to a tabulation machine which then could actually tabulate the results. 202 00:12:10,080 --> 00:12:13,600 Speaker 1: And his invention meant that ten years of labor done 203 00:12:13,600 --> 00:12:16,320 Speaker 1: by clerks who are working at desks would actually boil 204 00:12:16,400 --> 00:12:21,080 Speaker 1: down to about three months of labor using the tabulation machine. Obviously, 205 00:12:21,640 --> 00:12:24,920 Speaker 1: that was a huge improvement. Hollerith formed a company that 206 00:12:25,080 --> 00:12:28,040 Speaker 1: over time would evolve into one of the most famous 207 00:12:28,040 --> 00:12:32,120 Speaker 1: companies in all the world, Kentucky Fried Chicken. I'm just kidding. 208 00:12:32,200 --> 00:12:36,760 Speaker 1: It wasn't KFC. Instead, it was IBM. That's the company 209 00:12:37,040 --> 00:12:40,200 Speaker 1: that would grow out of Hollowarith's company that he founded 210 00:12:40,600 --> 00:12:44,360 Speaker 1: in the nineteenth century. Anyway, we're not going to spend 211 00:12:44,800 --> 00:12:47,520 Speaker 1: too much time in all these centuries gone by. We're 212 00:12:47,520 --> 00:12:50,560 Speaker 1: actually going to speed things up and get up to 213 00:12:50,600 --> 00:12:54,000 Speaker 1: the twentieth century. But before we do that, let's take 214 00:12:54,240 --> 00:13:07,480 Speaker 1: a quick break to thank our sponsor. We're back, okay, 215 00:13:08,200 --> 00:13:12,439 Speaker 1: So the actual term big data is still waiting for us. 216 00:13:12,480 --> 00:13:14,280 Speaker 1: We're not going to really get to that until we 217 00:13:14,360 --> 00:13:17,200 Speaker 1: hit the late nineteen nineties or so. But there are 218 00:13:17,200 --> 00:13:20,120 Speaker 1: a few things to point out before we get up 219 00:13:20,160 --> 00:13:24,760 Speaker 1: to there. Folks were starting to notice that we were generating, collecting, 220 00:13:24,880 --> 00:13:28,920 Speaker 1: and storing an awful lot of information in the twentieth century, 221 00:13:29,200 --> 00:13:33,280 Speaker 1: and that the rate of data generation was on the rise. 222 00:13:33,440 --> 00:13:35,960 Speaker 1: Not only were we generating a whole bunch of information, 223 00:13:36,360 --> 00:13:39,680 Speaker 1: we were doing it in larger amounts year over year. 224 00:13:39,880 --> 00:13:42,520 Speaker 1: In fact, it was rising much faster than our rate 225 00:13:42,559 --> 00:13:47,040 Speaker 1: of consumption of information, meaning that we were making way 226 00:13:47,120 --> 00:13:50,320 Speaker 1: more data than we were actually able to use. And 227 00:13:50,360 --> 00:13:53,560 Speaker 1: a big thanks goes out to Forbes for an article 228 00:13:53,600 --> 00:13:57,080 Speaker 1: that's titled A very Short History of Big Data by 229 00:13:57,200 --> 00:14:00,560 Speaker 1: Gil Press. A lot of the information that I'm drawing 230 00:14:00,640 --> 00:14:03,280 Speaker 1: upon came from that article. It is fantastic if you 231 00:14:03,320 --> 00:14:05,320 Speaker 1: want to learn more about this. I'm not going to 232 00:14:05,360 --> 00:14:07,960 Speaker 1: cover every element that they do. I mean, that would 233 00:14:07,960 --> 00:14:10,560 Speaker 1: just be me regurgitating their article. You should check it 234 00:14:10,559 --> 00:14:13,080 Speaker 1: out if you're interested in the history of big data. 235 00:14:13,240 --> 00:14:15,440 Speaker 1: We're going to touch on a few of the important points, 236 00:14:15,760 --> 00:14:18,000 Speaker 1: or what I think of as the important points. So 237 00:14:18,679 --> 00:14:20,400 Speaker 1: one of the earliest ones we're going to talk about 238 00:14:20,440 --> 00:14:24,400 Speaker 1: is in nineteen forty four, a librarian named Fremont Writer, 239 00:14:24,600 --> 00:14:28,640 Speaker 1: which is a fantastic name, wrote a work titled The 240 00:14:28,680 --> 00:14:32,040 Speaker 1: Scholar and the Future of the Research Library. So Writer 241 00:14:32,240 --> 00:14:34,840 Speaker 1: made an observation that reminds me a lot of Gordon 242 00:14:34,920 --> 00:14:39,800 Speaker 1: Moore's famous Moore's law, except this involves not silicon chips 243 00:14:40,280 --> 00:14:44,880 Speaker 1: but physical libraries. So Writer said that your typical library 244 00:14:45,120 --> 00:14:49,240 Speaker 1: in your typical American university was doubling in size every 245 00:14:49,320 --> 00:14:52,720 Speaker 1: sixteen years. He projected that this would mean that by 246 00:14:52,760 --> 00:14:56,440 Speaker 1: the year twenty forty, the library at Yale University would 247 00:14:56,480 --> 00:14:59,280 Speaker 1: be so large as to require a staff of more 248 00:14:59,320 --> 00:15:04,480 Speaker 1: than six thousand people to manage it. Of course, this 249 00:15:04,640 --> 00:15:08,280 Speaker 1: was before we had digital storage and digital filing systems 250 00:15:08,560 --> 00:15:12,600 Speaker 1: that has largely mitigated this particular requirement. We don't need 251 00:15:13,040 --> 00:15:16,960 Speaker 1: the physical space necessarily that we would if everything were 252 00:15:17,000 --> 00:15:20,640 Speaker 1: still in hard copy. But the observation showed that data 253 00:15:20,680 --> 00:15:24,280 Speaker 1: accumulation really had a steep trajectory even back in the 254 00:15:24,360 --> 00:15:28,800 Speaker 1: nineteen forties. Similarly, in the early nineteen sixties, a guy 255 00:15:28,880 --> 00:15:32,080 Speaker 1: named Derek Price published a piece explaining that the number 256 00:15:32,080 --> 00:15:35,400 Speaker 1: of scientific journals and papers was on a path of 257 00:15:35,600 --> 00:15:39,520 Speaker 1: exponential growth. It was doubling every fifteen years, so similar 258 00:15:39,760 --> 00:15:42,880 Speaker 1: to the rate at which university libraries were doubling in 259 00:15:42,960 --> 00:15:45,480 Speaker 1: size now. Part of the reason for this, he said, 260 00:15:45,840 --> 00:15:50,400 Speaker 1: was that scientific discoveries inevitably fuel further discoveries. So you 261 00:15:50,480 --> 00:15:53,320 Speaker 1: find out something new, this inspires other scientists to look 262 00:15:53,360 --> 00:15:56,160 Speaker 1: further into it, they find other new things, and so on. 263 00:15:56,520 --> 00:15:59,480 Speaker 1: In nineteen sixty five, the United States government needed to 264 00:15:59,480 --> 00:16:02,200 Speaker 1: build a place that would store records, including things like 265 00:16:02,320 --> 00:16:06,600 Speaker 1: tax returns and fingerprint sets, and so the plan was 266 00:16:06,680 --> 00:16:09,800 Speaker 1: to take the paper records and then transfer them to 267 00:16:10,000 --> 00:16:13,360 Speaker 1: magnetic tape, and then to store that magnetic tape in 268 00:16:13,440 --> 00:16:17,800 Speaker 1: this so called data center. This project fell through, however, 269 00:16:18,040 --> 00:16:21,920 Speaker 1: because the public got nervous. They felt squiky about this 270 00:16:22,000 --> 00:16:25,080 Speaker 1: idea of the government hoarding vast amounts of information about 271 00:16:25,080 --> 00:16:28,200 Speaker 1: its citizens. They did not fully trust the government. So 272 00:16:28,240 --> 00:16:30,680 Speaker 1: you understand like they're thinking, I don't really feel comfortable 273 00:16:30,720 --> 00:16:33,520 Speaker 1: with you just gathering all this information about us. It 274 00:16:33,560 --> 00:16:37,240 Speaker 1: feels kind of oppressive. Now, what's funny to me is 275 00:16:37,280 --> 00:16:41,000 Speaker 1: that today the average person is more than willing to 276 00:16:41,080 --> 00:16:44,320 Speaker 1: let companies do this to them without even protesting it. 277 00:16:44,560 --> 00:16:48,840 Speaker 1: Because that's how all the online social network companies work, right, 278 00:16:48,880 --> 00:16:51,440 Speaker 1: They work on the basis of gathering information about us 279 00:16:51,760 --> 00:16:55,480 Speaker 1: and then peddling that or or hoarding it, however you 280 00:16:55,560 --> 00:16:58,320 Speaker 1: might think of it. And it's very similar to what 281 00:16:58,360 --> 00:17:00,360 Speaker 1: was happening in the nineteen sixties. And back then we 282 00:17:00,360 --> 00:17:02,480 Speaker 1: were like, no, that's not cool, and now we're like, 283 00:17:02,520 --> 00:17:05,719 Speaker 1: that's just how it works. It's wild to me. Anyway, 284 00:17:05,720 --> 00:17:07,359 Speaker 1: I'm going to skip ahead a little bit to the 285 00:17:07,440 --> 00:17:14,160 Speaker 1: nineteen eighties. There was a lecturer, I a Tjomslend, and 286 00:17:14,200 --> 00:17:18,600 Speaker 1: I know I butchered his name. I apologize anyway. He 287 00:17:18,680 --> 00:17:24,159 Speaker 1: gave a lecture at the IE or IE Symposium in 288 00:17:24,200 --> 00:17:27,600 Speaker 1: which he posits that one reason all this information is 289 00:17:27,640 --> 00:17:30,560 Speaker 1: piling up is that we don't really have a good 290 00:17:30,560 --> 00:17:34,520 Speaker 1: way to determine which information is relevant and which information 291 00:17:35,119 --> 00:17:39,000 Speaker 1: is not. And we can make that determination, but it 292 00:17:39,040 --> 00:17:42,919 Speaker 1: requires work, and meanwhile, we're still accumulating more information. So 293 00:17:43,280 --> 00:17:45,480 Speaker 1: it's the kind of work where you're never done, and 294 00:17:45,560 --> 00:17:47,840 Speaker 1: it feels like you're never making any progress. So most 295 00:17:47,840 --> 00:17:50,280 Speaker 1: of us never bother to do it at all. And 296 00:17:50,720 --> 00:17:54,200 Speaker 1: if our ability to store data is sufficient, in other words, 297 00:17:54,200 --> 00:17:57,280 Speaker 1: if we have ways of storing the information, then we 298 00:17:57,359 --> 00:18:01,119 Speaker 1: have even less incentive to make any determination about the data. Right, Like, 299 00:18:01,160 --> 00:18:03,679 Speaker 1: if we've got plenty of storage, well, let's just go 300 00:18:03,720 --> 00:18:06,639 Speaker 1: ahead and keep the information. There's no reason to have 301 00:18:06,720 --> 00:18:09,080 Speaker 1: to worry about it whether it's useful or not. We 302 00:18:09,119 --> 00:18:12,200 Speaker 1: should keep it because it's better for us to keep 303 00:18:12,840 --> 00:18:17,960 Speaker 1: useless information without needing it, rather than accidentally deleting something 304 00:18:18,160 --> 00:18:21,520 Speaker 1: that turned out to be important. Right, And this kind 305 00:18:21,520 --> 00:18:23,840 Speaker 1: of makes sense. I mean, I'm sure a lot of 306 00:18:23,880 --> 00:18:26,119 Speaker 1: you out there can apply that to your lives. I 307 00:18:26,160 --> 00:18:28,359 Speaker 1: certainly can apply it to my life, right Like, I 308 00:18:28,480 --> 00:18:31,600 Speaker 1: have file folders that are full of stuff that I'm 309 00:18:31,640 --> 00:18:35,159 Speaker 1: never going to touch again, but I still feel reluctant 310 00:18:35,200 --> 00:18:37,600 Speaker 1: to delete it just in case I do need to 311 00:18:37,600 --> 00:18:39,679 Speaker 1: touch it again sometime in the future, even though the 312 00:18:39,760 --> 00:18:43,399 Speaker 1: likelihood of that is very low. So that's anecdotal. I 313 00:18:43,440 --> 00:18:46,639 Speaker 1: can't really call that evidence to prove the point, but 314 00:18:46,720 --> 00:18:50,240 Speaker 1: it feels like the point is relevant. So this is 315 00:18:50,280 --> 00:18:52,639 Speaker 1: also how I play a lot of those big open 316 00:18:52,640 --> 00:18:56,080 Speaker 1: world computer RPGs, by the way, things like Skyrim or whatever, 317 00:18:56,160 --> 00:18:59,800 Speaker 1: because I'll just hoarde potions and scrolls and I never 318 00:19:00,160 --> 00:19:02,679 Speaker 1: use them because what if I need it more in 319 00:19:02,760 --> 00:19:05,600 Speaker 1: the future. Balder's Gate three has really done a number 320 00:19:05,640 --> 00:19:07,600 Speaker 1: on me with this. I got a real problem with 321 00:19:07,640 --> 00:19:12,120 Speaker 1: that anyway. The Forbes article details several more entries indicating 322 00:19:12,160 --> 00:19:15,520 Speaker 1: how very smart people were taking note regarding the accumulation 323 00:19:15,600 --> 00:19:19,320 Speaker 1: of information, as well as methods to store the information, 324 00:19:19,480 --> 00:19:22,520 Speaker 1: and increasingly, as time went on, how we can do 325 00:19:22,640 --> 00:19:25,480 Speaker 1: useful things with all this information. So I recommend you 326 00:19:25,560 --> 00:19:27,480 Speaker 1: check out that Forbes article if you want to learn more. 327 00:19:27,920 --> 00:19:31,240 Speaker 1: I think it goes up to about twenty twelve at 328 00:19:31,240 --> 00:19:34,080 Speaker 1: this point, it has been updated numerous times, but obviously 329 00:19:34,400 --> 00:19:38,760 Speaker 1: twenty twelve was quite a long time ago, so it's 330 00:19:38,840 --> 00:19:41,680 Speaker 1: it's not exactly up to present day. But it's still 331 00:19:41,720 --> 00:19:45,440 Speaker 1: a really interesting article that gives lots more details about this. 332 00:19:46,040 --> 00:19:47,960 Speaker 1: But I don't want to just regurgitate the article, so 333 00:19:47,960 --> 00:19:50,840 Speaker 1: we're going to hop on ahead. Now, Folks, in general, 334 00:19:51,000 --> 00:19:54,840 Speaker 1: we're becoming more aware of this information challenge that was growing. 335 00:19:55,119 --> 00:19:58,840 Speaker 1: But where did the term big data actually come from? Well, 336 00:19:58,920 --> 00:20:02,760 Speaker 1: chances are it's sort of rose organically in conversations within 337 00:20:02,840 --> 00:20:07,160 Speaker 1: the computer sector. As you know, hackers and computer scientists 338 00:20:07,240 --> 00:20:11,080 Speaker 1: and programmers and researchers were all wrestling with ways to 339 00:20:11,200 --> 00:20:14,840 Speaker 1: deal with data. Now, by this time, folks had adapted 340 00:20:14,880 --> 00:20:19,119 Speaker 1: an observation made by Cyril Northcote Parkinson to apply to 341 00:20:19,240 --> 00:20:23,679 Speaker 1: computer systems and to information. So Parkinson's original observation was 342 00:20:23,680 --> 00:20:27,560 Speaker 1: that generally speaking, in public administration offices, you know, like 343 00:20:27,680 --> 00:20:31,960 Speaker 1: government offices, work expands to fill the time that was 344 00:20:32,000 --> 00:20:35,000 Speaker 1: allowed for that work. So if you have a project 345 00:20:35,040 --> 00:20:38,120 Speaker 1: that's going to be due in three weeks, but really, 346 00:20:38,160 --> 00:20:40,760 Speaker 1: if you were to be brutally honest, there's only a 347 00:20:40,800 --> 00:20:44,479 Speaker 1: week's worth of work to do for that project. Well, 348 00:20:44,800 --> 00:20:48,360 Speaker 1: that work will almost magically expand so that it actually 349 00:20:48,400 --> 00:20:51,720 Speaker 1: takes three weeks to complete. This gets more nuanced and 350 00:20:51,760 --> 00:20:54,880 Speaker 1: it brings into account elements like bureaucracy. But you get 351 00:20:54,920 --> 00:20:59,760 Speaker 1: the point right that somehow it doesn't matter, you know what, 352 00:21:00,119 --> 00:21:02,480 Speaker 1: who is working the job. It doesn't matter the nature 353 00:21:02,520 --> 00:21:05,159 Speaker 1: of the work. The work will expand to fill the 354 00:21:05,200 --> 00:21:07,920 Speaker 1: amount of time it requires to do that work, which 355 00:21:07,960 --> 00:21:10,439 Speaker 1: meant that if you had said it would take two weeks, 356 00:21:10,760 --> 00:21:12,879 Speaker 1: it would have just expanded to two weeks, not three. 357 00:21:13,000 --> 00:21:15,960 Speaker 1: It's very weird, right Anyway, Folks in the computer biz 358 00:21:16,000 --> 00:21:19,679 Speaker 1: adapted this to say that data will expand to fill 359 00:21:19,840 --> 00:21:23,199 Speaker 1: whatever space you have available for that data. So, in 360 00:21:23,240 --> 00:21:27,560 Speaker 1: other words, you make a bigger storage unit, you're going 361 00:21:27,640 --> 00:21:30,240 Speaker 1: to fill it like that data will just expand to 362 00:21:30,280 --> 00:21:34,119 Speaker 1: fill that even though you thought, oh, I'm future proofing this, 363 00:21:34,640 --> 00:21:38,640 Speaker 1: and again anecdotally, I have observed this in my personal life. 364 00:21:38,640 --> 00:21:41,320 Speaker 1: I remember when hard disk drives first became a thing 365 00:21:41,400 --> 00:21:44,879 Speaker 1: in personal computers, like they were already existed, but personal 366 00:21:44,920 --> 00:21:47,920 Speaker 1: computers didn't have them when they first it came out, right, 367 00:21:47,960 --> 00:21:51,080 Speaker 1: you were using external drives like floppy disks and stuff, 368 00:21:51,320 --> 00:21:54,440 Speaker 1: and I remember whenever there would be a dramatic expansion 369 00:21:54,440 --> 00:21:57,399 Speaker 1: of storage space, and it always seemed to be dramatic, right, 370 00:21:57,480 --> 00:22:00,600 Speaker 1: it always seemed like it had doubled since last time. 371 00:22:00,640 --> 00:22:03,240 Speaker 1: And typically that's how it worked. Anyway, I would walk 372 00:22:03,280 --> 00:22:05,880 Speaker 1: away thinking, Wow, I'm never gonna fill all this space. 373 00:22:06,040 --> 00:22:08,840 Speaker 1: I mean, who even needs that much space? Two hundred 374 00:22:08,840 --> 00:22:12,359 Speaker 1: and fifty six megabytes? Who the heck needs that much space? 375 00:22:12,400 --> 00:22:14,880 Speaker 1: That's way too much. I mean, I'll never fill it up. 376 00:22:15,520 --> 00:22:17,800 Speaker 1: But of course I would prove myself wrong, typically in 377 00:22:17,880 --> 00:22:22,280 Speaker 1: record time. But beyond anecdotes, which again don't really count 378 00:22:22,320 --> 00:22:25,480 Speaker 1: as evidence, the observation really pointed out that we will 379 00:22:25,560 --> 00:22:28,919 Speaker 1: eagerly fill up whatever space we're given. You could argue 380 00:22:29,160 --> 00:22:31,920 Speaker 1: this goes back to our tendency to avoid deleting material 381 00:22:32,000 --> 00:22:35,960 Speaker 1: out of concern that it might one day become useful. Anyway, 382 00:22:36,280 --> 00:22:39,160 Speaker 1: By the mid nineteen nineties, there was a computer scientist 383 00:22:39,240 --> 00:22:43,720 Speaker 1: named John Mashie, and he was giving presentations that related 384 00:22:43,840 --> 00:22:48,120 Speaker 1: to this concept of big data. Now, Mashie has dismissed 385 00:22:48,160 --> 00:22:51,800 Speaker 1: the idea that he personally coined the phrase. At most, 386 00:22:52,160 --> 00:22:55,919 Speaker 1: he says that he popularized the term big data in 387 00:22:55,920 --> 00:22:58,520 Speaker 1: his talks but his point was that he used the 388 00:22:58,520 --> 00:23:00,720 Speaker 1: phrase big data because it was a shit shorthand way 389 00:23:00,760 --> 00:23:04,360 Speaker 1: to give a nod to several related challenges, ranging from 390 00:23:04,480 --> 00:23:07,919 Speaker 1: storage to analysis. So one could argue that Mashie's use 391 00:23:07,960 --> 00:23:11,159 Speaker 1: of the term approached what we mean by big data today, 392 00:23:11,320 --> 00:23:13,680 Speaker 1: but it wasn't one hundred percent the same thing. And 393 00:23:14,119 --> 00:23:18,640 Speaker 1: the earliest use I've seen cited happened sometime around nineteen 394 00:23:18,760 --> 00:23:22,679 Speaker 1: ninety eight. So we know Mashie didn't invent the phrase, 395 00:23:23,359 --> 00:23:26,600 Speaker 1: and we know that partly because researchers found an instance 396 00:23:26,680 --> 00:23:29,920 Speaker 1: that predates his talks by nearly a decade. Steve Lohr 397 00:23:30,000 --> 00:23:32,480 Speaker 1: wrote a piece for The New York Times titled the 398 00:23:32,520 --> 00:23:37,119 Speaker 1: Origins of Big Data, An etymological detective Story. A great, 399 00:23:37,240 --> 00:23:40,639 Speaker 1: great article. By the way, Lore spoke with an associate 400 00:23:40,840 --> 00:23:44,640 Speaker 1: librarian in Yale Law School named Fred Shapiro, and Fred 401 00:23:44,680 --> 00:23:47,600 Speaker 1: Shapiro did some research and uncovered an instance of the 402 00:23:47,640 --> 00:23:51,240 Speaker 1: phrase big data in a nineteen eighty nine article in 403 00:23:51,359 --> 00:23:54,640 Speaker 1: Harper's magazine. The author of that piece was Eric Larson, 404 00:23:54,680 --> 00:23:58,239 Speaker 1: who said, quote, the keepers of big data say they 405 00:23:58,320 --> 00:24:01,119 Speaker 1: do it for the consumer's benefit, but data have a 406 00:24:01,160 --> 00:24:05,280 Speaker 1: way of being used for purposes other than originally intended quote, 407 00:24:05,520 --> 00:24:08,600 Speaker 1: and boy howdy, we have seen that observation play out 408 00:24:08,600 --> 00:24:11,960 Speaker 1: again and again, haven't we. It's remarkable because nineteen eighty 409 00:24:12,080 --> 00:24:15,520 Speaker 1: nine predates the World Wide Web, certainly predates all the 410 00:24:15,520 --> 00:24:18,920 Speaker 1: social networks that we talk about. But Eric Larson's observation 411 00:24:19,280 --> 00:24:22,639 Speaker 1: is just as relevant, if not more relevant, today than 412 00:24:22,680 --> 00:24:26,320 Speaker 1: it was in nineteen eighty nine. Also, incidentally, Eric Larson 413 00:24:26,480 --> 00:24:29,720 Speaker 1: wrote one of my favorite books of all time. It's 414 00:24:29,760 --> 00:24:33,280 Speaker 1: titled The Devil in the White City. Famous book. I'm 415 00:24:33,280 --> 00:24:35,119 Speaker 1: sure a lot of you have already read it, but 416 00:24:35,200 --> 00:24:37,840 Speaker 1: for those who haven't, it's a book that tells two 417 00:24:38,000 --> 00:24:43,000 Speaker 1: somewhat intertwined stories, the eighteen ninety three World's Columbian Exposition 418 00:24:43,119 --> 00:24:47,320 Speaker 1: in Chicago and the tale behind HH Holmes, credited as 419 00:24:47,359 --> 00:24:50,920 Speaker 1: one of America's first serial killers. Now, I originally bought 420 00:24:50,920 --> 00:24:54,280 Speaker 1: the book because I was interested in Holmes's story, but 421 00:24:54,320 --> 00:24:56,240 Speaker 1: I got to be honest, I actually found the chapters 422 00:24:56,240 --> 00:24:59,560 Speaker 1: about the exposition to be far more captivating, and it 423 00:24:59,640 --> 00:25:01,120 Speaker 1: ties up into a lot of the stuff we talk 424 00:25:01,160 --> 00:25:03,520 Speaker 1: about on tech stuff. So it's a great book if 425 00:25:03,560 --> 00:25:05,920 Speaker 1: you're looking for something to read. But now let's get 426 00:25:05,920 --> 00:25:09,520 Speaker 1: back to Big data. So things continue on their inevitable 427 00:25:09,560 --> 00:25:14,199 Speaker 1: path through time. As it goes, time marches on, we 428 00:25:14,280 --> 00:25:16,920 Speaker 1: get up to the two thousands. By now the Internet 429 00:25:17,000 --> 00:25:21,679 Speaker 1: has greatly exacerbated our data creation and accumulation problem. In 430 00:25:21,760 --> 00:25:25,879 Speaker 1: two thousand, Francis Diebold wrote, quote big data refers to 431 00:25:25,920 --> 00:25:31,080 Speaker 1: the explosion in the quantity and sometimes quality of available 432 00:25:31,119 --> 00:25:35,239 Speaker 1: and potentially relevant data, largely the result of recent and 433 00:25:35,359 --> 00:25:40,400 Speaker 1: unprecedented advancements in data recording and storage technology end quote. 434 00:25:40,640 --> 00:25:42,679 Speaker 1: So we're really starting to close in at this point 435 00:25:42,720 --> 00:25:45,679 Speaker 1: on the concept of big data as we understand it today. 436 00:25:46,160 --> 00:25:47,960 Speaker 1: Then we get up to two thousand and five, and 437 00:25:48,040 --> 00:25:51,000 Speaker 1: a couple actually several important things happened that year in 438 00:25:51,080 --> 00:25:54,280 Speaker 1: the realm of big data. We get Tim O'Reilly and 439 00:25:54,359 --> 00:25:59,520 Speaker 1: his media company, fittingly enough called O'Reilly Media, and this 440 00:25:59,600 --> 00:26:01,800 Speaker 1: is the year that he would publish an article titled 441 00:26:02,040 --> 00:26:05,679 Speaker 1: what is web two point zho, a famous or perhaps 442 00:26:05,760 --> 00:26:09,679 Speaker 1: infamous article in tech circles. So the dot com bubble 443 00:26:09,720 --> 00:26:12,120 Speaker 1: had burst several years earlier, around two thousand and two 444 00:26:12,119 --> 00:26:15,199 Speaker 1: thousand and one, and O'Reilly was making observations about the 445 00:26:15,280 --> 00:26:19,720 Speaker 1: qualities that helped the companies that survived that crash versus 446 00:26:19,720 --> 00:26:22,280 Speaker 1: the companies that went under, like what set them apart? 447 00:26:22,320 --> 00:26:24,480 Speaker 1: What are some of the qualities that we can say 448 00:26:24,840 --> 00:26:27,800 Speaker 1: are really valuable on the web. And part of that 449 00:26:27,880 --> 00:26:31,760 Speaker 1: involved how successful web ventures were handling data. Now. That 450 00:26:31,840 --> 00:26:35,960 Speaker 1: same year, he had a guy named Roger Mugalus or 451 00:26:36,080 --> 00:26:38,480 Speaker 1: Mugalas actually I don't know how to say his last name, 452 00:26:38,520 --> 00:26:40,560 Speaker 1: but he was also with O'Reilly, and he argued that 453 00:26:40,600 --> 00:26:44,000 Speaker 1: big data refers to how we now had the capacity 454 00:26:44,160 --> 00:26:47,879 Speaker 1: and the capability to gather and store data sets that 455 00:26:47,960 --> 00:26:51,640 Speaker 1: are so large that our traditional business tools are incapable 456 00:26:51,680 --> 00:26:56,040 Speaker 1: of doing anything useful with that information. It's makes me 457 00:26:56,119 --> 00:26:59,440 Speaker 1: think of the joker in the Dark Knight film where 458 00:26:59,560 --> 00:27:01,760 Speaker 1: he says, as a dog chasing a car, he wouldn't 459 00:27:01,760 --> 00:27:03,520 Speaker 1: know what to do if he caught it. That kind 460 00:27:03,520 --> 00:27:05,960 Speaker 1: of thing. Yeah, we've got all this information, but the 461 00:27:06,000 --> 00:27:11,119 Speaker 1: tools we have aren't sufficient to do anything meaningful with it. 462 00:27:11,359 --> 00:27:15,280 Speaker 1: We were overwhelmed with information. But that same year, because 463 00:27:15,320 --> 00:27:17,199 Speaker 1: an awful lot happened in two thousand and five in 464 00:27:17,240 --> 00:27:21,640 Speaker 1: the big data space, Doug Cutting and Mike Cafferella released 465 00:27:21,640 --> 00:27:26,720 Speaker 1: a tool that would really change things. I'll explain more, 466 00:27:27,000 --> 00:27:30,040 Speaker 1: but first we're going to take another quick break to 467 00:27:30,119 --> 00:27:42,760 Speaker 1: thank our sponsors. Okay, before the break, I teased that 468 00:27:42,880 --> 00:27:44,760 Speaker 1: we were going to talk about a tool made by 469 00:27:44,800 --> 00:27:48,920 Speaker 1: Doug Cutting and Mike Cafferella that would actually change our 470 00:27:48,960 --> 00:27:52,600 Speaker 1: approach to big data and make it possible to do 471 00:27:52,760 --> 00:27:56,600 Speaker 1: meaningful things with it. So these two had read papers 472 00:27:56,720 --> 00:27:59,800 Speaker 1: about Google's file system as well as a tool that 473 00:27:59,800 --> 00:28:02,960 Speaker 1: Go was using called map reduce. Now, the purpose of 474 00:28:03,000 --> 00:28:05,960 Speaker 1: map reduce is to take large clusters of data and 475 00:28:06,080 --> 00:28:09,840 Speaker 1: essentially break them down into more manageable chunks, and then 476 00:28:09,920 --> 00:28:15,000 Speaker 1: analyze these chunks in parallel, and this makes the process 477 00:28:15,040 --> 00:28:17,960 Speaker 1: of data analysis faster. It's really just another form of 478 00:28:18,000 --> 00:28:21,320 Speaker 1: parallel processing when you really think about it. Anyway, Cutting 479 00:28:21,359 --> 00:28:24,800 Speaker 1: and Cafarella were inspired to make their own tool that 480 00:28:24,840 --> 00:28:27,119 Speaker 1: could do similar work, but you know, they can make 481 00:28:27,160 --> 00:28:30,240 Speaker 1: it for everybody, and so they created a project called 482 00:28:30,560 --> 00:28:36,199 Speaker 1: hadoop hadop, and the first version of hadoop would come 483 00:28:36,200 --> 00:28:38,080 Speaker 1: out in two thousand and six, and it's an open 484 00:28:38,120 --> 00:28:41,960 Speaker 1: source project. It's still around today with thousands of contributors. 485 00:28:42,160 --> 00:28:44,640 Speaker 1: But the important bit is that we were now starting 486 00:28:44,680 --> 00:28:48,560 Speaker 1: to develop new business tools that actually could handle the 487 00:28:48,600 --> 00:28:52,440 Speaker 1: massive amounts of information that we were accumulating. But let's 488 00:28:52,440 --> 00:28:55,240 Speaker 1: take a quick step back. Let's also consider what's going 489 00:28:55,280 --> 00:28:59,520 Speaker 1: on around this same time, the mid to late two thousands, 490 00:28:59,520 --> 00:29:01,320 Speaker 1: and by that I mean the first decade of the 491 00:29:01,360 --> 00:29:04,800 Speaker 1: two thousands. So for the first several years in the 492 00:29:04,800 --> 00:29:07,920 Speaker 1: computer age, it was really computer systems themselves that were 493 00:29:07,920 --> 00:29:11,160 Speaker 1: seen as the genesis of data creation, right like it's 494 00:29:11,560 --> 00:29:14,280 Speaker 1: the computers are the things making all this info. But 495 00:29:14,400 --> 00:29:16,800 Speaker 1: other elements were starting to come into play by this point. 496 00:29:16,960 --> 00:29:18,600 Speaker 1: So when we get up to two thousand and seven, 497 00:29:18,680 --> 00:29:22,000 Speaker 1: we're into the consumer smartphone era, because that was the 498 00:29:22,040 --> 00:29:25,840 Speaker 1: introduction of the Apple iPhone. These consumer smartphones can generate 499 00:29:26,120 --> 00:29:29,080 Speaker 1: enormous amounts of information. You can perform all sorts of 500 00:29:29,080 --> 00:29:32,640 Speaker 1: computational tasks on them. They can track your location, you 501 00:29:32,680 --> 00:29:36,320 Speaker 1: can connect the internet, et cetera. We also were getting 502 00:29:36,320 --> 00:29:38,800 Speaker 1: into the age of the Internet of Things, so we 503 00:29:38,800 --> 00:29:42,920 Speaker 1: were starting to create millions of these tiny devices, usually 504 00:29:42,960 --> 00:29:46,320 Speaker 1: designed to collect specific bits of information and then zip 505 00:29:46,400 --> 00:29:49,440 Speaker 1: that info off to somewhere else. So it might be 506 00:29:49,560 --> 00:29:52,880 Speaker 1: a speed sensor along a road. It might be a 507 00:29:52,880 --> 00:29:55,800 Speaker 1: thermometer at a weather data collection site. It might be 508 00:29:55,840 --> 00:29:58,240 Speaker 1: a thermostat in your own home. It could be anything. 509 00:29:58,320 --> 00:30:02,479 Speaker 1: Could be a smart speaker. All of these individual little 510 00:30:02,680 --> 00:30:06,160 Speaker 1: components would add to the amount of information we were 511 00:30:06,600 --> 00:30:10,800 Speaker 1: gathering and storing and creating, all in the hopes of 512 00:30:10,800 --> 00:30:14,160 Speaker 1: being able to do something useful with that info. And 513 00:30:14,520 --> 00:30:17,040 Speaker 1: we also had another buzz term that was starting to 514 00:30:17,040 --> 00:30:19,880 Speaker 1: gain traction, just as big data was really beginning to 515 00:30:19,880 --> 00:30:22,400 Speaker 1: transition from a topic that was talked about in a 516 00:30:22,440 --> 00:30:27,080 Speaker 1: relatively small subculture of computer scientists and such into a 517 00:30:27,160 --> 00:30:31,080 Speaker 1: topic that the general public had actually heard about. You know, 518 00:30:31,280 --> 00:30:36,320 Speaker 1: usually we're a few years behind whatever group is really 519 00:30:36,360 --> 00:30:40,080 Speaker 1: focused on the subject matter. So this other buzz term 520 00:30:40,360 --> 00:30:43,240 Speaker 1: was cloud computing, which I also got an assignment to 521 00:30:43,640 --> 00:30:46,120 Speaker 1: work on right around the same time as big data. Now, 522 00:30:46,160 --> 00:30:48,840 Speaker 1: the simplest way to describe cloud computing is that it's 523 00:30:48,880 --> 00:30:51,840 Speaker 1: when you use someone else's computer to do your computational 524 00:30:51,880 --> 00:30:56,320 Speaker 1: tasks because you log in through your computer, but it's 525 00:30:56,400 --> 00:30:59,080 Speaker 1: this other computer that's actually doing the work, or it 526 00:30:59,160 --> 00:31:02,040 Speaker 1: might be a net work of other computers doing that work. 527 00:31:02,280 --> 00:31:05,680 Speaker 1: That work could be that you're storing photos or ekitty 528 00:31:05,720 --> 00:31:11,080 Speaker 1: cats on a drive on some cloud storage, or it 529 00:31:11,160 --> 00:31:13,680 Speaker 1: might be that you're using cloud computing to help you 530 00:31:13,760 --> 00:31:17,280 Speaker 1: crunch really big numbers that your computer could not handle 531 00:31:17,640 --> 00:31:20,920 Speaker 1: and you're peeling back the mysteries of quantum mechanics or something. 532 00:31:21,360 --> 00:31:25,200 Speaker 1: So cloud computing would rise at the same time as 533 00:31:25,200 --> 00:31:27,680 Speaker 1: big data and cloud computing and big data are very 534 00:31:27,680 --> 00:31:31,440 Speaker 1: closely related. They're enablers of one another in a way. 535 00:31:31,880 --> 00:31:34,720 Speaker 1: Organizations and companies feel the need to engage with cloud 536 00:31:34,760 --> 00:31:38,960 Speaker 1: computing services because their data tasks are growing increasingly complex 537 00:31:39,040 --> 00:31:42,680 Speaker 1: and voluminous, and it gets harder and harder to handle 538 00:31:42,720 --> 00:31:46,080 Speaker 1: all of that on your own. Right, Like most businesses 539 00:31:46,120 --> 00:31:51,840 Speaker 1: these days are not using exclusively on premises computing systems 540 00:31:52,200 --> 00:31:55,400 Speaker 1: to do all their computation and all their storage. It 541 00:31:55,640 --> 00:31:59,600 Speaker 1: just is not practical, right. You would have to continuously 542 00:32:00,200 --> 00:32:03,600 Speaker 1: buy or lease more space just to hold all the 543 00:32:03,640 --> 00:32:07,360 Speaker 1: systems you would need. So instead they engage with cloud 544 00:32:07,440 --> 00:32:10,800 Speaker 1: computing companies that will provide those services for them, and 545 00:32:10,800 --> 00:32:13,000 Speaker 1: then the cloud computing companies will go out and build 546 00:32:13,000 --> 00:32:15,800 Speaker 1: a warehouse and fill it full of computers. Big data 547 00:32:15,920 --> 00:32:19,320 Speaker 1: leans on cloud computing to make it practical to even 548 00:32:19,360 --> 00:32:21,640 Speaker 1: accumulate all that data in the first place, let alone 549 00:32:21,680 --> 00:32:25,680 Speaker 1: analyze it. Now, the lure a big data the reason 550 00:32:25,840 --> 00:32:28,320 Speaker 1: why we're concerned with it. I mentioned this in the 551 00:32:28,480 --> 00:32:31,680 Speaker 1: very beginning of this episode. The lure is that there 552 00:32:31,680 --> 00:32:36,360 Speaker 1: are nuggets of truth hiding inside vast amounts of possibly 553 00:32:36,480 --> 00:32:41,160 Speaker 1: useless information. There is signal, but there's also an enormous 554 00:32:41,200 --> 00:32:46,760 Speaker 1: amount of noise. If we can identify those little nuggets 555 00:32:46,760 --> 00:32:50,680 Speaker 1: of truth, then we can potentially benefit from them. But 556 00:32:50,800 --> 00:32:53,760 Speaker 1: these huge piles of information are just so vast that 557 00:32:53,800 --> 00:32:56,920 Speaker 1: our ability to zero in on the important stuff is 558 00:32:57,120 --> 00:32:59,640 Speaker 1: just not up to snuff. It is the proverbial needle 559 00:32:59,640 --> 00:33:03,000 Speaker 1: in a hay haystack problem. So the promise of big 560 00:33:03,080 --> 00:33:06,200 Speaker 1: data in our current age is that when we use 561 00:33:06,280 --> 00:33:10,320 Speaker 1: the right tools, we can sift through the haystack and 562 00:33:10,360 --> 00:33:12,640 Speaker 1: we can find all the needles, which is a really 563 00:33:12,680 --> 00:33:16,120 Speaker 1: tempting concept, because who knows what you might find when 564 00:33:16,160 --> 00:33:21,240 Speaker 1: you analyze large amounts of information. Maybe you identify patterns 565 00:33:21,640 --> 00:33:24,920 Speaker 1: that you can then use to lead you to change 566 00:33:24,960 --> 00:33:27,560 Speaker 1: things so that you can save huge sums of money 567 00:33:27,840 --> 00:33:30,960 Speaker 1: in the way you do business. Or maybe you identify 568 00:33:31,000 --> 00:33:35,520 Speaker 1: a previously unknown opportunity. Or maybe you can spot connections 569 00:33:35,560 --> 00:33:38,640 Speaker 1: between data points that you didn't see before and you 570 00:33:38,680 --> 00:33:42,960 Speaker 1: start to see correlation. Maybe you even determine causation. Maybe 571 00:33:43,480 --> 00:33:46,680 Speaker 1: this leads you to make some incredible scientific progress, and 572 00:33:46,720 --> 00:33:50,000 Speaker 1: it might be on anything from medicine to astronomy. It 573 00:33:50,040 --> 00:33:54,840 Speaker 1: all depends on the type of data. Obviously, However, there's 574 00:33:54,880 --> 00:33:58,040 Speaker 1: a big caveat that goes along with this sort of 575 00:33:58,360 --> 00:34:03,400 Speaker 1: beautiful concept, and it's possible that the tools we use 576 00:34:03,440 --> 00:34:07,640 Speaker 1: will make mistakes that they're going to spot patterns or 577 00:34:07,760 --> 00:34:12,360 Speaker 1: meaning when in reality there isn't anything there. They mistake 578 00:34:12,760 --> 00:34:17,080 Speaker 1: something to be meaningful when in fact it's not. This 579 00:34:17,200 --> 00:34:19,120 Speaker 1: is kind of like when you look up at the 580 00:34:19,160 --> 00:34:21,600 Speaker 1: clouds and you see a pattern that makes you think 581 00:34:21,600 --> 00:34:25,320 Speaker 1: of a specific shape, very like a whale, As Hamlet 582 00:34:25,400 --> 00:34:28,600 Speaker 1: and Polonius would say, so, the shape of the cloud 583 00:34:28,680 --> 00:34:30,800 Speaker 1: might remind you of a whale or a dog, or 584 00:34:30,840 --> 00:34:34,759 Speaker 1: a hand or whatever, but you probably are aware that 585 00:34:34,800 --> 00:34:38,440 Speaker 1: the cloud isn't actually a whale or whatever. In fact, 586 00:34:38,600 --> 00:34:41,719 Speaker 1: you might even realize that your point of view is 587 00:34:41,840 --> 00:34:44,560 Speaker 1: part of what is shaping your perception. It's part of 588 00:34:44,600 --> 00:34:48,239 Speaker 1: the reason why it looks like a whale. Maybe if 589 00:34:48,280 --> 00:34:51,720 Speaker 1: you were a mile away to the east or something 590 00:34:51,760 --> 00:34:54,359 Speaker 1: and you were to look at that same cloud, the 591 00:34:54,400 --> 00:34:56,360 Speaker 1: angle you would be at would mean that the cloud 592 00:34:56,360 --> 00:34:58,480 Speaker 1: wouldn't look anything like a whale. Maybe it would look 593 00:34:58,560 --> 00:35:00,840 Speaker 1: like something entirely different, or maybe it wouldn't remind you 594 00:35:00,880 --> 00:35:03,680 Speaker 1: of anything at all. So, from one perspective, the cloud 595 00:35:03,680 --> 00:35:08,480 Speaker 1: shape appears to have some meaning. From other perspectives, it doesn't. 596 00:35:08,680 --> 00:35:11,399 Speaker 1: So it would be a mistake to draw any conclusions 597 00:35:11,400 --> 00:35:15,160 Speaker 1: based on that one perception, because it would just be 598 00:35:15,360 --> 00:35:19,279 Speaker 1: the illusion of meaning, not actual meaning. And that can 599 00:35:19,360 --> 00:35:22,080 Speaker 1: happen when you're looking at huge data sets too. You 600 00:35:22,160 --> 00:35:25,560 Speaker 1: might see something that looks like it's meaningful, that it 601 00:35:25,600 --> 00:35:28,440 Speaker 1: represents a pattern or a connection, when in fact it doesn't. 602 00:35:28,840 --> 00:35:30,920 Speaker 1: That can lead you on a wild goose chase, and 603 00:35:30,960 --> 00:35:33,719 Speaker 1: in a worst case scenario, you might dedicate a lot 604 00:35:33,800 --> 00:35:37,520 Speaker 1: of time and effort and money toward pursuing this perceived 605 00:35:37,640 --> 00:35:40,600 Speaker 1: meaning and you only find out much later that there 606 00:35:40,680 --> 00:35:43,040 Speaker 1: was nothing there at all. Now that's not to say 607 00:35:43,040 --> 00:35:46,839 Speaker 1: that we can't trust the outcomes of big data analysis, 608 00:35:47,200 --> 00:35:49,320 Speaker 1: but it does mean that we have to make sure 609 00:35:49,440 --> 00:35:53,279 Speaker 1: that we have tests to ensure the validity of those analyzes. 610 00:35:53,640 --> 00:35:57,160 Speaker 1: We need to take a scientific approach toward big data, 611 00:35:57,280 --> 00:35:59,280 Speaker 1: or else we run the risk of chasing a dream 612 00:35:59,680 --> 00:36:02,919 Speaker 1: rather than and learning more about reality. And anytime there's 613 00:36:03,080 --> 00:36:06,600 Speaker 1: any uncertainty, there will be people who move in to 614 00:36:06,800 --> 00:36:12,200 Speaker 1: exploit that uncertainty, hucksters, scam artists, snake oil salesman. So 615 00:36:12,239 --> 00:36:13,880 Speaker 1: as an example of this, I would point to the 616 00:36:13,920 --> 00:36:17,680 Speaker 1: explosion we are seeing in artificial intelligence right now. AHI 617 00:36:18,320 --> 00:36:22,120 Speaker 1: has tons of applications, including in the analysis of big data, 618 00:36:23,400 --> 00:36:26,480 Speaker 1: and that means that there is also opportunity there to 619 00:36:26,560 --> 00:36:29,239 Speaker 1: take advantage of people. So it doesn't take much imagination 620 00:36:29,280 --> 00:36:31,680 Speaker 1: to think of a company that actually uses a cheap 621 00:36:31,760 --> 00:36:34,680 Speaker 1: human labor and to pass it off as a truly 622 00:36:34,800 --> 00:36:38,279 Speaker 1: AI company and to market that company's services to big 623 00:36:38,320 --> 00:36:41,719 Speaker 1: businesses that may not know any better. And really you're 624 00:36:41,800 --> 00:36:46,399 Speaker 1: just exploiting people in poorer countries and passing it off 625 00:36:46,440 --> 00:36:50,360 Speaker 1: as being this really high tech business. As it stands, 626 00:36:50,760 --> 00:36:53,320 Speaker 1: even if you're not doing that, human labor is already 627 00:36:53,360 --> 00:36:56,120 Speaker 1: the backbone of the AI industry, like it or not. 628 00:36:56,480 --> 00:36:59,400 Speaker 1: People in countries that have low wages and have very 629 00:36:59,440 --> 00:37:03,440 Speaker 1: little protect in place for working citizens, they're spending countless 630 00:37:03,440 --> 00:37:07,200 Speaker 1: hours tagging data so that AI can actually make use 631 00:37:07,239 --> 00:37:10,440 Speaker 1: of it. So as we marvel at how clever AI 632 00:37:10,520 --> 00:37:13,640 Speaker 1: tools appear to be, there are folks out there on 633 00:37:13,680 --> 00:37:17,560 Speaker 1: the margins who are the ones labeling images and applying 634 00:37:17,600 --> 00:37:20,319 Speaker 1: metadata to text so that the AI can grab the 635 00:37:20,400 --> 00:37:23,680 Speaker 1: right stuff based upon a query. Anyway, I think it's 636 00:37:23,680 --> 00:37:27,800 Speaker 1: important to remember that big data can, with the right tools, 637 00:37:28,120 --> 00:37:31,360 Speaker 1: provide us insights that we might not otherwise make because 638 00:37:31,360 --> 00:37:33,759 Speaker 1: the amount of information is just too large for us 639 00:37:33,760 --> 00:37:37,000 Speaker 1: to handle. Those insights might mean we can do things 640 00:37:37,040 --> 00:37:41,040 Speaker 1: like streamline supply chains, or identify a market for specific product, 641 00:37:41,400 --> 00:37:43,680 Speaker 1: or find a new way to treat an illness. Big 642 00:37:43,760 --> 00:37:46,880 Speaker 1: data can also lead us to some darker outcomes. Companies 643 00:37:46,960 --> 00:37:49,799 Speaker 1: will scrape as much of your personal information as they 644 00:37:49,960 --> 00:37:53,439 Speaker 1: possibly can. They will sell it to other companies. These 645 00:37:53,480 --> 00:37:57,279 Speaker 1: other companies will market you to yet more companies on 646 00:37:57,360 --> 00:37:59,640 Speaker 1: an effort to serve you ads or to lure you 647 00:37:59,640 --> 00:38:04,840 Speaker 1: into doing something foolish like downloading malware or consuming misinformation. 648 00:38:05,280 --> 00:38:08,400 Speaker 1: Because behind every silver lining is a big, scary cloud. 649 00:38:09,040 --> 00:38:11,960 Speaker 1: Maybe it's in the shape of a whale. That is 650 00:38:12,160 --> 00:38:16,120 Speaker 1: a brief history of big data. It is a history 651 00:38:16,120 --> 00:38:19,640 Speaker 1: that is ongoing. I'm sure we're going to see some incredible, 652 00:38:19,800 --> 00:38:24,239 Speaker 1: incredible discoveries thanks to analysis of big data. I'm sure 653 00:38:24,239 --> 00:38:27,440 Speaker 1: we're also going to see some pretty scary stuff as 654 00:38:27,440 --> 00:38:30,360 Speaker 1: a result of it as well, such is life, but 655 00:38:30,480 --> 00:38:35,520 Speaker 1: it is fascinating to see how we have arrived at 656 00:38:35,520 --> 00:38:38,160 Speaker 1: this point, like first from the point of how do 657 00:38:38,239 --> 00:38:40,799 Speaker 1: we collect all this information? And then what do we 658 00:38:40,840 --> 00:38:44,040 Speaker 1: do with it? I hope you enjoyed this episode. I 659 00:38:44,080 --> 00:38:46,880 Speaker 1: hope you are all well, and I will talk to 660 00:38:46,880 --> 00:38:57,240 Speaker 1: you again really soon. Tech Stuff is an iHeartRadio production. 661 00:38:57,560 --> 00:39:02,560 Speaker 1: For more podcasts from iheartradioit the iHeartRadio app, Apple Podcasts, 662 00:39:02,680 --> 00:39:04,680 Speaker 1: or wherever you listen to your favorite shows.