Thursday, February 13, 2014

Big Data Is a Small Part of the Real Data Science Revolution

"Big Data" continues to be the biggest buzz word in business and yet I feel that most people still don't see it in the right light. As someone who has analyzed data for years, I find it very strange that there is so much focus on the size of the data, methods for storing it and transforming it, and so little on the data-analysis process itself where the real revolution is occurring. In this article, I'll argue that the buzz should be about the rising productivity of the data-scientist. The causes are many and data size and Big Data tools have little to do with it. 


Getting a handle on data and the activity of data-analysis 


Data-analysis is not so simple and requires a thorough understanding of the experimental method and statistics. Unfortunately, most people don't understand statistical analysis and the data analysis process. It isn't about mechanically churning data into information. It is a more abstract process and involves much thinking and creativity. 

Data by itself is meaningless. It's just ones and zeros. It actually isn't even that. Binary data is really just a sequence of things that can be separated into two classes and kept there with high fidelity. That's it. Data only becomes useful when we also have knowledge of the process from which it was produced and a notion of how we can compute on it to reveal something else that is interpretable and useful to us. 

The algorithm used for computing and the method of interpreting the results are both inseparably linked to the data itself. The algorithm is itself an abstraction. What we really have is a program which is itself stored as binary data. So really computing with data is a process of taking two pieces of data: the data and the program, and letting it interact in a way that we control to create other data that human can interpret and use. 

But what is this program to manipulate the data? Humans create this as a way of representing a model of the world; the outside world that a computer knows nothing about.  A model is a set of constraints that we set on a set of abstraction (mental concepts) and the data is the physical representation of those abstractions. The program is the physical representation of those constraints.

For example, the place where you actually live is the same as the place I will end up if I following Google's directions based on the text of your address that I have stored somewhere as data. Those things are constrained to be the same in our model. That assumption is what gives meaning to the written address.

If I go to Amazon.com and browse a particular set of books, it is almost always the case that I am interested in those books and considering buying one or more of them. I know the traces that I leave behind in Amazon's log files are not independent from my preference of books. I have a model, from introspection of my own mind, that these things are related and that this is probably the case for most people. Thus, if I am tasked with writing a recommendation system for Amazon, I have a model of how  to relate people's book preferences to the traces in the data that they leave behind.

Thus every computation is based on some model and that model is not encoded anywhere in the data nor is it encoded in the program. It is encoded in the human mind. Computation, the economical kind that we actually do, is thus an interaction between the human mind and a machine.

The machine is needed only because humans are bad at storing and retrieving some kinds of data and are more error prone in carrying out logical operations. This is really the only thing that computers do and they do it very well.

What is information? A good enough definition is that the information is the part of the data that is relevant or related to the abstraction being computed. The data that is not related can be called noise. Note that data versus noise is not intrinsic to the data. Data is information only as it pertains to the model we have. These 100 bits of data are related to where you live. The next 100 bits of data are not. They are however relevant to where someone else lives but for my model and my task, they are noise.

So data analysis starts with a human need (I need to ship you your new book) and a model for how I can do this (write the address on an envelope and drop in the blue box). Implicit in that model is that the bits that I turned into text on my screen is the necessary information that I need to write on the envelope to accomplish this goal. The computer fetches that information as requested even though it has no idea what you intend to do with it or what it means to you.

More complicated models are really no different from this. They just involve more data, perhaps a more complicated model of the world and perhaps more computation. Consider predicting the weather. This is a complicated model. It involves, physics, measurement devices and the relation between the abstraction of weather and the traces left behind on these devices. Physics tells us which observables are relevant to changes in the weather. Engineering tells us how to create data closely related to those variables. Computers just help us keep a record of all that data and help us perform a complicated calculation.

A problem like this also has a problem that the data trace left behind is slightly ambiguous. There is measurement noise that is impossible to separate from the data. We don't really measure the density and pressure of every point in space. We measure estimates of this at a few points and make assumptions about the level of noise in the measurements and also how to interpolate between the sampled points.

Despite this ambiguity, we have an excellent mathematical tool for dealing with it. We know that there are many different possibilities for how the weather will evolve. We know that we are not going to make correct predictions all the time. However, we have also learned from experience that these predictions are much better than randomly guessing the weather or eyeballing the sky. That fact validates that our assumptions are likely correct.

We conceptualize the reason why this works by using statistical reasoning. Many different scenarios are  possible but we will choose the scenario that is most consistent with the data. That really means that we choose the ones for which, if they were true, the data that we measured would not be terribly unlikely. We assume that likely things will happen rather than unlikely things and this has proven to be an effective strategy for operating in the world.

For example, yes, there could be a hurricane tomorrow but that would be unlikely because there has never been a hurricane that did not show up on the radar the day before. Our model assumes that the physics in the future are not different from the past. Couldn't the random fluctuations in the measurement devices, by some unlucky change, result in a radar image that doesn't look like a hurricane? Yes, that is theoretically possible but requires a very unlikely coincidence. We assume that the measurement errors are random and uncorrelated which each other and also uncorrelated to whatever is actually going on with the weather.

Now that we have spent some time talking about what data is and what it is not and the importance of the human activity of modeling we can better understand what is wrong with so much that is being said about the Big Data movement.

Big Data is small part of a the larger data-science revolution


Many people seem to be making the case that big amounts of data somehow are going to change everything. Others are a little more careful but still seem to say that big data plus all these great tools for managing it are going to change everything. Some have even asked whether the role of data scientist will soon be replaced by the algorithms themselves.

Is big data plus Hadoop plus Mahout machine learning libraries going to change everything? Absolutely not. A refutation of this idea is not difficult once you realize that many excellent data science projects can be done with small amounts of data on your laptop. The fact that this has been true for decades and did not result in much data-science being done in the business world means that the limiting reagent in this reaction has not been the existence of more data or the tools for storing and transforming it.

It has much more to do with the lack of mathematical and statistical training of people working in the business world. The thing missing has been the skill to model the world in a mathematical way. People with those skills have remained in science and engineering or perhaps been lured into finance but largely had not been widely hired into typical businesses to perform advanced statistical modeling.

So why now? What is different recently that has caused data-science to rise to prominence? I don't believe it is that, all of a sudden, we have big data, nor do I think it is because of Big Data technology tools like Hadoop and NoSQL.

The data-science productivity boom


I believe it is mostly driven by higher productivity of these kind of workers that we now call data-scientists. I know this because I have done this for a while and I know that I am much more productive than I used to be and it's not just the result of more experience.  This productivity can be attributed to many things and most of them have nothing to do with Big Data.

One huge driver is the availability of open source data analysis platforms like R and Python's scientific stack. When I first started doing analytics in astronomy in the 90s, we worked in Tcl or Perl and then relied on low-level languages like C and Fortran when we needed better performance. For plotting some people used gnuplot or pgplot or super mongo. This was all duct-taped together with scripts, often using IO as a medium for communicating between programs.

When needing some advanced algorithm, you would pull out a book like Numerical Recipes or Knuth and copy the code into your browser or translate it to your preferred language. This took a long time, was error prone and people were still limited to which books they had on their desks.

So what's different today. The internet, obviously, is the biggest difference. We can search for mathematical ideas and algorithm and libraries using google. Wikipedia and Google are fantastic tools for learning math and mathematical methods especially when you are looking for connections between ideas.

The idea of open source repositories like on Github are an enormous boon for data-science productivity. Consolidation in the tools being used and the development of communities around data analytics helps enormously with sharing code and ideas.

The continual advance of computing hardware technology has of course been a wind at our back as well.  But before "Big Data" tools we had other big data tools. HDFS is not the first distributed filesystem. We used GFS at Fermilab in the 90s. Hadoop is a useful framework for doing batch processing with data locality awareness but it isn't that much different from what we had built in the past for astronomical data analysis with perl scripts and NFS mounted disks. It's a more general framework and can help avoid repeated work on subsequent projects of a similar nature but it doesn't truly enable anything new.

To sum up. What is different today is that I can learn faster by utilizing all the resources on the internet. I can avoid reinventing the wheel by downloading well tested libraries for analytics. I can spend more time working in a higher-level productivity language like Python or R rather than chasing memory leaks in low level C code. I can communicate with other data scientists online, notably at sites like Stack Overflow, Math Overflow etc if I can't answer my question simply by search. There is much more communication between the related fields of statistics, computer science, math and the physical sciences. This emerging interdisciplinary activity and applications to problems outside of academia is being called data-science.

So productivity is the key. I'd estimate that my productivity is three times higher than 10 years ago. That means that 10 years ago, a company would have to pay me three times more money to accomplish the same task. While I'd love to think that I was worth that kind of pay, I suspect that it is more likely that I was not.

While all of these things developed gradually over the past 15 years or so, there becomes a point where productivity is high enough to warrant the creation of news jobs, such as the data-scientist. Crossing that point probably only happened a few years back for most of today's data-scientists and productivity increases will continue. With any emerging industry, momentum gathers, creating feedback loops. VC's start funding new analytics startups and we get things like Mongodb and Tableau. The media gets involved and talks up how data-science and Big Data are about to change everything. All of this helps to drive more activity in creating more productivity enhancing tools and services. Pay and stature for data-scientists rises attracting people that are unhappy in academic appointments. All is self-reinforcing ... at least for a while.

So what is Big anyhow?


Where does this leave "Big Data"? Is it all a farce? Certainly not if you apply the label to the right things. Some companies like Facebook, Google, Netflix etc really, really do have big data. By that I means that the problem of working at those scales really is nothing like doing data analysis on your laptop. Many of the tools for working with data of that size really are extremely important to them. Still, the fact is that most companies are nothing like that. With some prudent selection, sampling, compression and other tricks, you can still usually fit a company's main data-set on your laptop. We have terabyte hard drives now and 16 GB of memory. If not, you can spin up a few servers on the cloud and work there. This really isn't any different from the past. It is easier and cheaper but not really much different.

The most important advances in Big Data research, in my opinion, is the advances happening in the area of processing data streams, data compression and dimensionality reduction. Hadoop by itself is really just a tool or framework for doing simple operations on large data-sets in a batch mode. Complex calculations are still quite complex and time consuming to code. The productivity of working in Hadoop even with higher level interfaces is still no where near that of working with a smaller data-set locally. And the fact is that for 99% of analyses it is not the right tool or at least not the first tool you should reach for.

Advances in machine learning are certainly major drivers of data-science productivity thought this too isn't just applicable to big data. Machine learning's main use case is problems of high data-richness exhibiting very complex structure such as automated handwriting recognition and facial recognition.

Many have claimed that real Big Data is coming in the form of sensor data; the internet of things etc and this certainly seems to be the case. For this we should look to fields such as astronomy and physics who have been dealing with large amounts of sensor data for years long before the arrival of Hadoop and the Big Data toolkit. The key to this as it has been before will likely be better algorithms, smart filtering and triggering mechanisms and not the brute force storage and transforming of enormous, information-poor data-sets which seems to be the modus operandi of current Big Data platforms.

Data reduction and information extraction 


Big Data is indeed a big deal but I think it is less a big deal than the emergence of data-science as a profession and I don't think these things should be conflated. Furthermore, as hinted at above, extracting more information from more and more data requires foremost the ability to construct more expressive and accurate mathematical models and the skills and tools to quickly turn these into functioning programs.

The fact that you only need a small subset of the data for most statistical analyses is not well understood by people without much background in statistics and experimental science. The amount of data required to constrain a given model is in fact calculable before you even see the data. When I worked in astronomy and helped plan new NASA satellites, this was much of my occupation. That's because gathering data by building and launching a satellite is expensive and so you only propose a mission expensive enough to gather the minimal amount of data to answer your question. The math to do this is called Bayesian statistics and without a basic understanding this you can't reason about the amount of data required for a given model or know when your data set is large enough to benefit from more detailed modeling.

When you do have your hands on a big data set, your goal as a data-scientist is to reduce the data. The term data reduction while ubiquitous in scientific settings is something I rarely hear mentioned by business people. In short, it means compressing down the data size by extracting the information and casting aside the rest. A physics experiment like the LHC in Geneva takes petabytes of data per second, quickly looks for interesting bits to keep and immediately deletes the rest. It is only simplifying a little to say that at the end of the day they are only interested in about a single byte of data. Does the Higgs Boson exist? That's a single bit of data. What is it's mass to 10% accuracy? That's another few bits of information; data reduction par excellence.

Now if data is cheap and just lying around the company anyway, gathering more than you need might not be that big of a deal. Now if you get 10x more data than you need, you're just slowing down the computation and creating more work for yourself. But the key thing to understand is that for a fixed model, you saturate it's usefulness rather quickly. To simplify a bit, the inverse of square-root of N is the usual scaling in statistical error. If your one-parameter model is only expected to be 10% accurate, you probably get there around N=100. That terabyte of data on HDFS is just not going to help.

So extracting more and more useful information from more and more data requires a more accurate model. That can mean more variables in linear model or a different model completely with much higher complexity and expressivity. The Big Data tools don't help you here. You simply have to think. You have to learn advanced statistics. Fortunately thinking about these problems, learning useful methods and borrowing tools for dealing with them is much easier today than it used to be. Again, this shows that cultural cohesion, more effective learning and communication channels and tools for dealing with complexity rather than size are more important than tools for doing simple manipulations on enormous data-sets.

No comments:

Post a Comment