Data Science Musings: February 2014

Monday, February 17, 2014

Best practices for data-science projects

In the following, I'll explain a way in which a data-scientist can effectively work with an agile software delivery team. The ideals we want to promote include:

Keeping the team aware of what the DS is doing at a high level
Making sure the devs are involved with the DS, are doing some pair programming with him/her and are taking on much of the data-science related software development tasks for which they may be better suited anyway.
Breaking data-science stories up into small pieces and managing them in a normal agile fashion
Keep with normal development practices of automated testing and validation
Avoid the situation of the DS's work becoming a black box that no one else on the team or the client understands.
Avoid the situation of the DS becoming a bottleneck for the team (i.e. Bus-factor One).
Enabling the BA's to communicate the process and progress effectively to the client.

We will assume that a project has been through the normal inception process and the goals of the project are well established and agreed upon by all. The whole team knows the problem that the DS will be trying to solve as well as problems of their own.

Data-science 101:

Before doing any data-science, it is important to explain the data-science process and terminology to the entire team. This document can be thought of as a step in that direction. Let's assume the project involves building either a predictive or prescriptive model for the client to better accomplish some goal. The whole team needs to understand terms such as the FOM, validation and testing methodology and how to write stories for analytics. Some time should be devoted to explaining how data-science is a less-linear process than standard development as many decision can only be made after some analysis of the data has begun. It will often involve more backtracking and change of direction.

The Figure of Merit

First, everyone needs to understand the goal and how progress towards that goal can be made quantitative. We can call this the Figure of Merit (FOM), a term I'll borrow from the language of scientific collaborations. An example might be the accuracy of a prediction. It might be terms such as precision and recall or the F-measure. It is also, often necessary to start with a preliminary FOM and adapt it as we better understand the goals of the client or learn about other constraints. Regardless, at any point, we should have a formal definition of the FOM and stories should only be written with the goal of improving this. The client should understand the FOM and take an active part in creating and adapting it. It should not be merely a technical term understood by the developers.

Validation versus testing

All successful software developers now make testing a ubiquitous part if their coding practices. Unit tests are made for every function or method and integration tests are used for making sure parts of the codebase interact properly. Testing analytics code can and should be done as well however there are some key differences that need to be understood by all. There is also another similar idea called validation which should not be confused with testing.

To reduce confusion, we should reserve the word testing to indicate the normal kind of testing that developers do. You test whether a program works as expected by giving it inputs for which the desired outputs are known. It passes the test when it delivers these expected results.

Validation is the word that we should use to test the accuracy of a model. Unlike regular tests, this is not necessarily something that results in pass or fail. An example would be testing to see whether a model, say logistic-regression, makes accurate predictions of our target variable. More important than the actual value output is the trend in time as you continue to refactor and improve the model or add other models. You want this validation result to be stable or improve as you work on the model or introduce new ones. A bug in the code may manifest itself as a drop in the validation score. If you introduce a more sophisticated model and find that it has a lower validation score, this may indicate that you have incorrectly implemented the model or may indicate that you do not understand the problem as well as you thought. It might be wise, especially further along in the project, to require validation scores above a certain threshold in order to result in a successful build or at least a deployment.

Keeping a chart of validation score versus time is also another way of showing steady progress to the client. Running these validation tests, recording the results and recreating this chart should be an automated process triggered by a new checkin or build just as automated testing is done. It may be useful to keep other metrics as well that might indicate a sudden change in behavior of the program. In many cases the validation score will be the same as the FOM however there are other cases where the model may be thought of as a subcomponent of the entire model and so the validation score may be different.

Different methods of testing in the exploratory phase

Much of the early code written for analytics is of an exploratory nature. The point of such code is to learn something. The code itself is not to be thought of as a deliverable. For this reason, writing it in TDD fashion is probably not advisable as it is important to move this phase along as fast as possible.

In addition, data-scientist have usually developed some other testing methodologies that may seem foreign to software developers especially those used to working in a language like Java. As data science code is usually written in a rapid-development, high-productivty language such as Python or R, it lends itself to something we may call REPL-driven development. For example, a python developer may test the following line in the REPL or the shell to see if it works.

x=range(100)
y=[i-10 for i in x]
z=[log(i) for i in y if i > 0]

at which point, they will get an error

NameError: name 'log' is not defined

reminding them to add the following to the top of the program

from math import log

after which the code works as expected. This method of trying code in the REPL as you write is a fairly effective way of eliminating bugs as you write code. A Java dev would likely discover these of bugs when they try to compile and occasionally will not discover them until runtime. If they are practicing TDD, they might catch them from a test failure.

Similarly, a DS working in these kinds of systems will often test code using visual tools. After writing code that is supposed to create an array with a saw-tooth-like pattern, the DS can usually paste a few lines in the REPL and plot the array which will confirm whether it looks as expected or otherwise. Another example, would be for code to open a csv file, parse it and stream it line by line. Once this short script is written, it can be run in the REPL and a few lines can be printed out to check to see if it looks as expected.

These techniques can be used to write code rapidly, testing as you go to make sure the code runs and behaves as expected. While these techniques are not a replacement for unit tests, and won't catch every bug, the speed of development often offsets the lack of quality checks for exploratory code development. The ability to move as quickly as possible and try out many ideas is often the best way to approach data-science especially in the early phases.

It is common for such scripts to evolve into a more standard piece of utility code and when this happened, an effort should be made to unit test to give us more confidence in the reliability of this tool. More often then not however, the script will teach us something and lead us off into a different direction. For example, the streaming program shows us that the data will not be useful for our purposes and so the script will no longer be needed.

While there are many cases where this style of programming may make sense, developers should watch out for cases where more standard TDD should be applied. One example would be cases for which a unit test would be simple to write and so not doing so, would not result in much of a slowdown. Other examples would be when such a script has become a dependency for many other subsequent programs. Another would be when the code is complex or they have low confidence in it's correctness and might require refactoring. True refactoring without units tests is not really possible.

The validation pipeline

Once the team understand the FOM and other validation scores, there is no reason why the code for implementing this can't be completed by developers rather than the DS. Having the developers take one this part has many advantages. One is that it frees up the data-scientist to concentrate on researching and developing the model. Since these tasks can happen in parallel, the validation pipeline may be completed by the time the data-scientist has a model to test.

If the validation framework is completed before a model is ready, there are other test models that can be created to test the functioning and performance of the validation pipeline. One test model is one that makes a random guess at the results. Such a model is typically easy to write and once it exists, it can be validated using the validation code. One would expect it to show a low validation score. Another model that can be created for the opposite extreme. This is model we might call the cheat model. This model is one where we use the known result from the validation data to cheat and give the right answer. Obviously, one would expect this model to perform as well as possible. Even if such models seem useless, there are reasons to create and validate such models. If one learns that the cheat model doesn't actually perform well, we know there must be a bug somewhere in the validation pipeline and thus we test the validation code. This exercise of testing both of these models can often lead to insights about changing the FOM. In addition, it may point out performance problems that can be dealt with before having to test a real model.

The iterative process of creating models

While the rest if the team is creating the validation pipeline, the DS is doing research on the problem and developing a real model. Having the validation pipeline finished first is similar to TDD and puts pressure on the DS to deliver a simple model as quickly as possible before trying to create a better one. Once a first model is delivered, it can be validated. It should perform better than the random model and less well than the cheat model. At this point, the team has some result to show to the client. They can demo the validation pipeline and show how a first simple model performs. In some cases they may realize that the simple model works better than expected and good enough for the client. Often there are other sources of known error such that the usefulness of a model saturates before saturating it's validation score.

At this point, we also have more insight into the FOM that was chosen and may find that a high FOM actually misses some other important constraints or desires. For example, we might be making a personalized recommendation system and we chose the FOM to be simply the accuracy of predicting future purchases. We might find that our model gets a decent FOM at the same time as realizing that everyone has been recommended the same item, a particularly popular one. At this point the team and client may realize that the FOM should include some measure of diversity in the recommendations. This change in the FOM will drive a change in the model. Now the team and DS either move on to another more pressing problem or try to develop a better model (the usual case).

Feature selection and computation

Most models involve the need to construct features as inputs to the model. Features can be thought of as some data element that it calculated from other raw-data measurements. Different models may utilize different features.

For example, the raw-data my be a file containing each individual purchase by each customer. The model may require the total spent by each customer in 5 different categories. These five numbers for each customer are the features that need to be computed. Choosing good features is often as important or more important that choosing a good model. The work to read the raw data and compute the features is another task that the devs on the team can be involved in. For very large data-sets, this might involve using Map-reduce in Hadoop, Spark or some aggregation feature provided by a database. The need to create features may impact technology choices by the team.

Learning a Model versus Scoring

Developing a model involves a process referred to as fitting or learning where we tune some parameters until they make good predictions on a set that we call the training sample. A simple model like a linear model with a fixed number of parameters is usually easy to fit for the free parameters and usually generalizes well to other data.

Some models are very general and have the ability to increase the number of parameters as the amount of data increases. These are referred to non-parametric models. Examples are Support Vector Machines. Other models may be parametric in the sense the the number of parameter is chosen ahead of time by the modeler but still leaves an ambiguity in how many parameters to choose. An example is the Random Forest algorithm where you have to at least specify the number of tress.

In either case, the modeler wants to avoid a situation called over-fitting. Over-fitting is the situation where the model fits the training too well or, in the extreme, exactly. Though this may seem like a good thing, it also usually results in a situation where it generalizes poorly so that it does not work well on new data.

Over-fitting is avoided by using another sample called the validation sample. While the parameters are still chosen to best fit the training data, the number of parameters or, equivalently, the expressive freedom of the model is limited by ensuring that it still works well on this validation sample. There always involves some trade off between fitting the training data better and generalizing better to new data. This is called the bias-variance tradeoff.

Balancing Trade-offs in data-science

The bias-variance tradeoff described above is just one of many trade-offs involved in data-science. Other trade-offs are model performance versus more practical matters. Such practical matters include time to development, computational difficulty, ease of maintenance, ease of integration, ease of understanding and flexibility or robustness in cases of data-set shift. All of these other issues need to be considered by the DS and the rest of the team in close communication with the client.

Thursday, February 13, 2014

Big Data Is a Small Part of the Real Data Science Revolution

"Big Data" continues to be the biggest buzz word in business and yet I feel that most people still don't see it in the right light. As someone who has analyzed data for years, I find it very strange that there is so much focus on the size of the data, methods for storing it and transforming it, and so little on the data-analysis process itself where the real revolution is occurring. In this article, I'll argue that the buzz should be about the rising productivity of the data-scientist. The causes are many and data size and Big Data tools have little to do with it.

Getting a handle on data and the activity of data-analysis

Data-analysis is not so simple and requires a thorough understanding of the experimental method and statistics. Unfortunately, most people don't understand statistical analysis and the data analysis process. It isn't about mechanically churning data into information. It is a more abstract process and involves much thinking and creativity.

Data by itself is meaningless. It's just ones and zeros. It actually isn't even that. Binary data is really just a sequence of things that can be separated into two classes and kept there with high fidelity. That's it. Data only becomes useful when we also have knowledge of the process from which it was produced and a notion of how we can compute on it to reveal something else that is interpretable and useful to us.

The algorithm used for computing and the method of interpreting the results are both inseparably linked to the data itself. The algorithm is itself an abstraction. What we really have is a program which is itself stored as binary data. So really computing with data is a process of taking two pieces of data: the data and the program, and letting it interact in a way that we control to create other data that human can interpret and use.

But what is this program to manipulate the data? Humans create this as a way of representing a model of the world; the outside world that a computer knows nothing about. A model is a set of constraints that we set on a set of abstraction (mental concepts) and the data is the physical representation of those abstractions. The program is the physical representation of those constraints.

For example, the place where you actually live is the same as the place I will end up if I following Google's directions based on the text of your address that I have stored somewhere as data. Those things are constrained to be the same in our model. That assumption is what gives meaning to the written address.

If I go to Amazon.com and browse a particular set of books, it is almost always the case that I am interested in those books and considering buying one or more of them. I know the traces that I leave behind in Amazon's log files are not independent from my preference of books. I have a model, from introspection of my own mind, that these things are related and that this is probably the case for most people. Thus, if I am tasked with writing a recommendation system for Amazon, I have a model of how to relate people's book preferences to the traces in the data that they leave behind.

Thus every computation is based on some model and that model is not encoded anywhere in the data nor is it encoded in the program. It is encoded in the human mind. Computation, the economical kind that we actually do, is thus an interaction between the human mind and a machine.

The machine is needed only because humans are bad at storing and retrieving some kinds of data and are more error prone in carrying out logical operations. This is really the only thing that computers do and they do it very well.

What is information? A good enough definition is that the information is the part of the data that is relevant or related to the abstraction being computed. The data that is not related can be called noise. Note that data versus noise is not intrinsic to the data. Data is information only as it pertains to the model we have. These 100 bits of data are related to where you live. The next 100 bits of data are not. They are however relevant to where someone else lives but for my model and my task, they are noise.

So data analysis starts with a human need (I need to ship you your new book) and a model for how I can do this (write the address on an envelope and drop in the blue box). Implicit in that model is that the bits that I turned into text on my screen is the necessary information that I need to write on the envelope to accomplish this goal. The computer fetches that information as requested even though it has no idea what you intend to do with it or what it means to you.

More complicated models are really no different from this. They just involve more data, perhaps a more complicated model of the world and perhaps more computation. Consider predicting the weather. This is a complicated model. It involves, physics, measurement devices and the relation between the abstraction of weather and the traces left behind on these devices. Physics tells us which observables are relevant to changes in the weather. Engineering tells us how to create data closely related to those variables. Computers just help us keep a record of all that data and help us perform a complicated calculation.

A problem like this also has a problem that the data trace left behind is slightly ambiguous. There is measurement noise that is impossible to separate from the data. We don't really measure the density and pressure of every point in space. We measure estimates of this at a few points and make assumptions about the level of noise in the measurements and also how to interpolate between the sampled points.

Despite this ambiguity, we have an excellent mathematical tool for dealing with it. We know that there are many different possibilities for how the weather will evolve. We know that we are not going to make correct predictions all the time. However, we have also learned from experience that these predictions are much better than randomly guessing the weather or eyeballing the sky. That fact validates that our assumptions are likely correct.

We conceptualize the reason why this works by using statistical reasoning. Many different scenarios are possible but we will choose the scenario that is most consistent with the data. That really means that we choose the ones for which, if they were true, the data that we measured would not be terribly unlikely. We assume that likely things will happen rather than unlikely things and this has proven to be an effective strategy for operating in the world.

For example, yes, there could be a hurricane tomorrow but that would be unlikely because there has never been a hurricane that did not show up on the radar the day before. Our model assumes that the physics in the future are not different from the past. Couldn't the random fluctuations in the measurement devices, by some unlucky change, result in a radar image that doesn't look like a hurricane? Yes, that is theoretically possible but requires a very unlikely coincidence. We assume that the measurement errors are random and uncorrelated which each other and also uncorrelated to whatever is actually going on with the weather.

Now that we have spent some time talking about what data is and what it is not and the importance of the human activity of modeling we can better understand what is wrong with so much that is being said about the Big Data movement.

Big Data is small part of a the larger data-science revolution

Many people seem to be making the case that big amounts of data somehow are going to change everything. Others are a little more careful but still seem to say that big data plus all these great tools for managing it are going to change everything. Some have even asked whether the role of data scientist will soon be replaced by the algorithms themselves.

Is big data plus Hadoop plus Mahout machine learning libraries going to change everything? Absolutely not. A refutation of this idea is not difficult once you realize that many excellent data science projects can be done with small amounts of data on your laptop. The fact that this has been true for decades and did not result in much data-science being done in the business world means that the limiting reagent in this reaction has not been the existence of more data or the tools for storing and transforming it.

It has much more to do with the lack of mathematical and statistical training of people working in the business world. The thing missing has been the skill to model the world in a mathematical way. People with those skills have remained in science and engineering or perhaps been lured into finance but largely had not been widely hired into typical businesses to perform advanced statistical modeling.

So why now? What is different recently that has caused data-science to rise to prominence? I don't believe it is that, all of a sudden, we have big data, nor do I think it is because of Big Data technology tools like Hadoop and NoSQL.

The data-science productivity boom

I believe it is mostly driven by higher productivity of these kind of workers that we now call data-scientists. I know this because I have done this for a while and I know that I am much more productive than I used to be and it's not just the result of more experience. This productivity can be attributed to many things and most of them have nothing to do with Big Data.

One huge driver is the availability of open source data analysis platforms like R and Python's scientific stack. When I first started doing analytics in astronomy in the 90s, we worked in Tcl or Perl and then relied on low-level languages like C and Fortran when we needed better performance. For plotting some people used gnuplot or pgplot or super mongo. This was all duct-taped together with scripts, often using IO as a medium for communicating between programs.

When needing some advanced algorithm, you would pull out a book like Numerical Recipes or Knuth and copy the code into your browser or translate it to your preferred language. This took a long time, was error prone and people were still limited to which books they had on their desks.

So what's different today. The internet, obviously, is the biggest difference. We can search for mathematical ideas and algorithm and libraries using google. Wikipedia and Google are fantastic tools for learning math and mathematical methods especially when you are looking for connections between ideas.

The idea of open source repositories like on Github are an enormous boon for data-science productivity. Consolidation in the tools being used and the development of communities around data analytics helps enormously with sharing code and ideas.

The continual advance of computing hardware technology has of course been a wind at our back as well. But before "Big Data" tools we had other big data tools. HDFS is not the first distributed filesystem. We used GFS at Fermilab in the 90s. Hadoop is a useful framework for doing batch processing with data locality awareness but it isn't that much different from what we had built in the past for astronomical data analysis with perl scripts and NFS mounted disks. It's a more general framework and can help avoid repeated work on subsequent projects of a similar nature but it doesn't truly enable anything new.

To sum up. What is different today is that I can learn faster by utilizing all the resources on the internet. I can avoid reinventing the wheel by downloading well tested libraries for analytics. I can spend more time working in a higher-level productivity language like Python or R rather than chasing memory leaks in low level C code. I can communicate with other data scientists online, notably at sites like Stack Overflow, Math Overflow etc if I can't answer my question simply by search. There is much more communication between the related fields of statistics, computer science, math and the physical sciences. This emerging interdisciplinary activity and applications to problems outside of academia is being called data-science.

So productivity is the key. I'd estimate that my productivity is three times higher than 10 years ago. That means that 10 years ago, a company would have to pay me three times more money to accomplish the same task. While I'd love to think that I was worth that kind of pay, I suspect that it is more likely that I was not.

While all of these things developed gradually over the past 15 years or so, there becomes a point where productivity is high enough to warrant the creation of news jobs, such as the data-scientist. Crossing that point probably only happened a few years back for most of today's data-scientists and productivity increases will continue. With any emerging industry, momentum gathers, creating feedback loops. VC's start funding new analytics startups and we get things like Mongodb and Tableau. The media gets involved and talks up how data-science and Big Data are about to change everything. All of this helps to drive more activity in creating more productivity enhancing tools and services. Pay and stature for data-scientists rises attracting people that are unhappy in academic appointments. All is self-reinforcing ... at least for a while.

So what is Big anyhow?

Where does this leave "Big Data"? Is it all a farce? Certainly not if you apply the label to the right things. Some companies like Facebook, Google, Netflix etc really, really do have big data. By that I means that the problem of working at those scales really is nothing like doing data analysis on your laptop. Many of the tools for working with data of that size really are extremely important to them. Still, the fact is that most companies are nothing like that. With some prudent selection, sampling, compression and other tricks, you can still usually fit a company's main data-set on your laptop. We have terabyte hard drives now and 16 GB of memory. If not, you can spin up a few servers on the cloud and work there. This really isn't any different from the past. It is easier and cheaper but not really much different.

The most important advances in Big Data research, in my opinion, is the advances happening in the area of processing data streams, data compression and dimensionality reduction. Hadoop by itself is really just a tool or framework for doing simple operations on large data-sets in a batch mode. Complex calculations are still quite complex and time consuming to code. The productivity of working in Hadoop even with higher level interfaces is still no where near that of working with a smaller data-set locally. And the fact is that for 99% of analyses it is not the right tool or at least not the first tool you should reach for.

Advances in machine learning are certainly major drivers of data-science productivity thought this too isn't just applicable to big data. Machine learning's main use case is problems of high data-richness exhibiting very complex structure such as automated handwriting recognition and facial recognition.

Many have claimed that real Big Data is coming in the form of sensor data; the internet of things etc and this certainly seems to be the case. For this we should look to fields such as astronomy and physics who have been dealing with large amounts of sensor data for years long before the arrival of Hadoop and the Big Data toolkit. The key to this as it has been before will likely be better algorithms, smart filtering and triggering mechanisms and not the brute force storage and transforming of enormous, information-poor data-sets which seems to be the modus operandi of current Big Data platforms.

Data reduction and information extraction

Big Data is indeed a big deal but I think it is less a big deal than the emergence of data-science as a profession and I don't think these things should be conflated. Furthermore, as hinted at above, extracting more information from more and more data requires foremost the ability to construct more expressive and accurate mathematical models and the skills and tools to quickly turn these into functioning programs.

The fact that you only need a small subset of the data for most statistical analyses is not well understood by people without much background in statistics and experimental science. The amount of data required to constrain a given model is in fact calculable before you even see the data. When I worked in astronomy and helped plan new NASA satellites, this was much of my occupation. That's because gathering data by building and launching a satellite is expensive and so you only propose a mission expensive enough to gather the minimal amount of data to answer your question. The math to do this is called Bayesian statistics and without a basic understanding this you can't reason about the amount of data required for a given model or know when your data set is large enough to benefit from more detailed modeling.

When you do have your hands on a big data set, your goal as a data-scientist is to reduce the data. The term data reduction while ubiquitous in scientific settings is something I rarely hear mentioned by business people. In short, it means compressing down the data size by extracting the information and casting aside the rest. A physics experiment like the LHC in Geneva takes petabytes of data per second, quickly looks for interesting bits to keep and immediately deletes the rest. It is only simplifying a little to say that at the end of the day they are only interested in about a single byte of data. Does the Higgs Boson exist? That's a single bit of data. What is it's mass to 10% accuracy? That's another few bits of information; data reduction par excellence.

Now if data is cheap and just lying around the company anyway, gathering more than you need might not be that big of a deal. Now if you get 10x more data than you need, you're just slowing down the computation and creating more work for yourself. But the key thing to understand is that for a fixed model, you saturate it's usefulness rather quickly. To simplify a bit, the inverse of square-root of N is the usual scaling in statistical error. If your one-parameter model is only expected to be 10% accurate, you probably get there around N=100. That terabyte of data on HDFS is just not going to help.

So extracting more and more useful information from more and more data requires a more accurate model. That can mean more variables in linear model or a different model completely with much higher complexity and expressivity. The Big Data tools don't help you here. You simply have to think. You have to learn advanced statistics. Fortunately thinking about these problems, learning useful methods and borrowing tools for dealing with them is much easier today than it used to be. Again, this shows that cultural cohesion, more effective learning and communication channels and tools for dealing with complexity rather than size are more important than tools for doing simple manipulations on enormous data-sets.

This new blog

This blog is a place to write about data science.