Data Science Musings: Best practices for data-science projects

In the following, I'll explain a way in which a data-scientist can effectively work with an agile software delivery team. The ideals we want to promote include:

Keeping the team aware of what the DS is doing at a high level
Making sure the devs are involved with the DS, are doing some pair programming with him/her and are taking on much of the data-science related software development tasks for which they may be better suited anyway.
Breaking data-science stories up into small pieces and managing them in a normal agile fashion
Keep with normal development practices of automated testing and validation
Avoid the situation of the DS's work becoming a black box that no one else on the team or the client understands.
Avoid the situation of the DS becoming a bottleneck for the team (i.e. Bus-factor One).
Enabling the BA's to communicate the process and progress effectively to the client.

We will assume that a project has been through the normal inception process and the goals of the project are well established and agreed upon by all. The whole team knows the problem that the DS will be trying to solve as well as problems of their own.

Data-science 101:

Before doing any data-science, it is important to explain the data-science process and terminology to the entire team. This document can be thought of as a step in that direction. Let's assume the project involves building either a predictive or prescriptive model for the client to better accomplish some goal. The whole team needs to understand terms such as the FOM, validation and testing methodology and how to write stories for analytics. Some time should be devoted to explaining how data-science is a less-linear process than standard development as many decision can only be made after some analysis of the data has begun. It will often involve more backtracking and change of direction.

The Figure of Merit

First, everyone needs to understand the goal and how progress towards that goal can be made quantitative. We can call this the Figure of Merit (FOM), a term I'll borrow from the language of scientific collaborations. An example might be the accuracy of a prediction. It might be terms such as precision and recall or the F-measure. It is also, often necessary to start with a preliminary FOM and adapt it as we better understand the goals of the client or learn about other constraints. Regardless, at any point, we should have a formal definition of the FOM and stories should only be written with the goal of improving this. The client should understand the FOM and take an active part in creating and adapting it. It should not be merely a technical term understood by the developers.

Validation versus testing

All successful software developers now make testing a ubiquitous part if their coding practices. Unit tests are made for every function or method and integration tests are used for making sure parts of the codebase interact properly. Testing analytics code can and should be done as well however there are some key differences that need to be understood by all. There is also another similar idea called validation which should not be confused with testing.

To reduce confusion, we should reserve the word testing to indicate the normal kind of testing that developers do. You test whether a program works as expected by giving it inputs for which the desired outputs are known. It passes the test when it delivers these expected results.

Validation is the word that we should use to test the accuracy of a model. Unlike regular tests, this is not necessarily something that results in pass or fail. An example would be testing to see whether a model, say logistic-regression, makes accurate predictions of our target variable. More important than the actual value output is the trend in time as you continue to refactor and improve the model or add other models. You want this validation result to be stable or improve as you work on the model or introduce new ones. A bug in the code may manifest itself as a drop in the validation score. If you introduce a more sophisticated model and find that it has a lower validation score, this may indicate that you have incorrectly implemented the model or may indicate that you do not understand the problem as well as you thought. It might be wise, especially further along in the project, to require validation scores above a certain threshold in order to result in a successful build or at least a deployment.

Keeping a chart of validation score versus time is also another way of showing steady progress to the client. Running these validation tests, recording the results and recreating this chart should be an automated process triggered by a new checkin or build just as automated testing is done. It may be useful to keep other metrics as well that might indicate a sudden change in behavior of the program. In many cases the validation score will be the same as the FOM however there are other cases where the model may be thought of as a subcomponent of the entire model and so the validation score may be different.

Different methods of testing in the exploratory phase

Much of the early code written for analytics is of an exploratory nature. The point of such code is to learn something. The code itself is not to be thought of as a deliverable. For this reason, writing it in TDD fashion is probably not advisable as it is important to move this phase along as fast as possible.

In addition, data-scientist have usually developed some other testing methodologies that may seem foreign to software developers especially those used to working in a language like Java. As data science code is usually written in a rapid-development, high-productivty language such as Python or R, it lends itself to something we may call REPL-driven development. For example, a python developer may test the following line in the REPL or the shell to see if it works.

x=range(100)
y=[i-10 for i in x]
z=[log(i) for i in y if i > 0]

at which point, they will get an error

NameError: name 'log' is not defined

reminding them to add the following to the top of the program

from math import log

after which the code works as expected. This method of trying code in the REPL as you write is a fairly effective way of eliminating bugs as you write code. A Java dev would likely discover these of bugs when they try to compile and occasionally will not discover them until runtime. If they are practicing TDD, they might catch them from a test failure.

Similarly, a DS working in these kinds of systems will often test code using visual tools. After writing code that is supposed to create an array with a saw-tooth-like pattern, the DS can usually paste a few lines in the REPL and plot the array which will confirm whether it looks as expected or otherwise. Another example, would be for code to open a csv file, parse it and stream it line by line. Once this short script is written, it can be run in the REPL and a few lines can be printed out to check to see if it looks as expected.

These techniques can be used to write code rapidly, testing as you go to make sure the code runs and behaves as expected. While these techniques are not a replacement for unit tests, and won't catch every bug, the speed of development often offsets the lack of quality checks for exploratory code development. The ability to move as quickly as possible and try out many ideas is often the best way to approach data-science especially in the early phases.

It is common for such scripts to evolve into a more standard piece of utility code and when this happened, an effort should be made to unit test to give us more confidence in the reliability of this tool. More often then not however, the script will teach us something and lead us off into a different direction. For example, the streaming program shows us that the data will not be useful for our purposes and so the script will no longer be needed.

While there are many cases where this style of programming may make sense, developers should watch out for cases where more standard TDD should be applied. One example would be cases for which a unit test would be simple to write and so not doing so, would not result in much of a slowdown. Other examples would be when such a script has become a dependency for many other subsequent programs. Another would be when the code is complex or they have low confidence in it's correctness and might require refactoring. True refactoring without units tests is not really possible.

The validation pipeline

Once the team understand the FOM and other validation scores, there is no reason why the code for implementing this can't be completed by developers rather than the DS. Having the developers take one this part has many advantages. One is that it frees up the data-scientist to concentrate on researching and developing the model. Since these tasks can happen in parallel, the validation pipeline may be completed by the time the data-scientist has a model to test.

If the validation framework is completed before a model is ready, there are other test models that can be created to test the functioning and performance of the validation pipeline. One test model is one that makes a random guess at the results. Such a model is typically easy to write and once it exists, it can be validated using the validation code. One would expect it to show a low validation score. Another model that can be created for the opposite extreme. This is model we might call the cheat model. This model is one where we use the known result from the validation data to cheat and give the right answer. Obviously, one would expect this model to perform as well as possible. Even if such models seem useless, there are reasons to create and validate such models. If one learns that the cheat model doesn't actually perform well, we know there must be a bug somewhere in the validation pipeline and thus we test the validation code. This exercise of testing both of these models can often lead to insights about changing the FOM. In addition, it may point out performance problems that can be dealt with before having to test a real model.

The iterative process of creating models

While the rest if the team is creating the validation pipeline, the DS is doing research on the problem and developing a real model. Having the validation pipeline finished first is similar to TDD and puts pressure on the DS to deliver a simple model as quickly as possible before trying to create a better one. Once a first model is delivered, it can be validated. It should perform better than the random model and less well than the cheat model. At this point, the team has some result to show to the client. They can demo the validation pipeline and show how a first simple model performs. In some cases they may realize that the simple model works better than expected and good enough for the client. Often there are other sources of known error such that the usefulness of a model saturates before saturating it's validation score.

At this point, we also have more insight into the FOM that was chosen and may find that a high FOM actually misses some other important constraints or desires. For example, we might be making a personalized recommendation system and we chose the FOM to be simply the accuracy of predicting future purchases. We might find that our model gets a decent FOM at the same time as realizing that everyone has been recommended the same item, a particularly popular one. At this point the team and client may realize that the FOM should include some measure of diversity in the recommendations. This change in the FOM will drive a change in the model. Now the team and DS either move on to another more pressing problem or try to develop a better model (the usual case).

Feature selection and computation

Most models involve the need to construct features as inputs to the model. Features can be thought of as some data element that it calculated from other raw-data measurements. Different models may utilize different features.

For example, the raw-data my be a file containing each individual purchase by each customer. The model may require the total spent by each customer in 5 different categories. These five numbers for each customer are the features that need to be computed. Choosing good features is often as important or more important that choosing a good model. The work to read the raw data and compute the features is another task that the devs on the team can be involved in. For very large data-sets, this might involve using Map-reduce in Hadoop, Spark or some aggregation feature provided by a database. The need to create features may impact technology choices by the team.

Learning a Model versus Scoring

Developing a model involves a process referred to as fitting or learning where we tune some parameters until they make good predictions on a set that we call the training sample. A simple model like a linear model with a fixed number of parameters is usually easy to fit for the free parameters and usually generalizes well to other data.

Some models are very general and have the ability to increase the number of parameters as the amount of data increases. These are referred to non-parametric models. Examples are Support Vector Machines. Other models may be parametric in the sense the the number of parameter is chosen ahead of time by the modeler but still leaves an ambiguity in how many parameters to choose. An example is the Random Forest algorithm where you have to at least specify the number of tress.

In either case, the modeler wants to avoid a situation called over-fitting. Over-fitting is the situation where the model fits the training too well or, in the extreme, exactly. Though this may seem like a good thing, it also usually results in a situation where it generalizes poorly so that it does not work well on new data.

Over-fitting is avoided by using another sample called the validation sample. While the parameters are still chosen to best fit the training data, the number of parameters or, equivalently, the expressive freedom of the model is limited by ensuring that it still works well on this validation sample. There always involves some trade off between fitting the training data better and generalizing better to new data. This is called the bias-variance tradeoff.

Balancing Trade-offs in data-science

The bias-variance tradeoff described above is just one of many trade-offs involved in data-science. Other trade-offs are model performance versus more practical matters. Such practical matters include time to development, computational difficulty, ease of maintenance, ease of integration, ease of understanding and flexibility or robustness in cases of data-set shift. All of these other issues need to be considered by the DS and the rest of the team in close communication with the client.

Data Science Musings

Monday, February 17, 2014

Best practices for data-science projects