Data Science Musings

Saturday, January 3, 2015

granville.R

davej — Jan 3, 2015, 7:38 PM

#Granville regression
library(huge)

Loading required package: Matrix
Loading required package: lattice
Loading required package: igraph
Loading required package: MASS

library(MASS)
library(ggplot2)

create.test.data.random= function(){
  #normally distributed data with non-trivial correlation structure
  dimension = 50
  num.observations = 10000
  prob.correlated = 0.2
  L = huge.generator(n=num.observations,d=dimension,graph="random",prob=prob.correlated,v=5.0)
  plot(L)
  par(mfrow=c(1,1))
  df=data.frame(L$data)
  #normalize
  df=scale(df)
  df=data.frame(df)
  return(df)
}

granville.coeffs.raw = function(df){
  cov.matrix=cov(df)
  coeffs=c()
  n=ncol(df)
  for (i in 1:n) {
    coeffs[i]=cov.matrix[1,i]/cov.matrix[i,i]
  }
  coeffs=coeffs[2:n]
  return(coeffs)  
}

lm.coeffs=function(df){
  n=ncol(df)
  linear.model=lm(X1~.,data=df)
  coeffs=linear.model$coefficients
  #ignore the intercept
  coeffs=coeffs[2:n]
  return(coeffs)  
}

ridge.coeffs=function(df,lambda=0.1){
  n=ncol(df)
  ridge.model=lm.ridge(X1~.,data=df,lambda=lambda)
  coeffs=ridge.model$coef
  #ignore the intercept
  coeffs=coeffs[2:n-1]
  return(coeffs)  
}

test.granville = function(){
  df=create.test.data.random()
  truth=lm.coeffs(df)
  granville.raw = granville.coeffs.raw(df)
  ridge.coeff=ridge.coeffs(df)
  par(mfrow=c(1,2))
  plot(truth,granville.raw,xlab='True Coefficients')
  title('Granville')
  plot(truth,ridge.coeff,xlab='True Coefficients')
  title('Ridge')
}

Thursday, May 8, 2014

Big Data is not a noun

The economist has an article out The backlash against big data pointing out recent reports and comments criticizing the concept.

“BOLLOCKS”, says a Cambridge professor. “Hubris,” write researchers at Harvard. “Big data is bullshit,” proclaims Obama’s reelection chief number-cruncher. A few years ago almost no one had heard of “big data”. Today it’s hard to avoid—and as a result, the digerati love to condemn it. Wired, Time, Harvard Business Review and other publications are falling over themselves to dance on its grave. “Big data: are we making a big mistake?,” asks the Financial Times. “Eight (No, Nine!) Problems with Big Data,” says the New York Times. What explains the big-data backlash? The

The article more or less gets it right in agreeing with the specifics of the criticism while maintaining that there is still lots to love about Big Data. I appreciate the shout-out to astronomical surveys as the place where the term originated, though it came from journalists not us scientists.

One thing that unquestioningly does amount to Big Data is all the nonsense written about it by writers and media outlets riding the hype cycle. And yes, hopefully that is starting to crest. If you were planning on writing a book full of vague cheerleading for Big Data revolutionizing the world, you might be out of luck.

I think most of the confusion and mis-information could be avoided if people just stopped using Big Data as a noun. It's not a noun. It's an adjective. There are Big Data technologies and Big Data developers and Big Data architectures. But there is really nothing called Big Data. Big Data isn't going to change the world because an adjective can't do anything. People do things with data. That's called either computation or statistics.

I'm fine with the term Data Science to describe the emerging occupation of applying statistics, data visualization and machine learning to business problems. That's actually a noun and can in principle do great things. Big Data technologies like Hadoop, Spark, the cloud and NoSQL databases have also been very successful and also helps us do data science faster, cheaper and better on data sets large enough to require distributed storage and processing. But these technologies are just about handling data so that people with good ideas about how to use the data can work more productively.

Wednesday, May 7, 2014

The Shark in the data lake

The concept of the data warehouse being the central repository of data for the enterprise has been around for decades. Relational databases and normalized star schemas is almost as ingrained in IT as concrete and rebar is to the construction business. These data stores do make sense for many purposes but have an inherent weakness. The careful data modeling that allows them to slice and dice efficiently also comes at the price of strong interdependency. Changing a column or an index in a table could have ripple effects all over the data warehouse and effect many different applications.

IT workers have a good solution to this problem. Just stop requesting changes! For a mature company with a rather static model, this might be suffice but most companies whose business model and strategy are constantly in flux, the tight coupling of the data warehouse is an impediment to growth. In fact, the model of the data warehouse envisioned as a final state, while perhaps applicable to billing and basic reporting is not well suited for the analytics of business intelligence that inform executives how to navigate the waters of business.

The idea of the "data lake" is emerging as an alternative model for many companies. The idea is somewhat of a throwback to earlier times before databases when working with flat files was the norm. What is different is now we can store everything in a scalable distributed filesystem and use technologies such as Hadoop or Spark to make this practical. The data lake need not be the only resting place of data. It can be a base for feeding the data warehouse or perhaps other NoSQL databases. What's important however is that it need not just be landing ground for data. There are many use cases where building apps on top of the data lake is a better solution.

One of the major benefits of Hadoop or Spark is that it is schema-on-read. That is, once you have written a file reader, you can decide what to keep and how to structure it or decide not to structure it at all. In effect, it gives enormous freedom to developers, data scientists and analysts. If you want the data in a different form, you don't have to send a change request to the data modeling team for review, you just change a few lines of code. This means that analysts can get at any data source quickly, test to see if it drives business value and even build prototype applications without having to go through a heavy change process.

Apache Spark seems to be taking over as the next generation Hadoop. It's Shark interface is the SQL-like query system similar to Hive. What you get is better use of in-memory data-structures and cacheing to increase speeds another factor of 10-100 and allows for more general operations than Map Reduce. It's also a lot more developer friendly with better APIs in Java, Scala and Python. If the data lake was exciting you before, now it has a Shark in it!

While going from a highly regulated data warehouse environment to a free-for-all data lake may seem a little wild-west for conservative companies, most concerns can be put aside. First, it doesn't mean giving access to all of your 3000 employees. At most medium to large corporations that we work with, we find that there are only a handful of people doing advanced analytics. Most will still access data through special purpose applications.

I'm sold, you may say but how do we get from here to there? Certainly the data warehouse is going to be around for a while as it supports most of your apps. Introducing a data lake into a data warehousing environment can be done gently and gradually. At the first stage it can be used as a data landing zone. You are probably already doing that using some unix file system on a few big servers. Now you're just going to use HDFS as a file system which is not a big change and once you are here, you have immediate ad-hoc access via Spark. Then next stage could be replacing your ETL scripts with Spark scripts or using commercial ETL products built on top of Spark. The final state is actually a system that allows for rapid change while keeping the data warehouse for the parts of your business that really doesn't require constant change. Now your data scientists, your executives and your IT guys are all getting what they want.

Tuesday, March 25, 2014

Transitioning to Data-science from physics, astronomy and other sciences

This document is to help scientists transition into the field of data-science. At ThoughtWorks, I interview a lot of DS candidates and get a lot of applications from people graduating with science PhDs or from postdocs. These people are usually very smart, know quite a bit about analyzing data and usually have sufficient math skills for the job. However they are also typically lacking in most areas that are required for the job. They will still find job opportunities in industry and may very well become data-scientists but probably won't attract the interest of ThoughtWorks and other highly competitive places.

The good news is that these scientists can typically pick up most of these skills in less than a year. So if you want to get a data-science job at very competitive places, you can, but you gotta prepare yourself. The time to get working on these things is while you still have some time left in your present position not after your postdoc has finished. Plus, learning this stuff is awesome and will make you way more productive.

What you might look like

If you're like most scientists that we interview or consider interviewing, you can program in C, C++, perl, perhaps Fortran and maybe some Python. You might also use IDL, Matlab or Mathematica. You are pretty good at the unix toolkit (grep, awk, sed etc). You never took a programming class and probably never took a stats class. You just taught yourself. This is about the minimal level of what we would consider "programming skills" and unless you brought some other fantastic skills, you probably wouldn't make it through many interviews (You'll get about 8 interviews at TW if you get hired).

What we like to see

We like to see a lot more indicating that you are interested in programming languages. Python and R are the two most important languages for data-science by far. Becoming fairly expert in one of these language is probably the most important thing you can do. That said, we like to see that people have written web-apps in Ruby or Django or even PHP or have used Java and possibly some of the newer languages for the JVM. I don't like Java. Nobody seems to like Java these days but it's very hard to avoid it in industry and even with the cool languages like Clojure or Scala, you need to know something about the JVM ecosystem.

What we don't expect to see

We don't expect you to be a professional developer and know much about testing and build frameworks and all the professional dev practices. It would be great but is not expected. We have another term for people similar to data-scientists which is closer to a developer or database admin: the Data-engineer. Such a person is an expert at Hadoop, enterprise scale Big Data tools etc. These people were usually developers in the past not scientists. As a data-scientist, you should know something about these toolsets but you don't need to be an expert (though that would help immensely).

The modern data-science toolkit

The following are the skills that a scientist need to get some practice with.

Programing Languages

A data scientist needs to be very good at either Python or R and should at least know a little about the other one. You should be able to read in a csv file of data and make histograms and plots very quickly. You should be able to fit various models to data. You can use Google to look up forgotten commands (we all do) but it should not take you 30 minutes of searching. If you're a scientist, you are more likely to know Python but at least give R a look. You can do many powerful things in R which just a little knowledge and it is a language written to be used by scientists so you're in luck. Python is much more general language and generally feels like it is written to be used by people who can program fairly well.

Vital Python packages to know: Numpy, Scipy, Matplotlib
Good Python packages to know: Pandas, numba

Vital R packages to know: data.table, ggplot2. RStudio IDE is very useful.
Good R packages to know: RMySQL, anything written by Hadley Wickham, Shiny.

Other data-sciency languages

There are a few other languages that are up-and-coming in data-science. Some places will be using them already. At ThoughtWorks they are certainly bonuses.

Clojure - A lisp, probably the coolest thing on the JVM
Scala - Some love it, some think it is disastrously over-complicated. Used by Twitter. Scala is actually very easy to get going with. Probably very difficult to master.
Julia - The Julia language is awesome but not very mature and probably not yet ready to take over but has a decent chance of being the next big thing. Plays well with Python.

Knowing other languages like C/C++ and especially Java is very useful when you need speed and can't find a library to call from Python or R.

Databases

You should be pretty familiar with SQL and the relational databases that use it. MySQL and Postgres seem to be the most popular open source ones. You should be able to connect to these programmatically from R/Python and do some useful things with them.

But SQL is not the end of the story any more. So called NoSQL databases have become very important and allow horizontal scaling, fault tolerance interaction with massive data-sets possible. I recommend people buy and study Seven Databases in Seven Weeks and NoSQL Distilled the latter by our own Martin Fowler. The following is a list of what I think are the most important for data-science.

MongoDB - A document database that can also be used as a key-value store
Hbase - Industrial strength columnar database often used with Hadoop
Cassandra - Similar to Hbase but with some different characteristics.
Redis - An in-memory database for fast key-value access. Supports many data-strictures as values.
Neo4J - A graph database. Less commonly used but still worth knowing about. Is great for the right kind of problems.

Hadoop and it's ecosystem

Hadoop is not a database but rather a framework for doing large batch-style computations with MapReduce using a cluster of distributed nodes. Hadoop gets so much attention that you'd think "Big Data and data-science is mostly mostly about Hadoop. I am sure there are plenty of jobs where the data-scientists do nothing but write Hadoop code. But you don't want those jobs. Again, we reserve the title Data-Engineer for people who are Hadoop experts.

Hadoop in short is a system for shipping the code to the data instead of the other way around which has obvious benefits. Java is good for this because you can ship the compiled classes and they will just run on the JVM of each node and it's also pretty fast. Hadoop being written in Java and running on the JVM is the main reason why data-scientists can't avoid the JVM. You can write java MapReduce programs if you're learning Hadoop or otherwise want to decrease your productivity but most people choose some higher level way of interfacing with Hadoop. The following are a list of ways of doing this that seem popular.

Hadoop streaming - Use the streaming interface which allow you to write your code in any language, Python, perl etc. You just need to pipe in and out of stdout like you do with unix tools. This is an easier, gentler way to approach Hadoop but has some performance tradeoffs.
Cascading - This is a Java interface that provides a much nicer API. Used by other methods as well.
Cascalog - Clojure library built on cascading. Great if you like Clojure.
Scalding - Twitter's open source interface built on cascading in Scala. Great if you like Scala.
Pig - A new language for making MapReduce queries. Perhaps falling out of favor.
Hive - Allowing you to write SQL on Hadoop with some additions and subtractions of features.
Hue - An app that bundles many of these Hadoop tools with a nice GUI interface.

There are plenty of other pieces to the Hadoop puzzle. The best reference is Hadoop: The Definitive Guide. Hadoop is a big ecosystem and you can't learn it quickly and might find it rather boring. I think most people learn Hadoop when they really need to. I wouldn't suggest trying to become a Hadoop expert. We look for this expertise more in our Data Engineer job title.

Machine Learning

Machine learning is quite important in data-science. It's more important than in science because in science, especially physics or astronomy, we know what the model is because we have an equation for it. You don't get to just make up a model (unless you're a particle theorist!). In industry, you will rarely come across a model based on physical laws and seldom care about the model in it's own right. We usually just want to predict things well and any model which does that well is good enough for us.

Machine learning can be divided into three sections which you should probably learn in this order. The first is just a large set of mostly-disconnected tools for modeling data and making predictions. In this I'd include

Linear models including regularized versions like ridge-regression and lasso
Decision trees and random forests
Boosting and bagging
Standard feed-forward neural networks

These are all quite standard and are available in either R or Python (scikits-learn for example) and lots of other places.

The second group is methods based on kernels. Kernels are probably not what you think. Kernel methods like support vector machines are some of the most useful tools and have wide applications for many different kinds of data. They are great for both classification and regression. As an aside regression is a term that I did not see much in science but it's a useful term. It just means fitting a curve and does not imply using linear methods like I had assumed. Physicists do this all the time, but don't use that term as far as I know.

The math of kernel methods is quite beautiful and, luckily for you, has a lot of overlap with physics. They even use Lagrangians! I learned about kernel methods from a book with an unusual title Machine Learning Methods in the Environmental Sciences. The majority of the book is about machine learning not environmental sciences.

The third group of machine learning techniques is graphical models. This includes Bayesian networks, log-linear models and Gaussian graphical models. I like the book Graphical Models with R because it also tells you about the R toolkits for using these methods. Most of the techniques of machine learning can be described in a graphical setting and if you really want to get into this, you might look at the text Probabilistic Graphical Models but that should probably wait.

My favorite ML book overall is the new encyclopedic tome by Kevin Murphy, Machine Learning: A Probabilistic Approach. This thick book seems to cover just about everything in machine learning and is quite up to date. However, it moves quickly and you may want to start with a gentler introduction such as Bayesian Reasoning and Machine Learning by David Barber. Still, I think you should by Murphy's book.

You will likely enjoy learning machine learning a lot more than learning Hadoop. However you can do just fine knowing the basics and can probably get by without knowing how everything actually works in detail. Mostly, you want to try out some of the tools which you will readily find in R or Python and have an idea about which methods work best with which problems. Decision trees, random forest and SVMs will probably be all you need in practice. Knowing how to generalize these algorithm, combine them and come up with new ones is master-class stuff.

Sketching and streaming algorithm and advanced data-structures

One of my favorite computing quotes is by Linus Torvalds: "Bad programmers worry about the code. Good programmers worry about data structures and their relationships". Often a find that what a client really needs is not a fleet of Hadoop nodes on the cloud but better data structure or better algorithms and I can often get this "Big Data" project up and running just fine on my laptop. Personally I think Hadoop and parallel computing should come into play after optimizing algorithms and data-structures. The book Mining Massive Datasets which is free and online is a great place to start. I also like the course notes from Jalani Nelson, possibly Harvard's coolest professor (and someone I'd like to meet). Streaming algorithms that ones that can be computed with only one pass through the data. You should also know the term online learning which are machine learning algorithms that work in a streaming way rather than a batch way from stored data. An example is stochastic gradient descent.

There are plenty of classic books on algorithms that you probably don't need to read but you should try to become familiar was many data-structures particularly such as Bloom filters and tries. Information retrieval is a closely related field to machine learning. I highly recommend this following webpage on Non-standard data-structures in Python. I have used quite a few of these and found huge performance gains.

Don't freak out

It might seem like learning all of this is going to take you 10 years and to master it all, it might. However, you don't need mastery of all of these things to become a productive data-scientist. You're probably a quick learner and can learn what you need in time to do what you need to do. That said, if you want to get a job at ThoughtWorks or Twitter or other highly competitive places, you should have a moderate understanding of much of this material. Having a PhD in physics is just not enough.

Monday, February 17, 2014

Best practices for data-science projects

In the following, I'll explain a way in which a data-scientist can effectively work with an agile software delivery team. The ideals we want to promote include:

Keeping the team aware of what the DS is doing at a high level
Making sure the devs are involved with the DS, are doing some pair programming with him/her and are taking on much of the data-science related software development tasks for which they may be better suited anyway.
Breaking data-science stories up into small pieces and managing them in a normal agile fashion
Keep with normal development practices of automated testing and validation
Avoid the situation of the DS's work becoming a black box that no one else on the team or the client understands.
Avoid the situation of the DS becoming a bottleneck for the team (i.e. Bus-factor One).
Enabling the BA's to communicate the process and progress effectively to the client.

We will assume that a project has been through the normal inception process and the goals of the project are well established and agreed upon by all. The whole team knows the problem that the DS will be trying to solve as well as problems of their own.

Data-science 101:

Before doing any data-science, it is important to explain the data-science process and terminology to the entire team. This document can be thought of as a step in that direction. Let's assume the project involves building either a predictive or prescriptive model for the client to better accomplish some goal. The whole team needs to understand terms such as the FOM, validation and testing methodology and how to write stories for analytics. Some time should be devoted to explaining how data-science is a less-linear process than standard development as many decision can only be made after some analysis of the data has begun. It will often involve more backtracking and change of direction.

The Figure of Merit

First, everyone needs to understand the goal and how progress towards that goal can be made quantitative. We can call this the Figure of Merit (FOM), a term I'll borrow from the language of scientific collaborations. An example might be the accuracy of a prediction. It might be terms such as precision and recall or the F-measure. It is also, often necessary to start with a preliminary FOM and adapt it as we better understand the goals of the client or learn about other constraints. Regardless, at any point, we should have a formal definition of the FOM and stories should only be written with the goal of improving this. The client should understand the FOM and take an active part in creating and adapting it. It should not be merely a technical term understood by the developers.

Validation versus testing

All successful software developers now make testing a ubiquitous part if their coding practices. Unit tests are made for every function or method and integration tests are used for making sure parts of the codebase interact properly. Testing analytics code can and should be done as well however there are some key differences that need to be understood by all. There is also another similar idea called validation which should not be confused with testing.

To reduce confusion, we should reserve the word testing to indicate the normal kind of testing that developers do. You test whether a program works as expected by giving it inputs for which the desired outputs are known. It passes the test when it delivers these expected results.

Validation is the word that we should use to test the accuracy of a model. Unlike regular tests, this is not necessarily something that results in pass or fail. An example would be testing to see whether a model, say logistic-regression, makes accurate predictions of our target variable. More important than the actual value output is the trend in time as you continue to refactor and improve the model or add other models. You want this validation result to be stable or improve as you work on the model or introduce new ones. A bug in the code may manifest itself as a drop in the validation score. If you introduce a more sophisticated model and find that it has a lower validation score, this may indicate that you have incorrectly implemented the model or may indicate that you do not understand the problem as well as you thought. It might be wise, especially further along in the project, to require validation scores above a certain threshold in order to result in a successful build or at least a deployment.

Keeping a chart of validation score versus time is also another way of showing steady progress to the client. Running these validation tests, recording the results and recreating this chart should be an automated process triggered by a new checkin or build just as automated testing is done. It may be useful to keep other metrics as well that might indicate a sudden change in behavior of the program. In many cases the validation score will be the same as the FOM however there are other cases where the model may be thought of as a subcomponent of the entire model and so the validation score may be different.

Different methods of testing in the exploratory phase

Much of the early code written for analytics is of an exploratory nature. The point of such code is to learn something. The code itself is not to be thought of as a deliverable. For this reason, writing it in TDD fashion is probably not advisable as it is important to move this phase along as fast as possible.

In addition, data-scientist have usually developed some other testing methodologies that may seem foreign to software developers especially those used to working in a language like Java. As data science code is usually written in a rapid-development, high-productivty language such as Python or R, it lends itself to something we may call REPL-driven development. For example, a python developer may test the following line in the REPL or the shell to see if it works.

x=range(100)
y=[i-10 for i in x]
z=[log(i) for i in y if i > 0]

at which point, they will get an error

NameError: name 'log' is not defined

reminding them to add the following to the top of the program

from math import log

after which the code works as expected. This method of trying code in the REPL as you write is a fairly effective way of eliminating bugs as you write code. A Java dev would likely discover these of bugs when they try to compile and occasionally will not discover them until runtime. If they are practicing TDD, they might catch them from a test failure.

Similarly, a DS working in these kinds of systems will often test code using visual tools. After writing code that is supposed to create an array with a saw-tooth-like pattern, the DS can usually paste a few lines in the REPL and plot the array which will confirm whether it looks as expected or otherwise. Another example, would be for code to open a csv file, parse it and stream it line by line. Once this short script is written, it can be run in the REPL and a few lines can be printed out to check to see if it looks as expected.

These techniques can be used to write code rapidly, testing as you go to make sure the code runs and behaves as expected. While these techniques are not a replacement for unit tests, and won't catch every bug, the speed of development often offsets the lack of quality checks for exploratory code development. The ability to move as quickly as possible and try out many ideas is often the best way to approach data-science especially in the early phases.

It is common for such scripts to evolve into a more standard piece of utility code and when this happened, an effort should be made to unit test to give us more confidence in the reliability of this tool. More often then not however, the script will teach us something and lead us off into a different direction. For example, the streaming program shows us that the data will not be useful for our purposes and so the script will no longer be needed.

While there are many cases where this style of programming may make sense, developers should watch out for cases where more standard TDD should be applied. One example would be cases for which a unit test would be simple to write and so not doing so, would not result in much of a slowdown. Other examples would be when such a script has become a dependency for many other subsequent programs. Another would be when the code is complex or they have low confidence in it's correctness and might require refactoring. True refactoring without units tests is not really possible.

The validation pipeline

Once the team understand the FOM and other validation scores, there is no reason why the code for implementing this can't be completed by developers rather than the DS. Having the developers take one this part has many advantages. One is that it frees up the data-scientist to concentrate on researching and developing the model. Since these tasks can happen in parallel, the validation pipeline may be completed by the time the data-scientist has a model to test.

If the validation framework is completed before a model is ready, there are other test models that can be created to test the functioning and performance of the validation pipeline. One test model is one that makes a random guess at the results. Such a model is typically easy to write and once it exists, it can be validated using the validation code. One would expect it to show a low validation score. Another model that can be created for the opposite extreme. This is model we might call the cheat model. This model is one where we use the known result from the validation data to cheat and give the right answer. Obviously, one would expect this model to perform as well as possible. Even if such models seem useless, there are reasons to create and validate such models. If one learns that the cheat model doesn't actually perform well, we know there must be a bug somewhere in the validation pipeline and thus we test the validation code. This exercise of testing both of these models can often lead to insights about changing the FOM. In addition, it may point out performance problems that can be dealt with before having to test a real model.

The iterative process of creating models

While the rest if the team is creating the validation pipeline, the DS is doing research on the problem and developing a real model. Having the validation pipeline finished first is similar to TDD and puts pressure on the DS to deliver a simple model as quickly as possible before trying to create a better one. Once a first model is delivered, it can be validated. It should perform better than the random model and less well than the cheat model. At this point, the team has some result to show to the client. They can demo the validation pipeline and show how a first simple model performs. In some cases they may realize that the simple model works better than expected and good enough for the client. Often there are other sources of known error such that the usefulness of a model saturates before saturating it's validation score.

At this point, we also have more insight into the FOM that was chosen and may find that a high FOM actually misses some other important constraints or desires. For example, we might be making a personalized recommendation system and we chose the FOM to be simply the accuracy of predicting future purchases. We might find that our model gets a decent FOM at the same time as realizing that everyone has been recommended the same item, a particularly popular one. At this point the team and client may realize that the FOM should include some measure of diversity in the recommendations. This change in the FOM will drive a change in the model. Now the team and DS either move on to another more pressing problem or try to develop a better model (the usual case).

Feature selection and computation

Most models involve the need to construct features as inputs to the model. Features can be thought of as some data element that it calculated from other raw-data measurements. Different models may utilize different features.

For example, the raw-data my be a file containing each individual purchase by each customer. The model may require the total spent by each customer in 5 different categories. These five numbers for each customer are the features that need to be computed. Choosing good features is often as important or more important that choosing a good model. The work to read the raw data and compute the features is another task that the devs on the team can be involved in. For very large data-sets, this might involve using Map-reduce in Hadoop, Spark or some aggregation feature provided by a database. The need to create features may impact technology choices by the team.

Learning a Model versus Scoring

Developing a model involves a process referred to as fitting or learning where we tune some parameters until they make good predictions on a set that we call the training sample. A simple model like a linear model with a fixed number of parameters is usually easy to fit for the free parameters and usually generalizes well to other data.

Some models are very general and have the ability to increase the number of parameters as the amount of data increases. These are referred to non-parametric models. Examples are Support Vector Machines. Other models may be parametric in the sense the the number of parameter is chosen ahead of time by the modeler but still leaves an ambiguity in how many parameters to choose. An example is the Random Forest algorithm where you have to at least specify the number of tress.

In either case, the modeler wants to avoid a situation called over-fitting. Over-fitting is the situation where the model fits the training too well or, in the extreme, exactly. Though this may seem like a good thing, it also usually results in a situation where it generalizes poorly so that it does not work well on new data.

Over-fitting is avoided by using another sample called the validation sample. While the parameters are still chosen to best fit the training data, the number of parameters or, equivalently, the expressive freedom of the model is limited by ensuring that it still works well on this validation sample. There always involves some trade off between fitting the training data better and generalizing better to new data. This is called the bias-variance tradeoff.

Balancing Trade-offs in data-science

The bias-variance tradeoff described above is just one of many trade-offs involved in data-science. Other trade-offs are model performance versus more practical matters. Such practical matters include time to development, computational difficulty, ease of maintenance, ease of integration, ease of understanding and flexibility or robustness in cases of data-set shift. All of these other issues need to be considered by the DS and the rest of the team in close communication with the client.

Thursday, February 13, 2014

Big Data Is a Small Part of the Real Data Science Revolution

"Big Data" continues to be the biggest buzz word in business and yet I feel that most people still don't see it in the right light. As someone who has analyzed data for years, I find it very strange that there is so much focus on the size of the data, methods for storing it and transforming it, and so little on the data-analysis process itself where the real revolution is occurring. In this article, I'll argue that the buzz should be about the rising productivity of the data-scientist. The causes are many and data size and Big Data tools have little to do with it.

Getting a handle on data and the activity of data-analysis

Data-analysis is not so simple and requires a thorough understanding of the experimental method and statistics. Unfortunately, most people don't understand statistical analysis and the data analysis process. It isn't about mechanically churning data into information. It is a more abstract process and involves much thinking and creativity.

Data by itself is meaningless. It's just ones and zeros. It actually isn't even that. Binary data is really just a sequence of things that can be separated into two classes and kept there with high fidelity. That's it. Data only becomes useful when we also have knowledge of the process from which it was produced and a notion of how we can compute on it to reveal something else that is interpretable and useful to us.

The algorithm used for computing and the method of interpreting the results are both inseparably linked to the data itself. The algorithm is itself an abstraction. What we really have is a program which is itself stored as binary data. So really computing with data is a process of taking two pieces of data: the data and the program, and letting it interact in a way that we control to create other data that human can interpret and use.

But what is this program to manipulate the data? Humans create this as a way of representing a model of the world; the outside world that a computer knows nothing about. A model is a set of constraints that we set on a set of abstraction (mental concepts) and the data is the physical representation of those abstractions. The program is the physical representation of those constraints.

For example, the place where you actually live is the same as the place I will end up if I following Google's directions based on the text of your address that I have stored somewhere as data. Those things are constrained to be the same in our model. That assumption is what gives meaning to the written address.

If I go to Amazon.com and browse a particular set of books, it is almost always the case that I am interested in those books and considering buying one or more of them. I know the traces that I leave behind in Amazon's log files are not independent from my preference of books. I have a model, from introspection of my own mind, that these things are related and that this is probably the case for most people. Thus, if I am tasked with writing a recommendation system for Amazon, I have a model of how to relate people's book preferences to the traces in the data that they leave behind.

Thus every computation is based on some model and that model is not encoded anywhere in the data nor is it encoded in the program. It is encoded in the human mind. Computation, the economical kind that we actually do, is thus an interaction between the human mind and a machine.

The machine is needed only because humans are bad at storing and retrieving some kinds of data and are more error prone in carrying out logical operations. This is really the only thing that computers do and they do it very well.

What is information? A good enough definition is that the information is the part of the data that is relevant or related to the abstraction being computed. The data that is not related can be called noise. Note that data versus noise is not intrinsic to the data. Data is information only as it pertains to the model we have. These 100 bits of data are related to where you live. The next 100 bits of data are not. They are however relevant to where someone else lives but for my model and my task, they are noise.

So data analysis starts with a human need (I need to ship you your new book) and a model for how I can do this (write the address on an envelope and drop in the blue box). Implicit in that model is that the bits that I turned into text on my screen is the necessary information that I need to write on the envelope to accomplish this goal. The computer fetches that information as requested even though it has no idea what you intend to do with it or what it means to you.

More complicated models are really no different from this. They just involve more data, perhaps a more complicated model of the world and perhaps more computation. Consider predicting the weather. This is a complicated model. It involves, physics, measurement devices and the relation between the abstraction of weather and the traces left behind on these devices. Physics tells us which observables are relevant to changes in the weather. Engineering tells us how to create data closely related to those variables. Computers just help us keep a record of all that data and help us perform a complicated calculation.

A problem like this also has a problem that the data trace left behind is slightly ambiguous. There is measurement noise that is impossible to separate from the data. We don't really measure the density and pressure of every point in space. We measure estimates of this at a few points and make assumptions about the level of noise in the measurements and also how to interpolate between the sampled points.

Despite this ambiguity, we have an excellent mathematical tool for dealing with it. We know that there are many different possibilities for how the weather will evolve. We know that we are not going to make correct predictions all the time. However, we have also learned from experience that these predictions are much better than randomly guessing the weather or eyeballing the sky. That fact validates that our assumptions are likely correct.

We conceptualize the reason why this works by using statistical reasoning. Many different scenarios are possible but we will choose the scenario that is most consistent with the data. That really means that we choose the ones for which, if they were true, the data that we measured would not be terribly unlikely. We assume that likely things will happen rather than unlikely things and this has proven to be an effective strategy for operating in the world.

For example, yes, there could be a hurricane tomorrow but that would be unlikely because there has never been a hurricane that did not show up on the radar the day before. Our model assumes that the physics in the future are not different from the past. Couldn't the random fluctuations in the measurement devices, by some unlucky change, result in a radar image that doesn't look like a hurricane? Yes, that is theoretically possible but requires a very unlikely coincidence. We assume that the measurement errors are random and uncorrelated which each other and also uncorrelated to whatever is actually going on with the weather.

Now that we have spent some time talking about what data is and what it is not and the importance of the human activity of modeling we can better understand what is wrong with so much that is being said about the Big Data movement.

Big Data is small part of a the larger data-science revolution

Many people seem to be making the case that big amounts of data somehow are going to change everything. Others are a little more careful but still seem to say that big data plus all these great tools for managing it are going to change everything. Some have even asked whether the role of data scientist will soon be replaced by the algorithms themselves.

Is big data plus Hadoop plus Mahout machine learning libraries going to change everything? Absolutely not. A refutation of this idea is not difficult once you realize that many excellent data science projects can be done with small amounts of data on your laptop. The fact that this has been true for decades and did not result in much data-science being done in the business world means that the limiting reagent in this reaction has not been the existence of more data or the tools for storing and transforming it.

It has much more to do with the lack of mathematical and statistical training of people working in the business world. The thing missing has been the skill to model the world in a mathematical way. People with those skills have remained in science and engineering or perhaps been lured into finance but largely had not been widely hired into typical businesses to perform advanced statistical modeling.

So why now? What is different recently that has caused data-science to rise to prominence? I don't believe it is that, all of a sudden, we have big data, nor do I think it is because of Big Data technology tools like Hadoop and NoSQL.

The data-science productivity boom

I believe it is mostly driven by higher productivity of these kind of workers that we now call data-scientists. I know this because I have done this for a while and I know that I am much more productive than I used to be and it's not just the result of more experience. This productivity can be attributed to many things and most of them have nothing to do with Big Data.

One huge driver is the availability of open source data analysis platforms like R and Python's scientific stack. When I first started doing analytics in astronomy in the 90s, we worked in Tcl or Perl and then relied on low-level languages like C and Fortran when we needed better performance. For plotting some people used gnuplot or pgplot or super mongo. This was all duct-taped together with scripts, often using IO as a medium for communicating between programs.

When needing some advanced algorithm, you would pull out a book like Numerical Recipes or Knuth and copy the code into your browser or translate it to your preferred language. This took a long time, was error prone and people were still limited to which books they had on their desks.

So what's different today. The internet, obviously, is the biggest difference. We can search for mathematical ideas and algorithm and libraries using google. Wikipedia and Google are fantastic tools for learning math and mathematical methods especially when you are looking for connections between ideas.

The idea of open source repositories like on Github are an enormous boon for data-science productivity. Consolidation in the tools being used and the development of communities around data analytics helps enormously with sharing code and ideas.

The continual advance of computing hardware technology has of course been a wind at our back as well. But before "Big Data" tools we had other big data tools. HDFS is not the first distributed filesystem. We used GFS at Fermilab in the 90s. Hadoop is a useful framework for doing batch processing with data locality awareness but it isn't that much different from what we had built in the past for astronomical data analysis with perl scripts and NFS mounted disks. It's a more general framework and can help avoid repeated work on subsequent projects of a similar nature but it doesn't truly enable anything new.

To sum up. What is different today is that I can learn faster by utilizing all the resources on the internet. I can avoid reinventing the wheel by downloading well tested libraries for analytics. I can spend more time working in a higher-level productivity language like Python or R rather than chasing memory leaks in low level C code. I can communicate with other data scientists online, notably at sites like Stack Overflow, Math Overflow etc if I can't answer my question simply by search. There is much more communication between the related fields of statistics, computer science, math and the physical sciences. This emerging interdisciplinary activity and applications to problems outside of academia is being called data-science.

So productivity is the key. I'd estimate that my productivity is three times higher than 10 years ago. That means that 10 years ago, a company would have to pay me three times more money to accomplish the same task. While I'd love to think that I was worth that kind of pay, I suspect that it is more likely that I was not.

While all of these things developed gradually over the past 15 years or so, there becomes a point where productivity is high enough to warrant the creation of news jobs, such as the data-scientist. Crossing that point probably only happened a few years back for most of today's data-scientists and productivity increases will continue. With any emerging industry, momentum gathers, creating feedback loops. VC's start funding new analytics startups and we get things like Mongodb and Tableau. The media gets involved and talks up how data-science and Big Data are about to change everything. All of this helps to drive more activity in creating more productivity enhancing tools and services. Pay and stature for data-scientists rises attracting people that are unhappy in academic appointments. All is self-reinforcing ... at least for a while.

So what is Big anyhow?

Where does this leave "Big Data"? Is it all a farce? Certainly not if you apply the label to the right things. Some companies like Facebook, Google, Netflix etc really, really do have big data. By that I means that the problem of working at those scales really is nothing like doing data analysis on your laptop. Many of the tools for working with data of that size really are extremely important to them. Still, the fact is that most companies are nothing like that. With some prudent selection, sampling, compression and other tricks, you can still usually fit a company's main data-set on your laptop. We have terabyte hard drives now and 16 GB of memory. If not, you can spin up a few servers on the cloud and work there. This really isn't any different from the past. It is easier and cheaper but not really much different.

The most important advances in Big Data research, in my opinion, is the advances happening in the area of processing data streams, data compression and dimensionality reduction. Hadoop by itself is really just a tool or framework for doing simple operations on large data-sets in a batch mode. Complex calculations are still quite complex and time consuming to code. The productivity of working in Hadoop even with higher level interfaces is still no where near that of working with a smaller data-set locally. And the fact is that for 99% of analyses it is not the right tool or at least not the first tool you should reach for.

Advances in machine learning are certainly major drivers of data-science productivity thought this too isn't just applicable to big data. Machine learning's main use case is problems of high data-richness exhibiting very complex structure such as automated handwriting recognition and facial recognition.

Many have claimed that real Big Data is coming in the form of sensor data; the internet of things etc and this certainly seems to be the case. For this we should look to fields such as astronomy and physics who have been dealing with large amounts of sensor data for years long before the arrival of Hadoop and the Big Data toolkit. The key to this as it has been before will likely be better algorithms, smart filtering and triggering mechanisms and not the brute force storage and transforming of enormous, information-poor data-sets which seems to be the modus operandi of current Big Data platforms.

Data reduction and information extraction

Big Data is indeed a big deal but I think it is less a big deal than the emergence of data-science as a profession and I don't think these things should be conflated. Furthermore, as hinted at above, extracting more information from more and more data requires foremost the ability to construct more expressive and accurate mathematical models and the skills and tools to quickly turn these into functioning programs.

The fact that you only need a small subset of the data for most statistical analyses is not well understood by people without much background in statistics and experimental science. The amount of data required to constrain a given model is in fact calculable before you even see the data. When I worked in astronomy and helped plan new NASA satellites, this was much of my occupation. That's because gathering data by building and launching a satellite is expensive and so you only propose a mission expensive enough to gather the minimal amount of data to answer your question. The math to do this is called Bayesian statistics and without a basic understanding this you can't reason about the amount of data required for a given model or know when your data set is large enough to benefit from more detailed modeling.

When you do have your hands on a big data set, your goal as a data-scientist is to reduce the data. The term data reduction while ubiquitous in scientific settings is something I rarely hear mentioned by business people. In short, it means compressing down the data size by extracting the information and casting aside the rest. A physics experiment like the LHC in Geneva takes petabytes of data per second, quickly looks for interesting bits to keep and immediately deletes the rest. It is only simplifying a little to say that at the end of the day they are only interested in about a single byte of data. Does the Higgs Boson exist? That's a single bit of data. What is it's mass to 10% accuracy? That's another few bits of information; data reduction par excellence.

Now if data is cheap and just lying around the company anyway, gathering more than you need might not be that big of a deal. Now if you get 10x more data than you need, you're just slowing down the computation and creating more work for yourself. But the key thing to understand is that for a fixed model, you saturate it's usefulness rather quickly. To simplify a bit, the inverse of square-root of N is the usual scaling in statistical error. If your one-parameter model is only expected to be 10% accurate, you probably get there around N=100. That terabyte of data on HDFS is just not going to help.

So extracting more and more useful information from more and more data requires a more accurate model. That can mean more variables in linear model or a different model completely with much higher complexity and expressivity. The Big Data tools don't help you here. You simply have to think. You have to learn advanced statistics. Fortunately thinking about these problems, learning useful methods and borrowing tools for dealing with them is much easier today than it used to be. Again, this shows that cultural cohesion, more effective learning and communication channels and tools for dealing with complexity rather than size are more important than tools for doing simple manipulations on enormous data-sets.

This new blog

This blog is a place to write about data science.