Tuesday, March 25, 2014

Transitioning to Data-science from physics, astronomy and other sciences


This document is to help scientists transition into the field of data-science. At ThoughtWorks, I interview a lot of DS candidates and get a lot of applications from people graduating with science PhDs or from postdocs. These people are usually very smart, know quite a bit about analyzing data and usually have sufficient math skills for the job. However they are also typically lacking in most areas that are required for the job.  They will still find job opportunities in industry and may very well become data-scientists but probably won't attract the interest of ThoughtWorks and other highly competitive places.

The good news is that these scientists can typically pick up most of these skills in less than a year. So if you want to get a data-science job at very competitive places, you can, but you gotta prepare yourself. The time to get working on these things is while you still have some time left in your present position not after your postdoc has finished.  Plus, learning this stuff is awesome and will make you way more productive.

What you might look like


If you're like most scientists that we interview or consider interviewing, you can program in C, C++, perl, perhaps Fortran and maybe some Python. You might also use IDL, Matlab or Mathematica. You are pretty good at the unix toolkit (grep, awk, sed etc). You never took a programming class and probably never took a stats class. You just taught yourself.  This is about the minimal level of what we would consider "programming skills" and unless you brought some other fantastic skills, you probably wouldn't make it through many interviews (You'll get about 8 interviews at TW if you get hired).

What we like to see


We like to see a lot more indicating that you are interested in programming languages. Python and R are the two most important languages for data-science by far. Becoming fairly expert in one of these language is probably the most important thing you can do. That said, we like to see that people have written web-apps in Ruby or Django or even PHP or have used Java and possibly some of the newer languages for the JVM. I don't like Java. Nobody seems to like Java these days but it's very hard to avoid it in industry and even with the cool languages like Clojure or Scala, you need to know something about the JVM ecosystem.

What we don't expect to see


We don't expect you to be a professional developer and know much about testing and build frameworks and all the professional dev practices. It would be great but is not expected. We have another term for people similar to data-scientists which is closer to a developer or database admin: the Data-engineer. Such a person is an expert at Hadoop, enterprise scale Big Data tools etc. These people were usually developers in the past not scientists. As a data-scientist, you should know something about these toolsets but you don't need to be an expert (though that would help immensely).

The modern data-science toolkit

The following are the skills that a scientist need to get some practice with.


Programing Languages


A data scientist needs to be very good at either Python or R and should at least know a little about the other one. You should be able to read in a csv file of data and make histograms and plots very quickly. You should be able to fit various models to data. You can use Google to look up forgotten commands (we all do) but it should not take you 30 minutes of searching. If you're a scientist, you are more likely to know Python but at least give R a look. You can do many powerful things in R which just a little knowledge and it is a language written to be used by scientists so you're in luck. Python is much more general language and generally feels like it is written to be used by people who can program fairly well. 

Vital Python packages to know: Numpy, Scipy, Matplotlib
Good Python packages to know: Pandas, numba

Vital R packages to know: data.table, ggplot2. RStudio IDE is very useful.
Good R packages to know: RMySQL, anything written by Hadley Wickham, Shiny.

Other data-sciency languages


There are a few other languages that are up-and-coming in data-science. Some places will be using them already. At ThoughtWorks they are certainly bonuses. 
  • Clojure - A lisp, probably the coolest thing on the JVM
  • Scala - Some love it, some think it is disastrously over-complicated. Used by Twitter. Scala is actually very easy to get going with. Probably very difficult to master.    
  • Julia - The Julia language is awesome but not very mature and probably not yet ready to take over but has a decent chance of being the next big thing. Plays well with Python. 
Knowing other languages like C/C++ and especially Java is very useful when you need speed and can't find a library to call from Python or R.

Databases


You should be pretty familiar with SQL and the relational databases that use it. MySQL and Postgres seem to be the most popular open source ones.  You should be able to connect to these programmatically from R/Python and do some useful things with them. 

But SQL is not the end of the story any more. So called NoSQL databases have become very important and allow horizontal scaling, fault tolerance interaction with massive data-sets possible. I recommend people buy and study Seven Databases in Seven Weeks and NoSQL Distilled the latter by our own Martin Fowler. The following is a list of what I think are the most important for data-science.

  • MongoDB - A document database that can also be used as a key-value store
  • Hbase - Industrial strength columnar database often used with Hadoop
  • Cassandra - Similar to Hbase but with some different characteristics. 
  • Redis - An in-memory database for fast key-value access. Supports many data-strictures as values.
  • Neo4J - A graph database. Less commonly used but still worth knowing about. Is great for the right kind of problems.


Hadoop and it's ecosystem


Hadoop is not a database but rather a framework for doing large batch-style computations with MapReduce using a cluster of distributed nodes. Hadoop gets so much attention that you'd think "Big Data and data-science is mostly mostly about Hadoop. I am sure there are plenty of jobs where the data-scientists do nothing but write Hadoop code. But you don't want those jobs. Again, we reserve the title Data-Engineer for people who are Hadoop experts. 

Hadoop in short is a system for shipping the code to the data instead of the other way around which has obvious benefits. Java is good for this because you can ship the compiled classes and they will just run on the JVM of each node and it's also pretty fast. Hadoop being written in Java and running on the JVM is the main reason why data-scientists can't avoid the JVM. You can write java MapReduce programs if you're learning Hadoop or otherwise want to decrease your productivity but most people choose some higher level way of interfacing with Hadoop. The following are a list of ways of doing this that seem popular.

  • Hadoop streaming - Use the streaming interface which allow you to write your code in any language, Python, perl etc. You just need to pipe in and out of stdout like you do with unix tools. This is an easier, gentler way to approach Hadoop but has some performance tradeoffs. 
  • Cascading - This is a Java interface that provides a much nicer API. Used by other methods as well. 
  • Cascalog - Clojure library built on cascading. Great if you like Clojure. 
  • Scalding - Twitter's open source interface built on cascading in Scala. Great if you like Scala. 
  • Pig - A new language for making MapReduce queries. Perhaps falling out of favor.  
  • Hive - Allowing you to write SQL on Hadoop with some additions and subtractions of features. 
  • Hue - An app that bundles many of these Hadoop tools with a nice GUI interface.
There are plenty of other pieces to the Hadoop puzzle. The best reference is Hadoop: The Definitive Guide. Hadoop is a big ecosystem and you can't learn it quickly and might find it rather boring. I think most people learn Hadoop when they really need to. I wouldn't suggest trying to become a Hadoop expert. We look for this expertise more in our Data Engineer job title.

Machine Learning


Machine learning is quite important in data-science. It's more important than in science because in science, especially physics or astronomy, we know what the model is because we have an equation for it. You don't get to just make up a model (unless you're a particle theorist!). In industry, you will rarely come across a model based on physical laws and seldom care about the model in it's own right. We usually just want to predict things well and any model which does that well is good enough for us.

Machine learning can be divided into three sections which you should probably learn in this order. The first is just a large set of mostly-disconnected tools for modeling data and making predictions. In this I'd include
  • Linear models including regularized versions like ridge-regression and lasso
  • Decision trees and random forests
  • Boosting and bagging
  • Standard feed-forward neural networks
These are all quite standard and are available in either R or Python (scikits-learn for example) and lots of other places.

The second group is methods based on kernels. Kernels are probably not what you think. Kernel methods like support vector machines are some of the most useful tools and have wide applications for many different kinds of data. They are great for both classification and regression. As an aside regression is a term that I did not see much in science but it's a useful term. It just means fitting a curve and does not imply using linear methods like I had assumed. Physicists do this all the time, but don't use that term as far as I know.

The math of kernel methods is quite beautiful and, luckily for you, has a lot of overlap with physics. They even use Lagrangians! I learned about kernel methods from a book with an unusual title Machine Learning Methods in the Environmental Sciences. The majority of the book is about machine learning not environmental sciences. 

The third group of machine learning techniques is graphical models. This includes Bayesian networks, log-linear models and Gaussian graphical models. I like the book Graphical Models with R because it also tells you about the R toolkits for using these methods. Most of the techniques of machine learning can be described in a graphical setting and if you really want to get into this, you might look at the text Probabilistic Graphical Models but that should probably wait.

My favorite ML book overall is the new encyclopedic tome by Kevin Murphy, Machine Learning: A Probabilistic Approach. This thick book seems to cover just about everything in machine learning and is quite up to date. However, it moves quickly and you may want to start with a gentler introduction such as Bayesian Reasoning and Machine Learning by David Barber. Still, I think you should by Murphy's book. 

You will likely enjoy learning machine learning a lot more than learning Hadoop. However you can do just fine knowing the basics and can probably get by without knowing how everything actually works in detail. Mostly, you want to try out some of the tools which you will readily find in R or Python and have an idea about which methods work best with which problems. Decision trees, random forest and SVMs will probably be all you need in practice. Knowing how to generalize these algorithm, combine them and come up with new ones is master-class stuff.

Sketching and streaming algorithm and advanced data-structures


One of my favorite computing quotes is by Linus Torvalds: "Bad programmers worry about the code. Good programmers worry about data structures and their relationships". Often a find that what a client really needs is not a fleet of Hadoop nodes on the cloud but better data structure or better algorithms and I can often get this "Big Data" project up and running just fine on my laptop. Personally I think Hadoop and parallel computing should come into play after optimizing algorithms and data-structures. The book Mining Massive Datasets which is free and online is a great place to start. I also like the course notes from Jalani Nelson, possibly Harvard's coolest professor (and someone I'd like to meet). Streaming algorithms that ones that can be computed with only one pass through the data. You should also know the term online learning which are machine learning algorithms that work in a streaming way rather than a batch way from stored data. An example is stochastic gradient descent.

There are plenty of classic books on algorithms that you probably don't need to read but you should try to become familiar was many data-structures particularly such as Bloom filters and tries. Information retrieval is a closely related field to machine learning. I highly recommend this following webpage on Non-standard data-structures in Python. I have used quite a few of these and found huge performance gains.

Don't freak out


It might seem like learning all of this is going to take you 10 years and to master it all, it might. However, you don't need mastery of all of these things to become a productive data-scientist. You're probably a quick learner and can learn what you need in time to do what you need to do. That said, if you want to get a job at ThoughtWorks or Twitter or other highly competitive places, you should have a moderate understanding of much of this material. Having a PhD in physics is just not enough.