Data Science Musings: May 2014

The economist has an article out The backlash against big data pointing out recent reports and comments criticizing the concept.

“BOLLOCKS”, says a Cambridge professor. “Hubris,” write researchers at Harvard. “Big data is bullshit,” proclaims Obama’s reelection chief number-cruncher. A few years ago almost no one had heard of “big data”. Today it’s hard to avoid—and as a result, the digerati love to condemn it. Wired, Time, Harvard Business Review and other publications are falling over themselves to dance on its grave. “Big data: are we making a big mistake?,” asks the Financial Times. “Eight (No, Nine!) Problems with Big Data,” says the New York Times. What explains the big-data backlash? The

The article more or less gets it right in agreeing with the specifics of the criticism while maintaining that there is still lots to love about Big Data. I appreciate the shout-out to astronomical surveys as the place where the term originated, though it came from journalists not us scientists.

One thing that unquestioningly does amount to Big Data is all the nonsense written about it by writers and media outlets riding the hype cycle. And yes, hopefully that is starting to crest. If you were planning on writing a book full of vague cheerleading for Big Data revolutionizing the world, you might be out of luck.

I think most of the confusion and mis-information could be avoided if people just stopped using Big Data as a noun. It's not a noun. It's an adjective. There are Big Data technologies and Big Data developers and Big Data architectures. But there is really nothing called Big Data. Big Data isn't going to change the world because an adjective can't do anything. People do things with data. That's called either computation or statistics.

I'm fine with the term Data Science to describe the emerging occupation of applying statistics, data visualization and machine learning to business problems. That's actually a noun and can in principle do great things. Big Data technologies like Hadoop, Spark, the cloud and NoSQL databases have also been very successful and also helps us do data science faster, cheaper and better on data sets large enough to require distributed storage and processing. But these technologies are just about handling data so that people with good ideas about how to use the data can work more productively.

The concept of the data warehouse being the central repository of data for the enterprise has been around for decades. Relational databases and normalized star schemas is almost as ingrained in IT as concrete and rebar is to the construction business. These data stores do make sense for many purposes but have an inherent weakness. The careful data modeling that allows them to slice and dice efficiently also comes at the price of strong interdependency. Changing a column or an index in a table could have ripple effects all over the data warehouse and effect many different applications.

IT workers have a good solution to this problem. Just stop requesting changes! For a mature company with a rather static model, this might be suffice but most companies whose business model and strategy are constantly in flux, the tight coupling of the data warehouse is an impediment to growth. In fact, the model of the data warehouse envisioned as a final state, while perhaps applicable to billing and basic reporting is not well suited for the analytics of business intelligence that inform executives how to navigate the waters of business.

The idea of the "data lake" is emerging as an alternative model for many companies. The idea is somewhat of a throwback to earlier times before databases when working with flat files was the norm. What is different is now we can store everything in a scalable distributed filesystem and use technologies such as Hadoop or Spark to make this practical. The data lake need not be the only resting place of data. It can be a base for feeding the data warehouse or perhaps other NoSQL databases. What's important however is that it need not just be landing ground for data. There are many use cases where building apps on top of the data lake is a better solution.

One of the major benefits of Hadoop or Spark is that it is schema-on-read. That is, once you have written a file reader, you can decide what to keep and how to structure it or decide not to structure it at all. In effect, it gives enormous freedom to developers, data scientists and analysts. If you want the data in a different form, you don't have to send a change request to the data modeling team for review, you just change a few lines of code. This means that analysts can get at any data source quickly, test to see if it drives business value and even build prototype applications without having to go through a heavy change process.

Apache Spark seems to be taking over as the next generation Hadoop. It's Shark interface is the SQL-like query system similar to Hive. What you get is better use of in-memory data-structures and cacheing to increase speeds another factor of 10-100 and allows for more general operations than Map Reduce. It's also a lot more developer friendly with better APIs in Java, Scala and Python. If the data lake was exciting you before, now it has a Shark in it!

While going from a highly regulated data warehouse environment to a free-for-all data lake may seem a little wild-west for conservative companies, most concerns can be put aside. First, it doesn't mean giving access to all of your 3000 employees. At most medium to large corporations that we work with, we find that there are only a handful of people doing advanced analytics. Most will still access data through special purpose applications.

I'm sold, you may say but how do we get from here to there? Certainly the data warehouse is going to be around for a while as it supports most of your apps. Introducing a data lake into a data warehousing environment can be done gently and gradually. At the first stage it can be used as a data landing zone. You are probably already doing that using some unix file system on a few big servers. Now you're just going to use HDFS as a file system which is not a big change and once you are here, you have immediate ad-hoc access via Spark. Then next stage could be replacing your ETL scripts with Spark scripts or using commercial ETL products built on top of Spark. The final state is actually a system that allows for rapid change while keeping the data warehouse for the parts of your business that really doesn't require constant change. Now your data scientists, your executives and your IT guys are all getting what they want.

Data Science Musings

Thursday, May 8, 2014

Big Data is not a noun

Wednesday, May 7, 2014

The Shark in the data lake