Demystifying the Data Scientist

Wed 22 January 2014

Tom Tunguz wrote an interesting piece yesterday on the diversity within the Data Science discipline. He pointed to the results of a survey conducted by the Data Community DC team in which they classified 250 data scientists into four distinct groups based on the types of activities they were most involved in. The four distinct segments were Data Businessperson, Data Creative, Data Engineer and Data Researcher. This is bound to change as the role matures and gets more defined in the industry, but it's a great starting point to understand what exactly a data scientist does and helps shed some light on this relatively new discipline (at least in title).

Here's the breakdown in visual form -

demystifying data scientist

I'd like to extend this classification by suggesting two more activities that I believe almost every data scientist spends her time on and then propose a new classification scheme that builds on the one suggested.

Data Identification & Aggregation

A large part of the job of a data scientist is to identify novel sources and sets of data that could be analyzed to provide insights into the problems or questions at hand. This is an iterative process and goes hand in hand with thinking about the business problems that need to be addressed. This could involve bringing in external data sources, identifying how data from different business units could be combined or working with the product team to build instrumentation to collect the relevant data.

Data Cleaning

DJ Patil, in his recent talk, said that data scientists spend almost 80% of their time cleaning the data (Leo Polovets has a great summary of his talk. I can concur that he's not exaggerating. Data rarely comes in the form of neatly formatted CSVs and much of the daily rigamarole that a data scientist goes through involves getting it into the form where it could be analyzed for insights.

The Data Evangelist

Adding these two activities to the mix and re-calibrating some of the other ones, here's what my classification looks like -

demystifying data scientist - part deux (click here for larger image)

As you'll notice, I've added a new segment to the classification - a Data Evangelist (for lack of a better term, suggestions welcome!). This person could be considered a full-stack data scientist and spends most of his time experimenting. For these experiments to be efficient and worthwhile, he must have a firm grounding in business but also be able to tackle the technical challenges that come with the role. He spends a good amount of time exploring, aggregating and cleaning data and needs to have the programming chops to get the job done. At the end of the day, he's the driving force behind the data strategy for the company.

Of course, this is not based on any survey data (as the original classification was) but rather my understanding of the role and experience through my daily activities as a product manager/data scientist/businessperson hybrid -- a Data Evangelist of sorts. I'm curious to hear what you think of this.