How do Data Scientists fit into an organization? How is the role of a Data Scientist different from the role of a Data Engineer? How should the Data Science team be organized to create the most value for the business?
Those are questions companies are struggling with as they begin to incorporate Big Data, Analytics, and Data Science into their technology organizations. The Enterprise Data World 2016 Conference provided insight into the topic at a panel discussion, Organizing for Enterprise Data Science. The panel included Chris Bergh from DataKitchen, John Akred from Silicon Valley Data Science, and Tim Berglund from DataStax, and was moderated by April Reeve from Reeve Consulting LLC.
Defining a Data Scientist
The panel began with the most basic questions, what does a Data Scientist do? What are the skills needed? Where can you find people with those skills? Reeve started the discussion by presenting the common Venn diagram of Data Scientist skills, showing they need hacking ability, subject matter expertise, and statistical skills. These skills support the roles of Data Scientists in researching the possibility of predicting behavior, introducing new data sources, performing analysis, proposing and validating predictive models, and developing prototypes of predictive solutions.
While performing those functions requires development skills, math and statistical knowledge, and business expertise, Bergh pointed out that it’s difficult to find people with all these skills. On the other hand, there is lots of incentive for people to develop these skills now: “It’s the sexiest job of the 21st century. The alpha nerds nowadays are Data Scientists.”
When Reeve suggested that the definition was too broad and that these skills couldn’t usually be found in a single person, Bergh agreed, adding, “I don’t think it’s too broad; I think it’s aspirational. I think your team should have all these skills as opposed to one person.”
Distinguishing Between Data Scientists and Data Engineers
Another role that’s common on Data Science teams is a Data Engineer. The panelists were in agreement that the roles of Data Scientists and Data Engineers are distinct.
“They are fundamentally different sides of the same coin. One is a mindset about improvisation, agility, and responding to current inputs and the other is about spending a lot of time very thoughtfully interpreting something,” Akred said.
Akred pointed out that the Venn diagram description of the Data Scientist job is missing an important skill that Data Engineers bring: familiarity with enterprise data systems. The challenges of “how do I find out what the data is; how do I find out what matters to this business” are issues that data engineers can address, he said. These are crucial skills, as understanding enterprise and operational systems is necessary to surface the data so the Data Scientists can do their analysis. Since Data Scientists often come from academic backgrounds where they haven’t interacted with enterprise systems, they lack those skills.
In fact, Data Engineers and Data Scientists have complementary skills. “Those two groups need to get along and work well together,” Akred added. If they don’t, the analytics team can lose a lot of productivity.
Packaging Data Science for Production
If the Data Scientists and Data Engineers work well together, ultimately there’s an insight that needs to be shared or a model that needs to be migrated into production. Typically the Data Scientists hand off that task to Data Engineers, Solution Architects, or some kind of software development team.
Bergh pointed out that deploying the work of Data Scientists doesn’t always result in implementing an application. Sometimes the result of the model is “deployed” in a PowerPoint shared with senior business management, and a software team isn’t required.
When it comes to developing a production application and building data pipelines, Reeve said:
“If the question is, ‘Are data engineers different from my regular enterprise programming group?’ maybe, maybe not. What matters is, are you trying to integrate these insights, the results of the Data Scientist, with your enterprise transaction processing systems? If that’s the case, that’s frequently where you need to bring in your current IT staff who know the systems to do that.”
But Akred pointed out there are some differences between analytics applications and traditional applications that conventional software teams may not be prepared to handle. In particular, analytics applications are unlike other applications in that “Most Data Science capabilities are non-deterministic, which is to say given exactly the same inputs you can get different outputs.” That’s all right for the analytic application, but it’s not a situation most testing methodologies address.
Keeping Analytics Applications Running in Production
Once the application makes it into production, another role may be required: the Data Administrator. Reeve presented a slide identifying the Data Administrator’s responsibilities as the provisioning of clusters, supporting the Hadoop infrastructure, integrating monitoring tools, integrating enterprise security, and analyzing workloads to perform capacity planning.
When Reeve asked what are the personality types that fit these responsibilities and where those skills can be found, whether the roles could be filled by current DBAs and other Enterprise Administrator roles or from people in the Big Data space, Berglund remarked, “That ends up being a horse of a fairly different color from what we knew as operational DBAs and then the people who will administer these systems going forward.”
Akred commented that an integrated operational and development team that has strong communication is necessary to achieve an infrastructure that functions properly. He also pointed out that today’s technology allows you to program and automate a lot of the support and operational tasks you used to have to hire staff to do. In fact, the team agreed that DevOps is a fit for the operational support role.
It’s important to put into place many of the same procedures that are used to manage code deployments to manage analytic systems or model deployments. Reeve pointed out that Data Scientists rarely get involved once the model is in production.
Bergh added, “There’s a difference between training a model and deploying a model. Data Scientists are really interested in finding and training cool models. That’s their job but once it gets done, you need to deploy it.” In fact, even if the model is simply included in a PowerPoint rather than built into an application, it should be managed as if it were a code deployment, because Data Scientists share and reuse models.
The issue of versioning is especially challenging for analytic models, Akred pointed out. “There’s one important gotcha. There’s a notion of versioning particular to Data Science applications that traditional things like Git can’t manage for you.” Some Data Science algorithms start with a random seed, and versioning these “model artifacts” is needed along with versioning the code and the data.
Organizing a Data Science Team
With all those roles identified, the panel turned their discussion to the question of how those professionals should be organized within the enterprise. Options range from every business unit’s having its own Data Scientists to having a centralized team from which every business unit has to request predictive analytical services. The centralized team helps ensure that the necessary human resources are fully utilized; it also has the advantage that, as Akred said, “Data Scientists like to hang out with other Data Scientists.”
That doesn’t necessarily mean co-locating all your Data Scientists, but if they aren’t located together, an organization should have quarterly or more frequent meetings that bring them together to talk Data Science. Akred also pointed out that, whether they’re managed as a centralized group or not, the Data Scientists need to be out with the business in order to understand the business problem. Generally, trying to hire people with domain expertise reduces the pool too much; hire a smart Data Scientist and they’ll figure out your domain.
It’s also important not to have too many layers of management in between the Data Scientists and the business. “The more fluid the conversation between the business and the Data Scientists the more likely you are to get really good results out of those Data Scientists,” Akred added.
Here is the video of the Enterprise Data World 2016 Presentation: