Contemporary Data Scientists: Working Machine Learning at Scale

In a recent Magic Quadrant for Data Science and Machine Learning Platforms report, it was expressed that the Data Science and Machine Learning platform market will be in a state of flux over the next few years. Among the drivers of change will be providing Data Scientists with the ability to manage models and collaborate at an enterprise level as well as the availability of free and open-source options that let users begin Data Science and Machine Learning projects in an easy-to-access and low-investment way.

Anaconda debuted this year among the vendors that Gartner evaluated in the report, claiming a coveted spot in the niche category. Noted among Anaconda’s strengths was “its ability to federate and provide a central access point for a very large number of Python developers who build machine-learning capabilities continuously.” Ninety-percent of respondents to a recent Anaconda survey use Anaconda for Python, and 14 percent of respondents consider Machine Learning to be a key application for it.

As Gartner also noted, the platform is aimed at and suitable for the expert Data Scientist community familiar with Python and the interactive notebook concept, not for business-oriented audiences.

Scaling Up

Anaconda recently introduced the latest version of its commercial platform, Anaconda Enterprise 5.2, to go along with its open source Anaconda Distribution, which has over 6 million users performing Python and R Data Science and Machine Learning on Linux, Windows, and Mac OS X.

According to Matthew Lodge, SVP of Products and Marketing at Anaconda Inc., the open source version is typically attractive to Data Scientists as they work on their own, exploring different data models, running visualizations, and trying different approaches with a subset of data that can fit into the memory of their laptops. The Enterprise version accommodates development of Data Science and Artificial Intelligence model pipelines from a Data Scientist’s laptop all the way through to production.

“When they need to do a training run on a full data set, they need scale,” Lodge said. “They can take the environment they built with the open source tooling on their laptop and transfer it to Anaconda Enterprise, and we can guarantee that they’ll get the same result.”

He explained that Anaconda Enterprise is like an enablement layer, helping Data Scientists collaborate with other Data Scientists to deploy at scale.

Greater scalability support is a major feature of the latest Enterprise release, which adds capabilities for NVIDIA GPU-accelerated scalable Machine Learning for the Artificial Intelligence enablement platform. Data Scientists, the company says, can go from model development on a laptop to a 1,000 node GPU cluster for training to production deployment with full governance.

GPU stands for “graphics processing unit,” the name given years ago to the technology that was created by NVIDIA for handling intensive graphics rendering tasks. It turned out that GPU computations for accelerating graphics worked for Machine Learning, too, Lodge noted.

“GPUs are essential to doing massively parallel computation,” he said. When used for Machine Learning, though, they can be an expensive resource if IT has to give every scientist his or her own GPU to use with a laptop. But when arranged in a large central cluster shared by every Data Scientist in an organization – as supported by Anaconda 5.2 – it becomes economically possible to train models on full data sets at scale.

Anaconda Enterprise 5.2, whether deployed in the Cloud or in a data center (as it might be for highly regulated industries such as financial services), leverages Cloud-native model management. Data Scientists can train models on a full data set at scale – including scheduling to make effective use of GPUs – and then deploy to production with one click, according to the company. They can do it all without having to become an expert in containers, DevOps, or Kubernetes.

“As you scale up containers, you need to automate the process of running vast fleets of containers,” Lodge noted. As organizations build out their central clusters of GPUs, “We take care of managing that so that clusters are managed appropriately. It just works as far as the Data Scientist is concerned. We do the work under the cover with Kubernetes.”

One thing that differentiates Anaconda’s platform from competitors like IBM Watson Studio and Cloudera Data Science Workbench is that “we are the only one that is using Cloud-native technology of containers and Kubernetes to scale out Data Science deployments,” Lodge said.

In its 2018 State of Data Science Report, Anaconda asked respondents what technologies they use for scaling out their Data Science. The company found that,

“Docker makes a strong showing at 19 percent, beating out Hadoop/Spark with 15 percent, followed by Kubernetes at 5.8 percent. This result suggests modern Cloud-native style architectures like Docker and Kubernetes are in the ascendancy, at the expense of traditional Hadoop ‘Big Data’ and Apache Mesos (0.85 percent).”

As Lodge puts it, the environment has reached the point where Cloud-native approaches offer lower costs and more flexibility. Hadoop MapReduce doesn’t work well for Machine Learning, as it doesn’t enable computations to talk to one another, he said. Additionally, “HDFS storage in Hadoop is about $100 per terabyte, while using Google or Amazon S3 is $20 per terabyte,” he said.

Anaconda in the Enterprise Ecosystem

Among the organizations using Anaconda is the electricity and gas company, National Grid, which wanted to use Data Science to develop a risk-based monitoring and maintenance system for its electricity assets. Using Anaconda Enterprise, it built Machine Learning models for a more cost-effective approach to asset management.

Helicopter inspections of physical assets resulted in videos of different pieces of transmission equipment along with everything else the helicopter was flying over, Lodge pointed out, and someone would actually have to watch the videos and make annotations about transmission equipment that might need service – a process that took hours.

“Now it’s automatic. By using Machine Learning to process the video and cut out the bits where the helicopter is just flying along and not looking at pieces of equipment, it saves 86% of inspection time,” Lodge said.

The Defense Advanced Research Projects Agency (DARPA) is using Anaconda for pattern recognition that indicates human trafficking, which had been a manual process. Now, it can look for patterns in data and automate that with Machine Learning to look for clusters of activity – the movement patterns of individuals who help identify human trafficking rings, he said. “It’s making them more effective at finding them at scale.” Those are just a few examples of where Anaconda and its Data Science platform are transforming the way people are doing business. The industry is still young, but Anaconda is helping push the envelope so that new use cases are being developed all the time.

Image used under license from Shutterstock.com

Data Topics

Contemporary Data Scientists: Working Machine Learning at Scale

Leave a Reply Cancel reply