The world’s most valuable resource is no longer oil but data. As a result, the skills required to convert data into insights that generate revenues are in high demand. Unfortunately, one of the biggest hurdles companies face when trying to capitalize on their data sets is a lack of data scientists and machine learning practitioners.
These professions require a wide array of specialized skills ranging from statistics, probability, and linear algebra to computer science. In an interview with DATAVERSITY®, Andras Palfi, data scientist at Bigstep talked about strategies for companies dealing with a lack of skilled personnel.
Where is Data Science Today?
Palfi started by saying he uses Data Science as an umbrella term that encompasses any activity relating to descriptive and prescriptive analytics. He sees a divide among companies that are doing Data Science: those investing in the future and those still rooted in the past. In some organizations, Data Science has progressed from being a simple back-office role to being an important building block of the business. These companies are developing new technologies and driving the entire industry forward. Netflix is an example of this type of company – as much a data company as it is a streaming service.
On the other side of the divide, some organizations have started to invest in Data Science while still keeping a foot in the past, said Palfi. They are held back either by aging infrastructure, a lack of leadership, or a lack of sophistication.
Challenges to Leveraging Data Science
Palfi remarked that the biggest hurdle standing in front of wider Data Science adoption is a lack in understanding of what Data Science is and how it can impact the bottom line. He quoted a recent article, Data Science and the Art of Persuasion by Scott Berinato, in the Harvard Business Review:
“Data teams know they’re sitting on valuable insights, but can’t sell them. They say decision makers misunderstand or oversimplify their analysis and expect them to do magic, to provide the right answers to all their questions. Executives, meanwhile, complain about how much money they invest in Data Science operations that don’t provide the guidance they hoped for.”
This mirrors Palfi’s experience. All too often executives don’t fully grasp either the potential or the limitations of Data Science, which leads to misguided expectations and conflicts, he said. On the other hand, a data scientist can get lost in the details, develop models just for the love of it, and forget to think about the business aspects of their work. This fuels Palfi’s belief that people who can bridge the gap between technical and business talent are and will continue to be in great demand.
The Shortage of Data Scientists
In August 2018, LinkedIn’s Workforce Report concluded that data analysts and scientists will be the most sought after professionals for the next five years. Data Science skills shortages are present in almost every large U.S. city: “As more industries rely on Big Data to make decisions, Data Science has become increasingly important across all industries, not just tech and finance.” In the October 2018 Workforce Report, LinkedIn reported that the trend was continuing and that employer demand for data scientists is “off the charts nationally,” with shortages increasing locally in growth areas such as Nashville, Charlotte, and Las Vegas.
Companies struggling to find solutions to this shortage can take a multi-faceted approach. Palfi recommends focusing on developing in-house talent as a way to increase the talent pool. “With the rise of great quality online courses, it’s now easier than ever to invest in employees,” he said. The only issue is that there are so many resources available, making it quite challenging to get started. Looking up “Introduction to Python” results in a list of hundreds of courses and tutorials, he said.
To counter this problem, companies should invest in developing in-house career tracks through which employees can progress from one role to another. For example, companies can build an in-house track for data analysts who want to learn and develop Data Science skills, or a track through which database administrators can progress into data engineering roles. “It makes a lot of sense. These are people who, with the right support and motivation, can easily pick up additional skills and knowledge.” It’s also a great way to keep employees motivated, loyal, and happy, he said.
Data Science Workflow Automation
As markets struggle to supply the necessary talent, Palfi noted, a few companies are working on a new paradigm: automated Data Science. The idea is that many or all aspects of Data Science and analytics can be automated, lessening the need for hands-on attention. Algorithm selection, for example, no longer requires that an employee understand the difference between logistic regression and random forest – a machine can simply figure out which type of algorithm fits the data best.
In most cases, the Data Science workflow can be broken down into three parts: data processing, model building, and deployment. There are existing tools that can automate parts of the workflow, but currently, there is no single tool that automates the entirety of the workflow, he said. DataRobot has fully automated the model building and deployment part of the workflow, with a one-click data-in model-out platform. Other companies, such as Alteryx, have succeeded in streamlining data pre-processing, which Palfi said is the most tedious and time-consuming part of the Data Science workflow. A true fully-automated data platform, however, would be able to handle every step of the Data Science process. “You could connect to a database, select a type of model for it to build and then let it do its magic. I would say we are still far away from this happening, though.”
Data Scientists and Automation
In the meantime, he suggests that data scientists need not be afraid of becoming obsolete anytime soon. “Just be prepared for your job to be transformed from one that’s primarily focused on pushing data around and optimizing models to one that is more high level.”
Palfi cited Randy Olson, the developer of the automated machine learning library called TPOT, from an article entitled Automated Machine Learning vs. Automated Data Science by Matthew Mayo. Olson believes that:
“Such automation tools should not be seen as replacements to data scientists, but as Data Science assistants. Such tools eliminate repetitive tasks such as running experiments on vast combinations of model hyperparameters and selected features and instead allow the humans in the process to be able to focus on more important and guiding issues.”
There will always be a need for people who can understand and explain machine learning models, Palfi said. “If I’m refused a loan based on a decision made by an algorithm, I want to know why,” and consumers won’t accept “The algorithm decided it” as an answer. Data scientists are needed to provide accountability for the models they maintain, he said.
Citizen Data Scientists
Another resource companies are using to address the data scientist shortage is to cultivate citizen data scientists. “Citizen data scientists are people who don’t possess all the skills a data scientist has but are still extremely interested in analyzing data.” Examples might be data engineers who aren’t necessarily good at visualizing data or data analysts who aren’t proficient at building statistical models. Palfi said that automation tools could “democratize” Data Science and help citizen data scientists perform tasks previously reserved for more experienced data scientists.
Palfi said that Lentiq is also in the business of automation. It is a cutting edge collaboration environment designed for analytics and machine learning teams. Focusing on building data lakes that enable freedom and flexibility, Lentiq has moved away from a centralized data repository to a fully distributed architecture that allows organizations to unify departments through data and knowledge-sharing mechanisms.
Current data lake technology promises advanced analytics for digital transformation. However, the 2018 McKinsey AI Index found that only 8 out of 100 pilot projects reached their intended goals. Creeping complexity, excessive time and resources needed to engineer the environment, and a lack of flexibility have created a “one size fits all” scenario. Instead, Lentiq offers an agile, scaled solution, allowing companies to reduce staffing pressure and distribute data to locations where it’s needed using a micro-data lake called a “data pool.”
Lentiq is a multi-cloud, production-scale data-lake-as-a-service that provides data teams with the tools, the collaboration mechanisms, and the freedom they need to innovate close to the data sources and the teams that use the data. Lentiq supports independent budgeting and resources for each pool and is flexible enough to work with various tools for specific different business use cases. As Palfi explained:
“Our goal is to allow as many teams as possible to access data and have a friendly environment for analytics and machine learning projects. We strongly believe transformative innovation can only be achieved through a human-centric machine learning approach for all data projects that are being developed in an organization.”
Image used under license from Shutterstock.com