Click to learn more about author Daniel Jebaraj.
Data Science is now being used as a competitive weapon. As with other technologies and processes that can transform the way companies operate, there’s a lot of contradictory information about it that’s causing considerable confusion.
Most of today’s business leaders have heard that Data Science can improve operational efficiency and customer relationships, but it isn’t always clear how Data Science should be implemented or what the specific business benefits might be.
This blog post addresses some of the misunderstandings individuals and organizations have about Data Science. It also includes tips developers can use to enable Data Science capabilities in their organizations.
What Is Data Science?
Data Science is an umbrella
term comprising some of the today’s hottest topics such as Machine Learning,
Analytics, Modeling, and Data Visualization. In practice, Data Science is a
process. It starts with a hypothesis, and then data is gathered with the hope
of producing valuable insights. Once the data is collected, it is used to test
and build models. Finally, the results are analyzed and presented to decision makers as reports or dashboards.
The models, which tend to approximate events or behaviors in the real world, are used to make important decisions. For instance, churn detection models are often used to predict which customers are at the highest risk of defecting to a competitor so the business can take preventative action. Depending on the circumstances, the preventative action may take the form of a phone call from a manager, a discounted subscription renewal rate, or coupon.
Unfortunately, there is no single definition of Data Science, but many data scientists and vendors describe it as a process, similar to the definition and workflow presented above. Some consider Data Science synonymous with Statistical Modeling or Analytics (identifying patterns in data and presenting the results via a dashboard), which only adds to the confusion. Modeling and Analytics are subsets of the Data Science process.
The good news is that businesses can choose how they implement Data Science in their organizations because there’s no “right” way to do it. How Data Science is implemented depends on many things, including the expertise, tools, and data available to the organization. The most effective implementations of Data Science tend to start with, and align with, business goals.
Seasoned data scientists understand such nuances. Such understanding promotes clarity. Unfortunately, there are many myths around Data Science that serve as roadblocks in the path toward clarity. By confronting these myths, it is our hope that more organizations, especially organizations with development teams, will implement Data Science.
Myth #1: It’s Hard to Find Data Scientists
The shortage of data scientists is well documented in the media. In fact, Fast Company and others cited a report from McKinsey that predicts a shortfall of 250,000 data scientists in the U.S. alone by 2024. Many of today’s companies are competing for “real” data scientists or “unicorns.” Unicorns are rare creatures who have a graduate degree in math or statistics (Ph.D. preferred), strong programming skills, and solid domain expertise. Few candidates have deep expertise in all three areas, which is why there is a shortage of data scientists. To overcome that obstacle, some organizations are trying to develop a Data Science practice that combines the expertise of several people.
A common mistake is to hire specialized expertise, such as a Ph.D.-level statistician or data scientist, before it’s necessary. Company decision makers believe the company needs such a person to gain a competitive advantage, but it is unclear what that person should do and for whom. Lacking a mission and purpose, the statistician or data scientist who longs to make a positive impact on the business but can’t will likely resign with a better offer in hand from another employer. That’s why it’s often easier to hire specialized talent than to keep it.
Most organizations can start reaping the benefits of Data Science without highly specialized expertise or expensive software, but quite often they don’t know where to start. We recommend looking inwards and starting with software development teams. It is our experience that software development teams can be trained to take on Data Science tasks.
Myth #2: Data Science is Suited Only for Large Organizations
Large organizations typically have the financial resources necessary to build a formal Data Science practice. However, that does not mean their Data Science practice will succeed.
When those large organizations are successful, the media likes to use them as examples of what companies can achieve, such as competing more effectively, improving operational efficiency, and even disrupting an entire industry. Because large, brand-name companies are often positioned as the leaders of their industries, small and medium businesses (SMBs) may believe that Data Science requires hefty investments in expensive software and the expertise needed to use that software.
In fact, Data Science requires neither of those things. In this domain, vast resources do not guarantee success. Smart resources do. Organizations of all sizes can succeed in their Data Science activities if they are implemented correctly by a competent team.
Myth #3: Data Science is Just a Buzzword
Business leaders, journalists, and industry analysts are quick to use the latest jargon. The resulting noise can make it difficult to discern between industry hype and technologies or processes that can stand the test of time. Given the extreme hype about Data Science these days, it’s not surprising that some consider it just another buzzword or fad.
Data Science isn’t a buzzword or fad, however. It’s a confluence of time-tested disciplines, including statistics and forecasting, that have existed in some form for centuries. For example, actuaries and meteorologists have long used models to predict risks and weather, respectively. Now, businesses in virtually every industry are trying to use data to improve their performance.
A few things that distinguish Data Science from its predecessors, including actuarial science and statistics, are access to massive amounts of data that can be stored cheaply, robust computing power, and quick access to predefined models. Compared to yesteryear, organizations can learn more about themselves, their markets, and their customers than ever before because the data they need is plentiful, easily duplicated, easy to share, and relatively easy to process. Those capabilities, coupled with today’s powerful programming environments, give developers considerable control over how data is manipulated, cleansed, preprocessed, analyzed, and visualized.
Myth #4: Complex Models are Better Than Simple Models
Decision trees, statistical regression, and linear regression are not new, so the media pays less attention to them than deep learning and neural networks. Deep Learning and Neural Networks use complex models that are considerably more sophisticated than the models used to solve simpler problems because they are attempting to emulate arbitrarily complex functions.
Complex models are not necessarily better than simpler models for a few reasons. First, a complex model can be less efficient than a simpler model if the problem is relatively simple. Second, complex models can be costly in terms of processing power. Finally, complex models can lead to black-box approaches that are difficult or impossible to explain. While the results of a black-box solution may be “good,” black-box solutions don’t allow users to explore how a result was derived. If users can’t explore how a result was derived, they can’t understand what went into it. If they can’t understand what led to the result, they can’t explain the details, which is not good, particularly in an audit scenario.
Simpler models are easier to understand and explain. For example, a relatively simple logistic regression model can be used to predict which of your prospects will likely buy your product.
A common mistake is to think that complex models necessarily yield better results than simple models in all situations. However, unnecessary complexity can result in diminishing returns. When that’s the case, it’s better to spend less time tweaking the model and more time understanding and cleansing the data.
Myth #5: Data Science Requires a Deep Understanding of Statistics and Statistical Methods
While it’s true that Data Science requires an understanding of statistics, businesses can take advantage of Data Science without having a statistician on staff. Most developers have a basic understanding of statistics because they took at least one course in college.
If you’re a developer who has been tasked with building Data Science capabilities in your organization, or you want to start building the capability yourself, it’s wise to refresh or augment your knowledge of statistics so that you can understand the fundamentals commonly used to develop models.
You do not have to take a formal course. You do not have to pursue a graduate degree. The e-books and other resources referenced at the end of this white paper will help you understand the basics. Armed with that knowledge, you’ll be able to build models that are meaningful to your organization.
If you want to modify the model later, you may need to learn a little bit more so you can understand how particular assumptions affect what you’re doing.
Myth #6: Regulated Companies Can’t Take Advantage of Data Science
Regulated companies have to be careful about the information they use and how they use it. However, those limitations do not mean regulated companies cannot take advantage of Data Science or build models.
For example, hospitals are using Data Science to improve patient care, emergency triage, and cost control. Similarly, companies in other regulated industries such as financial services, oil and gas, and pharmaceuticals are also benefitting from Data Science without using information that is prohibited by law.
Be mindful of inference, however. Your company may be prohibited from using certain types of information, such as personally identifiable information (PII), for specific purposes. It is nevertheless possible to infer sensitive information by combining other data points that are not restricted. Such uses could expose your company to regulatory fines and damages.
You can minimize the likelihood of such risks by avoiding unnecessary attributes that allow personal information to be inferred, which could be prohibited by law. For example, if it were illegal to use income as the basis for discrimination, one could nevertheless infer a person’s approximate income level from her zip code, car make and model, etc.
Even if certain types of personal information are not prohibited by law, their use can be brand-damaging. For example, Forbes reported that Target inferred a teenage girl’s pregnancy based on her purchasing habits. Based on that insight, Target sent relevant coupons to the girl’s home address where they were discovered by her unsuspecting father.
Because inference can open the door to legal and other risks, organizations should understand what can be inferred by their data and what the associated risks are.
Myth #7: Data Science Tools are Too Expensive
Some of the most sophisticated Data Science products are extremely costly to buy and difficult to use. However, it is not necessary to invest millions of dollars in software in order to benefit from Data Science.
For one thing, there are many open-source tools, such as R and Apache Spark, that are not difficult to set up and use. There are also a lot of commercial support options available for such tools, given their popularity.
There are also commercial products available that are far less expensive than traditional solutions.
You do not have to budget for expensive tools to take advantage of Data Science.
Myth #8: Data Science Requires Massive Computing Power
The Big Data and AI hype have created the impression that Data Science requires massively parallel GPU-accelerated machines or huge clusters. While large Deep Learning and Neural Networks do sometimes require that kind of computing power, many use cases do not.
Problems that can be solved with simple models may only require a PC with 64 GB or 128 GB of RAM. If that’s not enough, two or three hours spent on a cloud may be all that’s necessary to build and test a model. A Cloud environment, such as AWS or Microsoft Azure, may also be necessary if the data processing or data cleansing requirements exceed the capacity of a single node.
Essentially, it’s more cost effective to scale computing resources as necessary than to over-engineer a computing environment that is more complex and costly than the problem requires.
Myth #9: Data Cannot be Monetized Because it’s in a Hard-to-Use Format
Data-first companies such as Google and Facebook are masters at monetizing data. They have collected treasure troves of information that are sold to various parties at a handsome profit.
Some small to medium businesses think that data monetization is some- thing only industry giants can do because they are data-first companies. However, most businesses have valuable customer data that could be used to improve company operations and perhaps drive new revenue streams. For example, most companies have transactional information, whether it’s customer orders or credit card sales. They probably also have customer service records from their website or call center, and support tickets. Yet, many businesses aren’t able to leverage that data effectively, let alone monetize it.
In fact, it’s unclear what might be discerned from the data by modeling or analyzing it. Worse, the data may not be readily accessible because it’s stored in various databases, on paper, or in business systems that have not been interconnected yet.
Part of the problem can be solved using a data integration platform. Using an integration platform, organizations are able to connect the dots, which means their insights transcend the data stored in any one system. Using that approach, organizations are in a better position to optimize business processes and customer journeys. Common connections include sales, marketing, and customer support, although that information can also be connected with supply chain information and information from other systems, as appropriate.
Trend information, such as weather, traffic, and customer buying patterns are commonly bought and sold to improve sales, marketing, or operational effectiveness. The companies monetizing such data typically transform it so it can be consumed easily by other applications (which is part of what Data Integration platforms do). The data is then made available to third parties via APIs.
In short, Data Integration platforms lower barriers to information sharing and monetization.
Myth #10: Data Science is Hard to Adopt Because it’s Complicated
Data Science can be a very complex undertaking, but it doesn’t have to be. In fact, it’s better to start simply, drive success with that, and then expand your capabilities.
Many organizations start by aggregating data they think is valuable, gleaning some insights from it, and pushing those insights out to decision-makers via reports and dashboards. Later, they start building models on top of the data to drive new and finer-grained insights.
Although there is no single “right” path to Data Science adoption, the wrong path is inevitably overcomplicating the problem when a simpler solution is more elegant, effective, and cost-efficient.
Data Science doesn’t have to be a complex and expensive undertaking that requires a formidable staff of Ph.Ds. The software development capabilities you have today can produce valuable insights that you once considered impossible without heavy investments in additional resources.
One way you can overcome your organization’s obstacles is to supplement your computer science and business domain expertise with a basic under- standing of statistics so that you can start building models that benefit your organization. As the needs of your business grow, you can expand your knowledge to help move your company down a successful Data Science path.