Messy Data Shouldn’t Stop Machine Learning in Its Tracks

By on

Click to learn more about author Jon Reilly.

Businesses are creating data at an incredible pace that will only accelerate. In fact, data storage company Seagate predicts it will pass a yearly rate of “163 zettabytes (ZB) by 2025. That’s ten times the amount of data produced in 2017.” 

Moore’s Law – the principle that the speed and capability of computers can be expected to double every two years – partly drives the rate of data generation. The shrinking size, cost, and power requirements of transistors enable the embedding of networked computing devices into more products that stream usage and performance data back to manufacturers. Couple that with the increasing digitization of our interactions and businesses find themselves capturing and storing massive amounts of prospect and customer data. 

And by and large, the data is a mess. The data has many formats, including open-ended text, categories, dates/times, numbers, unique IDs, emails, addresses. It goes on and on. Within this mess are the most important stories of your business – the patterns that shape your outcomes, the user journeys that successfully build revenue, and the trajectories that will impact your future KPIs. 

It’s practically impossible for people to find the patterns in big data on their own, so we use software tools to hunt through the data for stories about the key factors underpinning our businesses. Software and data teams slice data, clean it up, merge different databases, and eliminate duplicated records. Then we try to tease out the stories.  

Another factor always inherent in messy data is human bias. It’s no coincidence that data almost always seems to support the position of the person presenting the analysis. Even with the best intentions, it’s challenging to eliminate bias – you see the patterns you are looking for and miss those you are not. That’s why scientific papers are peer-reviewed and experiments retested by third parties – to ensure that results are valid and not rosily interpreted by the authors. 

Machine Learning Is Changing Data Analytics

Machine learning (ML) replaces the excel-driven human-interpreted analytics exercise. With ML, businesses can use computer programs to identify the underlying patterns in their data. Emerging ML-driven analytics solutions make it easy to surface the variables that contribute to key business outcomes. Machine learning identifies the patterns in your data without bias, but it’s important to remember that the data itself can contain biases (gathering bias, time delay bias, and it will generally reflect underlying biases in the sample set itself). 

But What About Messy Data?

Most businesses approach a big data machine learning project in the following manner:

  • Clean your data
  • Train a model
  • Undertake a business-wide data hygiene exercise to clean all new data
  • Run new data against your model to predict key outcomes


And that’s great. There is no doubt that the more work you do to clean data, the better the model can identify patterns. However, this large-scope effort also prevents many ML projects from ever getting started, stopping them long before they can be deployed and drive business value. The reality is that it’s incredibly hard to generate clean data on an ongoing basis. Increasingly, ML providers are building solutions that are functional using messy data. It’s now possible to plug your data into a model, surface which variables (and variable combinations) drive key business results, and then view the patterns – which examples are most reflective of success and which most match failure – without having to go through expensive, time-consuming data cleansing first. 

Take the 80/20 Approach to ML Adoption 

Even with messy data, it’s often easy to find the signal in the noise, and if the model can find the signal, it can function effectively in your business. The payoff to be captured is huge. As Mckinsey notes, it has the “potential to create between $3.5 trillion and $5.8 trillion in value annually across nine business functions in 19 industries.”

Businesses would be smart to take the 80/20 approach to ML adoption. You do not need to execute complicated data hygiene programs to capture most of machine learning’s business value. By using your data as is, rather than waiting until your data is perfect, you’ll be able to reap the many benefits of machine learning while continuing to make improvements to your data and processes.

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept