Are Black Swans in Your Data Lakes Obstructing AI Progress?

By on

Click to learn more about author Kim Kaluba.

It seems every company is anxious to jump on the boat and launch an AI program to improve organizational performance. But do they know what’s swimming in the data lake alongside their boat?According to a recent Infosys survey, half of organizations reported that they will not be able to deploy AI because of data challenges and 37% of respondents cited data integrity as a barrier to getting AI projects off the ground. Companies also claim Data Quality and reliability (66%) along with data availability (61%) are barriers to the successful deployment of AI (based on the Digital Banking Report).


Learn how to develop a successful Data Governance framework and operating model with our online training program.

What are the root causes of these data challenges? What role do data lakes have in creating a stream to AI development? And what the heck does a black swan have to do with anything?

The Black Swan of Data and AI

In the world of Data Science, a significant and unexpected outlier in your data analysis is often called a “black swan” or black swan event – an incident that occurs randomly and unexpectedly, with wide-spread ramifications. Back in 16th century London, the notion of ‘seeing a black swan’ was used as a metaphor for a statement of improbability before it was discovered that black swans are native to Australia. Today, a black swan in data analytics can lead to broad generalizations based on incomplete, limited or bad data.

So, what causes a black swan in your data, ultimately impeding AI implementation? At the core, the problem is often a lack of a proper Data Strategy and Data Governance plan, coupled with data haphazardly dumped into data lakes. Data lakes are rarely well managed or supported and the information inside of the lakes can be messy, low quality, unreliable, duplicated and immense.

An Adequate Amount of High-Quality Data in the Lake

According to an Experian study, the yearly cost of bad data is over $3 trillion annually in the U.S. And poor data quality is a main reason for unsuccessful AI deployment and realization. Bad data can also cause a black swan to be swimming in your lake, either halting or stalling progress.  However, addressing the quality of the data used for AI is easier said than done. Because of the complexity of the data landscape across organizations and the volume and velocity of the data, managing and maintaining good data is not easy.

Data availability and integration struggles can also lead to a black swan. According to Forrester, data integration is the number one challenge organizations face today. If you think about it, data is entering the organization 7 days a week, 24 hours a day. This information enters different systems, which are managed by different departments, have different functional requirements, with various time intervals.  The data is moved and copied across the organization and changes are made that are not shared back with the systems of record.  If the data in your lake entering the AI process is inaccurate, disjointed, inconsistent, and untrusted, then decisioning being made by AI will also be inaccurate, inconsistent and untrusted by the data community. 

Keeping Black Swans Out of the Data Lake

So, how can a company overcome these challenges and black swans caused by poor Data Quality and data unavailability? How can you know what is the right data to use to achieve the promises that AI offers? How can you make sure the information you are using is timely, relevant and unbiased?  

Having a Data Strategy will help prevent black swans and set you up for successful AI implementation. A Data Strategy should be designed to improve the way the organization acquires, stores, manages, shares and uses data for AI. The strategy should have a strong Data Governance program supporting the strategy that will establish, manage, and communicate data polices, definitions, and standards for effective data usage for AI. This ensures that once data is decoupled from its source environments, the rules and details of the data are known and respected by the AI usage. It is important to point out the data strategy is not to limit data accessibility and access, but to ensure the data becomes easier to access by the data community, and that the data represents the best data for the AI process being carried out.

Once the Data Strategy program has been put in place, proper Data Management will enable the strategy. Data Management – whether in a warehouse or a lake –  provides the technical foundation for managing data strategy and governance programs by ensuring the data contributing to AI is meeting and adhering to the data standards and that it is reliable for AI decisioning. This ensures visibility and transparency into the AI process, building trust with the data users and decision makers, and keeping those black swans at bay.

Leave a Reply

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept