A Step Ahead: Data Quality – The Foundation of Reliable and Responsible AI

Artificial intelligence (AI) is advancing at an unprecedented pace and is rapidly integrating into nearly every aspect of modern life. Adoption of AI increased by 23% between 2023 and 2024 (TechJury), reflecting its growing influence across sectors such as medicine, banking, transportation, retail, and personal productivity.

In healthcare, AI is used to pre-analyze MRI scans to identify potential anomalies prior to radiologist review, while in banking it monitors transaction patterns to detect fraudulent activity. Transportation systems increasingly rely on AI for assisted driving and autonomous vehicle technologies. Retail platforms use AI to generate personalized product recommendations, and individuals commonly use AI-powered tools such as Siri, Amazon Alexa, and ChatGPT to enhance productivity and communication. As enthusiasm for AI’s potential to improve efficiency, accuracy, and innovation continues to grow, a critical reality is often overlooked: AI systems are only as reliable as the data on which they are trained. Understanding the relationship between AI performance and data quality is therefore essential as organizations increasingly rely on AI to support high-stakes decisions.

In the foreseeable future, it is anticipated that AI adoption and use will grow at exponential rates. It is imperative to remember the foundation of any AI application is data. As AI becomes even more prolific and widespread, data quality fit for use is not optional, it is required to ensure accurate and consistent results. The excitement that has surrounded the possibilities of AI has given birth to a common misconception that AI can solve data quality issues. However, this is a circular relationship in which AI relies on quality data to make useful data quality recommendations.

AI uses data to train its models. Poor-quality data leads to incorrect predictions, reenforced stereotypes, and overstated confidence in results. The speed and scale of AI means that a data issue is quickly amplified.

Data Quality Accelerator

Learn how to build, sustain, and measure a data quality initiative – September 30 – October 1, 2026.

Risks of Poor Data Quality in AI Systems

Poor data quality can result in errors such as systemic discrimination, hallucinations, incorrect predictions at scale, and loss of trust from stakeholders. Three of the most common and consequential issues that arise from poor data quality in AI systems are bias, hallucinations, and incorrect predictions.

Bias occurs when AI systems produce skewed or unfair outcomes due to unbalanced or non-representative training data or flawed data collection processes. One common form, measurement bias, arises when certain groups or categories are underrepresented or inaccurately captured in the training dataset. For example, a facial recognition algorithm was designed using gender and racial classes. Its training data contained primarily lighter skinned people. It resulted in a high rate of misclassifications of darker skinned females. (Buolamwini, Gebru). Such biases can reinforce existing inequalities and undermine the ethical use of AI.

Hallucinations refer to instances in which AI models generate outputs that are fabricated or factually incorrect but presented with high confidence. These errors often occur when training datasets are incomplete, insufficiently diverse, or lack contextual depth, causing the model to infer or “fill in” missing information. Hallucinations are particularly problematic in applications where accuracy and trust are critical, such as healthcare, finance, or decision-support systems. An example of this is the use of AI legal research tools. In a Stanford study, roughly 17% of queries were found to hallucinate often inventing case law or misrepresenting existing statutes (Stanford).

Incorrect predictions, including false positives and false negatives, frequently result from poor data labeling, inconsistent classification, or ambiguous training data. Inaccurate predictions can have serious consequences; for example, a financial institution may approve a high-risk loan applicant if an AI model incorrectly labels the individual as low risk. When such errors occur at scale, they can lead to financial losses, regulatory exposure, and diminished stakeholder trust. The Consumer Financial Protection Bureau proceeding against Hello Digit LLC is an example of such a loss. It concluded that Digit was aware of errors in their algorithm leading consumers to experience overdrafts and transaction fees. This error led to damages in both financial and trust (CFPB).

Data Quality Dimensions in AI Systems

The dimensions of traditional data quality frameworks categorize types of issues and mitigation techniques.

Completeness measures whether data or files are missing. With unstructured data this could indicate missing, incomplete files or metadata. Incompleteness can result in errors; An agent may try to fill in the blanks leading to hallucinations or produce biased results if components of the whole are not represented. Anomaly detection and inventory analysis can help identify gaps in unstructured data sets.

Timeliness indicates whether the data and files are up to date. Within an AI model, accessing and using out-of-date data can result in issues such as lack of relevance and trust as well as potential amplified biases. Data shift is when data changes over time creating the potential for out-of-date data in which AI algorithms should be aware to mitigate errors and bias. Ensuring metadata includes updated timestamps and is available for agents to access can help deter bias and errors that could be introduced.

Uniqueness indicates whether data is duplicative within the models. Duplicate data creates incorrect and disproportionate information. In addition to master data management mechanisms, AI agents can utilize natural language models (NLP) to help detect and reduce duplication.

Integrity and validity issues within AI often indicate schema drift. Schema drift is an unexpected alteration of schema structures. (DQOps) Mitigation includes governance and technical methods. Schema registries can help monitor and manage changes. “Using a schema registry helps ensure that every event or record is in the right format.” (Snowplow Analytics) Mitigation in advanced AI-driven pipelines can implement intelligent schema evolution for auto-adaptability. (Xplenty)

As AI continues to scale in speed, complexity, and influence, maintaining strong data quality practices is no longer optional, it is foundational to responsible and effective AI deployment. Poor data quality can lead to real world financial and legal consequences. By applying established data quality dimensions – such as completeness, timeliness, uniqueness, integrity, and validity, and supporting them with advanced monitoring pipelines and governance mechanisms, organizations can mitigate AI-related risks. Organizations that fail to apply data quality and then apply AI to that data, will increasingly experience how AI magnifies data errors. This makes robust data quality management a prerequisite for trustworthy and ethical AI systems.

References

Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research, 81, 1–15. https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf

Chapman University. (n.d.). Bias in AI. https://www.chapman.edu/

Consalvo, R. (2023). Three ways to rethink AI hallucinations. H1. https://h1.co/blog/the-power-of-data-three-ways-to-rethink-ai-hallucinations/

Consumer Financial Protection Bureau. (2022). 2022-CFPB-0007-Hello Digit LLC – Consent Order. Retrieved from https://www.consumerfinance.gov/

DQOps. (n.d.). What are schema changes? Definition, examples, best practices. https://www.dqops.com/blog/what-are-schema-changes/

Grother, P., Ngan, M., & Hanaoka, K. (2019). Face recognition vendor test (FRVT) Part 3: Demographic effects (NIST IR 8280). National Institute of Standards and Technology. https://nvlpubs.nist.gov/nistpubs/ir/2019/NIST.IR.8280.pdf

Hiniduma, K., Byna, S., Bez, J. L., & Madduri, R. (2024). AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AI. In Proceedings of the 36th International Conference on Scientific and Statistical Database Management (ISBN 9798400710209). SSDBM: Scientific and Statistical Database Management. https://doi.org/10.1145/3676288.3676296

IoT For All. (n.d.). How poor data annotation leads to AI model failures. https://www.iotforall.com/how-poor-data-annotation-leads-to-ai-model-failures

Monte Carlo & O’Reilly. (n.d.). Ensuring data + AI reliability through observability [Report]. Retrieved January 7, 2026, from https://info.montecarlodata.com/get-resources/oreilly-report-ensuring-data-and-ai-reliability-through-observability

Snowplow Analytics. (n.d.). Data pipeline architecture for AI: Why traditional approaches fall short. https://snowplow.io/blog/data-pipeline-architecture-for-ai/

Stanford Institute for Human-Centered Artificial Intelligence. (2023). Hallucinating law: Legal mistakes in large language models are pervasive. https://hai.stanford.edu/news/hallucinating-law-legal-mistakes-large-language-models-are-pervasive

TechJury. (n.d.). 88+ artificial intelligence statistics for 2026. https://techjury.net/blog/artificial-intelligence-statistics/

Xplenty. (n.d.). EAI in data integration: How AI transforms ETL & ELT pipelines by 2026. https://www.xplenty.com/blog/eai-in-data-integration/

Author: Anne C. Kling

Anne Kling is a principal information systems engineer. She is a seasoned data professional with experience across the data lifecycle. She has helped enable data and analytic transformation in both the healthcare and defense sectors.

Approved for Public Release; Distribution Unlimited. Public Release Case Number PR_25-00398-1. The author’ s affiliation with The MITRE Corporation is provided for identification purposes only and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author. ©2026 THE MITRE CORPORATION. ALL RIGHTS RESERVED.

Approved for Public Release; Distribution Unlimited. Public Release Case Number 26-0071

Data Architecture Workshop

Learn how to design unified, future-ready data architectures that bring together operational, analytical, and AI data – December 1-2, 2026.

Enroll Now

A Step Ahead: Data Quality – The Foundation of Reliable and Responsible AI

Data Quality Accelerator

Risks of Poor Data Quality in AI Systems

Data Quality Dimensions in AI Systems

References

Data Architecture Workshop

The MITRE Corporation

The Data Modeling Gap Undermining Enterprise AI

Why Your LLM Needs an Onboarding Program

AI Readiness Starts with Data Governance, Not Data Access

Thanks!

A Step Ahead: Data Quality – The Foundation of Reliable and Responsible AI

Data Quality Accelerator

Risks of Poor Data Quality in AI Systems

Data Quality Dimensions in AI Systems

References

Data Architecture Workshop

The MITRE Corporation

Related Articles

The Data Modeling Gap Undermining Enterprise AI

Why Your LLM Needs an Onboarding Program

AI Readiness Starts with Data Governance, Not Data Access

Lead the Data Revolution from Your Inbox.

Thanks!