Navigating Data Readiness for Generative AI

By on
data readiness
ArtemisDiana / Shutterstock

Just about everyone in tech agrees that generative AI is poised to hit the business world like a tidal wave. There’s less agreement about when the transformation will occur, and how generative AI will be harnessed to boost profitability.

survey of chief data officers conducted in the second half of 2023 by Amazon Web Services (AWS) found that 80% believe generative AI will fundamentally alter the way their organization does business. However, they aren’t yet ready to ditch their current data initiatives for the technology. Why? According to 46% of the survey respondents, what’s stopping them is poor data quality and difficulty finding optimal use cases.

The Harvard Business Review points out that generative AI won’t “generate” value for companies until they’re able to customize the language models and image models of AI vendors so the models can use the organization’s own proprietary data. That won’t happen until organizations have updated their internal processes to ensure their data is ready and fit for use by generative AI systems. 

Yet the AWS survey reports that 57% of respondents have not yet changed their company’s data strategies to prepare for generative AI, even though 93% agree that doing so is crucial to realizing value from the technology. Whatever the reasons for the slow pivot away from traditional analytics and machine learning applications toward generative AI, there’s no denying that the sooner firms get ready for the technology, the quicker they’ll realize its benefits.

Getting Data Ready for Use in Large Language Models

Generative AI relies on unstructured data, which refers to any large collections of files, or datasets, that aren’t stored in a structured database. This includes images, video, audio, and sensor data, but also text data and other standard forms. The process of curating unstructured data for use in generative AI language models remains a human-powered effort:

  • The Harvard Business Review reports that the brokerage Morgan Stanley hired a group of about 20 knowledge workers in the Philippines to score documents used by its large language models (LLMs).

Fine-tuning LLMs for specific purposes improves the models’ accuracy and makes interactions with users more timely than LLMs that haven’t been trained for a specialty. Several steps are involved in training more accurate models:

  • Start with an LLM that has been trained with a broad range of text data on different topics and in various styles.
  • Select the target domain and define its scope and tasks, such as analysis of investigation documents in legal and medical cases, or answering natural-language questions on a specific topic.
  • Ensure that the dataset represents the domain in terms of language, context, and identification of relevant content in historic data sources.
  • Clean the data to remove irrelevant and corrupt data, anonymize it, and tokenize text to break it into meaningful words and phrases.
  • Concentrate training on the chosen domain to adjust weights based on test results. Regularize the data by testing it on new datasets rather than trusting the results you get when using the training data.
  • Evaluate, test, and iterate in a continual process leading to deployment for general use by the intended audience.

Domain-specific LLMs are more effective in highly regulated fields requiring specialized knowledge and a greater demand for accuracy, such as legal, medical, and financial services. Potential applications for generative AI in legal cases include compliance and regulatory monitoring, contract analysis and negotiation, drafting and reviewing documents, due diligence in corporate transactions, intellectual property management, and legal research.

Among the uses of generative AI in healthcare are gathering routine patient information, enhancing diagnostic procedures, and post-treatment monitoring to take advantage of advances in wearable diagnostic devices. However, the models are not likely to be used in medical treatments because of concerns about accountability and liability, as well as the risk to patients’ trust in healthcare providers and their need for human caregivers.

The value of generative AI for the financial services industry lies in the technology’s ability to help businesses mitigate risk and improve efficiency. Use cases for domain-specific LLMs in financial services are fraud detection and prevention, risk assessment and credit scoring, and personalizing customer interactions. Generative AI helps investment managers with asset allocation decisions and market and trend analysis.

Other industries expected to benefit from domain-specific models for generative AI are sales in the form of AI-powered chatbots that know the shopper’s history and preferences, and code generation that includes code autocomplete, code comments and suggestions, and code reviews.

Data Ingestion for Large Language Models

LLMs require huge amounts of data that’s collected using a variety of methods, such as web scraping for text data, preprocessing, and feature engineering that prepares raw data for use in training machine learning models. Data ingestion for LLMs must accommodate a variety of data sources and data types, all of which have to be conditioned before being ingested to ensure they are digestible by the models.

Each of the four stages of data ingestion – collection, preprocessing, feature engineering, and storage – presents challenges for teams developing LLMs, from ensuring its relevance at the point of collection, to storage in a format that is easy for the models to access.

Data collection: Before targeting the data to collect, developers have to determine what types of data they need to train the LLMs to achieve their intended purpose. For example, a model being trained for sentiment analysis requires data from reviews, comments, and social media posts. 

Once the model’s data requirements have been defined, developers use web scraping to extract data from websites automatically. Web-scraping tools for LLMs include BeautifulSoup and Requests libraries for Python, the ScraPy framework, Selenium, the Ixml Python library, and LangChain.

Preprocessing: This step prepares the data you’ve collected for use in training the model. It entails three operations: cleaning, normalization, and tokenization.

  • Data cleaning identifies data that’s inaccurate, incomplete, or irrelevant, and then either corrects the data or removes it. In addition to eliminating duplicate data, the process fills in missing information, updates incorrect values, and excludes outliers.
  • Normalization converts the data to a standard format to facilitate comparison and analysis by the model. Normalizing text data reduces data dimensionality, such as replacing uppercase letters with lowercase and removing punctuation. This improves the model’s ability to work with the data.
  • Tokenization atomizes the text into the words and phrases that serve as the LLM’s vocabulary. This promotes natural language processing (NLP) applications by identifying meaningful vocabulary elements at the word level, character level, or sub-word level.

Feature engineering: Once the data has been preprocessed, it is used to create features, which are numerical representations of text that the model understands. One form of feature engineering is word embeddings, which create a dense vector of real numbers that captures the meaning of the words that the numbers represent.

The three stages of feature engineering are split, augment, and encode:

  • Split separates data into a training set that’s used for teaching the LLM, and validation and testing sets that are applied to evaluate the model’s performance.
  • Augment adds new examples and data, and transforms existing data, to make it more diverse and increase the amount of data available to the model.
  • Encode embeds the data into vectors, which represent the data in a form the model understands, and tokens, which are the basic units of data processed by the LLM.

Storage: After the model’s features have been created, they need to be stored in a format that LLMs can easily access for training. This is typically a vector database, which can be queried with ultra-low latency and supports both structured and unstructured data.

Obstacles to Data Readiness

Data professionals focus on the technical and operational challenges of implementing generative AI systems, but organizations must also address the ethical and societal implications of the technology, as well as the regulatory and legal issues it presents. Potential solutions include adoption of ethical AI frameworks, federated learning and differential privacy, and committing to open-source projects and collaboration.

Technical challenges posed by data readiness for generative AI are data preparation, LLM size, retrieval-augmented generation, and breaking down data silos.

  • Insufficient data preparation: Anonymizing data is especially important for health and finance applications, but it also reduces an organization’s liability and helps it meet compliance requirements. Labeling the data is a form of annotation that identifies its context, sentiment, and other features for NLP and other uses. Normalizing applies to all forms of data, including image sizes and resolutions to enhance the model’s performance and reduce storage requirements.
  • Finding the right size LLMs: Smaller models help companies reduce their resource consumption while making the models more efficient, more accurate, and easier to deploy. Organizations may start with large models for proof of concept and then gradually reduce their size while testing to ensure the model’s results remain accurate. Model size can be limited by writing detailed and concise prompts, and by adding examples within the prompt (few-shot prompting) to provide guidance to the model.
  • Retrieval-augmented generation: This AI framework supplements the LLM’s internal representation of information with external sources of knowledge. This helps the model stay up to date and allows users to access its sources to ensure the accuracy of its results. The goal of retrieval-augmented generation is to teach models to say “I don’t know” when they become stuck on a problem.
  • Overcoming data silos: Data silos prevent data the model needs from being discovered, introduces incomplete data sets, and results in inaccurate reports while also driving up data-management costs. Preventing data silos entails identifying disconnected data, creating a data governance framework, promoting collaboration across teams, and establishing data ownership.

Several technologies are available to help companies fill the gaps in their data infrastructure to enhance their generative-AI efforts. These include APIs such as Anthropic and Mistral for text generation, as well as image generator APIs such as Amazon Titan, and voice generator APIs such as ElevenLabs. Other tools for generative AI infrastructure are webhooksdata cubesdata modeling, and data scoring.

Organizations continue to make progress as they transition their generative AI projects from proof of concept to full-scale deployment across the enterprise. In addition to preparing their staff and internal stakeholders for the impact of AI, they must also redouble their efforts to ensure their data is ready to serve as the fuel that drives the reinvention of business.