Running Generative AI in Production – What Issues Will You Find?

By on
Read more about author Dom Couldwell.

As your data projects evolve, you will face new challenges. For new technology like generative AI, some challenges may just be variations on traditional IT projects like considering availability or distributed computing deployment problems. However, generative AI projects are also going through what Donald Rumsfeld once called the “unknown unknowns” phase, where we are discovering potential issues that we did not consider before.

To solve these issues, we can apply some of the best practices for IT and data management. However, we also must look for ways to overcome the problems faced by generative AI. This may also feed back into our data preparation approaches.

Where Data Management Issues Exist at Scale

We’ve all heard the old joke about code working on a developer laptop, and so that code is put into production. After all, the logic goes, if it works on one machine, it should work on larger instances. However, while testing and implementing proof of concept instances can yield a certain level of success, these results are not representative of operating at scale when supporting hundreds of thousands, if not millions, of requests. 

Supporting your infrastructure over multiple locations delivers resiliency and availability for your application or service. By planning your approach to continuity, you can survive failure in one or more of your components and still function effectively. As generative AI applications move into production, you will have to think about availability and resiliency for the data that this service uses.

Generative AI applications rely on vector data. As we scale up generative AI applications, we increase the amount of data stored as vectors as well. Data sources like product catalogs, customer records, and historical data sets can all be turned into vector data that can be used with generative AI through retrieval augmented generation (RAG). RAG helps improve the quality of the responses that AI systems provide by leveraging a company’s data alongside any other relevant data sets. 

For production applications, vector data will be in use all the time, so consider how to make it available and how to protect it against failure. If running generative AI in your own data center, implement backup for that data through more in-depth resiliency and high-availability technologies spreading across multiple sites. For cloud deployments, running in multiple locations is simpler as you can use different cloud regions to host and replicate copies of your vector data. There are other benefits from this as well including using multiple sites to deliver responses from the site that is closest to the user. This reduces latency and makes it easier to support geographic data locations if they are located in a specific region for compliance purposes.

Scaling up data involves services like databases that can operate across more than one physical data center or cloud location. Typically, these instances must cope with large volumes of transactions, or where the amount of data is higher than a single instance can maintain. The approach depends on the database involved: Some databases shard data into smaller collections so the database can run logically across multiple instances or machines, while other databases run in clusters based on a primary server supported by multiple secondary machines. 

Alternatively, databases like Apache Cassandra spread data across multiple nodes and use a “shared-nothing” approach that scales up by adding more nodes. This enables it to better support availability requirements and more geographical distribution of data. Today, we are accustomed to distributed computing environments – that is, systems that are designed to run across multiple locations at the same time. This distributed approach helps keep data closer to users rather than having to make longer round-trip transactions to a central location.

Generative AI and Data – Dealing with Round Trips and Latency

When a user carries out a search, that query is changed into a vector and then matched against the data within a vector database. Vector database growth can lead to latency. Up to 40% of any retrieval augmented generation (RAG) transaction is made by creating a vector from the original request and matching that against entries in the vector database, so any improvements to performance can have a significant impact.

Data set growth and an increase in transactions can significantly impact your overall performance. A one percent reduction is negligible when testing hundreds or thousands of interactions, but scaling up to millions of transactions will be noticeable. 

Vector data sets also grow as companies identify more sources that can be used to improve the accuracy of responses, and as they expand their own data over time. For example, a product catalog with a thousand different stock-keeping unit (SKU) codes evolves and changes over time. When customers ask a question about products, generative AI should reference the most up-to-date entries rather than older versions or versions that are no longer stocked. It is easier to update your vector database and use RAG to provide accurate data to your large language model (LLM) than to retrain your LLM every time there is an update. 

In addition to RAG, there are newer techniques that can improve your responses to users. These use the generative AI system to improve prompts and responses in the background so that the user can benefit from the overall work carried out. One example of this is RAG fusion, where the AI system creates additional versions of the initial prompt provided by the user and then measures responses to those extra prompts alongside the original request. Using these responses, the user should get a more useful answer based on a sum of all the queries.

Similarly, Forward-Looking Active Retrieval (FLARE) is an example of a multi-query RAG technique that provides custom instructions in your prompt to the LLM. This encourages the LLM to provide additional questions about key phrases that would help the overall system generate a better answer for the user. This approach relies on having more context data as part of the overall generative AI system that recalls what the user has asked about before. These techniques may add further trips between the generative AI system, the vector data used, and the LLM, but the result should be a more accurate response that is sent back to the user.

One thing to consider is how all of these elements integrate together. You may decide that you want to run this infrastructure yourself, so that you can have full control and flexibility. This can make it easier to respond to potential new developments or innovations, and it should also ensure that you are not locked into a specific provider. However, you will need to integrate and manage these components and dependencies, which is a technical overhead, particularly at scale.

Alternatively, you may want to rely on a single provider to manage the infrastructure. While this option might be simpler to get started, you will have to work at another organization’s pace and will be locked into their technology, which can be limiting..

A stack-based approach is a useful alternative. Look at the elements around RAG-like vector databases and integrations and decide how to integrate the different components together into one stack. You have the option to change those elements within the stack as needed with minimal impact on integrations. Working with a stack-based approach gives more flexibility around future growth and integration, but also reduces the management and integration overhead compared to running everything yourself.

Developing Generative AI with Your Data

Generative AI applications and services are only helpful to customers or users if they can leverage data in effective ways. To be effective, this data has to be available, usable, and up to date. Without preparation to cope with scale, generative AI will be relegated to simple tasks rather than automating tasks and more efficient processes. The expectation for generative AI is not just about creating smarter chatbots, but delivering genuine co-pilots that can work alongside employees in a wide range of roles to make them more productive.

As you scale up your use of AI, your focus will shift from testing new technologies to ensuring that those implementations consistently and cost-efficiently deliver value. Taking a stack-based approach can help, letting you concentrate on delivering the best possible service to users regardless of how much you scale up vector data and generative AI service.