The challenges of Generative AI features and the missing infrastructure layer

Generative-AI-powered features are unusual. Their lifecycle is unlike traditional software features. Their inherent challenges are different as well. These range from cost and latency to quality and reliability.

The underlying cause of these problems is that generative AI models are complex and unpredictable systems. Systems built upon such sub-systems have three options: they can propagate these properties to their dependents, all the way up to the applications; they can attempt to handle the behaviour of such systems by introducing more complex logic, significantly increasing their own complexity; or they can do neither and become bastions of bugs. Currently much of the complexity and unpredictability of generative AI models is directly exposed to developers.

We believe the solution is a software infrastructure layer between the models and the applications. Some have suggested applications interface directly with models.1 Others have called this the “tooling” layer.2 We believe it is more fundamental than that.

We see this layer as the Operating System (OS) layer where the models are analogous to the hardware and firmware, and the applications interface with the OS, not the firmware.

In this article we explore the challenges of building with current generative AI systems, discuss the underlying problem of complex and unpredictable systems, and outline the broad strokes of our vision for the operating system layer. In future articles we will dive deeper into specific facets of the problem.

The lifecycle: experimentation, integration, evolution

In stage 1, you have an idea of how your product can leverage generative AI effectively and you start experimenting to understand how it might work. This experimentation normally involves using a powerful general model (e.g. GPT-4). You get something okay, but not incredible. Along the way you discover edge cases where the results are unexpected, or even unacceptable. You iterate on the prompt, fixing edge cases and discovering new ones.

It is at this point you first contemplate hiring a Prompt Engineer.

Once it is working “well enough”, you move to stage 2, production integration. You take your prompt from stage 1 and start hitting an API with it. You build the surrounding feature components (e.g. frontend work, new database tables), the traditional components.

You deploy. Your users love it! Your budget does not.

This powerful general model is expensive and you realize the costs are prohibitive at your scale. To add insult to injury, you discover your users have uncovered creative edge cases you did not anticipate.

Now you have to iterate on your feature to make it cost effective and to fix those pesky edge cases. Your feature is live. You have to move quickly or disable the feature.

You have reached stage 3, evolution hell. You are scrambling to reduce costs. You try a cheaper generalized model. The results are underwhelming. You try training a smaller model. The results are underwhelming. You decide to go with the cheaper generalized model for now. It is underwhelming, but you won’t go broke.

Costs addressed, for now, you start iterating on the prompt again to patch the edge cases that break the feature. During this work you realize that fixing one edge case causes a regression on another case. This whac-a-mole process is time-consuming and infuriating.

At some point you, hopefully, have an improvement. You can run your task on a slightly cheaper model, get acceptable results, and have fewer edge case problems. You deploy. The evolution cycle restarts. You have more edge cases to fix. New regressions to dodge. And you start noticing problems arising from the slowness of the model. In the next cycle you decide you will have to tackle latency.

We regularly hear stories of this lifecycle from founders, engineering teams, and product teams. Though the lifecycle varies based on context, the essence is the same and the experience is common.

Notably, the problem emphasis and magnitude shifts based on company size. Enterprises are stymied by common hosting challenges which are exacerbated by generative AI, including cost, data privacy, tenancy isolation, regional hosting requirements, among others. These enterprise challenges essentially amplify the pressure for smaller, cheaper, easier-to-host models which maintain quality levels. In essence, the same problems, but more pressing.

Regardless of the context, the underlying technical problem is the same.

The underlying problem: complex and unpredictable systems

Though there are many short-, medium-, and long-term problems when building with generative AI, for now we are interested in understanding the underlying causes of these problems and their affect on the lifecycle.

Fundamentally, generative AI models are complex and unpredictable systems3. Unpredictable in that there is uncertainty and inconsistency regarding their output. Uncertainty and inconsistency causes complexity to propagate throughout dependent systems,45 such as your features. If the complexity is not handled, instability is propagated.

Unpredictable results lead to inconsistent quality. In cases where a feature depends on high quality results, this can lead to user-facing bugs. In other cases the quality of a feature may simply reflect the quality of the result. A poor quality result may result in a poor quality feature, instead of an outright bug. With features that rely on results to process or update data, poor results may lead to data corruption.

Good evaluation is a necessary component of stable, evolving systems. Without good evaluation it is at best time-consuming and at worst impossible to identify failure cases and avoid regressions. Currently many companies resort to time-consuming manual testing of generative AI systems and features, and a healthy dose of hope-and-see. Unfortunately, in all but the simplest of cases, manual testing is not comprehensive and failures are a matter of when, not if.

Currently, generative AI features are improved, both in quality and reliability, with manual evaluation and optimization. This is prompt engineering.

Manual prompt engineering is sub-optimal. It is hand-optimizing inputs to a complex system. Complex systems are better optimized by automated systems that can effectively explore the optimization space.

It is important to be clear here about what optimization is. It is improving a system against some goal (the optimization function). In traditional software engineering, optimization is commonly thought of as improving speed and/or cost. While that definition is still relevant, it is not what we are discussing here. As it relates to complexity and unpredictability, optimization means identifying better input (e.g. prompts) and model combinations to achieve the goal. This may, but does not necessarily, include training specialized models. Within this context, the optimization space represents all possible combinations of inputs and models.

An effective evaluation system is key to enabling auto-optimization. Auto-optimization works by exploring an optimization space. This exploration involves sampling values from the space and scoring those values to direct the exploration. In generative AI, sampling is generating some output (text, images, 3d, etc) by performing inference on a model with a set of inputs, while scoring is evaluating that output and assigning some quantified score.

Evaluation provides the scoring function for an auto-optimizer. Human evaluation is important but it is not scalable. It results in a bottleneck for the optimization mechanism. So we need good auto-evaluation systems that provide the necessary, scalable mechanism for auto-optimization.

Automatic evaluation of complex and unpredictable systems is inherently difficult. Auto-evaluation of generative AI deals with data that has not previously been seen (which is also why generative AI is powerful). Given manual evaluation does not scale, the solution requires a combination of both. Human evaluation provides a grounded basis, auto-evaluation provides scale.

In summary, generative systems are complex and unpredictable systems. Complexity and unpredictability are manageable with auto-optimization. Auto-optimization requires great evaluation that combines manual and automatic evaluation. Without this, these systems are impossible to comprehensively test, manage, and update.

Traditional software development infrastructure and practices are insufficient for solving these challenges or managing the generative AI lifecycle as a whole. Instead, we need new infrastructure for development and production.

The missing infrastructure layer

In the generative AI stack, there is a missing infrastructure layer, one for managing the complexity and unpredictability of generative AI systems. The stack has been described many times as three layers: the foundation model layer, the tooling layer, and the application layer. We view the tooling layer as more significant than existing discussions would suggest. In our framing, it is analogous to the traditional software operating system layer.

Foundation models represent the hardware and firmware. Operating systems provides a set of abstractions to enable applications to effectively leverage the hardware. Applications interface with the operating system, rather than the underlying models.

The OS layer manages the complexity of foundation models and performs a similar abstraction role as the traditional OS. This includes evaluation systems (both human and automated), optimization systems, deployment and management systems, and auditing systems.

There is a long roadmap to be built here. It is a community effort. We will progressively describe our perspective and vision for this layer in future articles.

For now, there are three crucial components which we believe are missing or underdeveloped: efficient human evaluation systems, auto-evaluation systems, and automated and auto-optimization system.

Human evaluation systems are necessary to provide grounded starting blocks for tasks (this is different to traditional ML labelling ground truth) and to enable developers and decision makers to build confidence in generative AI systems. Human confidence is a crucial aspect of building and deploying real-world generative AI products.

While human evaluation provides important input to the process, it is also limiting. It simply does not scale sensibly for many problems.

Therefore, to unlock the potential of generative AI, auto-evaluation systems are necessary. As discussed above, auto-evaluation provides a scoring function for use in auto-optimziation. This scoring function can be used to optimize model usage and train models for specific tasks, without requiring large pre-existing datasets.


Generative AI is an immensely valuable tool for software products.

Generative models are also complex and unpredictable, leading to significant quality variance.

The upside of the fact that generative systems produce variable quality is that you can improve the quality of your features by improving the underlying generative system. For many generative features, there is an almost limitless potential quality improvement. They can understand the user and context better. They can produce more relevant content. They can respond faster. With greater reliability, they can safely perform more actions for the user.

With the proliferation of generative AI throughout software products, and with improving ease-of-integration, the competitive pressures increase. If every product can serve the customer better by improving its generative features, every competitor will need to continually improve as well, or be left behind.

However, the complexity and unpredictability also gives rise to a plethora of problems that applications leveraging generative AI must address. Traditional systems are ill-equiped to manage these challenges.

A new layer of software infrastructure is required, the Operating System (OS) layer. Three crucial components of this layer are human-evaluation, auto-evaluation, and auto-optimization. Though these components do not address all aspects of the OS layer, they provide an important starting point.

Though it presents challenges, generative AI usage will proliferate and these problems will be further amplified in the future.

As a community, it is important for us to broaden the infrastructure being developed, both open- and closed-source infrastructure. The challenges of leveraging generative AI extend well beyond the inference server.

If you are interested in working on these sorts of problems, or would like to discuss how you can solve them at your organization, contact us at [email protected]



  2. For example:

  3. Useful resources for complexity and unpredictability in LLMs:

  4. The system complexity introduced by LLMs is different to other forms of machine learning. The results are more open ended. The behaviour and outputs of models can vary greatly based on training process and model size (some describe this as “emergent behaviour”). Whereas, for example, supervised learning models have defined labels and expected behaviours.

  5. The complexity of LLM-based systems can compound as multiple LLMs are combined. As more systems, which are unpredictable and complex, are added to an application, predicting the result of an input becomes more difficult. As does attributing the cause of some outcome.