Building AI Systems That Actually Ship

The AI industry has a demo problem. Every week, a new agent framework promises to revolutionise how organisations operate. The demos look incredible: fluent multi-step reasoning, perfect tool use, seamless automation. Then teams try to ship these systems to real users and discover a gap that no amount of prompt tweaking will close.

I've spent the past few years helping organisations bridge that gap. The pattern is remarkably consistent, and the research now confirms what practitioners have been learning the hard way.

The maths of compound failure

The most important concept in production AI is one that rarely appears in demos: compound reliability failure. If a ten-step agent task achieves 95% accuracy at each step, the cumulative success rate drops to roughly 60%. At twenty steps, it falls to 36%. A recent analysis of the demo-to-deployment gap illustrates this starkly: a 1,000-step task at 99% per-step accuracy succeeds less than 0.005% of the time.

This is not a theoretical concern. In my experience, most teams discover it the hard way. They build an agent that handles a five-step workflow beautifully in testing, expand it to fifteen steps for production, and watch reliability collapse. The fix is not a better model. It is a better architecture: shorter chains, deterministic checkpoints, and fallback paths at every decision point.

A survey of 1,800 engineering leaders found that only 95 respondents (roughly 5%) actually operate AI agents in production. Even among those early adopters, most systems execute ten or fewer actions before requiring human intervention. The "year of the agent" remains premature.

Context engineering is the real discipline

If there is one shift I would highlight from the past year, it is the move from prompt engineering to context engineering. Anthropic's applied AI team describes context engineering as addressing a deeper question than "what should I say to the model." Instead, it asks: "what configuration of context is most likely to generate the desired behaviour?"

This matters because models experience what Anthropic calls "context rot." As context windows expand, retrieval accuracy diminishes. Transformer architecture creates quadratic relationships among tokens, and larger sequences strain attention mechanisms. Treating context as an unlimited resource is a recipe for degraded performance.

In production systems I've built, tool outputs consume far more tokens than user messages (one analysis of 1,200 production deployments puts it at 100x more). That makes context management directly tied to both quality and cost. The practical techniques that work: just-in-time context injection based on immediate user state, tool masking to expose only relevant fields, and aggressive pruning rather than dumping everything into the window.

A team of experienced LLM practitioners documented this pattern clearly: simpler, focused prompts outperform complex catch-all versions that attempt handling multiple edge cases simultaneously. The same group showed that decomposing tasks into multi-step deterministic workflows dramatically improves accuracy. AlphaCodium, for instance, improved code generation accuracy from 19% to 44% by breaking the task into reflection, reasoning, generation, ranking, and iterative testing stages.

Why constrained architectures win

The industry is converging on a counterintuitive lesson: constraining your AI system makes it more capable, not less.

Successful production deployments share common traits. They use domain-specific designs with limited action sets rather than generalist systems. They implement human-in-the-loop patterns with confidence-based routing. They rely on predetermined workflows (prompt chaining, routing, parallelisation) rather than autonomous discovery.

Klarna's experience is instructive. The company initially reported impressive AI results: 2.3 million customer service conversations handled, with $40 million in projected savings. Then they reversed course and rehired human agents after recognising decreased service quality. The lesson is not that AI cannot handle customer support. It is that unconstrained deployment without proper evaluation and fallback paths creates problems that erode trust faster than they save money.

On the infrastructure side, the teams extracting real value treat LLMs as unreliable components requiring verification and containment. Stripe improved fraud detection from 59% to 97% accuracy using domain-specific foundation models. Amazon's Rufus scaled to 250 million users with a 60% increase in purchase completion. These are not demo results. They are production outcomes from teams that invested in evaluation, guardrails, and architectural discipline.

What actually works in practice

After building production AI systems for enterprise clients, I've found the successful teams converge on similar practices:

Start with the business problem. "Our support team spends 40% of their time answering the same 15 questions" is a problem statement. "We need an AI agent" is not. Define success metrics before writing code.
Invest in evaluation before scaling. Custom in-house evaluation methods dominate in production, particularly in regulated sectors where visibility into assessment processes matters. Combine assertion-based unit tests, LLM-as-judge comparisons, and what one practitioner group calls the "intern test": would a reasonable person succeed with the same input and context?
Build safety into infrastructure, not prompts. Session tainting, dual-layer permissions, and circuit breakers provide guarantees that prompt engineering cannot. One automotive company implements cost and turn limits that automatically halt agents when thresholds are exceeded.
Design for continuous change. 70% of regulated enterprises rebuild their AI infrastructure quarterly or faster. Modular architectures that allow component replacement without full rebuilds are not optional. They are survival.
Ship incrementally. A system that handles ten document types reliably is more valuable than one that handles a hundred unreliably. Start with predictable, high-volume tasks (document processing, structured data extraction) where failures are visible and ROI is measurable.

The path forward

AI is genuinely transformative technology, but transformative does not mean effortless. The gap between demos and production reflects architectural discipline requirements, not technological immaturity. Better models shift where challenges occur rather than solving them fundamentally. Robinhood reduced latency by 50% through hierarchical tuning (prompt optimisation, trajectory tuning, LoRA fine-tuning) without upgrading to a frontier model.

The engineering is well understood. The patterns for building reliable systems (evaluation frameworks, guardrails, observability, human oversight) are proven and repeatable. The challenge is committing to the work: treating AI systems with the same rigour you would apply to any other critical infrastructure.

If you're building AI systems and want to discuss what's working in practice, get in touch.