From AI Demo to Production: The Gap Most Teams Underestimate
Running AI in production is a different discipline from building an AI demo. Here's what observability, utilization, and scheduling actually mean when real users are watching.

Most AI demos are deceptively convincing. You fire up a Jupyter notebook, chain a few API calls, and the output looks great. Then someone says "can we put this in front of real users?" and suddenly you're dealing with an entirely different set of problems that have nothing to do with the model.
A recent episode of the Stack Overflow Podcast, published May 26 2026 at stackoverflow.blog, captures this well. Ryan Donovan sits down with Peter Salanki, CTO and co-founder of CoreWeave, to talk about what production AI really demands: observability, utilization, intelligent scheduling, and the discipline to avoid over-building too early. The conversation is framed around infrastructure, but the lessons apply to any team shipping an AI feature.
🔍 The Demo-to-Production Gap Is Wider Than You Think
A model that answers correctly 90% of the time in a notebook is impressive. A model that does the same in production, under concurrent load, with retries, fallbacks, cost caps, and logging, is an engineering achievement.
The gap shows up in three places most teams don't anticipate:
- Latency distribution, not average latency. Your median response time might look fine. Your p99 is the one that makes users close the tab.
- Error modes under load. Rate limits, context-window overflows, and model timeouts behave differently when ten users hit the endpoint simultaneously versus one.
- Cost at scale. A feature that costs $0.02 per query in testing can become a budget crisis when traffic arrives. GPU cloud pricing is typically in USD; if you're budgeting from Sri Lanka, the currency converter can help you track the LKR equivalent as exchange rates shift.
Key takeaway: Production AI is not faster demo AI. It is a different discipline, and the sooner a team treats it that way, the fewer surprises they face at 2am.
📊 Observability: What You Actually Need to Monitor
Traditional software observability focuses on CPU, memory, request throughput, and error rates. Those still matter, but AI workloads add a new set of metrics that most monitoring dashboards don't capture out of the box.
| Metric | Why it matters for AI |
|---|---|
| Token throughput (tokens/sec) | Directly affects how many concurrent users you can serve |
| Time to first token (TTFT) | Perceived responsiveness; LLMs feel slow even if total generation is fast |
| Cache hit rate | Prefix caching (reusing shared prompt prefixes) can cut costs significantly |
| GPU utilization % | Low utilization means you're paying for idle capacity |
| Queue depth | Requests waiting for a worker; spikes here predict latency spikes ahead |
| Model error rate by type | Timeout vs. context-length vs. content filter errors need different fixes |
Setting up this level of visibility requires intentional instrumentation from the start, not as an afterthought. OpenTelemetry, Prometheus, and open-source inference servers like vLLM and Ollama expose many of these metrics natively. If you're hosting your own inference, hooking these up before your first real user hits the endpoint is far less painful than retrofitting them after.
⚡ Utilization: The Number That Determines Whether Your Infrastructure Makes Sense
GPU time is expensive. What makes AI infrastructure economics work, as Peter Salanki's discussion at CoreWeave reflects, is utilization: the fraction of allocated GPU capacity that is actually doing useful work at any given moment.
Low utilization is the norm for teams new to production AI. The common causes:
- Bursty traffic with no batching. If users arrive in clusters, GPU cores sit idle between bursts. Continuous batching (grouping incoming requests into a single forward pass) is the single biggest utilization lever for LLM serving.
- Over-provisioned safety headroom. Teams size for peak load and then run at 20% average utilization. Autoscaling and queue-based routing help, but they add complexity.
- Cold-start penalties. Spinning up a new inference worker takes time. Keeping a minimum number of warm replicas is a cost-vs-latency trade-off that needs a deliberate decision, not a default.
For small teams or solo developers, the implication is practical: start with a single-replica deployment and measure actual utilization before adding capacity. Most early-stage AI features don't need a cluster; they need a measured baseline.
🛠️ The Over-Architecting Trap
One of the clearest pieces of advice in the CoreWeave interview is to avoid the trap of over-architecting too early. This sounds obvious, but the AI infrastructure space makes it easy to do.
The temptation looks like this: you read about multi-region deployments, model routing layers, A/B testing frameworks, fine-tuning pipelines, and feature stores, and you start building all of them before you have a single confirmed user.
Warning: Building infrastructure for problems you don't yet have is a reliable way to run out of runway before you solve the problems you do have.
A more pragmatic sequencing for a small team or solo developer:
- Week 1: Ship a working inference endpoint. A single hosted model (OpenAI, Anthropic, or a self-hosted Ollama instance) behind a thin API. Log every request and response.
- Week 2-4: Measure. What are your actual latency numbers, error rates, and costs? Let that data tell you what to fix.
- Month 2+: Add the infrastructure your data says you need: caching, batching, fallbacks, autoscaling.
Most production AI problems reveal themselves once you have real traffic. The architecture you need at 1,000 users per day is different from what you need at 1,000,000, and building for the latter before you have the former is a distraction.
💡 What This Means for You
If you're a developer in Sri Lanka building an AI feature, or evaluating whether to integrate an LLM into a product, here's the concrete takeaway from the CoreWeave conversation.
The hard part of production AI is not the model. Free and low-cost models have closed the gap on the intelligence side. GPT-4o-mini, Claude Haiku, Gemini Flash, and open weights models like Llama and Mistral are all capable enough for most practical applications. The differentiation now is in the engineering around the model.
The three things worth getting right from the start:
- Structured logging of every model call. Input, output, latency, tokens used, cost. Without this, debugging production failures is guesswork.
- A hard cost cap. Set a budget limit per user, per day, or per request before launch. AI costs have a way of surprising teams who didn't set one.
- A graceful fallback. If the model call fails or times out, what does the user see? A blank screen is worse than a cached response or a simple "try again" message.
Scheduling and utilization optimization, the topics that dominate infrastructure conversations at companies like CoreWeave, become relevant once you're past the initial deployment and starting to see predictable traffic. For most teams reading this, the near-term priority is simpler: ship something observable, measure it honestly, and let the data drive the next infrastructure decision.
The engineers who do that consistently are the ones who end up with production AI systems. The ones who don't tend to have very impressive demos.
Published May 26, 2026. Commentary based on the Stack Overflow Podcast episode featuring Peter Salanki, CTO of CoreWeave. No specific quotes are attributed; all analysis is original.
Original source
Do you have what it takes to run AI in production?