Back to blog list

AI Agents from Prototype to Production: Why the Model Isn't the Problem

Ship native customer-facing dashboards and self-serve reporting fast with Embeddable

Learn more

Contents

Delight users with lightning-fast, fully native embedded dashboards.

Learn more

TL;DR

  • Most AI agent failures in production trace to the data and output layer, not the model itself.
  • There are three distinct failure modes: reliability, correctness, and trust — only reliability is a model problem.
  • Multi-tenant environments expose a phantom tenant risk: an agent that worked in a demo can return another customer's data in production.
  • Defining metrics in code and enforcing row-level security gives agents a governed foundation they cannot build themselves.
  • A production-readiness checklist must cover tenant isolation, access policies, auditable outputs, and environment promotion.

Why Your AI Agent Worked in the Demo

AI can scaffold an agent in an afternoon. The production reality arrives later, and it arrives hard.

Taking an AI agent from prototype to production means closing the gap between a controlled demo and a system handling real customers, real permissions, and real accountability. A prototype runs against a clean, single-tenant dataset with no row-level security, no versioned metric definitions, and no audit trail. Production requires all three: per-tenant isolation, version-controlled metrics, and auditable outputs customers can verify. The bottleneck is almost never the model. It is the governed data and output layer the agent depends on.

We've watched teams demo agent reasoning that looked genuinely impressive — clean answers, confident outputs, the right numbers every time. Then they start deploying AI agents to real enterprise customers, and agent behavior shifts. The dataset is no longer single-tenant. Permissions logic exists but was never wired in. Metric definitions turn out to be informal. There's no human in the loop when an answer goes wrong, and no audit trail to explain why it did.

That's the enterprise agent reality check nobody flags during evaluation. The model didn't change. The environment did.

The Three Ways AI Agents Fail in Production

We've seen teams collapse three distinct problems into one — and then spend months tuning prompts on the wrong thing. The failure modes are different, and only one of them belongs to the model.

Failure typeWhat happensRoot causeWho feels it
Reliability failureAgent breaks under real load, times out, or returns errors after a schema change upstreamModel serving, infrastructure, schema driftEngineering team
Correctness failureAgent answers confidently using the wrong data — a stale metric, a misnamed field, another tenant's rowsData layer: no governed definitions, no row-level security, no per-tenant isolationThe customer who got the wrong number
Trust failureCustomer can't verify the answer, no audit trail, no source reference, no version history for the metric citedOutput layer: no auditability, no lineage, no stable semantic modelEnterprise buyers, compliance, customer success

Reliability is the failure mode everyone discusses. It's real, and tools like LangSmith do strong work on tracing and observability for it. But correctness and trust failures are quieter and more damaging. A customer who gets a wrong answer, especially one drawn from another tenant's data, doesn't file a bug report. They stop trusting the product.

That distinction matters for where you invest. Correctness and trust are not model problems. They're data-layer and output-layer problems: unversioned metric definitions, missing access policies, no stable semantic contract between the agent and the underlying data. Swapping the LLM doesn't fix them. Governing the layer underneath does.

What Multi-Tenancy Does to an AI Agent

Here's a failure scenario we've seen more than once. A SaaS team builds a customer-facing AI agent that answers questions about usage data. The demo runs beautifully: one dataset, one tenant, clean answers. They ship to production. Three weeks later, a customer emails to say the agent told them their monthly active users were 47,000. Their actual number is 3,200. The 47,000 belongs to a different customer entirely.

This is the phantom-tenant problem. It happens because the agent was never given the tools to self-police data access. In a single-tenant prototype, there's nothing to isolate, every query hits the same dataset and returns the right rows by accident. In a multi-tenant production environment, that accident stops working. Without row-level security enforced at the data layer and per-tenant environment isolation, the agent queries whatever the underlying connection exposes. It doesn't know tenant boundaries exist. It just answers.

No LLM solves this on its own.

Tools like Snowflake have robust row-level security primitives at the warehouse level, but those policies still need to be deliberately mapped to your tenant model and surfaced through whatever layer sits between the agent and the data. That wiring doesn't happen automatically. Platforms like Embeddable enforce it through defined access policies, but the point is structural: tenant isolation must be enforced below the agent, not delegated to it.

Why Customers Can't Trust an AI Answer Without an Audit Trail

Correctness failure is bad. Trust failure is worse, and it's quieter.

A customer sees a number in an AI-generated answer. Revenue is down 12% month-over-month, the agent says. They want to know: which revenue definition? Which date range boundary? Is that figure using the same logic as last quarter's board report, or something the agent assembled on the fly from three loosely related tables? If there's no audit trail, they can't know. And in regulated industries or high-stakes decisions, "the AI said so" isn't an acceptable provenance chain.

We've seen this pattern repeatedly: the data was technically correct, but the customer escalated anyway because they couldn't verify it. That's not an LLM problem, it's a missing governance layer. The agent had no version-controlled metric definitions, no record of which semantic model version it queried, and no way to surface the calculation logic alongside the answer. Tools like dbt have made it much easier to version-control transformation logic and document lineage at the warehouse layer (dbt Labs), which is a genuine strength, but that lineage rarely travels all the way to a customer-visible AI output (Datafold).

The resolution is a governed analytics layer where metrics are defined in code, tied to an explicit semantic model, and queryable with a consistent, auditable contract. When the data layer is structured that way, as it is in platforms like Embeddable, where metrics and models are defined in code, an AI agent's answer inherits that provenance automatically. The customer can trace the number. That's what makes the answer trustworthy, not just correct.

The Production-Readiness Checklist for a Customer-Facing AI Agent

Most teams we talk to have nailed two or three of these. The ones that have shipped confidently to paying customers have all six.

  • Row-level security enforced at the data layer, not delegated to the agent's prompt or the application tier. The query itself must be scoped to the requesting tenant before the agent ever sees results.
  • Per-tenant data environments, staging and production data are isolated, so an agent tested against synthetic data doesn't inadvertently touch live customer rows during rollout.
  • Metrics defined in code, revenue, churn, usage, and every derived figure your agent might cite should have a single authoritative definition, version-controlled and auditable. Defining models in code is the pattern that makes this possible.
  • Access policies enforced at query time, role-aware access policies that travel with the data model, not scattered across the application layer.
  • Auditability of AI-generated outputs, every answer the agent surfaces should be traceable: which metric definition, which tenant context, which environment.
  • Environment promotion from staging to production, per-tenant environments mean schema changes are validated before they reach customers.

Tools like Metabase (which offers embedded analytics features) handle some of this for simpler deployments and are worth knowing. But once multi-tenancy and agent-generated outputs are in scope, the checklist above is the floor, not a stretch goal. When the data layer satisfies all six, an agent built on top of it inherits governance it didn't have to build itself. That's the governed resolution pattern: the infrastructure earns the trust so the agent doesn't have to fake it.

Frequently Asked Questions

Why Do AI Agents That Work in Demos Fail When Real Customers Use Them?

Demo environments are single-tenant, clean, and forgiving. Real production adds multi-tenancy, conflicting schema versions, and customers who will dispute a wrong number. The agent looks brilliant in the demo because the conditions are artificial.

What Is the Difference Between a Reliability Failure and a Correctness Failure in an AI Agent?

Reliability failure means the agent breaks, times out, or errors. Correctness failure means it answers confidently using the wrong data: a stale metric, a misnamed field, or another tenant's rows. Only reliability is a model problem; correctness is a data-layer problem.

How Does Multi-Tenancy Create Data Security Risks for Customer-Facing AI Agents?

Without row-level security enforced at the data layer, a natural language query can return rows belonging to a different tenant. The agent has no way to know the data is wrong; it answers confidently from whatever the query returns.

What Does It Mean to Define Metrics in Code, and Why Does It Matter for AI Agents?

Defining metrics in code means the business logic for a measure (say, monthly active users) lives in a version-controlled model rather than an ad-hoc SQL string. Agents that build on a defined metrics layer inherit consistent, auditable definitions instead of amplifying whatever inconsistency already exists in the database.

What Should a Production-Readiness Checklist for a Customer-Facing AI Agent Include?

At minimum: per-tenant row-level security, versioned metric definitions, isolated staging and production environments, and an audit trail for AI-generated outputs. If a customer cannot verify where an answer came from, the agent is not production-ready regardless of how accurate it feels.

Related Reading