Resources
Resources

Generative AI
You Shipped an AI Agent. Now What?
Learn how tracing and continuous evaluation give you the operational visibility to catch regressions before your users do.

Bauke Brenninkmeijer
Research Engineer

In traditional software, bugs are loud. In AI, bugs whisper.
When your API has a bug, you get a 500. A stack trace. An alert fires, someone gets paged, you fix it. You've built entire workflows around this kind of failure: error tracking, alerting, incident response. It works because the failure announces itself.
AI failures are silent. Your agent doesn't throw an exception when it starts hallucinating, and your RAG pipeline doesn't return an error when retrieval quality degrades. Everything keeps running. The metrics look green, latency is fine, uptime is 100%. And meanwhile the product is quietly getting worse.
This is the part nobody warns you about when you ship an AI agent. The fun part (prompt engineering, tool integration, retrieval pipelines, the demo that made everyone in the room lean forward) is over. What replaces it is a different problem entirely, and most teams aren't set up for it.
Everyone is building. Almost nobody is measuring
The numbers are out of balance in a way that should worry you.
McKinsey's 2025 State of AI report finds that 62% of organizations are at least experimenting with AI agents. Gartner expects 40% of enterprise applications to feature task-specific AI agents by the end of 2026, up from less than 5% the year before.
Meanwhile, only 18% of software engineering teams are actually using AI evaluation and observability platforms. That number is projected to hit 60% by 2028, but 2028 is not today.
62% experimenting. 18% measuring. That gap is where projects go to die.
Gartner also predicts more than 40% of agentic AI projects will be canceled by the end of 2027. The reasons cited (escalating costs, unclear business value, weak risk controls) all collapse into one underlying problem: teams can't see what their AI is doing, can't measure whether it's working, and can't make a defensible case to keep it funded.
What separates the projects that survive isn't better models or bigger GPU budgets. It's operational discipline: the ability to trace what the system is doing, evaluate whether it's doing it well, and prove that to the business.
What silent failure looks like in practice
This is already happening.
A model provider rolls out an update. OpenAI, Anthropic, Google: they all push rolling updates that change model behavior without changing the model name in your API call. Prompts that worked last week start producing subtly different outputs. The formatting breaks, the model gets verbose and burns tokens, it stops following an instruction it used to handle fine. No changelog, no error. Just different.
A customer support agent works great in testing, where conversations are short. In production, users have long, multi-turn conversations. Chroma's "Context Rot" study of 18 frontier models found that performance degrades well before the advertised context limits, even on simple tasks. The agent starts forgetting the system prompt, contradicting earlier responses, and hallucinating details, while the uptime dashboard shows 100%.
An e-commerce company's AI-powered product descriptions drift after the underlying model gets updated. Hallucinated specs creep in: wrong dimensions, incorrect compatibility claims. Pages look fine at a glance, but customer returns climb for weeks before anyone connects the dots.
These aren't hypotheticals from a threat-modeling exercise. Air Canada was ordered to honor a refund policy its chatbot invented. DPD's support bot was jailbroken into swearing at customers and writing a haiku about how useless it was. A Chevy dealer's chatbot agreed to sell a Tahoe for $1. Every one passed its uptime checks on the way to the front page.
You already have the mental model for this
If you're a software engineer, the concepts map directly. They just go by different names.
Tests become evals. You write unit and integration tests to verify behavior, run them before every deploy, and catch regressions. The AI equivalent is evaluation: running your system against a set of known inputs and checking that the outputs are still good. The catch is that AI evals can't rely on equality checks. Tests are like checking whether a calculator works; evals are like grading an essay. That's why teams reach for LLM-as-judge approaches and custom scoring functions — though these come with their own problems: judge models can be inconsistent, biased, and sometimes hallucinate their own scores, so your evals need testing too. The hard part is defining "good enough" for fuzzy outputs when you don't have ground truth on day one. But the underlying discipline is the one you already practice: define what "correct" looks like as best you can, automate the check, iterate on the definition as you learn.
Logs become traces. You wouldn't run a web service without request logging or a microservice architecture without distributed tracing. AI traces are the same idea, adapted for AI primitives: LLM calls, retrieval, and tool execution instead of HTTP spans. When something goes wrong, you open the trace and walk it top to bottom.
CI/CD becomes continuous evaluation. In modern software, you don't just test before deploy. You monitor in production. Error rates, latency percentiles, business metrics. The AI version is online evaluation: scoring a sample of production outputs continuously and alerting when quality drifts. Pair it with shift-left testing (building eval datasets early, running evals during development) and you get the single most important capability: a bidirectional feedback loop between offline and online evaluation, where production data improves your evals and evals inform what to monitor.
Code review becomes eval review. Before merging a prompt change or model swap, someone reviews how it moved the eval scores. Did accuracy drop? Did latency change? Did cost per conversation go up?
You don't need a new discipline. You need the one you already have, pointed at a non-deterministic system.
What tracing actually gives you
Tracing makes failures debuggable. Evaluation makes them measurable. Start with tracing.
When you instrument an AI system, you capture the full lifecycle of every request: from the user's input through every retrieval, tool call, and model interaction, to the final output. A simple plaintext trace looks like this:
trace_id: tr_9f31 request: "Can I return shoes bought 40 days ago?" span 1 retrieval 42ms query: "return policy shoes 40 days" docs: [policy_returns_apparel_v3] # WRONG: apparel policy, not footwear score: 0.61 # low confidence, retrieved anyway span 2 llm.generate 810ms model: gpt-4.1-mini input_tokens: 812 output_tokens: 146 cost_usd: 0.0042 span 3 tool.http 95ms tool: order_lookup status: 200 final_answer: "Yes, you can return within 30 days."
Real traces are messier than this (nested agents, retry loops, ambiguous failures), but even this simplified version shows the root cause. Retrieval pulled the apparel return policy (30 days) instead of the footwear policy (45 days), with a low similarity score nobody was alerting on. The model dutifully answered from the wrong document. The HTTP layer and the LLM call both look green — the bug lives entirely in the retrieval span.
That single trace gives you three things you couldn't otherwise have.
First, debugging that's actually possible. A user reports a bad answer. Without traces you're guessing, and non-deterministic systems don't reproduce reliably. With traces you pull up the exact request and walk through it: retrieval returned irrelevant documents, or the model ignored the system prompt, or a tool failed silently and the model improvised. You can see it.
Second, cost attribution that makes sense. Your inference bill spiked 40% this month. With traces you can break it down per workflow, per step, per model, and find the prompt template that's burning 3x the tokens it should. One team we worked with found their agents spending $20 per run on web research, then hitting a timeout and wasting the entire budget. The issue only surfaced intermittently because of non-deterministic tool use, which made it impossible to debug without tracing.
Third, latency you can actually fix. Users complain the agent is slow. Traces give you the per-span breakdown the same way Jaeger or Datadog APM would for microservices. One client reported slow agents. We pulled up the waterfall view and showed them that nearly all time was spent on LLM inference, not retrieval or tool calls. That meant the fix was in prompting (shorter outputs, fewer reasoning steps), not infrastructure. Without the trace, they'd have been optimizing the wrong layer.
If you've used distributed tracing before, you already know this workflow — the span types are different, but everything else is the same.
What evaluation actually gives you
Tracing shows you what happened. Evaluation tells you whether it was any good.
Most teams "evaluate" their AI by trying a few prompts manually and eyeballing the output — the equivalent of testing a web app by clicking around for five minutes before shipping. It catches the obvious problems and misses everything subtle.
Systematic evaluation works in two modes:
Offline, before deployment. You maintain a set of test cases (your test suite). Before deploying a change, run the system against them and check the results. Did accuracy hold? Did edge cases break? This catches what manual checking misses: the prompt tweak that improves the common case but breaks the edge case, the model upgrade that reasons better but follows format instructions worse.
Online, in production. You continuously score a sample of real production outputs. Think of it as production monitoring, but for quality: hallucination rates, relevance scores, safety checks. Track these over time. Alert when they drift. This is how you catch slow degradation before it becomes a customer-facing incident: the silent provider update, the stale retrieval index, the shifted input distribution.
What systematic evals catch that vibes-checking doesn't:
Hallucination rates creeping up 2-3% per week after a provider model update. Invisible day-to-day, obvious over a month.
Safety regressions from a system prompt tweak that's better at answering questions but worse at refusing harmful ones.
Performance that's great for English but terrible for Spanish, or great for short queries but terrible for long ones. Slicing evals by segment reveals these patterns.
Cost per conversation doubling because the model started generating longer responses after a temperature change.
Where do you stand?
A self-assessment. Everyone starts at the bottom. The question is whether you're moving up.
Level | What it looks like |
|---|---|
0. Vibes | "It seems to work." You try a few prompts manually after changes. |
1. Logging | Inputs and outputs get saved somewhere. You can investigate after users complain. |
2. Tracing | Full request lifecycle captured. You can debug, attribute costs, profile latency. |
3. Evaluation | Automated quality scoring on production traffic. You catch regressions before users do. |
4. Continuous | Evals in CI, drift alerts, traces feeding eval datasets. You ship changes with confidence. |
Most teams we talk to are at vibes or logging. In our experience, the ones still in production a year later tend to be at evaluation or higher.
The parallel to traditional software maturity is exact: "it compiles, ship it" → unit tests → CI/CD → production monitoring → observability platforms. AI is on the same path, a few years behind. The teams that make the jump first will outpace the rest, just as early CI adopters left the manual-testers behind.
Start small. Start now
You don't need a full evaluation platform before your next deploy. You need to take one step up from where you are.
If you're at... | Do this next |
|---|---|
Vibes | Add tracing. Instrument your main workflow. You'll learn more about your system in a day of reading traces than in a month of building it. |
Logging | Add structure. Capture the full request lifecycle, not just the final input and output. That's the difference between a log line and a trace. |
Tracing | Add one offline eval. Pick the workflow that would embarrass you if it broke and build a small test set. Run it before every deploy. Even "does the output contain the expected entity" catches real issues. |
Evaluation | Close the loop. Use traces to build eval datasets from real production data. Set up alerts on quality metrics. Run evals in CI. |
Hamel Husain and others call this approach eval-driven development: the AI equivalent of TDD. Build your evaluation datasets. Use them as the scoring rubric during development. Run them before every deploy. Monitor drift in production. Feed production data back into better evals. It's the closest thing the field has to a settled best practice.
None of this requires exotic tooling or a dedicated ML ops team. It requires the engineering rigor you already have, pointed at a new kind of system.
The whisper test
Here's the question to take back to your team on Monday.
If your agent quietly started getting worse tomorrow — not failing, just drifting — how long before you'd notice? Hours? Days? The next quarterly review? A customer tweet?
If the honest answer is anything other than "within hours, automatically," you've got work to do. The teams whose agents are still running in 2027 will be the ones who answered that question honestly and built the feedback loops to back it up.
Bugs whisper. Build something that listens.
This is the problem we work on every day at orq.ai. We're one of the representative vendors in the Gartner Market Guide for AI Evaluation and Observability Platforms, the same guide that puts current adoption at 18%. If any of this sounds familiar, come take a look.
In traditional software, bugs are loud. In AI, bugs whisper.
When your API has a bug, you get a 500. A stack trace. An alert fires, someone gets paged, you fix it. You've built entire workflows around this kind of failure: error tracking, alerting, incident response. It works because the failure announces itself.
AI failures are silent. Your agent doesn't throw an exception when it starts hallucinating, and your RAG pipeline doesn't return an error when retrieval quality degrades. Everything keeps running. The metrics look green, latency is fine, uptime is 100%. And meanwhile the product is quietly getting worse.
This is the part nobody warns you about when you ship an AI agent. The fun part (prompt engineering, tool integration, retrieval pipelines, the demo that made everyone in the room lean forward) is over. What replaces it is a different problem entirely, and most teams aren't set up for it.
Everyone is building. Almost nobody is measuring
The numbers are out of balance in a way that should worry you.
McKinsey's 2025 State of AI report finds that 62% of organizations are at least experimenting with AI agents. Gartner expects 40% of enterprise applications to feature task-specific AI agents by the end of 2026, up from less than 5% the year before.
Meanwhile, only 18% of software engineering teams are actually using AI evaluation and observability platforms. That number is projected to hit 60% by 2028, but 2028 is not today.
62% experimenting. 18% measuring. That gap is where projects go to die.
Gartner also predicts more than 40% of agentic AI projects will be canceled by the end of 2027. The reasons cited (escalating costs, unclear business value, weak risk controls) all collapse into one underlying problem: teams can't see what their AI is doing, can't measure whether it's working, and can't make a defensible case to keep it funded.
What separates the projects that survive isn't better models or bigger GPU budgets. It's operational discipline: the ability to trace what the system is doing, evaluate whether it's doing it well, and prove that to the business.
What silent failure looks like in practice
This is already happening.
A model provider rolls out an update. OpenAI, Anthropic, Google: they all push rolling updates that change model behavior without changing the model name in your API call. Prompts that worked last week start producing subtly different outputs. The formatting breaks, the model gets verbose and burns tokens, it stops following an instruction it used to handle fine. No changelog, no error. Just different.
A customer support agent works great in testing, where conversations are short. In production, users have long, multi-turn conversations. Chroma's "Context Rot" study of 18 frontier models found that performance degrades well before the advertised context limits, even on simple tasks. The agent starts forgetting the system prompt, contradicting earlier responses, and hallucinating details, while the uptime dashboard shows 100%.
An e-commerce company's AI-powered product descriptions drift after the underlying model gets updated. Hallucinated specs creep in: wrong dimensions, incorrect compatibility claims. Pages look fine at a glance, but customer returns climb for weeks before anyone connects the dots.
These aren't hypotheticals from a threat-modeling exercise. Air Canada was ordered to honor a refund policy its chatbot invented. DPD's support bot was jailbroken into swearing at customers and writing a haiku about how useless it was. A Chevy dealer's chatbot agreed to sell a Tahoe for $1. Every one passed its uptime checks on the way to the front page.
You already have the mental model for this
If you're a software engineer, the concepts map directly. They just go by different names.
Tests become evals. You write unit and integration tests to verify behavior, run them before every deploy, and catch regressions. The AI equivalent is evaluation: running your system against a set of known inputs and checking that the outputs are still good. The catch is that AI evals can't rely on equality checks. Tests are like checking whether a calculator works; evals are like grading an essay. That's why teams reach for LLM-as-judge approaches and custom scoring functions — though these come with their own problems: judge models can be inconsistent, biased, and sometimes hallucinate their own scores, so your evals need testing too. The hard part is defining "good enough" for fuzzy outputs when you don't have ground truth on day one. But the underlying discipline is the one you already practice: define what "correct" looks like as best you can, automate the check, iterate on the definition as you learn.
Logs become traces. You wouldn't run a web service without request logging or a microservice architecture without distributed tracing. AI traces are the same idea, adapted for AI primitives: LLM calls, retrieval, and tool execution instead of HTTP spans. When something goes wrong, you open the trace and walk it top to bottom.
CI/CD becomes continuous evaluation. In modern software, you don't just test before deploy. You monitor in production. Error rates, latency percentiles, business metrics. The AI version is online evaluation: scoring a sample of production outputs continuously and alerting when quality drifts. Pair it with shift-left testing (building eval datasets early, running evals during development) and you get the single most important capability: a bidirectional feedback loop between offline and online evaluation, where production data improves your evals and evals inform what to monitor.
Code review becomes eval review. Before merging a prompt change or model swap, someone reviews how it moved the eval scores. Did accuracy drop? Did latency change? Did cost per conversation go up?
You don't need a new discipline. You need the one you already have, pointed at a non-deterministic system.
What tracing actually gives you
Tracing makes failures debuggable. Evaluation makes them measurable. Start with tracing.
When you instrument an AI system, you capture the full lifecycle of every request: from the user's input through every retrieval, tool call, and model interaction, to the final output. A simple plaintext trace looks like this:
trace_id: tr_9f31 request: "Can I return shoes bought 40 days ago?" span 1 retrieval 42ms query: "return policy shoes 40 days" docs: [policy_returns_apparel_v3] # WRONG: apparel policy, not footwear score: 0.61 # low confidence, retrieved anyway span 2 llm.generate 810ms model: gpt-4.1-mini input_tokens: 812 output_tokens: 146 cost_usd: 0.0042 span 3 tool.http 95ms tool: order_lookup status: 200 final_answer: "Yes, you can return within 30 days."
Real traces are messier than this (nested agents, retry loops, ambiguous failures), but even this simplified version shows the root cause. Retrieval pulled the apparel return policy (30 days) instead of the footwear policy (45 days), with a low similarity score nobody was alerting on. The model dutifully answered from the wrong document. The HTTP layer and the LLM call both look green — the bug lives entirely in the retrieval span.
That single trace gives you three things you couldn't otherwise have.
First, debugging that's actually possible. A user reports a bad answer. Without traces you're guessing, and non-deterministic systems don't reproduce reliably. With traces you pull up the exact request and walk through it: retrieval returned irrelevant documents, or the model ignored the system prompt, or a tool failed silently and the model improvised. You can see it.
Second, cost attribution that makes sense. Your inference bill spiked 40% this month. With traces you can break it down per workflow, per step, per model, and find the prompt template that's burning 3x the tokens it should. One team we worked with found their agents spending $20 per run on web research, then hitting a timeout and wasting the entire budget. The issue only surfaced intermittently because of non-deterministic tool use, which made it impossible to debug without tracing.
Third, latency you can actually fix. Users complain the agent is slow. Traces give you the per-span breakdown the same way Jaeger or Datadog APM would for microservices. One client reported slow agents. We pulled up the waterfall view and showed them that nearly all time was spent on LLM inference, not retrieval or tool calls. That meant the fix was in prompting (shorter outputs, fewer reasoning steps), not infrastructure. Without the trace, they'd have been optimizing the wrong layer.
If you've used distributed tracing before, you already know this workflow — the span types are different, but everything else is the same.
What evaluation actually gives you
Tracing shows you what happened. Evaluation tells you whether it was any good.
Most teams "evaluate" their AI by trying a few prompts manually and eyeballing the output — the equivalent of testing a web app by clicking around for five minutes before shipping. It catches the obvious problems and misses everything subtle.
Systematic evaluation works in two modes:
Offline, before deployment. You maintain a set of test cases (your test suite). Before deploying a change, run the system against them and check the results. Did accuracy hold? Did edge cases break? This catches what manual checking misses: the prompt tweak that improves the common case but breaks the edge case, the model upgrade that reasons better but follows format instructions worse.
Online, in production. You continuously score a sample of real production outputs. Think of it as production monitoring, but for quality: hallucination rates, relevance scores, safety checks. Track these over time. Alert when they drift. This is how you catch slow degradation before it becomes a customer-facing incident: the silent provider update, the stale retrieval index, the shifted input distribution.
What systematic evals catch that vibes-checking doesn't:
Hallucination rates creeping up 2-3% per week after a provider model update. Invisible day-to-day, obvious over a month.
Safety regressions from a system prompt tweak that's better at answering questions but worse at refusing harmful ones.
Performance that's great for English but terrible for Spanish, or great for short queries but terrible for long ones. Slicing evals by segment reveals these patterns.
Cost per conversation doubling because the model started generating longer responses after a temperature change.
Where do you stand?
A self-assessment. Everyone starts at the bottom. The question is whether you're moving up.
Level | What it looks like |
|---|---|
0. Vibes | "It seems to work." You try a few prompts manually after changes. |
1. Logging | Inputs and outputs get saved somewhere. You can investigate after users complain. |
2. Tracing | Full request lifecycle captured. You can debug, attribute costs, profile latency. |
3. Evaluation | Automated quality scoring on production traffic. You catch regressions before users do. |
4. Continuous | Evals in CI, drift alerts, traces feeding eval datasets. You ship changes with confidence. |
Most teams we talk to are at vibes or logging. In our experience, the ones still in production a year later tend to be at evaluation or higher.
The parallel to traditional software maturity is exact: "it compiles, ship it" → unit tests → CI/CD → production monitoring → observability platforms. AI is on the same path, a few years behind. The teams that make the jump first will outpace the rest, just as early CI adopters left the manual-testers behind.
Start small. Start now
You don't need a full evaluation platform before your next deploy. You need to take one step up from where you are.
If you're at... | Do this next |
|---|---|
Vibes | Add tracing. Instrument your main workflow. You'll learn more about your system in a day of reading traces than in a month of building it. |
Logging | Add structure. Capture the full request lifecycle, not just the final input and output. That's the difference between a log line and a trace. |
Tracing | Add one offline eval. Pick the workflow that would embarrass you if it broke and build a small test set. Run it before every deploy. Even "does the output contain the expected entity" catches real issues. |
Evaluation | Close the loop. Use traces to build eval datasets from real production data. Set up alerts on quality metrics. Run evals in CI. |
Hamel Husain and others call this approach eval-driven development: the AI equivalent of TDD. Build your evaluation datasets. Use them as the scoring rubric during development. Run them before every deploy. Monitor drift in production. Feed production data back into better evals. It's the closest thing the field has to a settled best practice.
None of this requires exotic tooling or a dedicated ML ops team. It requires the engineering rigor you already have, pointed at a new kind of system.
The whisper test
Here's the question to take back to your team on Monday.
If your agent quietly started getting worse tomorrow — not failing, just drifting — how long before you'd notice? Hours? Days? The next quarterly review? A customer tweet?
If the honest answer is anything other than "within hours, automatically," you've got work to do. The teams whose agents are still running in 2027 will be the ones who answered that question honestly and built the feedback loops to back it up.
Bugs whisper. Build something that listens.
This is the problem we work on every day at orq.ai. We're one of the representative vendors in the Gartner Market Guide for AI Evaluation and Observability Platforms, the same guide that puts current adoption at 18%. If any of this sounds familiar, come take a look.
In traditional software, bugs are loud. In AI, bugs whisper.
When your API has a bug, you get a 500. A stack trace. An alert fires, someone gets paged, you fix it. You've built entire workflows around this kind of failure: error tracking, alerting, incident response. It works because the failure announces itself.
AI failures are silent. Your agent doesn't throw an exception when it starts hallucinating, and your RAG pipeline doesn't return an error when retrieval quality degrades. Everything keeps running. The metrics look green, latency is fine, uptime is 100%. And meanwhile the product is quietly getting worse.
This is the part nobody warns you about when you ship an AI agent. The fun part (prompt engineering, tool integration, retrieval pipelines, the demo that made everyone in the room lean forward) is over. What replaces it is a different problem entirely, and most teams aren't set up for it.
Everyone is building. Almost nobody is measuring
The numbers are out of balance in a way that should worry you.
McKinsey's 2025 State of AI report finds that 62% of organizations are at least experimenting with AI agents. Gartner expects 40% of enterprise applications to feature task-specific AI agents by the end of 2026, up from less than 5% the year before.
Meanwhile, only 18% of software engineering teams are actually using AI evaluation and observability platforms. That number is projected to hit 60% by 2028, but 2028 is not today.
62% experimenting. 18% measuring. That gap is where projects go to die.
Gartner also predicts more than 40% of agentic AI projects will be canceled by the end of 2027. The reasons cited (escalating costs, unclear business value, weak risk controls) all collapse into one underlying problem: teams can't see what their AI is doing, can't measure whether it's working, and can't make a defensible case to keep it funded.
What separates the projects that survive isn't better models or bigger GPU budgets. It's operational discipline: the ability to trace what the system is doing, evaluate whether it's doing it well, and prove that to the business.
What silent failure looks like in practice
This is already happening.
A model provider rolls out an update. OpenAI, Anthropic, Google: they all push rolling updates that change model behavior without changing the model name in your API call. Prompts that worked last week start producing subtly different outputs. The formatting breaks, the model gets verbose and burns tokens, it stops following an instruction it used to handle fine. No changelog, no error. Just different.
A customer support agent works great in testing, where conversations are short. In production, users have long, multi-turn conversations. Chroma's "Context Rot" study of 18 frontier models found that performance degrades well before the advertised context limits, even on simple tasks. The agent starts forgetting the system prompt, contradicting earlier responses, and hallucinating details, while the uptime dashboard shows 100%.
An e-commerce company's AI-powered product descriptions drift after the underlying model gets updated. Hallucinated specs creep in: wrong dimensions, incorrect compatibility claims. Pages look fine at a glance, but customer returns climb for weeks before anyone connects the dots.
These aren't hypotheticals from a threat-modeling exercise. Air Canada was ordered to honor a refund policy its chatbot invented. DPD's support bot was jailbroken into swearing at customers and writing a haiku about how useless it was. A Chevy dealer's chatbot agreed to sell a Tahoe for $1. Every one passed its uptime checks on the way to the front page.
You already have the mental model for this
If you're a software engineer, the concepts map directly. They just go by different names.
Tests become evals. You write unit and integration tests to verify behavior, run them before every deploy, and catch regressions. The AI equivalent is evaluation: running your system against a set of known inputs and checking that the outputs are still good. The catch is that AI evals can't rely on equality checks. Tests are like checking whether a calculator works; evals are like grading an essay. That's why teams reach for LLM-as-judge approaches and custom scoring functions — though these come with their own problems: judge models can be inconsistent, biased, and sometimes hallucinate their own scores, so your evals need testing too. The hard part is defining "good enough" for fuzzy outputs when you don't have ground truth on day one. But the underlying discipline is the one you already practice: define what "correct" looks like as best you can, automate the check, iterate on the definition as you learn.
Logs become traces. You wouldn't run a web service without request logging or a microservice architecture without distributed tracing. AI traces are the same idea, adapted for AI primitives: LLM calls, retrieval, and tool execution instead of HTTP spans. When something goes wrong, you open the trace and walk it top to bottom.
CI/CD becomes continuous evaluation. In modern software, you don't just test before deploy. You monitor in production. Error rates, latency percentiles, business metrics. The AI version is online evaluation: scoring a sample of production outputs continuously and alerting when quality drifts. Pair it with shift-left testing (building eval datasets early, running evals during development) and you get the single most important capability: a bidirectional feedback loop between offline and online evaluation, where production data improves your evals and evals inform what to monitor.
Code review becomes eval review. Before merging a prompt change or model swap, someone reviews how it moved the eval scores. Did accuracy drop? Did latency change? Did cost per conversation go up?
You don't need a new discipline. You need the one you already have, pointed at a non-deterministic system.
What tracing actually gives you
Tracing makes failures debuggable. Evaluation makes them measurable. Start with tracing.
When you instrument an AI system, you capture the full lifecycle of every request: from the user's input through every retrieval, tool call, and model interaction, to the final output. A simple plaintext trace looks like this:
trace_id: tr_9f31 request: "Can I return shoes bought 40 days ago?" span 1 retrieval 42ms query: "return policy shoes 40 days" docs: [policy_returns_apparel_v3] # WRONG: apparel policy, not footwear score: 0.61 # low confidence, retrieved anyway span 2 llm.generate 810ms model: gpt-4.1-mini input_tokens: 812 output_tokens: 146 cost_usd: 0.0042 span 3 tool.http 95ms tool: order_lookup status: 200 final_answer: "Yes, you can return within 30 days."
Real traces are messier than this (nested agents, retry loops, ambiguous failures), but even this simplified version shows the root cause. Retrieval pulled the apparel return policy (30 days) instead of the footwear policy (45 days), with a low similarity score nobody was alerting on. The model dutifully answered from the wrong document. The HTTP layer and the LLM call both look green — the bug lives entirely in the retrieval span.
That single trace gives you three things you couldn't otherwise have.
First, debugging that's actually possible. A user reports a bad answer. Without traces you're guessing, and non-deterministic systems don't reproduce reliably. With traces you pull up the exact request and walk through it: retrieval returned irrelevant documents, or the model ignored the system prompt, or a tool failed silently and the model improvised. You can see it.
Second, cost attribution that makes sense. Your inference bill spiked 40% this month. With traces you can break it down per workflow, per step, per model, and find the prompt template that's burning 3x the tokens it should. One team we worked with found their agents spending $20 per run on web research, then hitting a timeout and wasting the entire budget. The issue only surfaced intermittently because of non-deterministic tool use, which made it impossible to debug without tracing.
Third, latency you can actually fix. Users complain the agent is slow. Traces give you the per-span breakdown the same way Jaeger or Datadog APM would for microservices. One client reported slow agents. We pulled up the waterfall view and showed them that nearly all time was spent on LLM inference, not retrieval or tool calls. That meant the fix was in prompting (shorter outputs, fewer reasoning steps), not infrastructure. Without the trace, they'd have been optimizing the wrong layer.
If you've used distributed tracing before, you already know this workflow — the span types are different, but everything else is the same.
What evaluation actually gives you
Tracing shows you what happened. Evaluation tells you whether it was any good.
Most teams "evaluate" their AI by trying a few prompts manually and eyeballing the output — the equivalent of testing a web app by clicking around for five minutes before shipping. It catches the obvious problems and misses everything subtle.
Systematic evaluation works in two modes:
Offline, before deployment. You maintain a set of test cases (your test suite). Before deploying a change, run the system against them and check the results. Did accuracy hold? Did edge cases break? This catches what manual checking misses: the prompt tweak that improves the common case but breaks the edge case, the model upgrade that reasons better but follows format instructions worse.
Online, in production. You continuously score a sample of real production outputs. Think of it as production monitoring, but for quality: hallucination rates, relevance scores, safety checks. Track these over time. Alert when they drift. This is how you catch slow degradation before it becomes a customer-facing incident: the silent provider update, the stale retrieval index, the shifted input distribution.
What systematic evals catch that vibes-checking doesn't:
Hallucination rates creeping up 2-3% per week after a provider model update. Invisible day-to-day, obvious over a month.
Safety regressions from a system prompt tweak that's better at answering questions but worse at refusing harmful ones.
Performance that's great for English but terrible for Spanish, or great for short queries but terrible for long ones. Slicing evals by segment reveals these patterns.
Cost per conversation doubling because the model started generating longer responses after a temperature change.
Where do you stand?
A self-assessment. Everyone starts at the bottom. The question is whether you're moving up.
Level | What it looks like |
|---|---|
0. Vibes | "It seems to work." You try a few prompts manually after changes. |
1. Logging | Inputs and outputs get saved somewhere. You can investigate after users complain. |
2. Tracing | Full request lifecycle captured. You can debug, attribute costs, profile latency. |
3. Evaluation | Automated quality scoring on production traffic. You catch regressions before users do. |
4. Continuous | Evals in CI, drift alerts, traces feeding eval datasets. You ship changes with confidence. |
Most teams we talk to are at vibes or logging. In our experience, the ones still in production a year later tend to be at evaluation or higher.
The parallel to traditional software maturity is exact: "it compiles, ship it" → unit tests → CI/CD → production monitoring → observability platforms. AI is on the same path, a few years behind. The teams that make the jump first will outpace the rest, just as early CI adopters left the manual-testers behind.
Start small. Start now
You don't need a full evaluation platform before your next deploy. You need to take one step up from where you are.
If you're at... | Do this next |
|---|---|
Vibes | Add tracing. Instrument your main workflow. You'll learn more about your system in a day of reading traces than in a month of building it. |
Logging | Add structure. Capture the full request lifecycle, not just the final input and output. That's the difference between a log line and a trace. |
Tracing | Add one offline eval. Pick the workflow that would embarrass you if it broke and build a small test set. Run it before every deploy. Even "does the output contain the expected entity" catches real issues. |
Evaluation | Close the loop. Use traces to build eval datasets from real production data. Set up alerts on quality metrics. Run evals in CI. |
Hamel Husain and others call this approach eval-driven development: the AI equivalent of TDD. Build your evaluation datasets. Use them as the scoring rubric during development. Run them before every deploy. Monitor drift in production. Feed production data back into better evals. It's the closest thing the field has to a settled best practice.
None of this requires exotic tooling or a dedicated ML ops team. It requires the engineering rigor you already have, pointed at a new kind of system.
The whisper test
Here's the question to take back to your team on Monday.
If your agent quietly started getting worse tomorrow — not failing, just drifting — how long before you'd notice? Hours? Days? The next quarterly review? A customer tweet?
If the honest answer is anything other than "within hours, automatically," you've got work to do. The teams whose agents are still running in 2027 will be the ones who answered that question honestly and built the feedback loops to back it up.
Bugs whisper. Build something that listens.
This is the problem we work on every day at orq.ai. We're one of the representative vendors in the Gartner Market Guide for AI Evaluation and Observability Platforms, the same guide that puts current adoption at 18%. If any of this sounds familiar, come take a look.
In traditional software, bugs are loud. In AI, bugs whisper.
When your API has a bug, you get a 500. A stack trace. An alert fires, someone gets paged, you fix it. You've built entire workflows around this kind of failure: error tracking, alerting, incident response. It works because the failure announces itself.
AI failures are silent. Your agent doesn't throw an exception when it starts hallucinating, and your RAG pipeline doesn't return an error when retrieval quality degrades. Everything keeps running. The metrics look green, latency is fine, uptime is 100%. And meanwhile the product is quietly getting worse.
This is the part nobody warns you about when you ship an AI agent. The fun part (prompt engineering, tool integration, retrieval pipelines, the demo that made everyone in the room lean forward) is over. What replaces it is a different problem entirely, and most teams aren't set up for it.
Everyone is building. Almost nobody is measuring
The numbers are out of balance in a way that should worry you.
McKinsey's 2025 State of AI report finds that 62% of organizations are at least experimenting with AI agents. Gartner expects 40% of enterprise applications to feature task-specific AI agents by the end of 2026, up from less than 5% the year before.
Meanwhile, only 18% of software engineering teams are actually using AI evaluation and observability platforms. That number is projected to hit 60% by 2028, but 2028 is not today.
62% experimenting. 18% measuring. That gap is where projects go to die.
Gartner also predicts more than 40% of agentic AI projects will be canceled by the end of 2027. The reasons cited (escalating costs, unclear business value, weak risk controls) all collapse into one underlying problem: teams can't see what their AI is doing, can't measure whether it's working, and can't make a defensible case to keep it funded.
What separates the projects that survive isn't better models or bigger GPU budgets. It's operational discipline: the ability to trace what the system is doing, evaluate whether it's doing it well, and prove that to the business.
What silent failure looks like in practice
This is already happening.
A model provider rolls out an update. OpenAI, Anthropic, Google: they all push rolling updates that change model behavior without changing the model name in your API call. Prompts that worked last week start producing subtly different outputs. The formatting breaks, the model gets verbose and burns tokens, it stops following an instruction it used to handle fine. No changelog, no error. Just different.
A customer support agent works great in testing, where conversations are short. In production, users have long, multi-turn conversations. Chroma's "Context Rot" study of 18 frontier models found that performance degrades well before the advertised context limits, even on simple tasks. The agent starts forgetting the system prompt, contradicting earlier responses, and hallucinating details, while the uptime dashboard shows 100%.
An e-commerce company's AI-powered product descriptions drift after the underlying model gets updated. Hallucinated specs creep in: wrong dimensions, incorrect compatibility claims. Pages look fine at a glance, but customer returns climb for weeks before anyone connects the dots.
These aren't hypotheticals from a threat-modeling exercise. Air Canada was ordered to honor a refund policy its chatbot invented. DPD's support bot was jailbroken into swearing at customers and writing a haiku about how useless it was. A Chevy dealer's chatbot agreed to sell a Tahoe for $1. Every one passed its uptime checks on the way to the front page.
You already have the mental model for this
If you're a software engineer, the concepts map directly. They just go by different names.
Tests become evals. You write unit and integration tests to verify behavior, run them before every deploy, and catch regressions. The AI equivalent is evaluation: running your system against a set of known inputs and checking that the outputs are still good. The catch is that AI evals can't rely on equality checks. Tests are like checking whether a calculator works; evals are like grading an essay. That's why teams reach for LLM-as-judge approaches and custom scoring functions — though these come with their own problems: judge models can be inconsistent, biased, and sometimes hallucinate their own scores, so your evals need testing too. The hard part is defining "good enough" for fuzzy outputs when you don't have ground truth on day one. But the underlying discipline is the one you already practice: define what "correct" looks like as best you can, automate the check, iterate on the definition as you learn.
Logs become traces. You wouldn't run a web service without request logging or a microservice architecture without distributed tracing. AI traces are the same idea, adapted for AI primitives: LLM calls, retrieval, and tool execution instead of HTTP spans. When something goes wrong, you open the trace and walk it top to bottom.
CI/CD becomes continuous evaluation. In modern software, you don't just test before deploy. You monitor in production. Error rates, latency percentiles, business metrics. The AI version is online evaluation: scoring a sample of production outputs continuously and alerting when quality drifts. Pair it with shift-left testing (building eval datasets early, running evals during development) and you get the single most important capability: a bidirectional feedback loop between offline and online evaluation, where production data improves your evals and evals inform what to monitor.
Code review becomes eval review. Before merging a prompt change or model swap, someone reviews how it moved the eval scores. Did accuracy drop? Did latency change? Did cost per conversation go up?
You don't need a new discipline. You need the one you already have, pointed at a non-deterministic system.
What tracing actually gives you
Tracing makes failures debuggable. Evaluation makes them measurable. Start with tracing.
When you instrument an AI system, you capture the full lifecycle of every request: from the user's input through every retrieval, tool call, and model interaction, to the final output. A simple plaintext trace looks like this:
trace_id: tr_9f31 request: "Can I return shoes bought 40 days ago?" span 1 retrieval 42ms query: "return policy shoes 40 days" docs: [policy_returns_apparel_v3] # WRONG: apparel policy, not footwear score: 0.61 # low confidence, retrieved anyway span 2 llm.generate 810ms model: gpt-4.1-mini input_tokens: 812 output_tokens: 146 cost_usd: 0.0042 span 3 tool.http 95ms tool: order_lookup status: 200 final_answer: "Yes, you can return within 30 days."
Real traces are messier than this (nested agents, retry loops, ambiguous failures), but even this simplified version shows the root cause. Retrieval pulled the apparel return policy (30 days) instead of the footwear policy (45 days), with a low similarity score nobody was alerting on. The model dutifully answered from the wrong document. The HTTP layer and the LLM call both look green — the bug lives entirely in the retrieval span.
That single trace gives you three things you couldn't otherwise have.
First, debugging that's actually possible. A user reports a bad answer. Without traces you're guessing, and non-deterministic systems don't reproduce reliably. With traces you pull up the exact request and walk through it: retrieval returned irrelevant documents, or the model ignored the system prompt, or a tool failed silently and the model improvised. You can see it.
Second, cost attribution that makes sense. Your inference bill spiked 40% this month. With traces you can break it down per workflow, per step, per model, and find the prompt template that's burning 3x the tokens it should. One team we worked with found their agents spending $20 per run on web research, then hitting a timeout and wasting the entire budget. The issue only surfaced intermittently because of non-deterministic tool use, which made it impossible to debug without tracing.
Third, latency you can actually fix. Users complain the agent is slow. Traces give you the per-span breakdown the same way Jaeger or Datadog APM would for microservices. One client reported slow agents. We pulled up the waterfall view and showed them that nearly all time was spent on LLM inference, not retrieval or tool calls. That meant the fix was in prompting (shorter outputs, fewer reasoning steps), not infrastructure. Without the trace, they'd have been optimizing the wrong layer.
If you've used distributed tracing before, you already know this workflow — the span types are different, but everything else is the same.
What evaluation actually gives you
Tracing shows you what happened. Evaluation tells you whether it was any good.
Most teams "evaluate" their AI by trying a few prompts manually and eyeballing the output — the equivalent of testing a web app by clicking around for five minutes before shipping. It catches the obvious problems and misses everything subtle.
Systematic evaluation works in two modes:
Offline, before deployment. You maintain a set of test cases (your test suite). Before deploying a change, run the system against them and check the results. Did accuracy hold? Did edge cases break? This catches what manual checking misses: the prompt tweak that improves the common case but breaks the edge case, the model upgrade that reasons better but follows format instructions worse.
Online, in production. You continuously score a sample of real production outputs. Think of it as production monitoring, but for quality: hallucination rates, relevance scores, safety checks. Track these over time. Alert when they drift. This is how you catch slow degradation before it becomes a customer-facing incident: the silent provider update, the stale retrieval index, the shifted input distribution.
What systematic evals catch that vibes-checking doesn't:
Hallucination rates creeping up 2-3% per week after a provider model update. Invisible day-to-day, obvious over a month.
Safety regressions from a system prompt tweak that's better at answering questions but worse at refusing harmful ones.
Performance that's great for English but terrible for Spanish, or great for short queries but terrible for long ones. Slicing evals by segment reveals these patterns.
Cost per conversation doubling because the model started generating longer responses after a temperature change.
Where do you stand?
A self-assessment. Everyone starts at the bottom. The question is whether you're moving up.
Level | What it looks like |
|---|---|
0. Vibes | "It seems to work." You try a few prompts manually after changes. |
1. Logging | Inputs and outputs get saved somewhere. You can investigate after users complain. |
2. Tracing | Full request lifecycle captured. You can debug, attribute costs, profile latency. |
3. Evaluation | Automated quality scoring on production traffic. You catch regressions before users do. |
4. Continuous | Evals in CI, drift alerts, traces feeding eval datasets. You ship changes with confidence. |
Most teams we talk to are at vibes or logging. In our experience, the ones still in production a year later tend to be at evaluation or higher.
The parallel to traditional software maturity is exact: "it compiles, ship it" → unit tests → CI/CD → production monitoring → observability platforms. AI is on the same path, a few years behind. The teams that make the jump first will outpace the rest, just as early CI adopters left the manual-testers behind.
Start small. Start now
You don't need a full evaluation platform before your next deploy. You need to take one step up from where you are.
If you're at... | Do this next |
|---|---|
Vibes | Add tracing. Instrument your main workflow. You'll learn more about your system in a day of reading traces than in a month of building it. |
Logging | Add structure. Capture the full request lifecycle, not just the final input and output. That's the difference between a log line and a trace. |
Tracing | Add one offline eval. Pick the workflow that would embarrass you if it broke and build a small test set. Run it before every deploy. Even "does the output contain the expected entity" catches real issues. |
Evaluation | Close the loop. Use traces to build eval datasets from real production data. Set up alerts on quality metrics. Run evals in CI. |
Hamel Husain and others call this approach eval-driven development: the AI equivalent of TDD. Build your evaluation datasets. Use them as the scoring rubric during development. Run them before every deploy. Monitor drift in production. Feed production data back into better evals. It's the closest thing the field has to a settled best practice.
None of this requires exotic tooling or a dedicated ML ops team. It requires the engineering rigor you already have, pointed at a new kind of system.
The whisper test
Here's the question to take back to your team on Monday.
If your agent quietly started getting worse tomorrow — not failing, just drifting — how long before you'd notice? Hours? Days? The next quarterly review? A customer tweet?
If the honest answer is anything other than "within hours, automatically," you've got work to do. The teams whose agents are still running in 2027 will be the ones who answered that question honestly and built the feedback loops to back it up.
Bugs whisper. Build something that listens.
This is the problem we work on every day at orq.ai. We're one of the representative vendors in the Gartner Market Guide for AI Evaluation and Observability Platforms, the same guide that puts current adoption at 18%. If any of this sounds familiar, come take a look.

