Unified Reasoning Engine
Evals

Delta evaluation: Production replay pipeline

November 2025 — April 2026 | Project: Delta Evaluation

Delta Evaluation (DE) is an end-to-end offline evaluation pipeline that replays recent production conversations through modified prompts, models, model parameters, or prompt-rendering logic and uses an LLM-as-judge to compare outputs against a baseline.

Armand Silviu Gurgu portrait
Armand Silviu Gurgu
Senior Machine Learning Scientist
Yasmine Messikh portrait
Yasmine Messikh
Associate Machine Learning Scientist

Two-stage replay for increased throughput

Each replay produces a (baseline, candidate) pair. A cheap stage marks the pair as equivalent when it can: first via deterministic checks (matching tool calls and outputs), then via a lightweight (faster and cheaper) LLM equivalence pass for outputs that look textually different but might be semantically the same. Anything the cheap stage can't confidently mark equivalent goes to a stronger reasoning judge with full conversation context, tools, and prompts that determine which output was better. Without this pre-filter, the expensive judge would have to run on every pair, and DE would be too slow to use before merging a prompt change. Typical runs are thousands of replays for both the baseline and candidate in tens of minutes of wall time on high-concurrency async execution.

LLM-as-judge calibration

The judge runs on a reasoning model and emits a summarized reasoning trace alongside each verdict. Its prompt is calibrated to product and business intent. The verdicts aggregate into win-rate metrics, and the reasoning traces are clustered post-hoc into themes that map back to specific prompt sections to revise. One illustrative theme that surfaced: the judge preferred a baseline that gave up over a candidate that asked a necessary clarifying question, because the judge prompt rewarded surface confidence over productive uncertainty. This observation was central to further recalibrate the alignment between the business-specific intent and the prompt used at the time for the LLM-as-judge. The key insight is verdicts and traces are complementary signals. Verdicts give the win-rate aggregate that gates rollouts. Traces give the diagnostic detail that points at what to fix next. We can use the verdicts for our quality estimates and the traces to track (and improve) business alignment of the judges.

Stratified sub-sampling for representative metrics

DE runs on a sample of recent production conversations, stratified to preserve the relative proportions of customers, languages, and conversation types from the source data. Metrics measured on the sample can be reweighted to estimate full-population behavior.

Coding-agent skill on top of the pipeline

Two coding-agent skills sit on top of the pipeline: an orchestration skill that runs end-to-end, and an analytical skill that loads run outputs and answers follow-up questions about them (which behaviors changed between baseline and candidate, whether each change is acceptable, what to try next). The analytical skill's first iteration consumed a substantial fraction of the agent's context window from a single large skill file; subsequent hardening moved most of the work into a Python module that the agent calls into, with caching for repeated analyses and an LLM step that clusters detected regressions into themes.

Used to evaluate prompt, model, and parameter changes before rollout

Recent prompt, model, and parameter changes to Ada's customer-facing reasoning agent went through DE before merging. By March, the rest of the team was using DE for their own iterations, sometimes running many variants over a couple of days.

Market implication

DE compresses iteration cycles for prompt, model, and parameter changes. The previous default was either a curated eval (higher quality, but costly to build/label/maintain for many hard-to-anticipate production scenarios) or shipping the change in an A/B test and watching production metrics (slow, with risk in between). DE replays thousands of recent production conversations against the candidate in tens of minutes, returning both an aggregate win-rate signal and per-trace diagnostics. The iteration loop shrinks from days to hours, rollouts can be scoped before traffic moves, and iterating on prompts, models, model parameters, or prompt-rendering logic becomes something the broader team does day-to-day rather than a dedicated experiment cycle. The coding-agent skill integration makes DE accessible from and by coding agents and unlocks a high degree of automation potential as well as developer, analyst, and user experiences.