Custom metrics: Per-conversation quality scoring

Background

Customer service AI agents handle thousands of conversations a day, so ACX managers, the customer's team responsible for monitoring the agent, need a scalable way to know how they are performing. Ada's Custom Metrics feature includes a library of pre-built quality measures (Accuracy, Return-Caller risk, Appropriate Escalation, Policy Adherence, Predicted NPS, and others) that score every production conversation automatically. The score is produced by a large language model acting as a judge: given a transcript, the model returns a structured verdict per metric — success, fail, or not applicable — along with its reasoning. This LLM-as-judge pattern is increasingly standard, but it surfaces two problems we worked through in four phases: scale (every LLM call per conversation adds material cost) and calibration (an out-of-the-box judge has no way to know what each individual customer means by, say, "promoter" in their context).

Phase 1 — Per-criterion evaluation: the experimentation harness

Our first design runs one LLM call per metric per conversation. This yields the richest signal we can extract: per-metric verdicts, free-text reasoning, confidence scores, and clean cross-model comparisons. We still use this design offline as our experimentation harness — when we need to compare prompts, swap models, or test design changes, the per-metric signal is what makes those comparisons trustworthy. A recent run scored ~1,800 conversations across 8 quality metrics. Production needs a different shape: one LLM call per metric scales linearly in metrics × conversations, a multiplier that doesn't scale.

Phase 2 — A tentative consolidation: one judge for all metrics

The most straightforward way to take this design into production is to consolidate: ask a single LLM call to score all of a customer's active metrics in one structured response. Cost is now linear in conversations alone — the only architecture that scales. This obvious move works, but comes with a known issue we set out to solve next. The model formats its response as a JSON object with one field per metric. If we ask for JSON without enforcing a schema, a single malformed entry gets silently dropped during validation — eight metrics scored, one parse failure, the conversation appears to have only seven scores. Across millions of conversations, this creates invisible holes that look like missing metrics rather than bugs.

Phase 3 — Hardening the production response

The fix is the OpenAI API's strict-schema response mode, which constrains the model's output at generation time to exactly match a defined JSON schema. With strict schema turned on, the model cannot return a malformed entry. This single change eliminates an entire class of invisible data-quality errors at no additional cost. While tightening the schema, we made two refinements per metric. First, the model emits a short chain-of-thought before the verdict, not after — a pattern well-established in the LLM-as-judge literature, since giving the model space to reason before committing yields more accurate verdicts, and lets a human reviewer see at a glance why. Second, a confidence score, sourced either by asking the model to emit it directly (the simpler path, which we ship today) or by reading the model's own probability over the verdict tokens — well-supported in recent research and open as a refinement. Confidence surfaces borderline verdicts and focuses human review where it adds the most signal. Refining metric definitions alone moved a risk-detection metric by ~+11pp and a sentiment-prediction metric by ~+6pp, before any per-customer calibration. By the end of this phase, the pipeline was reliable and well-instrumented, but it was occasionally still wrong on what counted as good, particularly around the edges of each customer's specific definitions.

Phase 4 — Closing the calibration gap: ShotJudge

The remaining errors showed up the moment we compared the model's predicted NPS distributions against the ground-truth surveys customers reported on the same conversations. The shapes didn't match — the model was skewed from what the customer's own data said. The diagnosis: the judge was being asked to evaluate something it didn't have clear context to evaluate. What "good" looks like depends on context that lives in the customer's domain expertise, not in any individual transcript. A zero-shot LLM judge, working only from a transcript and a generic prompt, has no way to learn that context.

We adopted ShotJudge, the calibration paradigm from the XpertBench paper (Liu et al.): a handful of expert-curated examples close most of the calibration gap. Our inversion is who the experts are. In the paper, they are researchers producing a static reference set. In our version, they are the customer's AI managers, tagging conversations during normal review. The exemplar set isn't a fixed corpus; it's a living signal the AI managers write continuously.

A dynamic few-shot system is defined as much by how it selects examples as by the examples themselves. Coaching, another Ada feature using few-shot examples, uses semantic similarity to choose examples. ShotJudge selects by recency, because what's drifting isn't "which conversations look like this one" but "what this customer currently means by 'promoter.'" Our initial implementation of Custom Metrics hard-codes a hand-curated exemplar per metric; the AI-manager tagging UI and live exemplar pipeline are the next iteration we are building.

Why this generalizes

Three takeaways apply beyond Ada. Run your experimentation harness and your production LLM judge as two different systems, on purpose. The harness is deliberately expensive: per-metric signal is what makes architectural comparisons trustworthy, and it's where you find the best prompt. Production runs that same prompt in a shape that scales with traffic. Strict structured output is not optional at scale. Silent malformed outputs are the worst kind of failure, because they look like missing data rather than bugs. Dynamic few-shot is a class of solutions, not a single one. The design question isn't just whether to use it, but which dimension of similarity to select examples along: semantic similarity when the issue is "does this conversation look like one I've seen before," recency when the issue is "has the customer's definition of 'good' moved."