Evals
Unified reasoner evaluation science
Ada's Unified Reasoner replaced its Modular Reasoner with a re-baselined eval harness and Legitimacy Classifier. Adversarial pass rate: 88% to 97%.
Delta evaluation: Production replay pipeline
DE replays production conversations through modified prompts & models, using an LLM-as-judge. Verdicts aggregate into win-rate metrics; traces into themes.