How We Evaluate Our AI: Scenarios, Benchmarks, and Continuous Quality Tracking

Floris Weers · April 7, 2026

We're Biscuit, a platform that turns your idea into a real business. App, payments, hosting, support, all of it. To do this, we run many models. This post is about how we evaluate every model, tracked over time. We know which model wins on quality, which wins on speed, and where the best tradeoff lives. We used to ship and hope. Now we ship and know.

We don't run a single prompt. We run many different models across many different tasks: one for generating UI code, another for planning data schemas, another for debugging, another for feature suggestions. Each model has its own strengths, and each task has its own requirements. Every prompt tweak, every model swap, every new workflow could make one thing better and break another.

A few months ago, shipping a change looked like this. An engineer would update a prompt or swap a model. They'd test a few things by hand. It looked fine. They'd ship it. Then they'd hope. Hope it didn't break something unrelated. Hope the ten other things that model needed to do still worked. There was no system. Just skilled people doing their best with manual spot checks.

The problem is that AI failures are quiet. The AI doesn't crash. It doesn't throw an error. It just does half the job, or does the wrong thing confidently.

One day we asked the AI to add a priority field to a todo app and display it as colored badges. It added the field. But the badges didn't show up in the UI. The data was there, the code compiled, but the feature was half-built. You don't notice these until you happen to try the exact task that broke.

Our platform handles hundreds of different interactions across every kind of app. Editing code, clicking buttons, filling forms, creating data models, generating features. We needed a way to capture each of these as a repeatable, automated test. And we needed to run them constantly.

So we built one.

Part I · The Building Blocks

Scenarios: the atomic unit of evaluation

We turned the priority-field problem into what we call a scenario: a self-contained test case for the AI. A scenario sets up a specific context, gives the AI a prompt, and then verifies whether the AI actually did the whole job.

What makes a scenario useful is that it captures the full context the AI needs. Not just a prompt, but an entire mini-app to work with, a specific starting state, and a way to check the result. This means we can replay the exact same test, against any model, at any time.

Here's what one looks like in practice:

Testing whether the editor can add tags, and clean them up again afterwards.

We started with a handful. Today we have over five hundred, spanning twenty different workflows, covering everything from the editor to billing safety checks.

We try to organize scenarios, both for ourselves and for any AI navigating them.

Scoring: deterministic checks and rubric-based judging

Some scenarios have clear-cut success criteria. Does the code compile? Is the priority field in the data model? Does the view contain an element with the right CSS color? For these, we write deterministic checks that inspect the actual code, data store, or rendered output. Pass or fail, no ambiguity.

But many of the things our AI does are harder to judge with a simple assertion. Did it write clean code? Are the feature descriptions concise? Did it find the root cause of a bug, or just describe a symptom? For these qualitative outcomes, we use LLM-as-a-judge: a separate model evaluates the result against a structured rubric.

LLM-as-a-judge is exactly what it sounds like: you use a strong language model to evaluate the output of another model. Early work in this area, including MT-Bench and Chatbot Arena [1] and G-Eval [2], showed that the approach can correlate well with human assessments, but also surfaced real failure modes: judges can be biased toward verbose answers, toward their own outputs, or toward whichever option appears first. The field has matured quickly since then. For a good overview, Li et al.[3] offer a comprehensive survey of where the field stands.

The key lesson from this research is that how you ask the judge matters enormously. A vague "rate this from 1 to 5" prompt produces noisy, inconsistent scores. What works is giving the judge a detailed rubric with explicit criteria, concrete score-level descriptions, and decomposed dimensions. This is well-supported: LLM-Rubric [4] showed that multidimensional rubrics significantly improve agreement with human ratings. "FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets."[5] decomposed evaluation into fine-grained skills like readability and conciseness. CheckEval [6] found that binary checklist questions reduce variance and improve interpretability. And LMUNIT [7] went further by turning criteria into natural-language unit tests, an idea that maps naturally to the kind of technical and style constraints we care about.

Our judge follows the rubric-based approach. Each rubric defines weighted criteria on a 1–5 scale, and every criterion has explicit descriptions for every score level. Instead of interpreting vague labels, the judge matches the output against concrete descriptions of what a 3 versus a 5 looks like for that specific criterion. Individual criteria can also enforce minimum thresholds, so a scenario can require at least a 4 on correctness even if style scores are more lenient. This is especially powerful for evaluating feature generation quality, code style, and complex editor tasks where "correct" is a spectrum.

This approach isn't perfect, and we're careful about where we trust it. Large-scale studies [8] show that judge reliability varies significantly across task types and the properties being evaluated. Context-grounded judging [9], such as checking whether generated code actually follows a codebase's conventions, is substantially harder than generic quality assessment. We account for this by tailoring criteria to each scenario type rather than relying on a single universal rubric.

There's much more to say about designing effective LLM judges, from rubric calibration and bias mitigation to choosing judge models, ensemble approaches[14][15][8], and prompt engineering for evaluators. We'll cover those in detail in a follow-up post. For now, the takeaway: structured, criterion-level rubrics turn "does this feel right?" into something repeatable and measurable.

Multi-turn scenarios

Scenarios can span multiple conversation turns. Each turn sends a prompt, waits for the AI to act, and verifies before continuing. This lets us test whether the AI retains context across a conversation, not just whether it can handle a single instruction. For example: "Change the button color to blue" → verify → "Oh actually, make it red instead" → verify that it's red and that blue is gone.

A scenario that tests the patience of the editor: the user really doesn't know what color they like.

Part II · From Scenarios to Signal

Benchmarks: from scenarios to scorecards

One scenario tells you whether one task works. That's useful, but it doesn't answer the bigger questions. Is the AI good at forms? Is it getting better at code editing? How does it compare across all UI interactions?

A benchmark answers these questions. It's a curated collection of scenarios, organized by capabilities.

Benchmark: "ui-basics"
  Capability: Click Interactions    12 scenarios
  Capability: Text Input             8 scenarios
  Capability: Form Components       10 scenarios

When a benchmark runs, you don't just see a single pass rate. You see a breakdown by capability, and this immediately tells you where to focus. A single number hides too much. Capabilities make the score actionable.

A capability groups scenarios that test the same skill, so you can track it independently across models and over time. If the score drops, you know exactly where to look. We try to make capabilities meaningfully distinct from each other: things like "handles multi-file changes" or "follows design tokens," not internal implementation details that users never see. Getting the granularity right takes a few tries: too broad and problems get averaged away, too narrow and you don't have enough scenarios to say anything useful.

We organize benchmarks into three tiers. Smoke benchmarks are 12–15 scenarios that run in minutes on every PR, serving as a quick sanity check. Core benchmarks cover essential capabilities like UI basics, form handling, code editing, and bug fixing. Extended benchmarks provide comprehensive coverage: multi-turn conversations, long context, multilingual support, complex tasks. Smoke runs on every change; core and extended run nightly.

Between scheduled runs, we also trigger benchmarks based on what changed. If files matching certain patterns are modified, the relevant benchmarks run automatically. This means we don't have to wait for the next nightly run to catch a regression: we get a signal within minutes of the change landing.

Did this change break anything?

After every benchmark run, the system automatically compares against the previous run on the same benchmark, model, and branch. Each scenario is classified: regressed, improved, or stable.

But here's the subtle part: with AI, some amount of variation is normal. A model might fail a scenario one run and pass it the next, purely due to sampling randomness. So how do we tell real regressions from noise?

McNemar Calculator

Imagine you run the same benchmark before and after a prompt change. Some scenarios regressed, some improved. Add or remove dots to see when the difference becomes statistically significant.

Regressions5

Improvements1

No clear change

p = 0.2188 (≥ 0.05) — the difference could be noise.

Reference: minimum discordant pairs

Discordant pairs (m)	Best-case p-value	Can reach significance?
1	1.0000	No
2	0.5000	No
3	0.2500	No
4	0.1250	No
5	0.0625	No
6	0.0313	Yes ✓
7	0.0156	Yes ✓
8	0.0078	Yes ✓
9	0.0039	Yes ✓
10	0.0020	Yes ✓

We use the exact McNemar test[10], a paired test for binary data. The idea: ignore scenarios where both runs agree (both pass or both fail). Only look at discordant pairs, scenarios that flipped. Under the null hypothesis (no real change), each flip is equally likely to be a regression or an improvement, like a fair coin. This approach was later formalized for comparing classifiers by Dietterich[11].

So if you have $m$ discordant pairs and $b$ of them are regressions, the p-value asks: if we flipped $m$ fair coins, how likely is it to see $b$ or more heads? That's just the binomial CDF.

We require at least 6 discordant pairs to even attempt the test. Why 6? Because with fewer, even the most extreme result (all flips in one direction) can't reach $p < 0.05$ . The minimum achievable p-value for $m$ discordant pairs is $2^{1-m}$ . At $m = 5$ that's 0.0625, not enough. At $m = 6$ it's 0.03125, just barely significant.

The practical consequence: small benchmarks ( $N < 15$ ) will rarely produce significant results unless changes are dramatic, while large benchmarks provide much more statistical sensitivity.

Key assumptions and limitations. The test assumes scenario outcomes are independent, but in practice, scenarios that test related capabilities may be correlated, which can inflate false positives. We don't apply multiple testing correction across consecutive comparisons (Bonferroni, BH, etc.), so viewing many trend comparisons increases the chance of a spurious result. We're honest about these limitations because the alternative, no statistical reasoning at all, is worse.

Content hashing: honest comparisons

Early on, we hit a frustrating problem. A benchmark score would drop from 92% to 78%, we'd spend an hour investigating, and it would turn out someone had changed the test, not that the AI got worse. The trend line was untrustworthy because we couldn't distinguish "the AI got worse" from "we changed the test."

So every scenario and benchmark gets a content hash (SHA-256 of the resolved definition). When a scenario's prompt or verification logic changes, the hash changes. This lets us distinguish "the AI got worse" from "we changed the test." Without this, trend lines would be untrustworthy.

Before any statistical comparison, the system checks that the content hashes match and that no scenarios were added or removed. If they don't match, the comparison is marked not_comparable, so no misleading p-values and no false alarms.

So now you have hundreds of scenarios, grouped into benchmarks, with statistical rigor baked in. The natural next question: are we getting better or worse over time?

Part III · Seeing the Big Picture

Are we getting better?

The Trends page tracks benchmark pass rates over time, broken down by capability. This is where the system starts to feel like a real observability tool, not just "did this run pass?" but "what's the trajectory?"

Trend Chart

Data point Definition changed Improvement Regression

Trends answer questions that no single benchmark run can. Is the AI getting better or worse over time? Are some capabilities improving while others degrade? And critically, when you see a dip, was it a real change or did the test definition change? The content hash markers make this immediately visible.

Which model should we use?

To make this concrete, let's look at one benchmark in detail: the editor benchmark. It tests the agent end-to-end: give it a task, let it work, then check whether the result is correct. It covers dozens of capabilities across the everyday work the editor performs, from fixing bugs and building UI components to managing state and multi-turn conversations.

To give you a feel for what these scenarios look like, here are a few examples:

"Add a delete button to each item in the list"

The agent needs to add a button to each row, wire it to a delete action, and make sure the list updates. We test whether it handles click interactions and data mutations correctly.

"The form doesn't save when I click submit. Can you fix it?"

The agent has to diagnose the issue, find the broken handler, and repair it without breaking anything else. We test bug fixing and form understanding.

"Add a settings page with a theme toggle, then link to it from the navbar"

The agent builds a new page, adds interactive state, and wires up routing across files. We test multi-turn editing, state management, and navigation.

We run every model against the same set of these scenarios. So how do different models actually perform? Three views tell the story: the overall rankings, the quality-vs-speed tradeoff, and a per-capability breakdown.

Overall rankings

The leaderboard answers the question we used to resolve by gut feeling: which model is best for our product? Models are ranked by pass rate on the editor benchmark. But a single ranking hides the interesting part: the trade-offs.

Overall Rankings

Rank	Model	Pass Rate	Passed / Total	Date
#1	claude-4.6-opus	86.2%	168 / 195	Apr 7
#2	gpt-5.4-concise	85.6%	167 / 195	Apr 7
#3	gemini-3-flash	82.6%	161 / 195	Apr 7
#4	gpt-5.4	82.0%	160 / 195	Apr 7
#5	gpt-5.2	79.5%	155 / 195	Apr 7

Quality vs. speed

Sometimes the fastest model is good enough. Sometimes you need the most capable one regardless of speed. A scatter plot of pass rate vs. execution time on the editor scenarios, with the Pareto frontier highlighted, shows the trade-off clearly. Models on the frontier offer the best balance, and no other model is both faster and more accurate.

Quality vs. Speed Scatter

Pareto optimal Other models

Capability heatmap

The heatmap shows everything at once. Rows are editor capabilities (click interactions, form components, bug fixing, navigation, and so on), columns are models, and cells are color-coded by pass rate. The key metric here is spread: the difference between the best and worst model on a capability. High spread means that capability strongly differentiates models. Low spread means all models handle it about the same.

Capability Heatmap

Capability	claude-4.6-opus	gpt-5.4-concise	gemini-3-flash	gpt-5.4	gpt-5.2	claude-4.5-haiku	gemini-3-flash-lite-no-thinking	claude-4.6-sonnet	gemini-3-flash-lite	grok-4-1-fast-reasoning	gpt-5.4-mini	gpt-5.4-nano	grok-4-1-fast-non-reasoning	grok-code-fast-1	gpt-oss-120b	gemini-3-pro	glm-4.7
▶Multi-Location Edits	100%	100%	100%	100%	100%	100%	0%	0%	100%	0%	100%	100%	0%	0%	0%	100%	0%
▶Creating New Models	60%	100%	80%	100%	100%	80%	100%	80%	80%	80%	100%	80%	80%	60%	40%	20%	0%
▶Bug Detection in Large Files	0%	0%	100%	100%	0%	0%	0%	100%	0%	0%	0%	0%	0%	0%	0%	0%	0%
▶Modifying Tokens	100%	100%	100%	75%	100%	75%	100%	100%	100%	50%	100%	100%	25%	75%	100%	100%	0%
▶Targeted Edits	100%	50%	50%	0%	50%	50%	50%	50%	50%	50%	50%	50%	50%	50%	50%	0%	0%
▶Using Tokens in Code	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	0%	100%	50%	0%	50%	100%	50%
▶Type Errors	75%	75%	100%	75%	75%	75%	50%	75%	75%	50%	75%	75%	75%	50%	50%	0%	50%
▶Workflow & API Features	74%	89%	78%	83%	74%	70%	76%	63%	59%	50%	57%	65%	50%	44%	50%	41%	28%
▶Visual & Content Features	100%	100%	70%	100%	80%	90%	90%	70%	90%	80%	80%	40%	70%	70%	60%	80%	40%
▶Fixing Model Policies	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	60%	40%	80%
▶Token Features	100%	92%	100%	92%	100%	83%	100%	92%	100%	50%	75%	75%	42%	42%	92%	83%	50%
▶General Features	100%	100%	88%	88%	75%	88%	75%	88%	75%	88%	50%	88%	75%	88%	63%	63%	50%
▶Logic Bugs	67%	83%	50%	67%	67%	83%	50%	67%	50%	67%	67%	67%	67%	83%	33%	50%	33%
▶Adding Tokens	88%	88%	84%	84%	81%	81%	79%	74%	74%	68%	62%	65%	63%	57%	57%	57%	42%
▶Async/State Bugs	85%	94%	88%	91%	82%	88%	82%	82%	77%	77%	79%	79%	74%	59%	62%	50%	50%
▶Adding Model Fields	100%	100%	100%	100%	80%	100%	100%	100%	100%	100%	80%	80%	100%	60%	80%	60%	80%
▶State Management	88%	96%	88%	96%	80%	88%	84%	92%	88%	84%	88%	80%	88%	72%	68%	56%	56%
▶UI Features	90%	90%	90%	93%	80%	87%	73%	80%	80%	77%	83%	73%	67%	60%	70%	60%	60%
▶Document Assets	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	67%	100%
▶Token Validation	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%

So the trends show us direction, and the leaderboard shows us who's ahead. But how much should you trust a pass rate?

Part IV · Understanding Uncertainty

How confident should you be in a pass rate?

A pass rate of 80% on 10 scenarios means something very different from 80% on 100 scenarios. With 10 scenarios, the true capability of the model could plausibly be anywhere from 50% to 95%. With 100 scenarios, it's much tighter: probably between 71% and 87%.

We use Wilson score intervals to put uncertainty bands around every pass rate. Try it yourself. Drag the sliders to see how sample size and pass count affect confidence:

Confidence Interval Explorer

Passed

out of

Total scenarios

80.0%

62.7%

90.5%

0%25%50%75%100%

95% Wilson interval width: 27.8pp

A reasonable sample size. The interval gives useful guidance.

The Wilson score interval[12] is our chosen method because it behaves well even when pass rates are near 0% or 100% and when sample sizes are small, both common in our benchmarks.

Unlike the naive $p \pm z\sqrt{\dfrac{p(1-p)}{n}}$ (Wald) interval, which can give nonsensical results like negative confidence bounds, Wilson is always well-behaved[13]. Unlike the Clopper-Pearson "exact" interval, it doesn't over-cover (producing uselessly wide intervals). Wilson hits the sweet spot: near-nominal coverage with a simple closed-form formula.

Key caveat: the interval assumes independent, identically distributed Bernoulli trials. In practice, our scenarios aren't identically distributed (each one has a different difficulty). And they may not be fully independent either, since related scenarios share the same underlying model capabilities. So the interval should be read as a useful approximation, not a precise guarantee. If many scenarios test closely related capabilities, the true uncertainty may be wider than what Wilson shows.

In our dashboards, we show confidence interval bars around every pass rate, which you see as a thin range around the main result. This simple visualization keeps us honest: it's a quick visual reminder that a seemingly dramatic shift might still be within the noise, or that what looks like a small difference might actually be significant. No single number is the full story.

Part V · What Changed

Shipping with confidence

Today, when we notice the AI half-building a feature, here's what happens. We write a scenario. It joins a benchmark automatically through its tags. The next CI run picks it up. If the fix works, it shows as an improvement in the comparison view. If it regresses later, Slack tells us within minutes.

We ship prompt changes and model updates with confidence because we know within minutes whether something broke. We choose between models based on data, not intuition. We can see at a glance whether our AI is getting better at forms, faster at code editing, or worse at multi-turn conversations.

Where this is going

We can tell you how every model scores today. We can't yet tell you how they'll score next quarter.

The building blocks are there. The next step is to curate a set of "frontier scenarios". Real tasks we actually want solved, where today's best models succeed maybe 30% of the time.

Once you have enough of those, you can start asking a more interesting question: how fast is the frontier moving? Run every new model release against the same battery. Track what fraction of frontier scenarios cross the "50% reliability threshold" over time. Fit a curve. This is the approach behind work like METR's autonomy evaluations: build a hard-enough benchmark, measure it consistently, and let the trendline tell you what's coming.

If we can predict that a capability (say, multi-file refactoring) will be reliably solvable in six months, we can start building for it now. Ship the feature just as the model catches up.

We're not there yet. But the evaluation system described in this post is the foundation it would run on.

What we haven't solved yet

We're honest about the limitations. We don't yet run the same scenario multiple times to estimate non-determinism; each scenario runs once per benchmark execution, so we can't separately measure "is this failure real or just sampling noise." We don't apply multiple testing correction across trend comparisons, which means if you look at enough charts, you'll find a "significant" result that isn't. And capability-level pass rates on small groups (2–5 scenarios) have confidence intervals so wide they're almost decorative.

These are known trade-offs. The system is designed to give us useful signal at realistic scale, not to be a peer-reviewed statistical study.

We used to ship and hope. Now we ship and know.

If this kind of work excites you, we're hiring engineers who care about AI quality. And if you want to see Biscuit in action, join the waitlist.

References

1.Zheng, L. et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023.
2.Liu, Y. et al. (2023). "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment." EMNLP 2023.
3.Li, D. et al. (2025). "From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge." EMNLP 2025.
4.Hashemi, H. et al. (2024). "LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts." ACL 2024.
5.Ye, S. et al. (2024). "FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets." ICLR 2024.
6.Lee, Y. et al. (2025). "CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists" EMNLP 2025.
7.Saad-Falcon, J. et al. (2025). "LMUNIT: Fine-grained Evaluation with Natural Language Unit Tests." Findings of EMNLP 2025.
8.Bavaresco, A. et al. (2025). "LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks." ACL 2025.
9.Xu, A. et al. (2025). "Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings" ACL 2025.
10.McNemar, Q. (1947). "Note on the sampling error of the difference between correlated proportions or percentages." Psychometrika, 12(2), 153–157.
11.Dietterich, T. G. (1998). "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms." Neural Computation, 10(7), 1895–1923.
12.Wilson, E. B. (1927). "Probable Inference, the Law of Succession, and Statistical Inference." Journal of the American Statistical Association, 22(158), 209–212.
13.Brown, L. D. et al. (2001). "Interval Estimation for a Binomial Proportion." Statistical Science, 16(2), 101–133.
14.Verga, P. et al. (2024). "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models." arXiv preprint.
15.Li, Z. et al. (2025). "LLM-as-a-Judge Failures at Automating the Identification of Poor Quality Outputs in Free-Form Texts." NewSum @ EMNLP 2025.

Back to home