Calculator
AI Eval Set Size
calculator.
Calculate how many eval examples you need to detect a quality regression with confidence.
How we calibrated this
Used internally before any AI eval engagement.
Inputs
Tell us about your project.
This is a static reference card. For interactive calculators, talk to us — we tune the assumptions per client.
Current quality baseline
Range: 50–99 % · Default: 80 %
Smallest regression you want to detect
Range: 1–20 % · Default: 5 %
Statistical confidence
- 90%0.7×
- 95%1×
- 99%1.4×
How it's calculated
The formula.
Power-analysis-style: needed n based on baseline, effect size, confidence
Output
Recommended eval set size
Examples needed.
Output
Cost to build (engineer days)
Approximate dataset-build time.
Output
Cost to run (LLM API)
Per regression-test run.
Want a real estimate?
This is a band,
not a quote.
For a real estimate calibrated to your specific project, brief us. We get back within two business days.
Brief us on evals