The Requirements Layer Your AI System is Missing

Contents

The Problem
The Judge's Dilemma
The Synthetic Data Unlock
Building Effective Specs
The Missing Layer

Most AI teams have a testing problem. Or so they think.

They're building evals, generating synthetic data, and running benchmarks. The infrastructure is there, the process exists, but something feels off. Eval scores don't match user complaints, teams argue about whether outputs are "good enough," and improvements in one area mysteriously break another.

The real problem isn't testing. It's that teams aren't defining what "good" actually means.

The Problem

Software engineers figured this out decades ago. Before you write tests, you write requirements. Acceptance criteria. Specifications (or specs). These are what force you to articulate what you're building before you test that you built it.

AI development teams often skip this step entirely. They jump straight from a task description ("build a customer service bot") to vibes-based evaluation ("does this feel right?"). When they do create evals, they're often holistic: a single judge scoring overall quality on a dataset of examples.

This approach has a fundamental flaw. Quality isn't one metric; it's dozens of independent dimensions:

Does the response follow the brand voice?
Does it refuse to discuss competitors?
Does it correctly retrieve information from the knowledge base?
Does it avoid hallucinating product features?
Does it detect and reject toxic requests?
Does it stay in character?

These requirements are orthogonal. Being accurate doesn't make you on-brand. Being safe doesn't make you helpful. Each dimension needs its own definition, its own examples, its own evaluation.

The Judge's Dilemma

Here's what happens when you give an LLM judge a checklist of ten requirements to evaluate at once:

It becomes a bad judge.

This isn't a bug in any particular model. It's a fundamental limitation of how attention and instruction-following work. When you ask a judge to simultaneously evaluate tone, accuracy, safety, formatting, and six other things, some criteria get deprioritized. The judge develops implicit preferences. Important requirements get overlooked.

Worse, you can't fix this by tuning. If your multi-criteria judge is too strict on tone but too lenient on accuracy, what do you adjust? Every change creates trade-offs across all ten dimensions. And when an aggregate score goes up, you have no idea which specific capabilities went down. With imbalanced data, entire failure modes stay invisible while your overall score looks great. You're playing whack-a-mole with quality. We go over this in more detail in our blog post: Many Small Evals Beat One Big Eval, Every Time.

The solution is obvious once you see it: one judge, one job.

A judge that only cares about tone can be precisely calibrated for tone. A judge that only cares about factual accuracy can be tuned for accuracy. These single-purpose judges don't interfere with each other. They can be developed, tested, and improved independently.

This is what specs enable. Each spec defines one requirement. Each spec gets its own judge. The judges run in parallel, producing independent scores that you can analyze, trend, and act on without cross-contamination.

The Synthetic Data Unlock

There's a second benefit to specs that's less obvious but equally important.

Generating good synthetic data is hard. And you need it not just for evaluating your AI system, but for calibrating the judges themselves. For validation data for a customer service bot, you might need:

Different tones and sentiment levels
Various types of requests across different domains
Edge cases and boundary conditions
Attempts to circumvent rules or policies
Format variations and error handling

But you don't actually need data that covers each constraint at once. Many teams believe realistic data is the best way to evaluate a system. But generating examples that satisfy multiple constraints simultaneously, like the right tone and the right content and the right edge case, have compounding complexity. The generation becomes fragile. The data often doesn't quite hit the target.

With specs synthetic data generation becomes trivially simple because you only need to validate one thing at a time.

Testing "refuse to discuss competitors"? Generate conversations about competitors. Testing "detect toxic requests"? Generate toxic requests. The data doesn't need to be realistic across every dimension, it just needs to trigger the specific behaviour you're validating. You're creating targeted data for a single, well-defined requirement, not a representative sample of all possible interactions.

In software engineering terms: unit tests are faster to write, easier to debug, and less fragile than end-to-end integration tests. The same principle applies here. Each spec gets its own focused dataset, optimized for testing that spec. You can generate data for each one in parallel without the exponential complexity. And if a specific capability breaks, that spec's score drops. You see exactly what failed and exactly what to fix.

Testing multiple requirements becomes trivial: separate specs with focused datasets, each easy to generate. Together, they give you complete coverage. And with focused synthetic data and human ratings, you can validate each judge—measuring precision and recall, catching when a judge is too strict or too lenient, and tuning with confidence.

Building Effective Specs

Using many single-purpose judges sounds like more work than one big eval, but in practice it's significantly less. A single focused spec can be added in minutes, and because it only tests one thing, it's unlikely to break when the system design changes. Compare that to maintaining one monolithic eval that has to balance dozens of criteria and needs constant retuning. Best of all, different people can own different specs, letting your team scale eval coverage without bottlenecks.

The harder part is precision. Each spec needs you to articulate requirements you've never consciously thought about. What exactly is your brand voice? When should the bot apologize vs. explain? How much retrieval error is acceptable? And each spec needs a calibrated judge: creating the prompt, generating test data, getting human annotations, and verifying accuracy. Done manually, this adds up.

Kiln Pro: AI that builds AI

This is where AI can help create AI.

The challenge we just described, the manual work required to write and calibrate specs, is what stops most teams. But with Kiln Eval Builder, spec creation is a breeze.

Kiln turns spec creation into a conversation. You start with a rough idea like "I want the bot to sound professional" and Kiln asks clarifying questions:

What does "professional" mean for your brand? Formal? Friendly? Concise?
Should the bot use contractions?
Can it use emoji?
How should it handle informal requests?

As you answer, Kiln refines your spec and generates a set of examples. Some that would pass the spec and some that would fail. Reviewing these examples helps you refine your mental model and provides feedback for Kiln to refine the spec further. For example, "Actually, that's too formal. We want professional but warm."

When you're satisfied with the definition, Kiln automatically creates the spec evaluator. It generates a judge prompt calibrated with your feedback and creates an eval dataset with synthetic data so it's ready to use. What used to take days of manual work happens in minutes. The result is a spec that actually captures what you meant, not what you wrote down the first time.

Kiln Eval Builder Walkthrough

Read the Docs

Building a Spec-Driven Workflow

Here's how to get started:

1. Start with failures

When you find a bug or user complaint, don't just fix it. Create a spec. "The bot should not recommend products we don't sell." "The bot should detect when it doesn't know the answer." Each failure becomes a permanent regression test.

2. Separate concerns ruthlessly

If you're tempted to create a spec like "accurate and well-formatted responses," split it. Accuracy and formatting are different requirements with different failure modes. They need different judges.

3. Think adversarially

The hardest part of writing specs is imagining all the ways they could be violated. For each spec, ask: what edge cases could break this? What would a user trying to bypass this requirement do? Build your dataset to include these adversarial examples. Kiln Eval Builder automatically generates adversarial examples, edge cases, and failure modes you wouldn't have thought of. It helps you build comprehensive coverage without the comprehensive effort.

4. Let specs drive your roadmap

When you look at your spec scores, you'll see exactly where quality is weakest. No more guessing about what to improve next.

Evaluating different run configurations across different specs.

The Missing Layer

Specs. Requirements. Acceptance criteria. Whatever you call them, they serve the same purpose: forcing you to define what "good" means before you test for it.

Many AI systems have been flying without this layer, with teams relying on holistic vibes, multi-criteria judges that can't be tuned, and synthetic data that doesn't quite hit the target.

Specs fix all of this by decomposing quality into independent dimensions. They enable single-purpose judges that can be precisely calibrated, and make synthetic data generation simple and focused.

Most importantly, they turn "does this feel right?" into "does this meet our requirements?"

That's not a testing problem, that's requirements engineering. And it's about time AI development caught up.

Download Kiln