Evaluations (Evals).

Describe your problem, Kiln builds evals.

Auto-generate evals from just a description of your goal. Automatic judge prompts, synthetic eval data and alignment to your preferences in 5 minutes.

Download Kiln Read the docs

macOS · Windows · Linux

From idea to eval in 4 steps

Define

Describe what good looks like—Kiln auto-generates a judge prompt from your task definition and requirements.

Generate

Populate eval and golden datasets with synthetic data, auto-tagged and ready for rating.

Calibrate

Benchmark multiple judges against human ratings using Kendall Tau, Spearman, and Pearson correlation.

Optimize

Pit run methods against each other—models, prompts, temperatures, fine-tunes—and find the best combination.

Evals in Kiln

Auto-generate evals in minutes

Kiln Auto-Evals collapse judge creation, edge-case discovery, data generation, and calibration into one copilot session. You get a production-ready eval, synthetic dataset, and validated judge in ~5 minutes.

An animated illustration of eval specs being generated row by row, with green checks and red x marks evaluating each one.

Generating Eval

Compare across every eval at once

Kiln benchmarks all run configurations against all your evals, cost and latency. Find the optimal configuration for your agent.

Faster collaboration with small evals

Large evals take weeks to develop, break down when requirements change, and miss regressions on subsets of their dataset. Kiln allows dozens of small parallel evals — easier to create, and each team owns the areas they care about: PMs, QA, subject matter experts, business teams, and engineers.

PM Product

Tone consistency
Brand voice
User clarity

ENG Engineering

Latency budget
Token cost
Tool reliability

SME Domain experts

Domain accuracy
Citation correctness
Edge-case coverage

Optimize for evals

Kiln's Auto-Optimizer will find the ideal prompt for your task, across every eval.

Auto Prompt Optimizer 148 mutations scored

#5 [+] Classify: billing, tech, account, other 62.4

#29 [+] Pick the user's primary blocker 69.1

#71 [+] Trust described problem over category 74.8

#118 [+] Outages are technical even if billing 81.5

converges in ~1 hour

RAG and Tool Evals

RAG evals generate Q&A pairs from your documents with known-correct references. Tool-use evals verify the right tool was called with the right parameters across the full trace. One system covers every dimension of AI quality.

RAG EVAL

Doc → Q → A → Score

TOOL-USE EVAL

USER Run the analysis

AGENT Calling fetch_data...

TOOL { status: 200 }

tool ✓ args ✓

Everything you need to measure AI quality

LLM-as-Judge

Chain-of-thought reasoning plus rubric scoring for nuanced evaluation.

G-Eval

Logprob-weighted scoring for finer granularity than binary pass/fail.

Kiln Auto-Evals

Auto-generated evals in 5 minutes with copilot-guided alignment.

Judge benchmarking

Compare judges against human ratings with Kendall Tau and more.

RAG accuracy evals

Document-sourced Q&A pairs test retrieval without circular judging.

Tool use evals

Verify agents call the right tool with the right parameters.

Built-in templates

Toxicity, bias, factual correctness, jailbreak susceptibility out of the box.

Issue-backed evals

Create evals from your issues.

AI quality before and after Kiln Evals

Without Kiln

Write a judge prompt, generate test data, label a golden set, benchmark the judge, and compare run methods—in five separate tools.
Evals take days or weeks to develop.
Ship a prompt change and hope it doesn't regress three other evals you forgot about.
Only the data scientist on the team can build or modify an eval, so the backlog grows.

With Kiln

Everything you need for evals, in one tool.
Kiln Auto-Evals collapse the entire eval workflow into one guided session anyone can complete in minutes.
Cross-eval compare view benchmarks every change across all evals at once, with cost breakdown.
Subject matter experts build production-quality evals through the copilot—no data science background needed.

Frequently asked

What eval algorithms does Kiln support?

LLM-as-Judge (chain-of-thought + rubric scoring) and G-Eval (logprob-weighted scoring). G-Eval produces nuanced scores instead of binary pass/fail, but requires models with logprobs.

How do I know my judge correlates with human preference?

Kiln benchmarks multiple judges against human ratings on a golden dataset using Kendall Tau, Spearman, and Pearson correlation. Pick the judge that best matches human preference before trusting it.

If you use Kiln Auto-Evals, we have a human-preference alignment phase to ensure the judge aligns to your preferences. Keep iterating until it matches.

Can I evaluate RAG accuracy?

Yes. RAG evals generate Q&A pairs from your own documents — realistic queries paired with reference answers from the source content. The judge compares model responses to the known-correct reference, avoiding the circular problem of an LLM judging knowledge it lacks.

Can I evaluate tool use?

Yes. Tool-use evals verify the right tool was called with the right parameters across the full conversation trace — not just the final response.

AI you can measure is AI you can improve.

Auto-generated specs, judge benchmarking, and cross-eval comparison.

Download Kiln Read the evals docs