Evaluations (Evals).
Describe your problem, Kiln builds evals.
Auto-generate evals from just a description of your goal. Automatic judge prompts, synthetic eval data and alignment to your preferences in 5 minutes.
From idea to eval in 4 steps
Describe what good looks like—Kiln auto-generates a judge prompt from your task definition and requirements.
Populate eval and golden datasets with synthetic data, auto-tagged and ready for rating.
Benchmark multiple judges against human ratings using Kendall Tau, Spearman, and Pearson correlation.
Pit run methods against each other—models, prompts, temperatures, fine-tunes—and find the best combination.
Evals in Kiln
Auto-generate evals in minutes
Kiln Auto-Evals collapse judge creation, edge-case discovery, data generation, and calibration into one copilot session. You get a production-ready eval, synthetic dataset, and validated judge in ~5 minutes.
Compare across every eval at once
Kiln benchmarks all run configurations against all your evals, cost and latency. Find the optimal configuration for your agent.
Faster collaboration with small evals
Large evals take weeks to develop, break down when requirements change, and miss regressions on subsets of their dataset. Kiln allows dozens of small parallel evals — easier to create, and each team owns the areas they care about: PMs, QA, subject matter experts, business teams, and engineers.
- Tone consistency
- Brand voice
- User clarity
- Latency budget
- Token cost
- Tool reliability
- Domain accuracy
- Citation correctness
- Edge-case coverage
Optimize for evals
Kiln's Auto-Optimizer will find the ideal prompt for your task, across every eval.
converges in ~1 hour
RAG and Tool Evals
RAG evals generate Q&A pairs from your documents with known-correct references. Tool-use evals verify the right tool was called with the right parameters across the full trace. One system covers every dimension of AI quality.
Everything you need to measure AI quality
Chain-of-thought reasoning plus rubric scoring for nuanced evaluation.
Logprob-weighted scoring for finer granularity than binary pass/fail.
Auto-generated evals in 5 minutes with copilot-guided alignment.
Compare judges against human ratings with Kendall Tau and more.
Document-sourced Q&A pairs test retrieval without circular judging.
Verify agents call the right tool with the right parameters.
Toxicity, bias, factual correctness, jailbreak susceptibility out of the box.
Create evals from your issues.
AI quality before and after Kiln Evals
- Write a judge prompt, generate test data, label a golden set, benchmark the judge, and compare run methods—in five separate tools.
- Evals take days or weeks to develop.
- Ship a prompt change and hope it doesn't regress three other evals you forgot about.
- Only the data scientist on the team can build or modify an eval, so the backlog grows.
- Everything you need for evals, in one tool.
- Kiln Auto-Evals collapse the entire eval workflow into one guided session anyone can complete in minutes.
- Cross-eval compare view benchmarks every change across all evals at once, with cost breakdown.
- Subject matter experts build production-quality evals through the copilot—no data science background needed.
Frequently asked
What eval algorithms does Kiln support?
LLM-as-Judge (chain-of-thought + rubric scoring) and G-Eval (logprob-weighted scoring). G-Eval produces nuanced scores instead of binary pass/fail, but requires models with logprobs.
How do I know my judge correlates with human preference?
Kiln benchmarks multiple judges against human ratings on a golden dataset using Kendall Tau, Spearman, and Pearson correlation. Pick the judge that best matches human preference before trusting it.
If you use Kiln Auto-Evals, we have a human-preference alignment phase to ensure the judge aligns to your preferences. Keep iterating until it matches.
Can I evaluate RAG accuracy?
Yes. RAG evals generate Q&A pairs from your own documents — realistic queries paired with reference answers from the source content. The judge compares model responses to the known-correct reference, avoiding the circular problem of an LLM judging knowledge it lacks.
Can I evaluate tool use?
Yes. Tool-use evals verify the right tool was called with the right parameters across the full conversation trace — not just the final response.
AI you can measure is AI you can improve.
Auto-generated specs, judge benchmarking, and cross-eval comparison.