Kiln for Engineers.
From prototype to production—without the rewrite
Run the same Python library in prototyping and prod — plus evals as CI, an Assistant that designs your next experiment, and your whole team in one workflow.
Get what you need from your team
Stakeholders define success as evals — not PRDs that drift from the model. Goals you can run, score, and trend, instead of docs to interpret and Slack threads to chase.
Every issue lands with the exact run configuration: prompt, model, tools, RAG context, seed. Reproduce in one click and turn the trace into a labeled eval example.
The Kiln app lets SMEs contribute to evals, generate synthetic data, and label datasets — without passing around spreadsheets or building custom tooling.
Prototype with the same engine you ship
The Kiln Python library is the same code that runs inside the desktop app. Load your project, run tasks, and access datasets — from notebooks, CI, or production. No port, no rewrite at the prod boundary.
from kiln_ai.core import Project
from kiln_ai.core import Task
project = Project.load("./my_project")
task = project.tasks()[0]
result = task.run(
input="Summarize the Q3 product roadmap",
)Move fast — without breaking what works
Evals: catch regressions before they ship
Every prompt, model, or fine-tune change runs against your eval suite — LLM-as-Judge, G-Eval, structured rubrics. Compare run methods side-by-side with cost. Wire EvalRunner into CI to gate deploys. Iterate fast without fearing the next regression.
- eval/accuracy 9.2 +0.3
- eval/tone 8.4 −0.1
- eval/cost $0.18 ≈
- eval/accuracy 6.1 −2.8
- eval/tone 8.5 ≈
Assistant: your virtual data-scientist
An agent that reads your evals and traces, proposes experiments, finds better configs, and helps you debug abstract issues. It knows every model, prompt technique, RAG pattern, and fine-tuning option — so you spend less time tuning and more time shipping.
AI engineering before and after Kiln
- Prototype in notebooks, then rewrite everything for production — different code, different data format, different deployment story.
- Evaluate by running one-off scripts, eyeballing outputs, and hoping your prompt change didn't break something else.
- Stitch together provider SDKs, agent frameworks, eval harnesses, RAG plumbing, and fine-tuning pipelines across five different tools.
- The same Python library that powers the desktop app runs in your CI pipeline and production server — no rewrite.
- Systematic evals with LLM-as-Judge and G-Eval scoring, run-method comparison, and tool-use verification — wired into the same project.
- One tool for agents, evals, fine-tuning, RAG, and deployment. 15+ providers, one interface, typed classes, Pydantic validation.
The developer surface, at a glance
Pydantic-validated data model with iterators and DataFrame export.
Full Swagger docs for the REST API; generate clients in any language.
Every change is a Git commit. Diff, branch, revert, and merge.
One interface for OpenAI, Anthropic, Gemini, Fireworks, and more.
Connect any MCP server; run Kiln as an MCP server.
Every model request logged with parameters and a replicable curl.
The Python library collects no analytics. MIT-licensed, fully auditable.
60+ models with automatic deployment. Pay per token, no infra.
Frequently asked
Is the Python library the same code as the desktop app?
Yes. The kiln_ai package is the same engine that powers the desktop app — no wrapper, no reduced feature set. Both read and write the same .kiln project files.
Can I use Kiln from languages other than Python?
Yes. The kiln_server package exposes a FastAPI REST API with full OpenAPI docs. Generate typed clients for TypeScript, Go, Rust, Java, or any supported language.
What does the Kiln Assistant actually do?
It's an agent that reads your evals, traces, and configurations, then proposes experiments to improve quality, latency, or cost. It can launch eval runs, run optimization passes, and answer questions about prompt techniques, RAG patterns, and fine-tuning options. Learn more about Kiln Assistant.
Can I evaluate existing agents without rewriting them?
Yes. Wrap any agent as an MCP tool and connect it to a Kiln task. Kiln evaluates it with the same evals as Kiln-native agents.
Ship AI you'd bet your on-call on.
pip install kiln_ai—the same library that powers the app, now in your codebase, your CI, and your production server.