Kiln for Engineers.

From prototype to production—without the rewrite

Run the same Python library in prototyping and prod — plus evals as CI, an Assistant that designs your next experiment, and your whole team in one workflow.

Download Kiln Read the docs

macOS · Windows · Linux

Get what you need from your team

PMs & Design deliver evals, not vibes

Stakeholders define success as evals — not PRDs that drift from the model. Goals you can run, score, and trend, instead of docs to interpret and Slack threads to chase.

QA delivers reproducible traces

Every issue lands with the exact run configuration: prompt, model, tools, RAG context, seed. Reproduce in one click and turn the trace into a labeled eval example.

Subject matter experts contribute directly

The Kiln app lets SMEs contribute to evals, generate synthetic data, and label datasets — without passing around spreadsheets or building custom tooling.

Prototype with the same engine you ship

The Kiln Python library is the same code that runs inside the desktop app. Load your project, run tasks, and access datasets — from notebooks, CI, or production. No port, no rewrite at the prod boundary.

Library docs

run_task.py python
from kiln_ai.core import Project
from kiln_ai.core import Task

project = Project.load("./my_project")
task = project.tasks()[0]
result = task.run(
    input="Summarize the Q3 product roadmap",
)

Move fast — without breaking what works

Evals: catch regressions before they ship

Every prompt, model, or fine-tune change runs against your eval suite — LLM-as-Judge, G-Eval, structured rubrics. Compare run methods side-by-side with cost. Wire EvalRunner into CI to gate deploys. Iterate fast without fearing the next regression.

PR #142 Upgrade model

eval/accuracy 9.2 +0.3
eval/tone 8.4 −0.1
eval/cost $0.18 ≈

PR #141 refactor system prompt blocked

eval/accuracy 6.1 −2.8
eval/tone 8.5 ≈

Assistant: your virtual data-scientist

An agent that reads your evals and traces, proposes experiments, finds better configs, and helps you debug abstract issues. It knows every model, prompt technique, RAG pattern, and fine-tuning option — so you spend less time tuning and more time shipping.

AI engineering before and after Kiln

Without Kiln

Prototype in notebooks, then rewrite everything for production — different code, different data format, different deployment story.
Evaluate by running one-off scripts, eyeballing outputs, and hoping your prompt change didn't break something else.
Stitch together provider SDKs, agent frameworks, eval harnesses, RAG plumbing, and fine-tuning pipelines across five different tools.

With Kiln

The same Python library that powers the desktop app runs in your CI pipeline and production server — no rewrite.
Systematic evals with LLM-as-Judge and G-Eval scoring, run-method comparison, and tool-use verification — wired into the same project.
One tool for agents, evals, fine-tuning, RAG, and deployment. 15+ providers, one interface, typed classes, Pydantic validation.

The developer surface, at a glance

Typed Python classes

Pydantic-validated data model with iterators and DataFrame export.

OpenAPI spec

Full Swagger docs for the REST API; generate clients in any language.

Git-backed datasets

Every change is a Git commit. Diff, branch, revert, and merge.

15+ AI providers

One interface for OpenAI, Anthropic, Gemini, Fireworks, and more.

MCP tool support

Connect any MCP server; run Kiln as an MCP server.

Logged curl commands

Every model request logged with parameters and a replicable curl.

Zero telemetry

The Python library collects no analytics. MIT-licensed, fully auditable.

Serverless fine-tunes

60+ models with automatic deployment. Pay per token, no infra.

Frequently asked

Is the Python library the same code as the desktop app?

Yes. The kiln_ai package is the same engine that powers the desktop app — no wrapper, no reduced feature set. Both read and write the same .kiln project files.

Can I use Kiln from languages other than Python?

Yes. The kiln_server package exposes a FastAPI REST API with full OpenAPI docs. Generate typed clients for TypeScript, Go, Rust, Java, or any supported language.

What does the Kiln Assistant actually do?

It's an agent that reads your evals, traces, and configurations, then proposes experiments to improve quality, latency, or cost. It can launch eval runs, run optimization passes, and answer questions about prompt techniques, RAG patterns, and fine-tuning options. Learn more about Kiln Assistant.

Can I evaluate existing agents without rewriting them?

Yes. Wrap any agent as an MCP tool and connect it to a Kiln task. Kiln evaluates it with the same evals as Kiln-native agents.

Ship AI you'd bet your on-call on.

pip install kiln_ai—the same library that powers the app, now in your codebase, your CI, and your production server.