Kiln for Data Scientists.

Rigorous AI experiments—without the notebook spaghetti

Run experiments across 190+ models, build agents in config that ship to production unchanged, and hand off to engineering with no rewrite.

Download Kiln Read the docs

macOS · Windows · Linux

Get what you need from your team

Engineers ship what you built

Hand off the exact run config you iterated on — model, prompt, tools, RAG, subagents. Engineering deploys the same artifact, no reimplementation, no port at the prod boundary.

PMs & subject matter experts deliver evals

Stakeholders define success as evals — not PRDs that drift from the model. PMs and SMEs contribute directly in the Kiln app: build evals, label datasets, generate synthetic data — no spreadsheets, no custom tooling.

QA delivers reproducible traces

Every issue lands with the exact run configuration: prompt, model, tools, RAG context, seed. Reproduce in one click and turn the trace into a labeled eval example.

Built for rigor, ready to ship

Experiments ready to ship, not just prototypes

Build agents in configuration on Kiln's production agent library — tools, RAG, skills, prompts, and subagents are all part of the run config. No code cleanup, no port to the prod stack. The config you iterate on in a notebook is the artifact engineering deploys.

Agent Run Configuration runnable

model claude-sonnet-4

prompt v3 · 412 tokens

tools search_docs web_fetch +3

rag embeddings/v2 · top-k 8

skills cite_sources refuse_pii

subagents classifier summarizer

Kiln production

Rigor built in

Frozen dataset splits never shift when new data lands. Withheld validation sets stay withheld. G-Eval and LLM-as-Judge with judge benchmarking against golden datasets and your choice of correlation method (Kendall Tau, Spearman, Pearson). Trustworthy comparisons across hundreds of runs.

Filter by tag…

#001 train golden

#002 val withheld

#003 test golden fine_tuning

#004 train synthetic

#005 test fine_tuning

Your AI assistant for experiments

An agent that designs and runs experiments while you focus on the science. Ask for a grid search across models and prompts, or a fine-tune sweep — come back to a ranked answer with cost and quality side-by-side. It knows every model, prompt technique, RAG pattern, and fine-tuning option Kiln supports.

15+

AI providers, one interface

190+

Models ready to call from a single config

60+

Fine-tuneable models across 4 providers

Kiln data in your notebooks

The Kiln Python library imports into any notebook — Jupyter, Colab, or your favorite IDE. Load any project, iterate over runs with typed Pydantic classes, and convert to pandas or polars DataFrames. The same data the desktop app shows, with zero import/export friction.

Library docs

analyze.py python
from kiln_ai.core import Project
import pandas as pd  # or: import polars as pl

project = Project.load("./my_project")
task = project.tasks()[0]
runs = list(task.runs())

df = pd.DataFrame([r.dict() for r in runs])
print(df[["input", "output", "rating"]].describe())

AI experimentation before and after Kiln

Without Kiln

Eval scripts, data formatters, and tracking code scattered across notebooks that don't compose, version, or share.
Train/test splits shift when new data arrives. Experiment comparison is unreliable by default.
Hand off to engineering means a rewrite. The notebook prototype and the prod system live in different code, formats, and infrastructure.

With Kiln

Agents, evals, datasets, and fine-tuning share one config. Iterate in the UI or from your notebook with pip install kiln_ai.
Frozen splits never shift. Withheld validation. Run hundreds of experiments against exactly the same data.
Engineering deploys the same run config you iterated on — model, prompt, tools, RAG, subagents — no reimplementation at the prod boundary.

Frequently asked

Is the agent config I iterate on what runs in production?

Yes. Tools, RAG, skills, prompts, and subagents are all part of the run config. The kiln_ai package is the same engine the desktop app and prod use — no wrapper, no reduced feature set, no rewrite at the handoff.

How does Kiln integrate with my notebooks?

pip install kiln_ai and point it at a Kiln project folder. Load datasets and runs into pandas or polars DataFrames, iterate with typed Pydantic classes, and run tasks programmatically. Works in Jupyter, Colab, or any IDE. Zero telemetry.

How do frozen dataset splits work?

When you create a split, Kiln permanently assigns items to train, validation, and test subsets. Assignments don't shift when new data is added, so you can run multiple experiments on exactly the same data and compare fairly.

What does the Kiln Assistant actually do for experiments?

It's an agent that reads your evals, traces, and configs, then designs and runs experiments — grid searches across models and prompts, fine-tune sweeps, RAG ablations. It returns ranked results with cost and quality trade-offs. Learn more about Kiln Assistant.

Experiment faster. Hand off without the rewrite.

190+ models, frozen splits, an AI assistant for experiments, and a config that ships to production unchanged.