Kiln for Data Scientists.
Rigorous AI experiments—without the notebook spaghetti
Run experiments across 190+ models, build agents in config that ship to production unchanged, and hand off to engineering with no rewrite.
Get what you need from your team
Hand off the exact run config you iterated on — model, prompt, tools, RAG, subagents. Engineering deploys the same artifact, no reimplementation, no port at the prod boundary.
Stakeholders define success as evals — not PRDs that drift from the model. PMs and SMEs contribute directly in the Kiln app: build evals, label datasets, generate synthetic data — no spreadsheets, no custom tooling.
Every issue lands with the exact run configuration: prompt, model, tools, RAG context, seed. Reproduce in one click and turn the trace into a labeled eval example.
Built for rigor, ready to ship
Experiments ready to ship, not just prototypes
Build agents in configuration on Kiln's production agent library — tools, RAG, skills, prompts, and subagents are all part of the run config. No code cleanup, no port to the prod stack. The config you iterate on in a notebook is the artifact engineering deploys.
Rigor built in
Frozen dataset splits never shift when new data lands. Withheld validation sets stay withheld. G-Eval and LLM-as-Judge with judge benchmarking against golden datasets and your choice of correlation method (Kendall Tau, Spearman, Pearson). Trustworthy comparisons across hundreds of runs.
Your AI co-pilot for experiments
An agent that designs and runs experiments while you focus on the science. Ask for a grid search across models and prompts, or a fine-tune sweep — come back to a ranked answer with cost and quality side-by-side. It knows every model, prompt technique, RAG pattern, and fine-tuning option Kiln supports.
Kiln data in your notebooks
The Kiln Python library imports into any notebook — Jupyter, Colab, or your favorite IDE. Load any project, iterate over runs with typed Pydantic classes, and convert to pandas or polars DataFrames. The same data the desktop app shows, with zero import/export friction.
from kiln_ai.core import Project
import pandas as pd # or: import polars as pl
project = Project.load("./my_project")
task = project.tasks()[0]
runs = list(task.runs())
df = pd.DataFrame([r.dict() for r in runs])
print(df[["input", "output", "rating"]].describe())AI experimentation before and after Kiln
- Eval scripts, data formatters, and tracking code scattered across notebooks that don't compose, version, or share.
- Train/test splits shift when new data arrives. Experiment comparison is unreliable by default.
- Hand off to engineering means a rewrite. The notebook prototype and the prod system live in different code, formats, and infrastructure.
- Agents, evals, datasets, and fine-tuning share one config. Iterate in the UI or from your notebook with pip install kiln_ai.
- Frozen splits never shift. Withheld validation. Run hundreds of experiments against exactly the same data.
- Engineering deploys the same run config you iterated on — model, prompt, tools, RAG, subagents — no reimplementation at the prod boundary.
Frequently asked
Is the agent config I iterate on what runs in production?
Yes. Tools, RAG, skills, prompts, and subagents are all part of the run config. The kiln_ai package is the same engine the desktop app and prod use — no wrapper, no reduced feature set, no rewrite at the handoff.
How does Kiln integrate with my notebooks?
pip install kiln_ai and point it at a Kiln project folder. Load datasets and runs into pandas or polars DataFrames, iterate with typed Pydantic classes, and run tasks programmatically. Works in Jupyter, Colab, or any IDE. Zero telemetry.
How do frozen dataset splits work?
When you create a split, Kiln permanently assigns items to train, validation, and test subsets. Assignments don't shift when new data is added, so you can run multiple experiments on exactly the same data and compare fairly.
What does the Kiln Assistant actually do for experiments?
It's an agent that reads your evals, traces, and configs, then designs and runs experiments — grid searches across models and prompts, fine-tune sweeps, RAG ablations. It returns ranked results with cost and quality trade-offs. Learn more about Kiln Assistant.
Experiment faster. Hand off without the rewrite.
190+ models, frozen splits, an AI co-pilot for experiments, and a config that ships to production unchanged.