Auto-Optimize.

Optimize your agents, automatically

Prompt optimization, fine-tuning, RAG tuning, and model swaps — compare them all against the same evals, with cost data alongside quality.

Download Kiln Read the docs

macOS · Windows · Linux

How Auto-Optimize works

Eval-driven prompt optimization

The Auto Prompt Optimizer runs hundreds of eval-scored mutations, converging on a better prompt in about an hour. The output is a human-readable string you ship by changing one value — no retraining, no infrastructure.

Auto Prompt Optimizer 148 mutations scored

#5 [+] Classify: billing, tech, account, other 62.4

#29 [+] Pick the user's primary blocker 69.1

#71 [+] Trust described problem over category 74.8

#118 [+] Outages are technical even if billing 81.5

converges in ~1 hour

Pick the right lever per task

Prompt optimization is fastest. When you need more, fine-tune across 60+ models, tune RAG configs with auto-generated evals, or swap models and re-run the whole suite. Every lever in one place.

Prompt Opt fastest

Fine-Tune deepest

RAG Tuning knowledge

Model Selection broadest

every lever in one place

Evals tie everything together

Whether you changed a prompt, swapped a model, or fine-tuned, you measure with the same evals. The Compare view scores every approach side-by-side with cost and latency.

AI-driven experiments

Kiln Assistant can propose and run experiments to optimize your agent — picking the right lever, kicking off the run, and reporting back when results are ready.

Every optimization lever, in one place

Auto Prompt Optimizer

Eval-driven prompt evolution — hundreds of mutations, ~1 hour.

Fine-Tuning

60+ models across 4 providers, with serverless deployment.

Synthetic data generation

Topic trees produce diverse training and eval datasets.

RAG tuning

Optimize chunk size, embeddings, and retrieval — auto-generated Q&A evals.

Model comparison

Pit any model/prompt/config combo across all your evals.

Cost-aware tradeoffs

Quality scores and cost per run, side-by-side.

Eval-driven everywhere

Every lever scores against the same eval framework.

Open-source library, source-available app

Your data stays on your machine.

Prompt optimization vs. fine-tuning

Both solve similar problems. For most teams, start with prompt optimization.

Capability	Prompt Optimizer	Fine-Tuning
Effort to start	Low	High
Optimization target	Evals	Supervised training data
Time to result	~1 hour	20 min to 1 day
Interpretability	Read your new prompt	Can’t interpret weight changes
Deployment	Change your prompt string	Host a custom model
Overfitting risk	Easy to detect and avoid	Requires careful validation
Data science skills	Not required	Recommended

Optimization before and after Kiln

Without Kiln

Try a new prompt, guess if it’s better, ship it. Fine-tune in a notebook, prompt-optimize in a different tool, compare in a spreadsheet.
Ship a prompt change and silently regress three other evals you forgot about.
Make model decisions on whoever tested it last, not systematic comparison.
Data scientist manually proposes and runs experiments.

With Kiln

Run the Auto Prompt Optimizer and get a measurably better prompt in about an hour.
Prompt optimization, fine-tuning, RAG, and model comparison live in one tool with shared evals.
Compare every run method across every eval before shipping — quality and cost side-by-side.
Kiln Assistant can propose and execute experiments.

Frequently asked

How is the Auto Prompt Optimizer different from asking an LLM to write my prompt?

A single LLM call produces one prompt from general training data. Kiln runs hundreds of iterative mutations scored against your evals, converging on the best candidate through systematic search.

Does the Auto Prompt Optimizer require a paid plan?

Yes. It runs on Kiln’s servers and consumes millions of tokens per run, so it requires a paid plan. All other features — model comparison, fine-tuning, synthetic data, RAG tuning — are in the free version.

Stop guessing. Start optimizing.

Auto-optimize prompts, fine-tune models, tune RAG, and compare every approach against the same evals.