Auto-Optimize.
Optimize your agents, automatically
Prompt optimization, fine-tuning, RAG tuning, and model swaps — compare them all against the same evals, with cost data alongside quality.
How Auto-Optimize works
Eval-driven prompt optimization
The Auto Prompt Optimizer runs hundreds of eval-scored mutations, converging on a better prompt in about an hour. The output is a human-readable string you ship by changing one value — no retraining, no infrastructure.
converges in ~1 hour
Pick the right lever per task
Prompt optimization is fastest. When you need more, fine-tune across 60+ models, tune RAG configs with auto-generated evals, or swap models and re-run the whole suite. Every lever in one place.
every lever in one place
Evals tie everything together
Whether you changed a prompt, swapped a model, or fine-tuned, you measure with the same evals. The Compare view scores every approach side-by-side with cost and latency.
AI-driven experiments
Kiln Assistant can propose and run experiments to optimize your agent — picking the right lever, kicking off the run, and reporting back when results are ready.
Every optimization lever, in one place
Eval-driven prompt evolution — hundreds of mutations, ~1 hour.
60+ models across 4 providers, with serverless deployment.
Topic trees produce diverse training and eval datasets.
Optimize chunk size, embeddings, and retrieval — auto-generated Q&A evals.
Pit any model/prompt/config combo across all your evals.
Quality scores and cost per run, side-by-side.
Every lever scores against the same eval framework.
Your data stays on your machine.
Prompt optimization vs. fine-tuning
Both solve similar problems. For most teams, start with prompt optimization.
| Capability | Prompt Optimizer | Fine-Tuning |
|---|---|---|
| Effort to start | Low | High |
| Optimization target | Evals | Supervised training data |
| Time to result | ~1 hour | 20 min to 1 day |
| Interpretability | Read your new prompt | Can’t interpret weight changes |
| Deployment | Change your prompt string | Host a custom model |
| Overfitting risk | Easy to detect and avoid | Requires careful validation |
| Data science skills | Not required | Recommended |
Optimization before and after Kiln
- Try a new prompt, guess if it’s better, ship it. Fine-tune in a notebook, prompt-optimize in a different tool, compare in a spreadsheet.
- Ship a prompt change and silently regress three other evals you forgot about.
- Make model decisions on whoever tested it last, not systematic comparison.
- Data scientist manually proposes and runs experiments.
- Run the Auto Prompt Optimizer and get a measurably better prompt in about an hour.
- Prompt optimization, fine-tuning, RAG, and model comparison live in one tool with shared evals.
- Compare every run method across every eval before shipping — quality and cost side-by-side.
- Kiln Assistant can propose and execute experiments.
Frequently asked
How is the Auto Prompt Optimizer different from asking an LLM to write my prompt?
A single LLM call produces one prompt from general training data. Kiln runs hundreds of iterative mutations scored against your evals, converging on the best candidate through systematic search.
Does the Auto Prompt Optimizer require a paid plan?
Yes. It runs on Kiln’s servers and consumes millions of tokens per run, so it requires a paid plan. All other features — model comparison, fine-tuning, synthetic data, RAG tuning — are in the free version.
Stop guessing. Start optimizing.
Auto-optimize prompts, fine-tune models, tune RAG, and compare every approach against the same evals.