Auto-Optimize.

Optimize your agents, automatically

Prompt optimization, fine-tuning, RAG tuning, and model swaps — compare them all against the same evals, with cost data alongside quality.

macOS · Windows · Linux

How Auto-Optimize works

Eval-driven prompt optimization

The Auto Prompt Optimizer runs hundreds of eval-scored mutations, converging on a better prompt in about an hour. The output is a human-readable string you ship by changing one value — no retraining, no infrastructure.

Pick the right lever per task

Prompt optimization is fastest. When you need more, fine-tune across 60+ models, tune RAG configs with auto-generated evals, or swap models and re-run the whole suite. Every lever in one place.

Evals tie everything together

Whether you changed a prompt, swapped a model, or fine-tuned, you measure with the same evals. The Compare view scores every approach side-by-side with cost and latency.

AI-driven experiments

Kiln Assistant can propose and run experiments to optimize your agent — picking the right lever, kicking off the run, and reporting back when results are ready.

Every optimization lever, in one place

Auto Prompt Optimizer

Eval-driven prompt evolution — hundreds of mutations, ~1 hour.

Fine-Tuning

60+ models across 4 providers, with serverless deployment.

Synthetic data generation

Topic trees produce diverse training and eval datasets.

RAG tuning

Optimize chunk size, embeddings, and retrieval — auto-generated Q&A evals.

Model comparison

Pit any model/prompt/config combo across all your evals.

Cost-aware tradeoffs

Quality scores and cost per run, side-by-side.

Eval-driven everywhere

Every lever scores against the same eval framework.

Open-source library, source-available app

Your data stays on your machine.

Prompt optimization vs. fine-tuning

Both solve similar problems. For most teams, start with prompt optimization.

Capability Prompt OptimizerFine-Tuning
Effort to start LowHigh
Optimization target EvalsSupervised training data
Time to result ~1 hour20 min to 1 day
Interpretability Read your new promptCan’t interpret weight changes
Deployment Change your prompt stringHost a custom model
Overfitting risk Easy to detect and avoidRequires careful validation
Data science skills Not requiredRecommended

Optimization before and after Kiln

Without Kiln
  • Try a new prompt, guess if it’s better, ship it. Fine-tune in a notebook, prompt-optimize in a different tool, compare in a spreadsheet.
  • Ship a prompt change and silently regress three other evals you forgot about.
  • Make model decisions on whoever tested it last, not systematic comparison.
  • Data scientist manually proposes and runs experiments.
With Kiln
  • Run the Auto Prompt Optimizer and get a measurably better prompt in about an hour.
  • Prompt optimization, fine-tuning, RAG, and model comparison live in one tool with shared evals.
  • Compare every run method across every eval before shipping — quality and cost side-by-side.
  • Kiln Assistant can propose and execute experiments.

Frequently asked

How is the Auto Prompt Optimizer different from asking an LLM to write my prompt?

A single LLM call produces one prompt from general training data. Kiln runs hundreds of iterative mutations scored against your evals, converging on the best candidate through systematic search.

Does the Auto Prompt Optimizer require a paid plan?

Yes. It runs on Kiln’s servers and consumes millions of tokens per run, so it requires a paid plan. All other features — model comparison, fine-tuning, synthetic data, RAG tuning — are in the free version.

Stop guessing. Start optimizing.

Auto-optimize prompts, fine-tune models, tune RAG, and compare every approach against the same evals.