We just released Kiln v0.18.1, our biggest release ever!
Overview
- Evals V2: simplified eval creation, improved prompts, eval templates, and a new "compare" view to find the best model.
- Synthetic Data V2: more use cases, including fine-tuning, generating eval data, issue resolution, and more.
- Introducing 'Kiln Issues': Kiln Issues are like software bug trackers, but for AI systems.
- New Models: Added Kimi K2, Grok 4, Llama 4, Deepseek R1 0528, and more!
- And More: improved UI, model suggestions, model call logs, new hyperparams, and bug fixes
- How to Help: we're open source and could use your help!
Evals V2
We've made a huge step forward in our mission to make evals easy and fast for AI teams.
- New Compare Screen: compare multiple models across all your evals, all in one view. This includes assessing cost differences between models/prompts.
- New Eval Templates: we've added 6 new eval templates, which now integrate deeply with synthetic data generation.
- Improved UI: the eval UI will walk you through all steps of creating an eval (datasets, prompts, judges, results).
Check out our eval guide for a complete walkthrough.

Synthetic Data Generation V2
Different goals require different synthetic datasets. For fine-tuning you want high quality outputs across a broad range of possible inputs. For evals you want broad inputs that often lead to specific types of failure.
Our new synthetic data gen system is all about building the right data for your current goal. Select one of our many new built-in templates, or customize the prompts to target specific use cases.

In addition to goals, synthetic data sessions are now saved across page loads making it easier to use.
Check out our synthetic data guide for more details.
Introducing "Kiln Issues"
Software teams have tools like Github Issues and Jira for tracking issues. What should AI product teams use?
We're launching Kiln Issues, an AI-native issue tracker for AI teams. It doesn't just collect bug reports, but keeps the structured data needed to reproduce, evaluate and fix issues in AI systems. This is an early release, with lots more to come. Let us know what you think!
To get started, check out our new Issues documentation.
To learn more about our philosophy on improving AI system quality over time, read our recent blog post: Many Small Evals Beat One Big Eval, Every Time.
New Models: Kimi, Grok and More
This release includes a range of new models including Kimi K2, Grok 4, Deepseek R1 0528, and Llama 4. Kimi K2 has been showing great promise for evals, and the uncensored Grok 4 is great for generating adversarial synthetic data.
We also added the ability to add new models without app updates. Going forward, we'll push new models over the air, as they become available.
And More
This release includes many improvements and bug fixes:
- Improved UI for filtering long lists: just type to filter
- More model suggestions for data gen and evals
- Model call logs: see exactly the prompts and options being passed to models
- New hyperparams: set temperature, top_p, JSON mode, and more
- Simplified chain of thought calls to non-thinking models
- Bug fixes: fixed how model outputs are passed to evals, and several other small issues
How to Help (or Get Help)
We're an open source project and we really appreciate community feedback and support! Some ways you can help:
- Share Kiln with friends and co-workers
- Star our repo on Github - we're about to hit 4,000 stars!
- Give feedback on Discord
- Contribute by fixing or filing issues on our Github
Thanks for all your support!
Steve - The Kiln Maintainer