Kiln Release v0.18.1: Evals V2, Synthetic Data V2, Kiln Issues, New Models, and more!

We just released Kiln v0.18.1, our biggest release ever!

Overview

Evals V2: simplified eval creation, improved prompts, eval templates, and a new "compare" view to find the best model.
Synthetic Data V2: more use cases, including fine-tuning, generating eval data, issue resolution, and more.
Introducing 'Kiln Issues': Kiln Issues are like software bug trackers, but for AI systems.
New Models: Added Kimi K2, Grok 4, Llama 4, Deepseek R1 0528, and more!
And More: improved UI, model suggestions, model call logs, new hyperparams, and bug fixes
How to Help: we're open source and could use your help!

Evals V2

We've made a huge step forward in our mission to make evals easy and fast for AI teams.

New Compare Screen: compare multiple models across all your evals, all in one view. This includes assessing cost differences between models/prompts.
New Eval Templates: we've added 6 new eval templates, which now integrate deeply with synthetic data generation.
Improved UI: the eval UI will walk you through all steps of creating an eval (datasets, prompts, judges, results).

Check out our eval guide for a complete walkthrough.

Comparing models and evals in Kiln — Comparing many models and evals

Synthetic Data Generation V2

Different goals require different synthetic datasets. For fine-tuning you want high quality outputs across a broad range of possible inputs. For evals you want broad inputs that often lead to specific types of failure.

Our new synthetic data gen system is all about building the right data for your current goal. Select one of our many new built-in templates, or customize the prompts to target specific use cases.

Selecting a data generation goal in Kiln — Selecting a data gen goal

In addition to goals, synthetic data sessions are now saved across page loads making it easier to use.

Check out our synthetic data guide for more details.

Introducing "Kiln Issues"

Software teams have tools like Github Issues and Jira for tracking issues. What should AI product teams use?

We're launching Kiln Issues, an AI-native issue tracker for AI teams. It doesn't just collect bug reports, but keeps the structured data needed to reproduce, evaluate and fix issues in AI systems. This is an early release, with lots more to come. Let us know what you think!

To get started, check out our new Issues documentation.

To learn more about our philosophy on improving AI system quality over time, read our recent blog post: Many Small Evals Beat One Big Eval, Every Time.

New Models: Kimi, Grok and More

This release includes a range of new models including Kimi K2, Grok 4, Deepseek R1 0528, and Llama 4. Kimi K2 has been showing great promise for evals, and the uncensored Grok 4 is great for generating adversarial synthetic data.

We also added the ability to add new models without app updates. Going forward, we'll push new models over the air, as they become available.

And More

This release includes many improvements and bug fixes:

Improved UI for filtering long lists: just type to filter
More model suggestions for data gen and evals
Model call logs: see exactly the prompts and options being passed to models
New hyperparams: set temperature, top_p, JSON mode, and more
Simplified chain of thought calls to non-thinking models
Bug fixes: fixed how model outputs are passed to evals, and several other small issues

How to Help (or Get Help)

We're an open source project and we really appreciate community feedback and support! Some ways you can help:

Share Kiln with friends and co-workers
Star our repo on Github - we're about to hit 4,000 stars!
Give feedback on Discord
Contribute by fixing or filing issues on our Github

Thanks for all your support!
Steve - The Kiln Maintainer