Users report the agent gives wrong refund policy answers. The PM files a JIRA ticket, but engineering can't reproduce it and closes it as 'works for me.'
The PM creates a Kiln Issue with the exact input/output pair, generates synthetic data that exercises the refund scenario, and builds an eval that fails on the bad behavior.
The eval is picked up automatically in the next optimization run, resolving the issue. All future changes are validated against it, ensuring it never regresses.