At AI Engineer Europe 2026, Safe Intelligence CEO Steven Willmott gave a talk titled “Spec-Driven Testing for Agents With A Brain the Size of A Planet.”
AI agents are getting more capable.
That is the good news.
It is also the thing that makes them much harder to test.
When an agent can reason, use tools, follow workflows, retrieve information, and act across systems, the space of possible behaviours gets very large very quickly.
A fixed prompt set can tell you what happened on the examples you remembered to write down.
It cannot tell you whether the agent will behave correctly across the messy, varied, adversarial, and changing conditions it will meet in the real world.
That is where specification-driven testing comes in.
Instead of starting with a static list of prompts, teams start by defining what the agent should do, what it must not do, what robustness properties matter, and what evidence is needed to trust the result.
In this talk, Steve explores why dataset-driven evaluation is not enough for increasingly capable agents, and why specifications give teams a more structured way to define, test, and monitor agent behaviour over time.
Watch the talk: https://www.youtube.com/watch?v=UQKg0td-Bf4
Since you’re here, you can also check out the Spec27.ai early access programme: validation and monitoring for AI agents.