AI Quality and Safety at the AI Engineer Summit

Every new wave of technology brings a new wave of conferences, summits, and meetups. AI is no different, and a dizzying array of new events has sprung up. One of the most interesting of these is the AI Engineer Summit series (and its counterpart, the AI Worlds Fair). The events resonate with our work at Safe Intelligence because they focus on the engineering of products, services, and systems around AI, rather than just AI models themselves.

Sure, you can train a model, but how do you put it in the right execution context, test the resulting products, and run it at scale?

The recent event in New York touched on many exciting themes and wasn’t explicitly focused on safety, but production quality and reliability shone through in many talks. Here are just a few:

Lux Capital’s Grace Isford talked about the accumulation of small errors looking at AI market and agents in particular. The more we use AI powered autonomous systems to execute tasks on their own, and the more complex those tasks, resulting small errors in any step add up. These errors aren’t theoretical, in fact they are almost guaranteed when solving messy real world problems where it becomes hard to know how much “context” to factor in. What’s relevant in flight booking for example? Just the flight schedules in price? Seat preferences? Traffic or commute times? Weather? Airline status?
Contextual’s Douwe Kiela focused on the need for context in order for models to accurately resolve tasks. The less specific the knowledge an agent has, the less likely it is to get the right answers. Kielo’s talk focused on LLM RAG systems that provide reasoning context, but the same is true for any type of model input relevant to an action. Vision models need to be tested with data from the actual sensor clusters they are attached to, trading models need to be tested (and receive signals from) as many market indicators as possible that might affect decisions.
Anthropic’s AI team dug deep into the hard topic of explainability for LLMs. Alexander Bricken and Joe Bayley, talked through the company’s roadmap of research to better understand the linkage between LLM model outputs and the structures represented in model weights. This is a hugely challenging task and it’s likely it may never be fulfilled. Still understanding what models are doing, what conditions could cause failures and how things are represented in a model is extremely valuable if it can be teased out.
In one of the other highlight talks, Mustafa Ali (Method Financial) and Kyle Corbitt (OpenPipe) talked through how they took built early prototypes for an LLM based system using powerful off-the-shelf LLMs to build out an application and then produced a smaller distilled open source based model for large scale deployment. The engineering mindset here is really to focus the fine tuning of models on specific tasks, make the smallest model that works and thereby gain efficiency and high performance.

All in all a great event and we’ll be back at the event for the 2025 World’s Fair!

AI Quality and Safety at the AI Engineer Summit

Submit a Comment Cancel reply

Recent Posts

Recent Comments