AsgardBench: A benchmark for visually grounded interactive planning

At a glance

To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback.

AsgardBench isolates whether agents can use visual observations to revise their plans as tasks unfold.

Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe.

Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment.

AsgardBench: A benchmark for visually grounded interactive planning

Other newsrooms on this story

Related reading

SocialReasoning Bench shows the limits of today’s AI agents

How to build a better AI benchmark

A Chinese firm has just launched a constantly changing set of AI benchmarks

Your AI models are failing in production—Here’s how to fix model selection

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

ADeLe: Predicting and explaining AI performance across tasks - Microsoft…