Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It)

Hi everyone, I wanted to share a project I’ve been building and recently open-sourced:...

mercoledì 20 maggio 2026 New tab

229 words~1 min read

Hi everyone,

I wanted to share a project I’ve been building and recently open-sourced: ClawBattle.

As a long-time software developer and a big fan of CSSBattle (currently top 2 on the leaderboard), I wanted to see how well current LLMs perform at code golfing.

It turns out this task is also excellent for benchmarking. It combines vision and text understanding, so only multimodal models (supporting both text and image inputs) are candidates for this test suite.

Right now, OpenAI's GPT-5.5 is by far the best model on this benchmark. I also just added Gemini 3.5 Flash. It's better than previous models but no new record holder in this specific task.

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It)

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It)

Other newsrooms on this story

Related reading

Bringing Scientific Rigor to LLM Comparison

I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My…

AI-generated accessibility, an update — frontier models still fail, but skills…

An open source LLM eval tool with two independent quality signals

The Best Open Source LLMs for Coding Right Now (June 2026)

Can You Build an Alternative to LLMs? 8 Months, ~200 Failed Experiments, One…

Other newsrooms on this story

Related reading

Bringing Scientific Rigor to LLM Comparison

I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My…

AI-generated accessibility, an update — frontier models still fail, but skills…

An open source LLM eval tool with two independent quality signals

The Best Open Source LLMs for Coding Right Now (June 2026)

Can You Build an Alternative to LLMs? 8 Months, ~200 Failed Experiments, One…