Benchmarking LLM Structured Outputs

Cross-posted from carrick.tools.

When you read the API documentation for OpenAI, Anthropic, or Google Gemini, the feature called "structured outputs" looks like a solved problem: pass a JSON schema, get back JSON that conforms to it.

In production, it is not a contract. It is a well-typed, best-effort suggestion.

At Carrick, the code-analysis scanner I work on, our post-LLM pipeline relies on a four-stage fallback parser. We attempt a direct parse, strip markdown fences, scan for array bounds inside surrounding garbage text, and finally apply regex cleanup. If all four fail, we drop the payload and proceed. If structured outputs worked as advertised, this would be a single serde_json::from_str(response).

To isolate why this defensive parsing is necessary, I built a benchmark testing 8 synthetic schemas against six models (the flagship and cheaper tiers from each provider). The schemas isolate one structural stressor each: a flat baseline, a 3-level nested object, a 7-level nested chain, a long enum, a oneOf tagged union, nullable + format fields, a 20-item array, and a closed object with additionalProperties: false. Every response is validated against the original schema using two independent validators (ajv and hyperjump). A response only counts as strict adherence when both agree.

Cross-posted from carrick.tools.

In production, it is not a contract. It is a well-typed, best-effort suggestion.

Benchmarking LLM Structured Outputs

Benchmarking LLM Structured Outputs

Related reading

Mastering Structured JSON Outputs with Gemini API

Why JSON Schema Isn't Enough for Production AI

Structured Outputs: How We Stopped Parsing LLM Responses by Hand

Getting structured JSON out of five incompatible LLM APIs — and degrading when…

Structured Output From Local LLMs: JSON That Never Breaks (Ollama + Zod)

JSON or XML Tags for LLM Output: The Format That Holds Under Pressure

Related reading

Mastering Structured JSON Outputs with Gemini API

Why JSON Schema Isn't Enough for Production AI

Structured Outputs: How We Stopped Parsing LLM Responses by Hand

Getting structured JSON out of five incompatible LLM APIs — and degrading when…

Structured Output From Local LLMs: JSON That Never Breaks (Ollama + Zod)

JSON or XML Tags for LLM Output: The Format That Holds Under Pressure