A practical guide to prompt engineering for structured data extraction

Extracting structured data from unstructured text is one of the most practical uses of language models in production. Advisory feeds, incident reports, job postings, legal documents — they all contain structured information buried in natural language. Getting that information out reliably requires more than a naive "respond in JSON" instruction.

This tutorial walks through the full stack: system prompt design, few-shot examples, chain-of-thought for ambiguous fields, JSON mode, and Pydantic validation with retry logic. The running example is CVE advisory extraction, which is genuinely hard because advisories vary wildly in format and verbosity.

What we are extracting

Given raw advisory text like this:

CERT-FR CERTFR-2025-AVI-0312

What we are extracting

Given raw advisory text like this:

CERT-FR CERTFR-2025-AVI-0312

A practical guide to prompt engineering for structured data extraction

A practical guide to prompt engineering for structured data extraction

Related reading

Structured PDF-to-JSON: A Guide to Open-Source Extraction Models in 2026

Pydantic AI vs LangChain vs instructor: structured LLM outputs compared

Extract Structured JSON from Messy Text with Telnyx AI Inference

Using Lift to Turn Research PDFs into Structured JSON with Controlled,…

Extracting structured data from messy text: what worked for me

AI Agents in Production: Why Structured Generation Matters More Than Prompt…

Related reading

Structured PDF-to-JSON: A Guide to Open-Source Extraction Models in 2026

Pydantic AI vs LangChain vs instructor: structured LLM outputs compared

Extract Structured JSON from Messy Text with Telnyx AI Inference

Using Lift to Turn Research PDFs into Structured JSON with Controlled,…

Extracting structured data from messy text: what worked for me

AI Agents in Production: Why Structured Generation Matters More Than Prompt…