I Built a Columnar File Format in Pure Python — a tiny, readable Parquet

A while back I caught myself in an interview saying "yeah, I use Parquet a lot" — and then realized I couldn't actually explain why it's faster than a CSV beyond hand-waving about "columnar." That bugged me. So I did the thing that always fixes this for me: I rebuilt it from scratch.

The result is Columna, a columnar storage engine and file format written in pure Python. No pandas, no pyarrow, no numpy. Just struct and zlib from the standard library. It's about 3,000 lines, it has 81 tests, and on a 50,000-row dataset it ends up 91% smaller than the equivalent CSV while reading 98% fewer bytes to answer a filtered query.

Live demo + file inspector: https://hajirufai.github.io/columna/

Code: https://github.com/hajirufai/columna

Here's what I learned building it.

Live demo + file inspector: https://hajirufai.github.io/columna/

Code: https://github.com/hajirufai/columna

Here's what I learned building it.

I Built a Columnar File Format in Pure Python — a tiny, readable Parquet

I Built a Columnar File Format in Pure Python — a tiny, readable Parquet

Related reading

From Data Quality Checks to Analytics-Ready Parquet with Python

How a pure-Python jq ended up 40x faster than the C bindings

How I forced Python standard libraries to process and serialize production…

How I Prevented Claude Code from Breaking My Architecture with 18 Tests That…

A Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and…

Claude Files API in Production: 5 Patterns for Document Workflows

Related reading

From Data Quality Checks to Analytics-Ready Parquet with Python

How a pure-Python jq ended up 40x faster than the C bindings

How I forced Python standard libraries to process and serialize production…

How I Prevented Claude Code from Breaking My Architecture with 18 Tests That…

A Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and…

Claude Files API in Production: 5 Patterns for Document Workflows