Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About

Apache Iceberg looked like the answer to everything when we first adopted it. Open format, ACID transactions, time travel, schema evolution. We migrated our Hive tables, ran a few queries, and felt good about life.

Three months later, our S3 costs doubled. Queries that used to take 10 seconds were taking 4 minutes. Metadata operations were timing out. Nobody on the team could explain why.

That was the beginning of a real education in how Iceberg actually behaves in production. This post covers what I wish someone had told us before we went all-in.

The Small Files Problem Is Not Optional

Iceberg is append-friendly by design. Every micro-batch write, every streaming insert, every incremental load creates new Parquet files. Each file also gets its own metadata entry.

Three months later, our S3 costs doubled. Queries that used to take 10 seconds were taking 4 minutes. Metadata operations were timing out. Nobody on the team could explain why.

That was the beginning of a real education in how Iceberg actually behaves in production. This post covers what I wish someone had told us before we went all-in.

The Small Files Problem Is Not Optional

Iceberg is append-friendly by design. Every micro-batch write, every streaming insert, every incremental load creates new Parquet files. Each file also gets its own metadata entry.

Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About

Other newsrooms on this story

Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About

Other newsrooms on this story

Related reading

Apache Iceberg v4: The Current State, the Proposals, and Why They Matter

Performance and Apache Iceberg's Metadata

Migrating to Apache Iceberg: Strategies for Every Source System

Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup

Apache Iceberg interoperability reaches tipping point - SiliconANGLE

Hands-On with Apache Iceberg Using Dremio Cloud

Related reading

Apache Iceberg v4: The Current State, the Proposals, and Why They Matter

Performance and Apache Iceberg's Metadata

Migrating to Apache Iceberg: Strategies for Every Source System

Maintaining Apache Iceberg Tables: Compaction, Expiry, and Cleanup

Apache Iceberg interoperability reaches tipping point - SiliconANGLE

Hands-On with Apache Iceberg Using Dremio Cloud