PostgreSQL for Data Engineers: Indexes, Bulk Loads, and the Patterns That Actually Matter

The LedgerSync pipeline was inserting 1.5 million rows into PostgreSQL using pandas.to_sql(). It took four minutes per run. I switched to psycopg2's COPY command and it dropped to 18 seconds. Same data, same schema, same machine. That is not an optimization tip. It is the difference between a pipeline that fits in an Airflow schedule and one that does not. This article is about patterns like that: the ones that matter when you are building pipelines that run on a schedule, not when you are writing ad-hoc queries.

Loading Data: to_sql vs execute_values vs COPY

There are three ways to write rows from Python into PostgreSQL, and the performance gap between them is significant.

pandas to_sql issues one INSERT statement per row by default, or a multi-row INSERT with method="multi". It is the easiest to write and the slowest for any serious volume.

psycopg2 execute_values batches many rows into a single multi-row INSERT VALUES statement. About 5x faster than to_sql for medium-sized loads.

Loading Data: to_sql vs execute_values vs COPY

There are three ways to write rows from Python into PostgreSQL, and the performance gap between them is significant.

pandas to_sql issues one INSERT statement per row by default, or a multi-row INSERT with method="multi". It is the easiest to write and the slowest for any serious volume.

psycopg2 execute_values batches many rows into a single multi-row INSERT VALUES statement. About 5x faster than to_sql for medium-sized loads.

PostgreSQL for Data Engineers: Indexes, Bulk Loads, and the Patterns That Actually Matter

PostgreSQL for Data Engineers: Indexes, Bulk Loads, and the Patterns That Actually Matter

Related reading

Stop Guessing Batch Sizes: High-Performance Bulk Inserts in Spring Boot with…

Optimizing SQL Queries or Tracking Dangerous Criminals

ETL Pipeline: Fetching Real-Time News Data with Python and Postgres

PostgreSQL Query Optimization: Reducing API Response Time from 2.8 Seconds to…

Rewriting PostgreSQL in Rust: Technical Challenges and Lessons Learned

Designing High-Performance Blockchain Indexers

Related reading

Stop Guessing Batch Sizes: High-Performance Bulk Inserts in Spring Boot with…

Optimizing SQL Queries or Tracking Dangerous Criminals

ETL Pipeline: Fetching Real-Time News Data with Python and Postgres

PostgreSQL Query Optimization: Reducing API Response Time from 2.8 Seconds to…

Rewriting PostgreSQL in Rust: Technical Challenges and Lessons Learned

Designing High-Performance Blockchain Indexers