ETL with Python + DuckDB: CSV to parquet to SQL

DuckDB makes most “big data” pipelines obsolete for the laptop-scale work most of us actually do.

Install

pip install duckdb

Convert CSV to parquet

import duckdb
duckdb.sql("COPY (SELECT * FROM 'events.csv') TO 'events.parquet' (FORMAT PARQUET)")

10x smaller, 100x faster to query.

Query like Postgres

duckdb.sql("SELECT user_id, COUNT(*) FROM 'events.parquet' GROUP BY 1").show()

What went wrong

First run hit OutOfMemory on a 12GB CSV. Fix: stream via read_csv_auto instead of loading whole file.