DuckDB makes most “big data” pipelines obsolete for the laptop-scale work most of us actually do.
Install
pip install duckdb
Convert CSV to parquet
import duckdb
duckdb.sql("COPY (SELECT * FROM 'events.csv') TO 'events.parquet' (FORMAT PARQUET)")
10x smaller, 100x faster to query.
Query like Postgres
duckdb.sql("SELECT user_id, COUNT(*) FROM 'events.parquet' GROUP BY 1").show()
What went wrong
First run hit OutOfMemory on a 12GB CSV. Fix: stream via read_csv_auto instead of loading whole file.