Banned Pandas: Building Fast, Scalable Data Pipelines with DuckDB + SQL on Azure

✍️ By Abhishek Kumar | #FirstCrazyDeveloper

🧠 Why This Change Matters

Our data pipelines were fast — until they weren’t.
Large joins started crashing memory, inconsistent datetime types caused nightly job failures, and debugging hidden Python logic in Jupyter notebooks became a nightmare.

So, we did something radical.

We banned pandas.
Every transformation now runs in DuckDB, using pure SQL, directly on Parquet files in Azure Blob Storage.

And the results?
Our pipelines are faster, easier to debug, and simpler to maintain.

⚙️ 1️⃣ SQL-First Design = Reviewable and Reproducible

Traditional Pandas pipelines scatter logic across multiple scripts and functions.
With DuckDB, every transformation is just a SQL query — easy to review, test, and reproduce.

Example (Pandas ➡ DuckDB)

Before – Pandas

import pandas as pd

orders = pd.read_parquet('orders.parquet')
orders = orders[orders['amount'] > 0]
result = orders.groupby('customer_id')['amount'].sum().reset_index()

After – DuckDB (SQL-Only)

SELECT customer_id, SUM(amount) AS total_amount
FROM 'az://company-data/orders/*.parquet'
WHERE amount > 0
GROUP BY customer_id;

No hidden logic, no scattered functions — just clean, testable SQL.

☁️ 2️⃣ Direct on Azure Blob Storage = Zero Local Copies

DuckDB reads Parquet files directly from Azure Blob Storage using its built-in Azure extension.
No staging, no downloads, no temporary local storage.

INSTALL azure;
LOAD azure;

CREATE SECRET az_conn (TYPE AZURE, CONNECTION_STRING 'DefaultEndpointsProtocol=...');

SELECT *
FROM 'az://company-data/orders/date=2025-10-*/part-*.parquet'
WHERE region = 'EU';

💡 Unlike Pandas, which loads data into memory, DuckDB queries the files in place — even large datasets that exceed RAM.

🧩 3️⃣ Handles Big Data Without Crumbling Memory

Pandas DataFrames live entirely in memory; join two large DataFrames and you’ll often hit OOM.

DuckDB was built differently:

Columnar storage + vectorized execution
Predicate & projection pushdown
Works seamlessly with Parquet and Arrow
Spills gracefully beyond RAM

SELECT region, SUM(sales)
FROM 'az://company-data/sales/**/*.parquet'
WHERE sales_date >= DATE '2025-01-01'
GROUP BY region;

This can easily scan hundreds of millions of rows without memory errors.

🧱 4️⃣ Simpler Setup = One Engine to Rule Them All

We replaced a multi-layer stack:

pandas for DataFrames
pyarrow for I/O
SQLAlchemy for SQL interfacing
temp SQLite/Postgres for joins

➡️ With just DuckDB + Parquet + Arrow

No extra dependencies, no environment mismatch, no need to maintain multiple engines.
DuckDB handles ingest + transform + export in one consistent runtime.

🧮 5️⃣ Consistent Schemas via Apache Arrow

If you’ve ever fought with Pandas dtype quirks (object vs string, timezone offsets, nullable ints), you’ll love DuckDB’s Arrow-based schema consistency.

DuckDB keeps everything aligned — column names, data types, nested structs — across the pipeline.

✅ Fewer type mismatches
✅ No datetime conversion bugs
✅ Predictable schema evolution

🧠 Real-World Example: Customer Revenue Aggregation

Here’s how a full transformation pipeline now looks.

WITH orders AS (
  SELECT *
  FROM 'az://data/orders/date=2025-10-*/part-*.parquet'
),
filtered AS (
  SELECT order_id, customer_id, amount, order_ts::TIMESTAMP AS ts
  FROM orders
  WHERE amount > 0
),
enriched AS (
  SELECT f.*, c.segment
  FROM filtered f
  JOIN 'az://data/customers/*.parquet' c USING (customer_id)
)
SELECT segment, DATE_TRUNC('day', ts) AS day, SUM(amount) AS revenue
FROM enriched
GROUP BY 1, 2
ORDER BY day;

This single SQL script performs filtering, joining, and aggregation — without touching a DataFrame.
It’s fully auditable and can be wrapped in automated tests.

⚖️ Decision Matrix: When to Choose DuckDB vs Pandas

Use Case	DuckDB	Pandas
Querying Parquet in Azure Blob	✅ Native support (no local copies)	⚠️ Requires downloads / in-memory load
Big Joins & Aggregations	✅ Handles larger-than-RAM	🚫 Crashes on large joins
SQL Auditability & Testing	✅ 100 % SQL based	⚠️ Hidden Python logic
Schema Consistency	✅ Arrow-based types	⚠️ dtype mismatch headaches
EDA / Feature Engineering	⚠️ SQL only	✅ Rich Python ecosystem

If your workflow centers around data pipelines, not EDA, DuckDB wins — hands down.

🏗️ Migration Patterns (From Pandas to SQL)

Pandas Pattern	DuckDB SQL Equivalent
`df.query('amount>0')`	`SELECT * FROM t WHERE amount > 0`
`pd.merge(df1, df2, on='id')`	`SELECT * FROM t1 JOIN t2 USING (id)`
`df.groupby(['region']).sum()`	`SELECT region, SUM(val) FROM t GROUP BY region`
`pd.concat([df1, df2])`	`SELECT * FROM t1 UNION ALL SELECT * FROM t2`

🔍 When You Still Need Pandas (Optional Cases)

Pandas still shines for:

Quick ad-hoc exploration (small < 1 GB data)
Integrating machine-learning feature engineering in Python
Tight loops with NumPy/Scikit-Learn

But if your pipelines live in production — DuckDB’s SQL-first design makes debugging, scaling, and maintenance dramatically easier.

🧩 Final Verdict

We didn’t tune Pandas.
We removed it.

Now, every transformation:

Runs directly on Azure Blob (Parquet + Arrow)
Uses only DuckDB SQL
Is auditable, testable, and reproducible
Scales from MB to hundreds of GB without memory pain

DuckDB is not just a “smaller database.”
It’s a new paradigm for analytical pipelines — one that fits the modern cloud native data stack perfectly.

✍️ Abhishek Take

If your data workflow starts with “load into pandas,” you’ve already limited yourself to RAM.
If it starts with “run SQL on storage,” you’ve unlocked scalability by design.

The best optimization isn’t tuning Pandas — it’s removing it.

🧭 References

DuckDB Official Docs – Azure Extension & SQL Interface (duckdb.org/docs)
Apache Arrow Interchange Format Docs
Pandas Performance Notes & Scaling Guide
DuckDB Benchmark Tracking Portal

#DuckDB #DataEngineering #SQL #AzureBlob #Parquet #Arrow #BigData #Azure #CloudArchitecture #Python #Pandas #Performance #LinkedInTech #TechCommunity