Polars vs Pandas: A Field Report

After six months of using Polars in production pipelines, here's an honest assessment.

I switched a production data pipeline from Pandas to Polars about six months ago. Not as an experiment as a necessity. The pipeline was processing ~15 million rows of claims data per run and the pandas version was hitting memory limits on our Azure VM.

Here's what I found.

The Speed Claims Are Real

Polars is genuinely fast. The lazy evaluation engine and Rust internals aren't marketing. On our pipeline, the same transformation logic ran 4.2x faster and used about 60% less peak memory.

The gains are most dramatic when: - You're doing complex groupby + aggregation chains - You have wide DataFrames (many columns) with selective reads - You're doing string operations at scale

The API Takes Getting Used To

The mental model shift is real. Pandas encourages in-place mutation. Polars is immutable by design every operation returns a new DataFrame. At first this feels verbose. After a month it feels correct.

The expression API is where Polars shines:

# Polars - readable, composable, lazy
df.lazy()
  .filter(pl.col("status") == "ACTIVE")
  .with_columns([
      pl.col("amount").cast(pl.Float64),
      (pl.col("amount") * pl.col("rate")).alias("weighted_amount")
  ])
  .group_by("provider_id")
  .agg(pl.col("weighted_amount").sum())
  .collect()

Where I Still Use Pandas

  • Exploratory work in notebooks - the ecosystem (seaborn, statsmodels) still expects pandas
  • Small datasets - the overhead of thinking in Polars isn't worth it under ~100k rows
  • When I need .apply() on complex Python objects - Polars has map_elements but it's slower than pandas .apply() for arbitrary Python

The Verdict

If you're building production pipelines on large datasets, switch. The performance gains are real, the API is better designed, and the error messages are actually informative.

For notebooks and exploration, Pandas isn't going anywhere.