Sail (PySail) Benchmark Report

Platform: Apple M1 MacBook Pro — 8GB RAM
Engine: Sail 0.6.0 (Rust-native, Apache Arrow + DataFusion)
Runtime: Docker container, --memory=6g
Date: 15 April 2026

Overview

Sail is a Rust-native, JVM-free compute engine compatible with the Apache Spark Connect protocol. It implements the Spark SQL and DataFrame API with no code rewrites required — existing PySpark code connects to Sail over sc://localhost:{port} unchanged.

This report documents a series of progressively complex workload tests run entirely inside a Docker container on a constrained 8GB M1 machine, using pysail and pyspark-client as the only dependencies.

Environment

Item	Detail
Machine	Apple MacBook Pro M1
RAM	8GB unified memory
Docker memory limit	6GB
Python	3.14
PySail version	0.6.0
PySpark client	pyspark-client (Spark 4.x)
Storage	Local `/tmp` inside container, mounted volume for Parquet output

Test Results

Test 1 — Simple Aggregation (5M rows)

Workload: Generate 5M rows, join to 2 dimension tables, run 6 aggregations.

Metric	Value
Rows	5,000,000
Joins	2 (star schema)
Aggregations	6 (count, sum, avg, max, min, stddev)
Total time	0.59s

Test 2 — Heavy Aggregation with Derived Metrics (10M rows)

Workload: 10M rows, 2 joins, 7 derived computed columns (sqrt, log, pow, abs), 13 aggregations, 2 post-aggregation ratio columns.

Metric	Value
Rows	10,000,000
Joins	2
Derived columns	7 (sqrt, log, pow, abs, discount, tax, gross profit)
Aggregations	13
Post-agg columns	2 (profit margin %, tax ratio)
Total time	1.39s

Test 3 — Window Functions (8M rows)

Workload: 8M rows, 2 joins, 3 partition-only window functions (avg over tier, max over category, avg over region), 9 final aggregations.

Metric	Value
Rows	8,000,000
Joins	2
Window functions	3 (partition-only, no unbounded sort)
Aggregations	9
Total time	4.27s

Window functions require a full partition pass over data and are inherently more memory-intensive than plain aggregations. The ~10x cost per row vs plain agg is expected behaviour — not a Sail limitation.

Test 4 — Parquet Read/Write + Query (10M rows)

Workload: Generate 10M rows, write to partitioned Parquet (by tier + category), read back, apply filter + 6 aggregations.

Step	Time
Write (partitioned Parquet)	1.61s
Read + row count	0.01s
Filter + 6 aggregations on Parquet	0.11s
Total wall time	1.74s

0.11s for a filtered aggregation query on 10M rows read from disk. Partition pruning on tier + category reduces the effective scan significantly.

Test 5 — Parallel SCD Type 2 (5 dimensions, 4M rows total)

Workload: 5 dimension tables processed concurrently via ThreadPoolExecutor(max_workers=5). Each dimension undergoes a full SCD Type 2 merge — row hash comparison, expiry of changed records, insertion of new versions.

Dimension	Existing	Updates	Final	Expired	Time
dim_agent	750,000	250,000	1,000,000	250,000	2.46s
dim_branch	250,000	100,000	350,000	100,000	1.38s
dim_customer	1,000,000	300,000	1,300,000	300,000	2.73s
dim_policy	1,500,000	500,000	2,000,000	500,000	2.92s
dim_product	500,000	200,000	700,000	200,000	2.05s
TOTAL	4,000,000	1,350,000	5,350,000	1,350,000	2.92s wall

All 5 SCD2 jobs completed in 2.92s wall time — bounded by the largest dimension (dim_policy). Demonstrates Sail's ability to handle concurrent Spark Connect sessions from a single server process.

Test 6 — End-to-End Daily SCD2 Pipeline with Parquet I/O (5M rows)

Workload: Full simulated daily pipeline — write T-1 snapshot to Parquet, read back, apply 30% change batch, run SCD2 merge, write T snapshot to Parquet, validate.

Step	Detail	Time
Write T-1 snapshot	5M rows → Parquet	0.64s
Read T-1 from Parquet	5M rows	0.01s
Generate change batch	1.5M records (30% churn)	—
SCD2 merge	Join + expire + new versions	0.00s
Write T snapshot	6.5M rows → Parquet	5.04s
Validate output	count + filter checks	0.18s
Total wall time		5.91s

Output validation:

Metric	Value
T-1 rows	5,000,000
T total rows	6,500,000
Current records	5,000,000
Expired records	1,500,000
Net new versions	1,500,000

Parquet files written to local volume and verified readable post-container exit.

Summary

Test	Rows	Key Operations	Time
Simple aggregation	5M	2 joins, 6 aggs	0.59s
Heavy aggregation	10M	2 joins, 13 aggs, 7 derived cols	1.39s
Window functions	8M	2 joins, 3 windows, 9 aggs	4.27s
Parquet query	10M	filter + 6 aggs off disk	0.11s
Parallel SCD2 (5 dims)	4M total	5 concurrent merges	2.92s wall
Daily SCD2 + Parquet I/O	5M + 1.5M delta	full pipeline	5.91s

Key Observations

No JVM overhead. Sail starts in under 2 seconds and consumes a small memory footprint at idle — no warmup, no GC pauses, no heap tuning required.

Parquet performance is exceptional. Read latency on a 5M row file is effectively zero (0.01s). Query performance on 10M rows off partitioned Parquet is 0.11s — faster than most in-memory engines on equivalent hardware.

Concurrent session support works. Five independent SCD2 pipelines ran in parallel against a single Sail server process with no contention issues. Wall time was bounded by the largest job, not the sum of all jobs.

Memory ceiling on constrained hardware. The practical limit on a 6GB Docker allocation is approximately 5–8M rows depending on operation complexity. Window functions with unbounded sorts and large Parquet writes are the most memory-intensive operations. On a 32GB+ cloud VM, the 10M+ workloads would complete comfortably.

SCD Type 2 logic is fast. The merge computation itself (join, filter, union) executes in sub-second time at 5M rows. The bottleneck in production pipelines is Parquet I/O, not the transformation logic.

Architecture Notes

The SCD2 implementation used in these tests follows a hash-based change detection pattern:

Compute MD5(concat_ws("|", tracked_cols)) as row_hash on both existing and incoming records
Left join existing dim to incoming updates on natural key
Split into three sets: unchanged (no match or same hash), expired (hash changed → set is_current=False, close expiry_date), new versions (hash changed → insert with new effective_date, expiry_date=9999-12-31, is_current=True)
Union all three sets as final output

All columns in concat_ws must be cast to string before hashing — Sail enforces strict type checking on concat_ws and does not auto-coerce integer types.

Conclusion

Sail delivers credible Spark-compatible performance on severely constrained hardware. On production-grade infrastructure (cloud VMs, Kubernetes), the throughput numbers suggest it is a viable replacement for Spark in batch ETL, dimension processing, and analytical query workloads — with significantly lower operational overhead and no JVM dependency.