Spark Performance for Data Teams
Back to modules
Course progress0%
article
Partitioning and cache strategy
Avoid over-partitioning, under-partitioning, and accidental cache pressure.
Partitioning and Cache Strategy
Partitioning is useful when it matches common filters. Caching is useful when it avoids repeated expensive work. Both can hurt when applied reflexively.
Practical guidance
- Partition on stable, low-to-medium cardinality columns used in filters.
- Avoid partitioning on IDs with millions of values.
- Cache only when the same intermediate result feeds repeated actions.
- Unpersist cached data when the workflow moves on.
Skew smell
When one key dominates a group or join, adding workers may not help. Consider salting, pre-aggregation, or a different join strategy.
spark.table("main.silver.events").groupBy("customer_id").count().orderBy("count", ascending=False).show(10)
Review question
Which access pattern is this table optimized for, and is that still the pattern that matters?