Spark Performance for Data Teams

Back to modules
Course progress0%
article

Partitioning and cache strategy

Avoid over-partitioning, under-partitioning, and accidental cache pressure.

Partitioning and Cache Strategy

Partitioning is useful when it matches common filters. Caching is useful when it avoids repeated expensive work. Both can hurt when applied reflexively.

Practical guidance

  • Partition on stable, low-to-medium cardinality columns used in filters.
  • Avoid partitioning on IDs with millions of values.
  • Cache only when the same intermediate result feeds repeated actions.
  • Unpersist cached data when the workflow moves on.

Skew smell

When one key dominates a group or join, adding workers may not help. Consider salting, pre-aggregation, or a different join strategy.

spark.table("main.silver.events").groupBy("customer_id").count().orderBy("count", ascending=False).show(10)

Review question

Which access pattern is this table optimized for, and is that still the pattern that matters?

Partitioning and cache strategy

Partitioning practice