Partitioning and cache strategy

Avoid over-partitioning, under-partitioning, and accidental cache pressure.

Partitioning and Cache Strategy Partitioning is useful when it matches common filters. Caching is useful when it avoids repeated expensive work. Both can hurt when applied reflexively. Practical guidance Partition on stable, low-to-medium cardinality columns used in filters. Avoid partitioning on IDs with millions of values. Cache only when the same intermediate result feeds repeated actions. Unpersist cached data when the workflow moves on. Skew smell When one key dominates a group or join, adding workers may not help. Consider salting, pre-aggregation, or a different join strategy. spark.table("main.silver.events").groupBy("customer_id").count().orderBy("count", ascending=False).show(10) Review question Which access pattern is this table optimized for, and is that still the pattern that matters?

Partitioning practice