Article Image
Article Image
read

In my last post, I promised a Spark UI walkthrough you can actually use under pressure. This is that post.

The goal is simple: open the Spark UI (on EMR) and identify whether your job is suffering from:

  • Skew (a few tasks do all the work)
  • Spill (not enough memory, so Spark spills to disk and gets slow)
  • Too many small tasks (scheduler overhead dominates; lots of tiny partitions)

If you follow the steps below, you can usually diagnose the root cause in under 10 minutes.

Screenshot (Spark UI on EMR)

Spark UI (EMR) - Stages tab

How to open Spark UI on EMR (quick)

You typically have two “Spark UIs” on EMR:

  • For running apps: the driver UI (often port 4040 on the driver host; changes if multiple apps run).
  • For completed apps: Spark History Server (commonly port 18080 on the primary node).

Common access patterns:

  • From the EMR console: on newer EMR releases you can open persistent UIs from the cluster UI links.
  • Via SSH tunnel (simple and reliable):
ssh -i /path/to/key.pem -N -L 18080:localhost:18080 hadoop@<EMR_PRIMARY_PUBLIC_DNS>

Then open http://localhost:18080 in your browser.

The 10-minute triage flow (what I do every time)

This is the exact order that gives you signal fast.

Minute 0–1: find the bottleneck stage

  1. Open the application in Spark UI.
  2. Go to Jobs.
  3. Click the slowest job (or the job with the biggest wall-clock time).
  4. Jump to the linked Stages from that job.

What you’re looking for:

  • A stage that accounts for most of the runtime.
  • A stage with huge shuffle read/write.
  • A stage where “some tasks are way slower than others”.

Minute 1–4: diagnose skew (the fastest win)

On the Stages page, click the stage that dominates runtime. Then look for:

  • Task time spread: if you see a handful of tasks taking dramatically longer than the median, that’s skew.
  • Shuffle read skew: a few tasks reading far more shuffle data than the rest.
  • Input records skew: similar pattern, but from input side.

In the stage details, I focus on “summary metrics” style columns (names vary slightly by Spark version):

  • Duration / Task Time
  • Shuffle Read Size / Records
  • Shuffle Write Size
  • Input Size / Records
  • Spill (Memory/Disk) (if present)

Rule of thumb:

  • If (p95) task time is 10x the median, you almost certainly have skew.

Fast fixes (in order):

  • Enable AQE + skew join handling (Spark 3.x):
    • spark.sql.adaptive.enabled=true
    • spark.sql.adaptive.skewJoin.enabled=true
  • Broadcast the small side (if it’s truly small and stable).
  • Salt the key (when you have extreme “one key owns the world” skew).
  • Repartition by the join/agg key before the expensive operation.

Minute 4–7: diagnose spill (memory pressure)

Spill shows up as “everything got slow” during wide stages (joins, groupBy, orderBy, distinct). In the stage details look for:

  • Spilled bytes (memory and/or disk spill)
  • High shuffle read + high spill (classic “not enough memory for shuffle” signature)

Also check the Executors tab:

  • Executors with high GC time
  • Executors dying/restarting (OOM, container killed)
  • Consistently high memory usage with little headroom

What spill usually means:

  • Your partitions are too large, or
  • Your executors are too small (memory/overhead), or
  • The query plan is creating a massive shuffle (bad join strategy, huge explode, no pre-aggregation)

Fast fixes (in order):

  • Fix partition sizing first:
    • Increase effective parallelism (more, smaller partitions) before a big shuffle.
    • Let AQE coalesce later (so you don’t write millions of tiny files).
  • Right-size executors:
    • Increase spark.executor.memory and/or spark.executor.memoryOverhead.
  • Reduce shuffle volume:
    • Filter early, project fewer columns, pre-aggregate before joins.

Minute 7–9: diagnose “too many small tasks” (scheduler overhead)

This one is sneaky because the cluster looks “busy” but makes slow progress.

Signs:

  • Stages with tens/hundreds of thousands of tasks
  • Individual tasks are very short (sub-second to a few seconds)
  • Total stage time is large anyway because overhead dominates

What it usually means:

  • You’re reading tons of small files, or
  • You created an extreme number of partitions (often by blindly setting spark.sql.shuffle.partitions too high without AQE), or
  • A previous step wrote a partition layout that exploded file counts.

Fast fixes:

  • Fix small files at the source (compaction): write fewer, larger files (often 128–512 MB is healthy).
  • Coalesce before writing (reduce output files) while keeping enough parallelism:
    • df.coalesce(200).write...
  • Use AQE partition coalescing:
    • spark.sql.adaptive.enabled=true
    • spark.sql.adaptive.coalescePartitions.enabled=true

Minute 9–10: confirm with one “sanity metric”

Before you leave the UI, confirm your diagnosis with one metric:

  • Skew: long tail of task times in the bottleneck stage.
  • Spill: spilled bytes + high GC + wide stages.
  • Too many tasks: task count is massive and median task runtime is tiny.

If you can’t confirm, you don’t have a diagnosis, you have a guess.

The fastest Spark UI pages (and what they’re best at)

  • Jobs: which action/job is slow (high-level entry point).
  • Stages: the truth (shuffle, skew, spill, task count).
  • SQL (if present): which SQL query / operator is blowing up (joins, aggregations).
  • Executors: GC time, executor loss/restarts, memory pressure.
  • Environment: confirm configs (AQE enabled? shuffle partitions? dynamic allocation?).

What I change first (opinionated defaults)

If you want a small set of defaults that pay off on EMR:

  • Turn on AQE and skew join handling (Spark 3.x).
  • Start with a sane executor shape (often 4–6 cores per executor) and enough memory to avoid constant spilling.
  • Treat small files as a production bug, not a “nice to have”.
  • Don’t “fix” performance by randomly turning knobs, use the Spark UI to make one hypothesis at a time.

That’s it for today. In my next post I will discuss how a single skewed join key made a 12 minute job run for 2 hours, and how to fix it without resizing the cluster.

Thank you for reading.

Cheers!
Jason

Blog Logo

Jason Rich


Published

Image

NADEBlg!

Back to Overview