Setting Up Data Lineage in Spark on AWS EMR (From Scratch)

read

Quick apology up front: this post is late. I meant to ship it last week, but a production incident reminded me (again) why lineage matters. So here it is: a practical guide to implementing data lineage from scratch in a Spark on AWS EMR environment.

This is Part 3 of the series. If you are starting from zero, the goal is simple: emit lineage events from Spark jobs, capture them centrally, and make them visible.

What “from scratch” actually means on EMR

Most EMR clusters start with:

Spark jobs (batch or streaming)
S3 as the data lake
Airflow or an internal scheduler
Zero lineage beyond “the job logs”

To get to end-to-end lineage, you need three pieces:

1) Lineage emission (your jobs publish events) 2) Lineage collection (events land in a system) 3) Lineage visibility (a UI or API you can query)

The most common open-source path is OpenLineage + Marquez. OpenLineage defines the event spec, and Marquez stores and visualizes it.

Step 1: Stand up a lineage backend (Marquez)

You can run Marquez on ECS, EKS, or a small EC2 box. For a simple start, docker-compose on a small host is fine.

# docker-compose.yml
services:
  marquez:
    image: marquezproject/marquez:latest
    ports:
      - "5000:5000"
  marquez-web:
    image: marquezproject/marquez-web:latest
    ports:
      - "3000:3000"

Once the API is live, you will have an endpoint to send lineage events to.

Step 2: Emit lineage from Spark on EMR

On EMR, you can configure Spark to emit OpenLineage events via the OpenLineage Spark listener.

In your Spark submit or job configuration:

spark-submit \
  --conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener \
  --conf spark.openlineage.transport.type=http \
  --conf spark.openlineage.transport.url=http://marquez.yourdomain:5000/api/v1/lineage \
  --conf spark.openlineage.namespace=prod \
  s3://your-bucket/jobs/revenue_daily.py

Now a basic Spark job will emit lineage automatically based on inputs and outputs.

Step 3: Make your jobs explicit about datasets

Spark can infer a lot, but you get better lineage when you make datasets explicit. Use structured paths and consistent namespaces.

# revenue_daily.py
orders = spark.read.parquet("s3://lake/warehouse/orders_clean")
result = (
    orders.groupBy("order_date")
    .sum("order_total")
    .withColumnRenamed("sum(order_total)", "total_revenue")
)
result.write.mode("overwrite").parquet("s3://lake/warehouse/revenue_daily")

When your paths are consistent, the lineage graph becomes readable and stable.

Step 4: Capture orchestration context

Lineage is more useful when you can connect it to the scheduler that ran it. If you are using Airflow, propagate run metadata.

export OPENLINEAGE_PARENT_RUN_ID=""
export OPENLINEAGE_PARENT_JOB_NAME=""

This lets you trace lineage from the DAG run to the Spark job to the dataset.

Step 5: Validate that lineage is working

Before you declare success, check three things:

You can see datasets, jobs, and runs in the UI.
You can trace from a dataset to its upstream and downstream nodes.
The graph updates after new deploys.

If the graph is empty, it is usually:

the Spark listener is not installed on the cluster
the OpenLineage endpoint is unreachable from EMR
the namespace or transport URL is wrong

Common EMR gotchas

EMR is not the problem. The defaults are.

No outbound connectivity from private subnets means no events.
Ephemeral clusters lose config unless you bake it in or use bootstrap actions.
Mixed runtimes (PySpark + Spark SQL + custom Java) may emit partial lineage.

Treat lineage configuration as part of the cluster definition, not a per-job hack.

Final thought

Lineage on EMR is not magic, but it is straightforward if you treat it like infrastructure. Once events flow, the rest of the system is just visibility.

Thanks for reading!

Cheers!
Jason