I first started caring about end-to-end data lineage on a Friday night incident. A dashboard metric moved, everyone panicked, and the only honest question in the room was: where did that number actually come from?
This post is Part 1 of 4 in a short lineage series. We start with what lineage is, the top tools people use, and a quick comparison to ontology.
What “end-to-end data lineage” actually means
End-to-end data lineage is the trace of a data product from its origin to every downstream use, across systems, transforms, and time. It answers questions like:
- Where did this metric start?
- What transformations touched it?
- What dashboards, models, or APIs depend on it?
- If I change this upstream table, who breaks downstream?
Think of it as the causal chain for data. It is not just table-to-table mapping. It is also:
- source systems and ingestion
- transformations (SQL, Spark, Python, dbt, etc.)
- orchestration and scheduling
- metadata and governance context
Here is a tiny example of the kind of trace lineage systems build:
-- orders_raw -> orders_clean -> revenue_daily
INSERT INTO revenue_daily (ds, total_revenue)
SELECT
order_date AS ds,
SUM(order_total) AS total_revenue
FROM orders_clean
GROUP BY order_date;
That simple query produces a useful lineage edge: orders_clean -> revenue_daily.
A real lineage system would also connect orders_clean back to orders_raw, and then to the source system that produced it.
The top 5 tools used for data lineage
There are many tools in this space, but these five show up repeatedly in enterprise stacks:
1) OpenLineage (framework)
OpenLineage is an open framework for collecting lineage events across pipelines. It’s commonly used to standardize lineage metadata across many tools and runtimes.
2) Alation
Alation is a data catalog with strong analyst adoption. Lineage is typically integrated with data discovery and governance workflows.
3) DataHub (open source)
DataHub is an open-source metadata platform originally built at LinkedIn. It’s popular with engineering teams who want deep integration and extensibility.
4) Apache Atlas (open source)
Apache Atlas is a governance and metadata system commonly used in Hadoop ecosystems. It’s foundational for lineage in some legacy stacks and government environments.
5) Marquez (open source)
Marquez is an open-source lineage project that tracks jobs, datasets, and runs. It’s often paired with OpenLineage as a reference backend and UI.
Each tool takes a different stance on how lineage is captured:
- Passive: scan metadata and infer lineage from queries
- Active: instrument pipelines to emit lineage events
- Hybrid: combine inferred edges with pipeline-provided edges
Data lineage vs ontology: similar goals, different artifacts
This is a common point of confusion, so it helps to state it plainly.
Data lineage describes movement and transformation over time. Ontology describes meaning and relationships in a domain.
The easiest way to feel the difference is to imagine two different moments in your week. On Tuesday, a product manager asks, “What exactly is a churned customer?” That is an ontology question. On Thursday, the finance team asks, “Why did churn jump 12% last week?” That is a lineage question. Both are about trust, but they’re different kinds of trust.
Lineage is the story of how the number was made. Ontology is the story of what the number means. If you only have lineage, you can trace the steps but still disagree on the definition. If you only have ontology, you can agree on the definition but still not know where the number came from.
In practice, ontology often lives in data models and business glossaries. Lineage lives in query history, pipeline metadata, and the lineage graphs your catalog can build. They meet when you can click a metric, see its definition, and then traverse the full upstream chain that produced it.
Lineage is about:
- “This field came from that field.”
- “This model depends on that table.”
- “This report will break if that upstream job changes.”
Ontology is about:
- “A customer has accounts.”
- “An account can have transactions.”
- “A transaction has a category and a merchant.”
Here is a tiny ontology-style example:
Customer:
hasAccount: Account
Account:
hasTransaction: Transaction
Transaction:
hasCategory: Category
Similarities:
- Both are metadata-driven.
- Both are about making data understandable and trustworthy.
- Both help with governance and impact analysis.
Differences:
- Lineage is a timeline and dependency graph.
- Ontology is a semantic model of the world.
- Lineage tells you how data moved; ontology tells you what the data means.
You often want both: ontology to make data comprehensible, and lineage to make it safe to change. That combination is what turns a metric from a rumor into a product. When definitions and dependencies live together, a team can agree on what they are measuring and quickly understand how that measurement is produced. It is the difference between debating a number in a meeting and confidently shipping a change that improves it.
Final thought
If you can’t trace a metric end-to-end, you don’t really own it. Lineage is the map that makes data change survivable.
Thanks for reading.
Cheers!
Jason