Data Lineage, Ontology, and Data Contracts: Better Together

read

By now, if you have been following this series, you know what lineage is, where teams go wrong, and how to wire it up on EMR. This final post is about the bigger picture: how data lineage, ontology, and data contracts work together as a system instead of three separate projects that your platform team keeps fighting about.

This is Part 4 of 4. If you have not read the earlier posts, the short version is: lineage without meaning is just a graph of tables, and meaning without accountability is just documentation that gets stale.

The three-layer model for data trust

I think about data trust in three layers, and I have seen teams fail on all three:

Layer	Question it answers	Artifact
Ontology	What does this data mean?	Business glossary, entity model
Data Contract	What is promised about this data?	Schema, SLA, ownership
Data Lineage	How did this data get here?	Pipeline graph, transformation history

Each layer is useful alone. But the real payoff comes when they reference each other.

Without ontology, lineage graphs show you that acct_rev_adj_v2 feeds rpt_fin_q3, and you still have no idea what either of those means. Without contracts, lineage shows you the path, but not whether the path delivered what was promised. Without lineage, contracts and definitions exist in a catalog that no one trusts because no one can prove the data actually got there.

How ontology anchors the lineage graph

Lineage is most powerful when the nodes in the graph carry business meaning, not just technical identifiers.

A raw lineage edge looks like this:

orders_raw -> orders_clean -> revenue_daily

That tells you the path. It does not tell you what a “revenue” means in this company, or whether orders_raw includes returns and refunds or excludes them.

An ontology-anchored lineage edge looks like this:

# Lineage node enriched with ontology reference
dataset: revenue_daily
ontologyTerm: "Net Revenue"
definition: "Total order value excluding returns, refunds, and adjustments, net of discounts."
ownedBy: "Finance"
upstreamOf:
  - "dashboard://finance/weekly-revenue"
  - "model://ml/churn-predictor"

Now the node is not just a table name. It is a defined business concept with a traceable history.

The practical way to build this: use your data catalog or metadata platform to link lineage nodes to glossary terms. Tools like DataHub, Alation, and Atlan support this directly; you can attach a business term to a dataset so that the lineage graph and the business glossary point to the same thing.

When a data consumer clicks on revenue_daily in a lineage graph and sees the definition, the owning team, and the upstream chain; that is when data stops being a rumor and starts being a product.

How data contracts make lineage enforceable

Lineage shows you what happened. Contracts define what should happen.

A data contract is a formal agreement between the team that produces a dataset and the teams that consume it. It specifies:

Schema: the columns, types, and nullability that are guaranteed
Freshness: when the data will be ready and how often it updates
Quality: what assertions are always true (no nulls in customer_id, revenue is always non-negative)
Ownership: who to call when something breaks

Here is a simple contract definition:

# data-contract: revenue_daily
schema:
  - name: ds
    type: date
    nullable: false
  - name: total_revenue
    type: decimal(18, 2)
    nullable: false
  - name: order_count
    type: integer
    nullable: false

freshness:
  expectedUpdateHour: 6   # 6 AM UTC
  maxDelayMinutes: 30

quality:
  - assertion: "total_revenue >= 0"
  - assertion: "order_count > 0"

owner: "data-platform"
consumers:
  - "dashboard://finance/weekly-revenue"
  - "model://ml/churn-predictor"

The contract is useful on its own, but it becomes powerful when it is validated as a lineage event.

Here is what that looks like in practice:

import requests

# Run contract assertions after the pipeline writes its output
assertions = [
    ("total_revenue >= 0", total_revenue >= 0),
    ("order_count > 0",    order_count > 0),
]
passed = [a for a, ok in assertions if ok]
failed = [a for a, ok in assertions if not ok]

# Emit the result as an OpenLineage event to Marquez
lineage_event = {
    "eventType": "COMPLETE",
    "job": {"namespace": "prod", "name": "revenue_daily"},
    "outputs": [{
        "namespace": "warehouse",
        "name": "revenue_daily",
        "facets": {
            "dataQuality": {
                "contractValidated": len(failed) == 0,
                "assertionsPassed": passed,
                "assertionsFailed": failed
            }
        }
    }]
}

requests.post(
    "http://marquez.yourdomain:5000/api/v1/lineage",
    json=lineage_event
)

Now your lineage graph does not just show that revenue_daily was written. It shows whether the contract was honored when it was written. That is the difference between a pipeline that “ran” and a pipeline that delivered.

What the collaboration actually looks like in practice

The integration is not a single tool. It is a pattern that runs across your stack.

Here is the flow when all three layers work together:

Source system
  → ingestion job  (emits lineage event)
  → orders_raw     (ontology: "Raw order events, pre-validation")
  → transform job  (emits lineage event)
  → orders_clean   (ontology: "Validated orders", contract: schema + quality SLA)
      ↓
      Contract validation runs → result recorded in lineage facet
      ↓
  → revenue_daily  (ontology: "Net Revenue", contract: freshness + assertions)
      ↓
      Contract validation runs → result recorded in lineage facet
      ↓
  → downstream dashboards, models, APIs

At every step:

The ontology answers “what is this?”
The contract answers “what was promised?”
The lineage answers “what actually happened, and did it honor the promise?”

When an incident happens, you don’t start a meeting with twenty people and a shared screen. You open the lineage graph, click the failing metric, see the upstream jobs, and look for the first node where the contract validation turned red. Most of the time, that is the root cause.

Common mistakes when trying to combine them

I have seen teams try to build all three layers at once and get none of them right. The failure mode is always the same: too much tooling, not enough discipline.

Mistake: owning the tools but not the definitions. You can deploy a data platform, write YAML contracts, and instrument every pipeline, but if no one owns the glossary terms or reviews failed contract assertions, the system decays. Assign humans to maintain definitions the same way you assign humans to maintain pipelines.

Mistake: contracts that describe what the data is instead of what it promises. A contract that just repeats the schema is not a contract. It is a README. A real contract makes assertions that can fail, with a named owner who is responsible when they do.

Mistake: lineage that stops at the warehouse. If your dashboards, ML models, and APIs are not lineage nodes, you cannot trace impact end-to-end. The contract for revenue_daily means nothing if you cannot see that three dashboards and one churn model depend on it.

A practical starting point

If you are starting from zero, here is the order I would tackle this:

Get lineage flowing first. Empty graphs cannot be anchored to anything. Follow the steps in Part 3 of this series.
Pick 5-10 critical datasets and write contracts for them. Start with datasets that cause incidents when they break or go stale. Define schema, freshness, and two or three quality assertions each.
Attach ontology terms to those same datasets. Link each dataset to a business glossary definition. Even a short, precise sentence is enough.
Wire contract validation into the pipeline as a lineage facet. Now failures are visible in the graph, not just in a separate monitoring tool.
Expand from there. Once the pattern is established on ten datasets, it is much easier to scale.

The goal is not to do all three perfectly for every dataset. The goal is to have the pattern working end-to-end for the datasets that matter most. Start there, prove the value, then grow it.

Final thought

Data lineage, ontology, and data contracts are not competing frameworks. They are three answers to three different questions every data team eventually has to answer.

Lineage tells you how the data arrived. Ontology tells you what it means. Contracts tell you what was promised and whether that promise was kept.

When all three work together, data stops being something you debate in meetings. It becomes something you can trust, change safely, and hand off with confidence.

That is the goal this whole series has been building toward. Thanks for sticking with me through all four parts.

Cheers!
Jason

Data Lineage, Ontology, and Data Contracts: Better Together

Jason Rich