Article Image
Article Image
read

This is Part 5 of the lineage series, and it is the most opinionated one.

The first four posts talked about what lineage is, where companies break it, and how to wire it up on EMR. This post is about building the full picture: lineage, ontology, and data contracts working together on a real stack.

The stack is:

  • Databricks E2 / Serverless on AWS
  • RDS (PostgreSQL / MySQL) as transactional databases
  • Unity Catalog for lineage and governance inside Databricks
  • Marquez for lineage outside Databricks (RDS origins, pre-ingestion transforms)
  • Protégé for building ontological objects
  • Great Expectations for data contracts

The goal is a system where you can ask three questions at any time:

  1. Where did this data come from? (lineage)
  2. What does this data mean? (ontology)
  3. Is this data correct? (contracts)

The architecture at a high level

Data starts in RDS, moves through ingestion into Databricks, gets transformed, and ends up in dashboards and ML models. Lineage needs to cover the entire chain.

RDS (source of truth)
  ↓  [captured by Marquez]
S3 landing zone
  ↓  [ingested by Databricks]
Bronze tables (Unity Catalog)
  ↓  [validated by Great Expectations]
Silver tables (Unity Catalog lineage)
  ↓
Gold tables → dashboards, ML features

Unity Catalog gives you lineage from bronze onward. Marquez gives you lineage from RDS to the landing zone. Great Expectations validates the handoffs. Protégé defines what the entities actually mean.


Lineage inside Databricks: Unity Catalog

Unity Catalog captures table-level and column-level lineage automatically for operations that run through the metastore. If you are using Delta tables and writing standard Spark SQL or DataFrame operations, lineage is emitted as part of the query lifecycle.

CREATE OR REPLACE TABLE silver.orders_clean AS
SELECT
  order_id,
  customer_id,
  order_date,
  order_total
FROM bronze.orders_raw
WHERE order_total > 0;

Unity Catalog records the edge: bronze.orders_rawsilver.orders_clean. You can view this in the Catalog Explorer or query it through the lineage API.

import requests

response = requests.get(
    f"{workspace_url}/api/2.0/lineage-tracking/table-lineage",
    headers={"Authorization": f"Bearer {token}"},
    json={"table_name": "main.silver.orders_clean"}
)
upstream = response.json().get("upstreams", [])

The critical thing to understand is that Unity Catalog lineage only covers what happens inside Databricks. Data that arrives from RDS through an ingestion pipeline has no lineage until you create it.


Lineage outside Databricks: Marquez for RDS origins

Your transactional data lives in RDS. Lineage needs to start there, not at the bronze table. Marquez handles this gap.

The pattern is: your ingestion job (whether it is a Glue job, a custom Python script, or an Airflow task) emits OpenLineage events to Marquez as it reads from RDS and writes to S3.

from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job, InputDataset, OutputDataset

client = OpenLineageClient(url="http://marquez.internal:5000")

client.emit(
    RunEvent(
        eventType=RunState.COMPLETE,
        run=Run(runId="abc-123"),
        job=Job(namespace="ingestion", name="rds_orders_to_s3"),
        inputs=[InputDataset(namespace="rds", name="prod.public.orders")],
        outputs=[OutputDataset(namespace="s3", name="s3://lake/landing/orders/")]
    )
)

Now Marquez knows: rds:prod.public.orderss3://lake/landing/orders/. Databricks knows: bronze.orders_rawsilver.orders_clean. The full chain is RDS → S3 → bronze → silver → gold, split across two systems but traceable.


Bridging the two lineage systems

The gap between Marquez and Unity Catalog is the ingestion boundary. Marquez knows that rds:prod.public.orders became s3://lake/landing/orders/. Unity Catalog knows that bronze.orders_raw became silver.orders_clean. But neither system knows about the other, and that gap is where trust breaks down.

In the first draft of this post I described two lightweight options. In practice, a production bridge needs more structure than either one alone. Here is how to build it properly.

Layer 1: A shared mapping registry

Start with a mapping file that connects Marquez dataset identifiers to Unity Catalog table names. This file is the single source of truth for the boundary.

# lineage_bridge/mappings.yaml
bridges:
  - marquez_namespace: "s3"
    marquez_dataset: "s3://lake/landing/orders/"
    unity_catalog: "main.bronze.orders_raw"
    s3_path: "s3://lake/landing/orders/"

  - marquez_namespace: "s3"
    marquez_dataset: "s3://lake/landing/customers/"
    unity_catalog: "main.bronze.customers_raw"
    s3_path: "s3://lake/landing/customers/"

  - marquez_namespace: "s3"
    marquez_dataset: "s3://lake/landing/products/"
    unity_catalog: "main.bronze.products_raw"
    s3_path: "s3://lake/landing/products/"

Keep this file in version control alongside your pipeline code. When someone adds a new ingestion path, they add a mapping entry in the same PR. If a mapping is missing, the bridge reports it instead of silently producing an incomplete graph.

Layer 2: A stitching service that queries both APIs

The bridge reads the mapping registry, calls the Marquez API for upstream lineage, calls the Unity Catalog API for downstream lineage, and joins them at the S3 boundary.

import yaml
import requests

def load_mappings(path="lineage_bridge/mappings.yaml"):
    with open(path) as f:
        return yaml.safe_load(f)["bridges"]

def get_marquez_upstream(marquez_url, namespace, dataset):
    resp = requests.get(
        f"{marquez_url}/api/v1/namespaces/{namespace}/datasets/{dataset}",
    )
    resp.raise_for_status()
    return resp.json().get("facets", {}).get("inputDatasets", [])

def get_uc_downstream(workspace_url, token, table_name):
    resp = requests.get(
        f"{workspace_url}/api/2.0/lineage-tracking/table-lineage",
        headers={"Authorization": f"Bearer {token}"},
        json={"table_name": table_name, "direction": "DOWNSTREAM"},
    )
    resp.raise_for_status()
    return resp.json().get("downstreams", [])

def build_full_lineage(marquez_url, workspace_url, token):
    mappings = load_mappings()
    full_graph = []

    for m in mappings:
        upstream = get_marquez_upstream(
            marquez_url, m["marquez_namespace"], m["marquez_dataset"]
        )
        downstream = get_uc_downstream(
            workspace_url, token, m["unity_catalog"]
        )
        full_graph.append({
            "marquez_upstream": upstream,
            "bridge_point": {
                "marquez_dataset": m["marquez_dataset"],
                "unity_table": m["unity_catalog"],
                "s3_path": m["s3_path"],
            },
            "uc_downstream": downstream,
        })

    return full_graph

This is not a real-time system. It runs on a schedule (hourly or daily) and writes its output to a metadata store or a simple JSON file that your catalog UI can read.

Layer 3: Validation and drift detection

The bridge is only useful if it stays accurate. Add a validation step that catches problems before they become invisible lineage gaps.

def validate_bridge(marquez_url, workspace_url, token):
    mappings = load_mappings()
    issues = []

    for m in mappings:
        try:
            get_marquez_upstream(
                marquez_url, m["marquez_namespace"], m["marquez_dataset"]
            )
        except requests.HTTPError:
            issues.append(f"Marquez dataset missing: {m['marquez_dataset']}")

        try:
            get_uc_downstream(workspace_url, token, m["unity_catalog"])
        except requests.HTTPError:
            issues.append(f"Unity Catalog table missing: {m['unity_catalog']}")

    if issues:
        raise RuntimeError(
            f"Lineage bridge validation failed:\n" +
            "\n".join(f"  - {i}" for i in issues)
        )

Run this as a scheduled check or as a CI step when the mapping file changes. When it fails, you know exactly which edge is broken and which system is the source of the problem.

Why this matters

Without a bridge, you have two disconnected graphs. With a convention-only bridge, you have a handshake that breaks silently. With a registry, a stitching service, and validation, you have a system that tells you when it is wrong instead of letting you discover it during an incident.


Ontology: Protégé for building meaning

Protégé is the industry-standard open source ontology editor, maintained by Stanford. It lets you define classes, properties, and relationships in OWL (Web Ontology Language).

Where lineage answers “how did this data get here?”, ontology answers “what does this data represent?”

Most data teams skip ontology entirely and rely on tribal knowledge. That works until someone leaves, a new team onboards, or two dashboards define “active customer” differently. Ontology prevents those problems by making domain meaning explicit, versioned, and machine-readable.

Building your first domain ontology in Protégé

Protégé gives you a visual editor, but the output is a structured OWL file. Start by identifying the core entities in your domain and the relationships between them.

Class: Customer
  hasProperty: customer_id (xsd:string)
  hasProperty: email (xsd:string)
  hasProperty: signup_date (xsd:date)
  hasRelation: placesOrder → Order
  hasRelation: belongsToSegment → CustomerSegment

Class: CustomerSegment
  hasProperty: segment_name (xsd:string)
  hasProperty: definition (xsd:string)

Class: Order
  hasProperty: order_id (xsd:string)
  hasProperty: order_date (xsd:date)
  hasProperty: order_total (xsd:decimal)
  hasProperty: status (xsd:string)
  hasRelation: belongsTo → Customer
  hasRelation: containsItem → OrderItem

Class: OrderItem
  hasProperty: item_id (xsd:string)
  hasProperty: quantity (xsd:integer)
  hasProperty: unit_price (xsd:decimal)
  hasRelation: refersTo → Product

Class: Product
  hasProperty: product_id (xsd:string)
  hasProperty: product_name (xsd:string)
  hasRelation: belongsToCategory → ProductCategory

This is a formal declaration of what a Customer, Order, and Product are in your domain. It is not a schema. It is a semantic model.

A schema says “this table has these columns.” An ontology says “a Customer is an entity that places Orders, and an Order contains OrderItems that refer to Products.”

The practical value shows up in real conversations. When a product manager says “active customer,” the ontology can include a formal definition: a Customer with at least one Order where status = 'completed' in the last 90 days. That definition lives in the ontology, not in someone’s head.

Organizing OWL files for a real project

Protégé exports OWL files in XML (RDF/XML) or Turtle format. Store them in version control alongside your pipeline code and organize by domain boundary.

ontology/
  core/
    customer.owl
    order.owl
    product.owl
  definitions/
    metrics.owl        # formal definitions for KPIs and business metrics
    segments.owl       # customer/product segmentation logic
  mappings/
    customer-to-uc.yaml  # maps ontology classes to Unity Catalog tables
    order-to-uc.yaml
  README.md

The mappings/ directory is what connects the abstract ontology to the physical data. Each mapping file declares which Unity Catalog tables and columns correspond to which ontology classes and properties.

# ontology/mappings/customer-to-uc.yaml
ontology_class: Customer
unity_catalog_table: main.silver.customers
property_mappings:
  customer_id: customer_id
  email: email_address
  signup_date: created_at
relations:
  placesOrder:
    target_class: Order
    join_key: customer_id
    target_table: main.silver.orders_clean

When the schema or the ontology changes, the mapping is the first thing that should be reviewed.

Validating schemas against the ontology

Once you have ontology files and mappings, you can build automated validation. This catches drift: cases where the physical data no longer matches the domain model.

import yaml
from owlready2 import get_ontology

def load_mapping(path):
    with open(path) as f:
        return yaml.safe_load(f)

def validate_ontology_alignment(mapping_path, spark):
    mapping = load_mapping(mapping_path)
    onto = get_ontology(f"file://ontology/core/{mapping['ontology_class'].lower()}.owl").load()
    ontology_class = getattr(onto, mapping["ontology_class"])

    expected = {p.name for p in ontology_class.get_properties()}
    mapped = set(mapping["property_mappings"].keys())
    actual = set(spark.table(mapping["unity_catalog_table"]).columns)

    unmapped_properties = expected - mapped
    missing_columns = set(mapping["property_mappings"].values()) - actual

    issues = []
    if unmapped_properties:
        issues.append(f"Ontology properties without column mapping: {unmapped_properties}")
    if missing_columns:
        issues.append(f"Mapped columns missing from table: {missing_columns}")

    return issues

Run this per-table as part of your pipeline or as a scheduled CI job.

issues = validate_ontology_alignment("ontology/mappings/customer-to-uc.yaml", spark)
if issues:
    for issue in issues:
        print(f"WARN: {issue}")

Using the ontology to generate documentation

An OWL file is machine-readable, which means you can generate human-readable catalog descriptions automatically.

from owlready2 import get_ontology

onto = get_ontology("file://ontology/core/customer.owl").load()

for cls in onto.classes():
    print(f"## {cls.name}")
    if cls.comment:
        print(f"{cls.comment[0]}")
    for prop in cls.get_properties():
        print(f"  - {prop.name}: {prop.range}")
    print()

This output can feed into your data catalog, your wiki, or your onboarding docs. The point is that the ontology is the single source for domain definitions, and everything else is derived from it.

Why most teams skip ontology (and why they regret it)

Ontology feels academic. The tooling has roots in semantic web research, and the terminology (OWL, RDF, triples) does not sound like data engineering.

But the problem ontology solves is concrete: conflicting definitions. Every team that has two dashboards showing different revenue numbers has an ontology problem. Every migration that breaks because “customer” means something different in two systems has an ontology problem.

You do not need a perfect ontology on day one. Start with the five entities that cause the most confusion, define them formally, map them to tables, and validate the mapping. That alone will prevent a class of incidents that lineage and contracts cannot catch.


Data contracts: Great Expectations as the enforcement layer

Data contracts are the runtime promises your data makes to its consumers. Lineage tells you where data came from. Ontology tells you what it means. Contracts tell you whether it is correct right now.

Great Expectations is the contract enforcement tool in this stack. You define expectations (the contract), attach them to datasets, and validate on every pipeline run.

import great_expectations as gx

context = gx.get_context()

datasource = context.data_sources.add_spark("databricks_spark")
asset = datasource.add_dataframe_asset("orders_clean")

batch = asset.add_batch_definition_whole_dataframe("full").get_batch(
    batch_parameters={"dataframe": spark.table("silver.orders_clean").toPandas()}
)

expectations = context.suites.add(
    gx.ExpectationSuite(name="orders_clean_contract")
)
expectations.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
expectations.add_expectation(
    gx.expectations.ExpectColumnValuesToBeGreaterThan(column="order_total", value=0)
)

results = batch.validate(expectations)

If a contract fails, the pipeline should stop before bad data propagates. That is what makes it a contract and not just a test.

Where contracts meet lineage and ontology

The three systems reinforce each other:

  • Ontology says: “An Order must have an order_total that is a positive decimal.”
  • Contract says: “Validate that order_total is not null and greater than zero.”
  • Lineage says: “If this contract fails, here is every downstream asset that would have been affected.”

Without lineage, a contract failure is a local problem. With lineage, a contract failure is a prevented incident with a known blast radius.


Putting it all together

Here is the full lifecycle for a single dataset:

  1. RDS holds the transactional order data.
  2. Ingestion job reads from RDS, writes to S3, emits lineage to Marquez.
  3. Databricks ingests from S3 into bronze tables. Unity Catalog starts tracking lineage.
  4. Great Expectations validates the bronze-to-silver transform against the data contract.
  5. Silver and gold tables are built. Unity Catalog tracks every edge.
  6. Protégé ontology defines what an Order is. Automated checks validate schema alignment.
  7. A dashboard consumes the gold table. Lineage traces from the dashboard back to RDS.

No single tool covers all of this. The discipline is in making the tools talk to each other and treating lineage, ontology, and contracts as first-class infrastructure.


Final thought

If lineage is the map, ontology is the legend, and contracts are the guardrails. You need all three to move fast without breaking trust.

Thanks for reading — and thanks for following the series.

Cheers!
Jason

Blog Logo

Jason Rich


Published

Image

NADEBlg!

Back to Overview