Common Data Lineage Mistakes (and How to Fix Them)

read

I wrote the first post in this series after a messy incident. This second post is written after the third time I saw the same lineage mistakes repeat at a different company. The pattern is always the same: lineage is treated as a dashboard, not as part of the pipeline itself. The result is predictable: broken trust, long incident calls, and slow delivery. This ultimately leads to explaining why the data isn’t right, why the pipelines are not catching the issues, and what are we going to do to fix it. And course, when will the fix be ready for production!

This post focuses on where companies make mistakes in data lineage, how to spot the warning signs early, and what actually fixes them.

Mistake 1: Treating lineage as a catalog-only feature

If your lineage only shows up inside a catalog UI, it is probably out of date. This happens when lineage is inferred only from scans and not emitted by the pipelines that do the work. Treating data lineage as a second or thrid class citizen within your data platform, is analogous to the flashing red light emojis in Slack to alert you to a problem.

How to recognize it

Lineage graphs look “thin” or incomplete.
The catalog shows a table, but not the job that created it.
Dashboards or ML models are missing from the graph entirely.
Lineage starts at ingress, not the app that created the data.

How to fix it

Instrument pipelines to emit lineage events. Even a basic event payload creates a durable trail. Build local apps to event data lineage from the app.

{
  "eventType": "START",
  "job": { "namespace": "prod", "name": "revenue_daily" },
  "inputs": [{ "namespace": "warehouse", "name": "orders_clean" }],
  "outputs": [{ "namespace": "warehouse", "name": "revenue_daily" }]
}

Once pipelines emit lineage, the catalog becomes a viewer of the truth instead of its only source.

Mistake 2: Lineage stops at the warehouse boundary

Many teams have decent lineage inside the warehouse but none for: APIs, reverse ETL, dashboards, or ML features. That breaks the end-to-end story, sewing doubt is the quality of the data with customers and leadership.

How to recognize it

The lineage graph ends at the final mart or “gold” table.
No edges reach BI dashboards, data apps, or model features.
Product teams still ask, “Which reports will break?”

How to fix it

Connect lineage to downstream tools (BI, ML, and API layers) and treat them as first-class citizen.

# Example of recording a BI asset as a downstream node
asset:
  type: "dashboard"
  name: "Revenue Overview"
  dependsOn:
    - "warehouse.revenue_daily"

If your tooling doesn’t support downstream assets, you will never be end-to-end.

Mistake 3: Untracked manual steps

The fastest way to break lineage is a spreadsheet or a one-off script that never gets recorded. Those manual hops become the silent source of “why is this number different?”

How to recognize it

Important metrics come from “the Excel version.”
A pipeline finishes, then someone “tweaks” a file before publishing.
The last mile is a copy/paste step no one owns.

How to fix it

IF the data product requires analyst manipulation after the piepline finishes, fix the product to meet the customer’s requirements. Then, add this fix into a pipeline to record it as a step within the lineage.

# Replace manual spreadsheet logic with a small transform step
df = orders_clean.assign(
    is_high_value=lambda x: x.order_total > 250
)
df.to_parquet("warehouse/orders_enriched")

If you cannot automate it, create a documented lineage exception and review it weekly.

Mistake 4: Schema drift without lineage awareness

Schemas change. That is normal. The problem is silent drift that breaks downstream logic without warning.

How to recognize it

Downstream failures appear as “null spikes” or subtle metric shifts.
Jobs run successfully, but dashboards disagree.
The change shows up in a diff after the incident, not before.

How to fix it

Capture column-level lineage and validate against expected schemas.

-- Guardrail: assert expected columns before publishing
SELECT
  COUNT(*) AS missing_cols
FROM expected_columns e
LEFT JOIN information_schema.columns c
  ON e.column_name = c.column_name
WHERE c.column_name IS NULL;

If you can’t do column-level lineage yet, start by tracking schema versions in metadata.

Mistake 5: No owner, no accountability

Lineage that belongs to “the platform team” in the abstract belongs to no one. When no one owns the graph, no one keeps it accurate.

How to recognize it

Lineage breaks and no team feels responsible.
Fixes are done during incidents but never institutionalized.
There is no on-call path for metadata issues.

How to fix it

Make lineage part of pipeline ownership and add it to the definition of done. If a pipeline ships, it should emit lineage.

# Example checklist embedded in a pipeline PR template
lineage:
  events_emitted: true
  downstream_assets_registered: true
  owner: "data-platform"

Ownership is not paperwork. It is how lineage stays alive. Data Owners are one the most important members of the team, and great care should be taken when select “Who owns the DATA”. Data Owners should not be the junior associate on staff. They should be an associate that understands the business, the purpose of the data (e.g., the data model), and how the data flows through the pipelines.

How to recognize a healthy lineage system

You know it is working when:

you can start at a metric and trace upstream in minutes
you can see the job, schedule, and code that created a dataset
you can see downstream impacts before changing a table
the graph stays accurate after deploys
data owners take pride in the quality of the lineage and the fidelity of the data

If those answers are “sometimes,” you have work to do.

Final thought

Lineage is not a dashboard. It is the operating system of data trust. When companies fix the above mistakes, they stop treating data incidents as mysteries and start treating them as engineering. They find the root cause of data issue quickly, and implement robust fixes to continue to build trust within the organization.

Thanks for reading.

Cheers!
Jason

Common Data Lineage Mistakes (and How to Fix Them)

Jason Rich

Mistake 1: Treating lineage as a catalog-only feature

Mistake 2: Lineage stops at the warehouse boundary

Mistake 3: Untracked manual steps

Mistake 4: Schema drift without lineage awareness

Mistake 5: No owner, no accountability

How to recognize a healthy lineage system

Final thought

Written by

Jason Rich

Supported by

NADEBlg!