Why do data tests pass but data is still wrong?

Data tests validate structural properties — schema conformance, null constraints, uniqueness, and referential integrity. They do not validate semantic correctness: whether calculations produce the right results, whether filters include the right records, or whether business logic matches real-world rules. Semantic errors produce structurally valid output that passes all tests while being meaningfully wrong.

What is a semantic data failure?

A semantic data failure is a data error where the output is structurally valid (correct types, no nulls, proper relationships) but logically incorrect for the business. Examples include a revenue calculation that uses the wrong column, a filter that silently excludes valid customers, or a join that produces duplicated rows. These failures pass all automated tests because the tests check structure, not meaning.

How do you validate data correctness before production?

Validate data correctness by comparing development data against production baselines before merging. Use data diffs to check row counts, column distributions, and value-level changes. Apply impact analysis via column-level lineage to scope which models are affected. Automate these checks in CI on critical models, and involve domain experts for high-stakes changes.

What should a data validation framework include?

A comprehensive data validation framework should include four layers: structural tests (schema, nulls, uniqueness), statistical validation (distribution checks, profile diffs), semantic validation (business logic verification through data diffs against production), and human review (domain expert verification of high-impact changes). Each layer catches a different class of errors.

What Framework Catches the Data Errors That Tests Miss?

March 31, 2026 data-qualitybest-practicesdbt

Why Do Data Tests Give a False Sense of Security?

Every data team has experienced this: the CI pipeline is green, all tests pass, the PR gets merged — and then someone reports that the numbers are wrong. The data was structurally valid the entire time. The failure was semantic, not structural.

Semantic data failures are errors where the output conforms to every schema constraint and test assertion but is meaningfully wrong for the business. A revenue metric that doubled because of a JOIN fan-out. A customer count that dropped because a filter was too aggressive. A conversion rate that looks plausible but uses the wrong denominator.

These failures pass tests because tests check structure. They miss meaning. And for data teams, the cost of semantic failures — lost trust, wasted investigation time, bad business decisions — is far higher than the cost of a null value in a column.

What Types of Data Errors Do Tests Miss?

Understanding the categories of failure is the first step toward building a framework that catches them. The relationship between passing tests and wrong data comes down to a fundamental mismatch between what tests check and what can go wrong:

Error Category	Example	Why Tests Miss It
Logic errors	Revenue calculated with gross instead of net column	Both columns are valid numeric, non-null
Filter mistakes	WHERE clause excludes valid records	Remaining records still pass uniqueness and type checks
JOIN fan-out	One-to-many join produces duplicate rows	Each row is individually valid; row count tests may not exist
Upstream drift	Source column’s meaning changes	Format and type remain the same
Aggregation bugs	SUM applied where COUNT was intended	Result is a valid number
Temporal errors	Date filter off by one day	Dates are valid; range is just wrong

Each of these produces output that is structurally indistinguishable from correct data. The only way to catch them is to check the actual values, not just the constraints.

What Does a Production Data Validation Framework Look Like?

A practical framework operates in four layers, each catching a different class of error. The layers are ordered from cheapest (most automated, least context needed) to most expensive (requires human judgment):

Layer 1: Structural Tests

This is what most teams already have — dbt tests, schema checks, null constraints. These are essential but insufficient. They form the base of the pyramid.

What they catch: Missing columns, null values, duplicate keys, invalid references. What they miss: Everything that is structurally valid but logically wrong.

Layer 2: Statistical Validation

Compare column-level statistics between your development environment and production. Profile diffs check whether the distribution of values has changed meaningfully — means, medians, min/max values, null percentages, and cardinality.

What they catch: Subtle shifts in data shape that indicate logic changes. A column that used to average 150 now averages 300? Something changed. What they miss: Changes that are statistically plausible but semantically wrong — for example, a 2% shift in a metric that should have been 0%.

Layer 3: Semantic Validation

This is where data diffs come in. Compare actual data values between environments at the row level for critical models. Check whether specific metrics, counts, and aggregations match expected values.

What they catch: The errors that statistical checks miss — specific rows that changed, specific values that shifted, specific models where the output diverged from the baseline. What they miss: Novel business logic that has no production baseline to compare against (new models, new metrics).

Layer 4: Human Review

Domain experts review high-stakes changes where the cost of being wrong is significant and automated checks cannot fully validate correctness. This is the most expensive layer and should be reserved for changes with the highest impact.

What they catch: Business context violations that no automated system can detect — “this metric should never exceed X” or “these two segments should never overlap.” What they miss: Nothing, in theory — but human attention is scarce and cannot scale to every PR.

How Do You Implement This Framework in Practice?

Implementation does not require building everything at once. Start with the highest-leverage additions to your existing pipeline:

Step 1: Identify Critical Models

Not every model needs all four layers. Map your DAG to identify models that feed customer-facing reports, financial calculations, or ML pipelines. These are your critical models — the ones where being wrong is expensive.

Step 2: Add Automated Diffs to CI

For critical models, add schema diffs, row count diffs, and profile diffs to your CI pipeline. These run automatically on every PR, adding minutes to CI time but saving hours of incident response. This is the core of what dbt CI should check beyond tests.

Step 3: Scope Reviews with Lineage

Use column-level lineage to determine which downstream models are affected by a change. This prevents two failure modes: reviewing too much (wasting time on unaffected models) and reviewing too little (missing affected models that aren’t obvious from the code change).

Step 4: Establish Review Triggers

Define clear criteria for when human review is required. Examples:

Changes to models that feed financial reports
Changes that affect more than N downstream models
Changes where profile diffs show more than X% shift in key metrics
Changes to models that have caused incidents in the past

How Does This Framework Compare to Data Observability?

Data observability tools monitor production data for anomalies after deployment. The validation framework described here operates before deployment — at the PR level. They are complementary, not competing:

Approach	When	What It Catches	Limitation
dbt tests	Build time	Structural violations	Misses semantic errors
Validation framework	PR time	Semantic errors via comparison	Requires production baseline
Data observability	Post-deployment	Anomalies in live data	Catches issues after users are affected

The ideal setup uses all three. But if a team can only add one new layer, PR-level data validation has the highest return — it prevents issues rather than detecting them after the fact.

What Results Should Teams Expect?

Teams that implement systematic pre-merge data validation consistently report fewer production incidents, faster PR reviews (because reviewers can see what changed in the data, not just the code), and improved trust from stakeholders. The vaidukt case study demonstrated a 70% reduction in data complaints from a three-person team.

The framework is not about replacing tests. Tests remain essential for structural validation. The framework adds the semantic layer that tests cannot provide — checking that the data is not just valid, but correct.

Frequently Asked Questions

Why do data tests pass but data is still wrong?: Data tests validate structural properties — schema conformance, null constraints, uniqueness, and referential integrity. They do not validate semantic correctness: whether calculations produce the right results, whether filters include the right records, or whether business logic matches real-world rules. Semantic errors produce structurally valid output that passes all tests while being meaningfully wrong.
What is a semantic data failure?: A semantic data failure is a data error where the output is structurally valid (correct types, no nulls, proper relationships) but logically incorrect for the business. Examples include a revenue calculation that uses the wrong column, a filter that silently excludes valid customers, or a join that produces duplicated rows. These failures pass all automated tests because the tests check structure, not meaning.
How do you validate data correctness before production?: Validate data correctness by comparing development data against production baselines before merging. Use data diffs to check row counts, column distributions, and value-level changes. Apply impact analysis via column-level lineage to scope which models are affected. Automate these checks in CI on critical models, and involve domain experts for high-stakes changes.
What should a data validation framework include?: A comprehensive data validation framework should include four layers: structural tests (schema, nulls, uniqueness), statistical validation (distribution checks, profile diffs), semantic validation (business logic verification through data diffs against production), and human review (domain expert verification of high-impact changes). Each layer catches a different class of errors.

Read the full article: The Production Data Reality Check: A Framework for Catching What Tests Miss