What is institutional knowledge in data validation?

Institutional knowledge in data validation refers to the accumulated understanding of which metrics need care, which models are critical, and which validation checks should be performed when specific columns or models are modified. This knowledge typically exists only in the heads of senior data engineers.

What are preset checks in Recce?

Preset checks are validation checks that have been marked as reusable and are automatically run across every PR in a data project. When a team member validates something and decides every future PR should include that check, they mark it as a preset check, creating an automated safety net.

Why do checklists fail for local data validation?

Checklists fail locally because there is no collaboration layer. Data engineers validating on their local machine have no reason to document their checks in a checklist when sharing results means taking screenshots and pasting them into Slack. Checklists only become useful when they enable real-time collaboration with reviewers and stakeholders.

How do you preserve validation knowledge when team members leave?

Preserve validation knowledge by capturing each ad-hoc check as a preset check stored at the project level in a cloud platform. When a senior engineer identifies an important validation, it becomes part of the automated pipeline rather than disappearing when that person changes roles or leaves the team.

How Do You Turn Ad-Hoc Data Checks into Automated Institutional Knowledge?

March 31, 2026 data-qualityworkflowsbest-practicesdbt

The Knowledge Problem Behind Data Validation

Every data team has the same hidden vulnerability: critical validation knowledge locked inside the heads of senior engineers. One team lead described the problem this way: “I reviewed three PRs in one day. Each touched core metrics but each developer checked different things. One caught a revenue metric issue, another completely missed it. Not because they weren’t good, they just didn’t know what they don’t know.”

This is not a process problem. It is a knowledge problem. The checks a senior engineer runs are not just validation steps. They are artifacts of domain knowledge, learned over time through incidents, bugs, and experience with how the data pipeline behaves.

Why Do Ad-Hoc Checks Dominate Data Validation?

Data developers validate their work during development using whatever tools are at hand: a profile diff here, a custom query there, a value diff on a specific column. Each of these validations is a check that encodes knowledge about what matters for a given change.

The problem is twofold. First, developers know how to validate what they changed but often do not know how to validate the impacts of what they changed. If you modified the CLV calculation in the customers model, you might verify the row counts and schema are stable. But do you know that the marketing team cares specifically about the high/medium/low customer segment distribution? That knowledge lives with the reviewer, not the developer.

Second, the validation results stay siloed. Developers told us their manual process:

Prepare two datasets for stakeholders
Export results to spreadsheets or screenshots
Explain over Slack why the changes are expected
Hope that the next developer touching the same model does similar checks

This workflow produces correct validations that immediately vanish. The same checks get reinvented from scratch by different developers on different PRs, with varying levels of thoroughness.

Why Do Checklists Fail Without Collaboration?

Many teams try to solve the knowledge problem with checklists. The logic is sound: document your validation steps so others can follow them. In practice, checklists fail when there is no collaboration layer.

One developer explained the disconnect: “Why should I add this validation into a checklist? I validate on my local machine. If the stakeholder wants to see the result, I just paste a screenshot into Slack.”

This approach felt good enough until the same developer started asking how to crop screenshots better to make results clear. Data engineers were spending time perfecting screenshots instead of validating data.

Validation Approach	Knowledge Preserved?	Collaboration Possible?	Automated?
Local checks + screenshots	No (lost when Slack scrolls)	Minimal (async, lossy)	No
Local checklists	Partially (list exists, results don’t)	No (requires tool access)	No
Cloud checklists with shared links	Yes (checks + results preserved)	Yes (stakeholders click a link)	Partially
Preset checks	Yes (accumulated at project level)	Yes (run automatically)	Yes

The breakthrough happens when checklists move from a local solo activity to a shared collaboration surface. When the same teams that ignored checklists locally moved to Recce Cloud, they described checklists as “a really powerful feature” because suddenly a reviewer or stakeholder could click a link and see the full validation context without installing anything.

What Are Preset Checks and How Do They Capture Knowledge?

Preset checks are the mechanism that transforms individual validation knowledge into team-wide institutional knowledge. The concept is straightforward: when you run a check during PR validation and think “every PR that touches this area should verify this,” you mark it as a preset check. That check then runs automatically across every future PR in the project.

For example, during a PR that modified CLV calculations, you might discover that the marketing team relies on a top-k diff of customer_segments.value_segments to understand segment distribution. You mark that check as a preset. From that point forward, any PR that impacts CLV-related models will automatically run that check, regardless of who authored the PR or how much experience they have.

This matters most for three scenarios:

New hires: A junior data engineer creates their first PR. The preset checks run automatically, catching issues the engineer would not have known to look for because they haven’t yet experienced the production incident that taught the team to add that check.
Cross-functional changes: An engineer working on payments touches a model they don’t usually own. Preset checks defined by the model owner ensure the right validations run even when the author lacks domain context.
Team turnover: When a senior engineer leaves, their validation knowledge persists in the preset checks they defined, rather than walking out the door with them.

How Do Preset Checks Differ from dbt Tests?

Preset checks and dbt tests serve complementary purposes. Understanding when to use which is part of effective data review best practices.

Aspect	dbt Tests	Preset Checks
What they validate	Schema constraints, referential integrity, accepted values	Data impact: how values changed, distribution shifts, metric movements
When they run	Every build	Every PR (scoped to impacted models)
What they catch	Hard failures (nulls, duplicates, broken references)	Soft changes that pass tests but produce wrong data
Who defines them	Analytics engineers in YAML	Anyone who identifies an important validation pattern
Knowledge source	Schema documentation	Production experience and stakeholder feedback

dbt tests answer “is the data structurally valid?” Preset checks answer “did the data change in ways the team expects and stakeholders can accept?”

Building a Knowledge Accumulation System

The vision beyond individual preset checks is a system where validation knowledge accumulates over time at the project level. Each PR becomes an opportunity to capture a new check that strengthens the safety net for future changes.

Recce Cloud stores preset checks at the data project level. As a team lead, this represents an ever-growing safety net: no matter how experienced a team member is, the PRs they create run through the accumulated wisdom of every past validation the team has deemed important.

The next evolution is context-aware automation: rules like “when model X is impacted, run check Y.” This connects impact radius analysis with preset checks, ensuring that checks run when they are relevant rather than on every PR. For teams building their dbt CI pipeline beyond basic tests, preset checks become the data-aware layer that CI has always been missing.

Frequently Asked Questions

What is institutional knowledge in data validation?: Institutional knowledge in data validation refers to the accumulated understanding of which metrics need care, which models are critical, and which validation checks should be performed when specific columns or models are modified. This knowledge typically exists only in the heads of senior data engineers.
What are preset checks in Recce?: Preset checks are validation checks that have been marked as reusable and are automatically run across every PR in a data project. When a team member validates something and decides every future PR should include that check, they mark it as a preset check, creating an automated safety net.
Why do checklists fail for local data validation?: Checklists fail locally because there is no collaboration layer. Data engineers validating on their local machine have no reason to document their checks in a checklist when sharing results means taking screenshots and pasting them into Slack. Checklists only become useful when they enable real-time collaboration with reviewers and stakeholders.
How do you preserve validation knowledge when team members leave?: Preserve validation knowledge by capturing each ad-hoc check as a preset check stored at the project level in a cloud platform. When a senior engineer identifies an important validation, it becomes part of the automated pipeline rather than disappearing when that person changes roles or leaves the team.

Read the full article: From Ad-Hoc Checks to Automated Institutional Knowledge