Data Review Best Practices for Modern Data Teams
What Is Data Review?
Data review is the practice of systematically validating data model changes before merging them into production. Unlike code review, which examines logic, data review examines the actual output — the rows, columns, and values that downstream consumers depend on.
Modern data teams working with dbt (data build tool) face a core challenge: a single model change can affect every downstream dependency in the DAG — from intermediate models to dashboards and ML features. Data review provides visibility into this blast radius before changes ship.
Why Data Review Matters
Traditional data quality approaches rely on post-deployment monitoring — catching issues after they’ve already affected production dashboards and reports. Data review shifts this left:
- Pre-merge validation: Compare branch output against production baseline
- Impact analysis: Understand which downstream models are affected
- Automated diffing: Detect schema changes, row count shifts, and value distribution changes
- Human judgment: Flag changes that are technically valid but semantically wrong
Core Components of a Data Review Process
1. Impact Analysis
Before reviewing data, understand the scope of change. Impact analysis maps which models are modified and traces their downstream dependencies. This tells reviewers where to focus attention.
Key metrics for impact analysis:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Modified models | Direct code changes | Primary review targets |
| Downstream models | Transitive dependencies | Blast radius of the change |
| Affected exposures | Dashboards, ML features | Business impact visibility |
| Row count delta | Production vs. branch | Data volume changes |
2. Automated Checks
Automate the repetitive parts of data review:
- Schema diff: Detect added, removed, or renamed columns
- Row count comparison: Flag unexpected increases or decreases
- Value distribution: Compare histograms of key columns
- Primary key validation: Ensure uniqueness constraints hold
3. PR-Level Reporting
Integrate data review results into your pull request workflow. A data review summary posted as a PR comment gives reviewers context without switching tools.
Implementing Data Review with Recce
Recce automates data review for dbt projects. The typical workflow:
- Developer opens a PR with model changes
- CI runs
dbt buildon the PR branch - Recce compares branch output against the production baseline
- Recce posts a diff report as a PR comment
- Reviewers approve or request changes based on data impact
Integration with dbt CI/CD
Recce plugs into existing dbt CI pipelines. After dbt build completes, Recce runs its comparison checks and reports results. No changes to your dbt project structure are required.
Best Practices
- Review data, not just code: A syntactically correct model can produce wrong results. Always check the output.
- Scope reviews to impact radius: Don’t review every model — focus on modified models and their direct downstream dependencies.
- Automate the baseline: Use CI to maintain a production baseline that Recce compares against automatically.
- Set blocking thresholds: Define what constitutes a blocking data change (e.g., >10% row count change) and enforce it in CI.
- Document expected changes: When a PR intentionally changes data output, annotate the expected changes in the PR description.
Frequently Asked Questions
- What is data review?
- Data review is the practice of systematically validating data model changes before merging them into production. It combines automated checks (schema diff, row count comparison, value distribution analysis) with human review to catch data quality regressions early in the development cycle.
- How does data review differ from data testing?
- Data testing validates that data meets predefined rules (not-null constraints, accepted values, relationship tests). Data review goes further by comparing the actual output of changed models between your development branch and production, surfacing unexpected differences that tests alone would miss.
- What tools support automated data review?
- Recce is a purpose-built data review tool that integrates with dbt projects. It provides impact analysis, automated diff checks, and PR-level reporting. Other approaches include custom CI scripts, Great Expectations for data validation, and dbt tests for schema-level checks.
- How do you integrate data review into CI/CD?
- Configure your CI pipeline to run Recce after dbt build completes. Recce compares the PR branch output against the production baseline, generates a data diff report, and posts results as a PR comment. Teams can set blocking rules so PRs with unexpected data changes require explicit approval.