Recce AI Blog

How to Generate a Time Spine in dbt

Mon, 23 Feb 2026 00:00:00 GMT

## What Is a Time Spine and Why Do You Need One? A time spine (also called a date spine) is a utility table that contains one row for every time period in a defined range. Most commonly, it has one row per day, but it can be hourly, weekly, or monthly depending on the grain you need. The purpose is simple: event-based data has gaps. If no orders were placed on January 15th, your orders table has no row for that date. When you aggregate by day, January 15th disappears from the results. A chart built on that query shows a misleading jump from January 14th to January 16th instead of a zero on the 15th. A time spine fixes this. By LEFT JOINing from the spine to your event data, every period gets a row — periods with events get actual values, periods without events get zero or null. This is essential for accurate time series reporting, dashboard visualizations, and metric calculations. ## What Goes Wrong Without a Time Spine? Consider a simple query that counts daily orders: ```sql SELECT order_date, COUNT(*) AS order_count FROM orders GROUP BY order_date ORDER BY order_date ``` This returns rows only for dates that have orders. The gaps cause three problems: 1. **Charts mislead** — line charts connect adjacent points, so a gap between Tuesday and Thursday looks like Wednesday had the same trend, when in reality there were zero orders. 2. **Aggregations break** — a rolling 7-day average that skips days produces incorrect results because it averages over fewer data points than expected. 3. **Metrics layer errors** — MetricFlow (dbt's semantic layer) requires a time spine to correctly calculate cumulative and derived metrics. Without one, metric queries fail or return incorrect values. ## How Do You Generate a Time Spine with dbt-utils? The most portable approach uses the date_spine macro from dbt-utils. Create a model in your project: ```sql -- models/utilities/time_spine.sql {{ config(materialized='table') }} WITH spine AS ( {{ dbt_utils.date_spine( datepart="day", start_date="cast('2020-01-01' as date)", end_date="cast('2026-12-31' as date)" ) }} ) SELECT date_day FROM spine ``` This generates one row per day from January 1, 2020 through December 31, 2026. The `date_spine` macro works across all dbt-supported warehouses because it uses cross-joins or recursive logic internally rather than database-specific functions. ## Can You Use generate_series Instead? For databases that support it, the native `generate_series` function is more performant: ```sql -- Postgres / DuckDB SELECT generate_series( '2020-01-01'::date, '2026-12-31'::date, '1 day'::interval )::date AS date_day ``` generate_series is a SQL function that produces a set of values from a start to an end at a specified interval. It runs natively in the database engine, making it faster than the macro approach. The tradeoff is portability — this syntax does not work on Snowflake, BigQuery, or Databricks. ## What About Databases Without Native Series Generation? For databases that lack both `generate_series` and efficient cross-join support, a recursive CTE approach works: ```sql WITH RECURSIVE date_series AS ( SELECT CAST('2020-01-01' AS DATE) AS date_day UNION ALL SELECT DATEADD(day, 1, date_day) FROM date_series WHERE date_day < '2026-12-31' ) SELECT date_day FROM date_series ``` This starts with a seed date and adds one day at a time until it reaches the end date. It is the least performant approach for large ranges but works on virtually every SQL database. ## Which Approach Works Best for Each Warehouse? | Warehouse | Recommended Method | Notes | | -------------- | ----------------------------------------------- | ----------------------------------------------------------------------------- | | **Snowflake** | `dbt_utils.date_spine` | No native `generate_series`; macro uses `GENERATOR` table function internally | | **BigQuery** | `GENERATE_DATE_ARRAY` or `dbt_utils.date_spine` | BigQuery has `GENERATE_DATE_ARRAY` which is native and performant | | **Postgres** | `generate_series` | Native function, most performant option | | **DuckDB** | `generate_series` | Same syntax as Postgres | | **Redshift** | `generate_series` or `dbt_utils.date_spine` | Redshift supports `generate_series` but with some limitations on large ranges | | **Databricks** | `SEQUENCE` or `dbt_utils.date_spine` | `SEQUENCE` function generates arrays that can be exploded into rows | If your project runs on a single warehouse, use the native approach for that platform. If you need portability across warehouses, `dbt_utils.date_spine` is the safest choice. ## What Does MetricFlow Require for the Time Spine? dbt's semantic layer (powered by MetricFlow) requires a time spine table to calculate cumulative metrics, derived metrics, and fill-values. The required format is specific: ```sql -- models/utilities/metricflow_time_spine.sql {{ config(materialized='table') }} WITH spine AS ( {{ dbt_utils.date_spine( datepart="day", start_date="cast('2020-01-01' as date)", end_date="cast('2026-12-31' as date)" ) }} ) SELECT date_day, DATE_TRUNC('week', date_day) AS date_week, DATE_TRUNC('month', date_day) AS date_month, DATE_TRUNC('quarter', date_day) AS date_quarter, DATE_TRUNC('year', date_day) AS date_year FROM spine ``` MetricFlow looks for a model registered as a time spine in your `dbt_project.yml` or semantic manifest. The model must have `date_day` as the primary time column. Adding truncated columns for other grains allows MetricFlow to aggregate metrics at weekly, monthly, or quarterly levels without additional models. ## Where Should the Time Spine Live in Your Project? Most teams place the time spine in a `utilities` or `staging` directory: ``` models/ staging/ ... utilities/ time_spine.sql intermediate/ ... marts/ ... ``` The time spine is not a staging model (it does not clean raw data) and not a mart (it is not business-facing). It is infrastructure — a utility that other models reference. Placing it in a dedicated `utilities` folder makes its purpose clear in your [DAG](/ai-blog/what-is-dbt-dag-lineage/). **Grain selection**: daily is the most common grain because it balances storage and utility. Hourly spines are large but necessary for real-time dashboards. Weekly or monthly spines are rare as standalone models — it is simpler to truncate a daily spine. **Range management**: hardcoded start and end dates are simple but require manual updates. Dynamic ranges using `MIN(date)` from source data and `CURRENT_DATE` are more maintainable but add a dependency on source freshness. ## How Do You Validate Time Spine Changes? Modifying a time spine — changing the date range, switching from daily to hourly grain, or adjusting the generation method — affects every model that joins to it. This makes time spine changes high-impact despite their apparent simplicity. Before merging a time spine change, validate: - **Row count** — does the new spine have the expected number of rows? A daily spine from 2020-01-01 to 2026-12-31 should have exactly 2,557 rows. - **Date range** — does `MIN(date_day)` and `MAX(date_day)` match your intended range? - **No duplicates** — is `date_day` unique? A duplicate date in the spine will double-count events when joined. - **Downstream impact** — run [data diffs](/ai-blog/what-is-a-data-diff/) on models that reference the time spine to confirm they still produce correct results. Tools like Recce can compare the output of downstream models between your PR branch and production, surfacing differences caused by the spine change. These checks are especially important because time spine errors are silent — they do not cause query failures, they cause subtly wrong numbers. ## Summary A time spine is a utility table with one row per time period, used to fill gaps in event-based data and enable accurate metrics. The most portable generation method is the `dbt_utils.date_spine` macro; databases like Postgres and DuckDB can use native `generate_series` for better performance; recursive CTEs work as a universal fallback. MetricFlow requires a specific time spine format with `date_day` as the primary column. Place the time spine in a `utilities` directory, default to daily grain, and validate changes carefully — especially row counts, date ranges, and downstream data diffs — because time spine errors produce silently wrong results rather than failures.

What Is an AI Data Review Agent?

Sun, 22 Feb 2026 00:00:00 GMT

## What Problem Does an AI Data Review Agent Solve? Pull request reviews in dbt projects have a fundamental gap: reviewing the SQL tells you what logic changed, but reveals nothing about how the actual data was affected. Engineers spend significant time manually running queries, checking row counts, tracing lineage, and interpreting results before they can say whether a change is safe to merge. An AI data review agent automates this mechanical work. It analyzes the PR's code changes, runs data validations against actual warehouse output, and generates a human-readable impact summary — all before a human reviewer opens the PR. Unlike traditional CI checks that report raw numbers (row count: 1,042,387), an AI agent interprets the results: "Row count increased 3.2% due to the new filter including previously excluded records from the APAC region. This aligns with the stated intent of the PR." ## How Does Multi-Agent Architecture Work for Data Review? A single monolithic agent trying to handle git context, data validation, and analysis synthesis tends to produce inconsistent results. Multi-agent architecture solves this by delegating specific tasks to specialized subagents, each with a narrow scope and focused toolset. A typical data review agent system uses an orchestrator pattern: | Agent | Responsibility | Tools Available | | ------------------------ | --------------------------------------------------------------------------------- | ------------------------ | | PR Analysis Orchestrator | Coordinates the review workflow, delegates to subagents, synthesizes final report | Task delegation only | | Git Context Agent | Extracts PR metadata, changed files, commit messages, and modified model names | Git and GitHub API tools | | Recce Analysis Agent | Runs data validations — lineage diff, schema diff, row count diff, profile diff | MCP tools (Recce server) | | Synthesis Agent | Combines raw data from other agents into a structured, human-readable summary | Text generation only | Each subagent runs in an isolated context. The git context agent cannot access warehouse data. The Recce analysis agent cannot modify files. This tool constraint principle — restricting each agent to only the tools it needs — dramatically improves reliability. ## What Role Does MCP Play in AI Data Review? MCP (Model Context Protocol) provides a standardized interface for AI agents to invoke external tools. In data review, an MCP server exposes Recce's validation capabilities as callable tools: - `lineage_diff` — compare DAG structure between environments - `schema_diff` — detect column additions, removals, and type changes - `row_count_diff` — compare row counts across modified models - `profile_diff` — compare column-level statistics - `query_diff` — run arbitrary SQL comparisons This matters because it lets agents work with real data rather than guessing at impact from code alone. An agent that can only read SQL might guess that a filter change reduces row counts. An agent with MCP access can confirm the row count dropped 12.4% and report which downstream models are affected. ## What Design Principles Make AI Agents Reliable? Building a reliable AI data review agent requires more than connecting an LLM to tools. Several design principles distinguish agents that produce trustworthy output from those that hallucinate or miss critical issues. **Specialized agents over general-purpose.** Narrow scope produces consistent output. An agent that only extracts git context will do that well every time. An agent that tries to extract context, run validations, and write analysis in one pass will cut corners under token pressure. **Show your work.** Require agents to output raw data before generating diagrams or summaries. If a lineage diagram is generated from raw edge data that the orchestrator can verify, hallucinated edges are caught. If the diagram is generated directly, there's no ground truth to check against. **Negative constraints.** Explicit "do NOT" instructions are surprisingly effective. Telling an agent "do NOT infer column relationships from naming patterns — only report relationships confirmed by lineage_diff output" prevents a common hallucination mode. **Required output structure.** Mark critical sections of the output format with `[REQUIRED]` markers. Agents are more likely to include sections that are explicitly labeled as non-optional than sections that are merely listed in a template. ## What Makes a Good AI Data Review? Not all AI-generated reviews are equally useful. Here is a framework for evaluating whether an AI data review agent is producing trustworthy output: | Quality Criterion | Good Review | Poor Review | | ------------------------------------ | ------------------------------------------------------------------------------------------------- | -------------------------------------------------- | | Grounded in data | Cites specific numbers from actual diffs (e.g., "row count increased from 50,412 to 51,823") | Makes vague claims ("row counts may have changed") | | Scoped to the change | Focuses on models modified in the PR and their direct downstream dependencies | Reports on the entire DAG regardless of relevance | | Distinguishes intent from regression | Identifies which changes align with the PR description and which are unexpected | Treats all differences as equally noteworthy | | Actionable next steps | Suggests specific checks a reviewer should run ("verify the APAC region filter in dim_customers") | Ends with generic advice ("review carefully") | | Transparent about limitations | States what it could not check ("no primary key available for value diff on this model") | Silently skips validations without noting the gap | | Reproducible | Another run with the same inputs produces the same conclusions | Output varies significantly between runs | ### AI-Assisted vs. Fully Manual Review The practical impact of an AI data review agent becomes clear when comparing workflows: | Aspect | Fully Manual Review | AI-Assisted Review | | --------------------- | ----------------------------------------------------- | ------------------------------------------------------------------------ | | **Trigger** | Reviewer opens the PR and starts from scratch | Agent runs automatically on PR open | | **Context gathering** | Read code diff, manually trace downstream models | Agent extracts PR metadata, changed files, and impact scope | | **Data validation** | Open a SQL editor, write and run comparison queries | Agent runs lineage diffs, schema checks, and profile comparisons via MCP | | **Interpretation** | Reviewer interprets raw query results | Agent generates a structured summary with cited numbers | | **Human focus** | Everything — from mechanical checks to judgment calls | Business context, edge cases, and final approval | | **Typical time** | 30–90 minutes per complex PR | 5–15 minutes of human attention per complex PR | The agent does not remove the human from the loop. It removes the mechanical work that precedes human judgment. The reviewer still decides whether the change is correct — they just start from an informed position rather than a blank screen. ## How Does This Connect to Data Review Best Practices? An AI data review agent implements many [data review best practices](/ai-blog/data-review-best-practices/) automatically: scoping the review to the impact radius, running structural checks before drilling into values, and documenting what was checked. The agent's output becomes the first draft of the review checklist that teams refine and approve. The key insight is that AI agents work best when they have access to real validation data — not just code. Combining [data diffs](/ai-blog/what-is-a-data-diff/) with AI interpretation bridges the gap between raw numbers and actionable review. ## Summary An AI data review agent automates the mechanical work of dbt PR review: extracting context, running data validations via MCP, and generating structured impact summaries. Multi-agent architecture with specialized subagents produces more reliable output than monolithic approaches. Design principles like tool constraints, required output structure, and negative constraints improve consistency. The result is not a replacement for human review but an informed starting point — reducing review time from hours to minutes while keeping human judgment in the loop for business context and edge cases.

What Is the dbt DAG? A Guide to Lineage and Dependencies

Sat, 21 Feb 2026 00:00:00 GMT

## What Is a Directed Acyclic Graph? A directed acyclic graph (DAG) is a structure made up of nodes and edges, where each edge has a direction and no path leads back to its starting node. In plain terms: things depend on other things, and those dependencies never form a loop. In the context of dbt, each node is a model, source, seed, snapshot, or exposure. Each edge is a dependency created by a `ref()` or `source()` call in your SQL. The "directed" part means the relationship has a direction — model B depends on model A, not the other way around. The "acyclic" part means circular dependencies are impossible: if B depends on A, then A cannot also depend on B (directly or through any chain of intermediate models). ## How Does dbt Use the DAG? Every time you write `{{ ref('some_model') }}` in a dbt model, you create an edge in the DAG. dbt parses all models, resolves all `ref()` calls, and constructs the complete graph before running anything. This graph determines **build order** — dbt executes models in topological order so that every model runs only after its upstream dependencies are complete. This is why dbt can parallelize builds: models that do not depend on each other can run simultaneously, while dependent models wait for their parents to finish. ## How Do You Read the dbt DAG? A typical dbt DAG flows left to right through several layers: | Layer | Description | Examples | | ---------------- | --------------------------------------- | ------------------------------ | | **Sources** | Raw data ingested from external systems | `source('stripe', 'payments')` | | **Staging** | Cleaned, renamed, lightly transformed | `stg_stripe__payments` | | **Intermediate** | Business logic, joins, aggregations | `int_orders__joined` | | **Marts** | Final business-facing models | `fct_orders`, `dim_customers` | | **Exposures** | Downstream consumers (dashboards, ML) | `exposure: revenue_dashboard` | Reading the DAG from left to right tells you the story of your data: where it comes from, how it is transformed, and where it ends up. Reading right to left tells you the lineage of any specific model — which upstream models contributed to it. ## Why Does DAG Complexity Grow Over Time? A new dbt project with ten models has a DAG you can understand at a glance. An enterprise project with 500 models has a DAG that looks like a dense web of interconnections. This growth is natural — as a business adds use cases, the DAG accumulates models, cross-references, and shared intermediate logic. The problem is not complexity itself but the review burden it creates. When you modify a model in a 500-model DAG, understanding which downstream models are affected requires tracing paths through a graph that no human can hold in working memory. This is where tooling becomes essential. ## What Is the Difference Between Static and Diff-Aware Lineage? Static lineage shows the current state of your DAG — all models and their dependencies as they exist right now. This is what you see in `dbt docs generate` and the dbt Cloud IDE. It answers the question: "What does my project look like?" Lineage diff compares the DAG between two states — typically your PR branch and production — and highlights what changed. It answers a different and more actionable question: "What did my changes affect?" | Aspect | Static Lineage (dbt docs) | Lineage Diff | | ------------ | ----------------------------- | -------------------------------- | | **Shows** | Current state of all models | Difference between two states | | **Purpose** | Exploration and documentation | Impact analysis and PR review | | **Scope** | Entire DAG | Only changed and affected models | | **Use case** | Understanding the project | Reviewing a specific change | Static lineage is valuable for onboarding and documentation. But for day-to-day PR review, it forces you to mentally filter hundreds of unchanged models to find the ones that matter. A lineage diff does that filtering for you. ## What Is the Modified+ View? The modified+ view shows the modified models in your PR plus all their downstream dependents. This represents the potential [impact radius](/ai-blog/what-is-impact-radius/) of your changes — every model that could be affected by what you changed. Consider an example: you modify `int_orders__joined`. The modified+ view shows that model plus the five mart models and two exposures downstream of it. Instead of scanning the entire DAG, you now have a focused list of seven models to validate. This scoping is the default starting point for data impact analysis. You examine each model in the modified+ set — checking schema diffs, row counts, and profile diffs — to confirm the change behaved as expected and did not introduce unintended side effects. ## How Does Lineage Diff Help with PR Review? When reviewing a dbt pull request, the first question is always: "What is the blast radius?" A lineage diff answers this immediately by showing: - **Which models were directly modified** — the ones the author changed - **Which models are downstream** — the ones that could be indirectly affected - **Which models were added or removed** — structural changes to the DAG itself This information scopes the review. Instead of reading every SQL file in the diff, the reviewer focuses on the modified models and their downstream dependents. For each model in the impact radius, the reviewer checks whether the data changed as expected using [data diffs](/ai-blog/what-is-a-data-diff/) — schema comparisons, row count checks, and value-level validation. Without a lineage diff, reviewers either check too little (only the directly modified models, missing downstream breakage) or too much (the entire project, wasting time on unaffected models). ## How Does Column-Level Lineage Add Granularity? Model-level lineage tells you that model B depends on model A. But if you changed only one column in model A, you may not need to review all of model B — only the columns that depend on the one you changed. [Column-level lineage](/ai-blog/what-is-column-level-lineage/) provides this precision. It traces individual columns through transformations, showing exactly which downstream columns are derived from your changed column. On large DAGs, this can reduce the review scope from dozens of models to a handful of specific columns. Column-level lineage is the next level of granularity beyond the model-level DAG. It does not replace model-level lineage — it refines it. ## DAG Complexity and the Case for Tooling The relationship between DAG size and review effort is not linear — it is combinatorial. A 50-model DAG might have a few dozen dependency paths. A 500-model DAG can have thousands. Manual impact analysis at that scale is slow, error-prone, and inconsistent between team members. This is why diff-aware lineage tools exist. dbt Cloud provides lineage visualization in its IDE. Recce provides lineage diff with integrated data validation — comparing the DAG between your PR branch and production, highlighting modified models, and letting you run data diffs directly from the lineage view. The goal is the same: reduce the cognitive load of understanding how a change propagates through a complex graph. ## Summary The dbt DAG is a directed acyclic graph that maps every dependency in your project. It determines build order, enables parallelism, and — most importantly — defines how changes propagate. Reading the DAG tells you the story of your data from sources through marts to exposures. As projects grow, static lineage (dbt docs) becomes insufficient for PR review; lineage diffs that compare two states of the DAG are essential for scoping impact analysis. The modified+ view focuses review on the models most likely affected by a change. For even finer granularity, column-level lineage traces individual columns through transformations. The bigger your DAG, the more you need tooling that makes its complexity manageable.

Recce vs Datafold: Which Data Validation Tool?

Fri, 20 Feb 2026 00:00:00 GMT

## Why Compare Recce and Datafold? Both Recce and Datafold help data teams validate dbt model changes before merging to production. They solve the same core problem — SQL changes alone don't reveal how the actual data was affected — but take fundamentally different approaches to getting there. Understanding where they diverge helps you pick the tool that fits your team's workflow. ## What Is Each Tool's Validation Philosophy? The biggest difference between Recce and Datafold is not features but philosophy. **Recce: validate what matters.** Recce treats data diffing as one tool among several, not the default starting point. You begin with lineage and metadata — understanding what changed and what's downstream — then drill into targeted diffs where the signal warrants it. Data validation is selective and human-in-the-loop. **Datafold: automate everything.** Datafold runs cross-environment diffs across all modified models on every PR by default. The goal is comprehensive coverage — catch every difference, then let reviewers triage. Its Slim Diff feature reduces volume but selects at the model level, not by business relevance. This philosophical split shapes every downstream decision: what runs in CI, what gets reported, and how much compute you burn. ## How Do the Features Compare? | Capability | Recce | Datafold | | ------------------------ | -------------------------------------------------- | ---------------------------------------- | | Lineage Diff | Yes — visual DAG comparison between environments | Limited — model-level dependency view | | Breaking Change Analysis | Yes — detects schema and contract-breaking changes | No dedicated feature | | Column-Level Lineage | Yes — traces column transformations across models | Yes — column-level tracking | | Schema Diff | Yes | Yes | | Row Count Diff | Yes | Yes | | Profile Diff | Yes — column-level statistics comparison | No direct equivalent | | Value Diff | Yes — per-column match percentage with primary key | Yes — row-level data diff | | Top-K Diff | Yes — categorical distribution comparison | No direct equivalent | | Histogram Diff | Yes — overlaid distribution visualization | No direct equivalent | | Query Diff | Yes — arbitrary SQL comparison | No direct equivalent | | CI Integration | Opt-in, scoped via recce.yml | Auto-diff all changed models by default | | Open Source | Yes — free CLI, public pricing for Cloud | No — original data-diff tool sunset | | Pricing | Public pricing, free tier available | Commercial, pricing behind sales process | | Self-Serve Setup | Yes — install and configure independently | Requires sales engagement | ## How Does CI/CD Integration Differ? Recce's CI is opt-in and scoped. You define which checks to automate in your `recce.yml` configuration file, choosing from schema diffs, row count checks, profile comparisons, or custom queries. Only the checks you've validated manually first get promoted to CI. This means your automated checks reflect real review experience, not a generic "diff everything" rule. Datafold auto-diffs all changed models on every PR by default. Slim Diff reduces the volume by selecting only models that were directly modified, but the selection is at the model level — it doesn't distinguish between a cosmetic column rename and a revenue-critical calculation change. Every diff gets the same treatment. For teams working on large DAGs, this distinction matters. A single upstream change can propagate through the entire dependency chain, touching models that the author never intended to affect. Recce lets you focus CI on the models where being wrong is expensive. Datafold reports on everything and leaves triage to the reviewer. ## Why Do Teams Switch From Datafold? Common reasons teams evaluate alternatives to Datafold: - **Setup friction** — Datafold requires a sales process and onboarding. Teams wanting to evaluate quickly find the barrier high. - **Noisy results** — auto-diffing every model on every PR generates alert fatigue. Reviewers learn to skim or ignore the reports. - **Limited control** — you can't easily scope what gets diffed based on business context or risk level. - **Compute costs** — comprehensive diffing triggers heavy warehouse queries. On large datasets, auto-diff budgets add up fast. - **Pricing opacity** — without public pricing, teams can't plan costs or compare options independently. These aren't flaws in Datafold's design — they're tradeoffs of a coverage-first philosophy. Teams that prefer targeted, context-driven validation often find Recce a better fit. That said, Datafold has legitimate strengths. Its automated cross-environment diffing requires minimal configuration — once connected, every PR gets coverage without any per-model setup. For large-scale migrations (warehouse moves, dbt version upgrades), exhaustive row-level comparison across hundreds of models is exactly what you need. And teams with dedicated data quality engineers who can triage alerts effectively may prefer the comprehensive approach over manual drill-down. ## How Should You Decide Between Them? Use this decision framework based on your team's priorities: | Criterion | Choose Recce | Choose Datafold | | ---------------------- | --------------------------------------------------------------- | ------------------------------------------------------ | | Validation approach | You want to validate selectively based on business context | You want comprehensive automated coverage | | Team size | Small to mid-size teams that value signal over volume | Larger teams with dedicated data quality roles | | DAG complexity | Large DAGs where diffing everything is impractical or expensive | Manageable DAGs where full coverage is feasible | | Budget sensitivity | Need public pricing and predictable costs | Budget is flexible and sales engagement is acceptable | | CI philosophy | Prefer opt-in checks that you curate over time | Prefer out-of-the-box automated diffing | | Migration use case | Day-to-day PR validation and iterative development | Large-scale migrations requiring exhaustive comparison | | Open-source preference | Want an open-source foundation with optional cloud | Commercial-only is acceptable | | Review workflow | Drill-down: lineage first, then targeted diffs with checklist | Top-down: see all diffs, then triage and dismiss | Neither tool is universally better. The choice depends on whether your team's bottleneck is coverage (you miss things because nothing checks them) or noise (you miss things because everything is flagged). ## How Do They Fit Into the Broader dbt Ecosystem? Both tools complement dbt's built-in testing. dbt tests validate structure and constraints; [data diffs](/ai-blog/what-is-a-data-diff/) validate actual output against a known-good baseline. The question is how much automation and scope you want around that diffing. Other tools in the ecosystem include `dbt-audit-helper` for lightweight relation comparison, SQLMesh with built-in table diff, and custom CI scripts. Recce and Datafold sit at the more capable end of this spectrum — the difference is in how they wield that capability. For teams building a structured review process, combining Recce's selective diffing with [CI checks beyond dbt tests](/ai-blog/what-should-dbt-ci-check-beyond-tests/) provides a practical middle ground: automate what you've validated, investigate everything else with context. ## Summary Recce and Datafold solve the same problem — validating data changes before they reach production — with opposite philosophies. Recce is selective and human-in-the-loop, starting with lineage and drilling into targeted diffs. Datafold is comprehensive and automated, diffing all changed models by default. Choose Recce when signal-to-noise ratio and cost control matter most. Choose Datafold when exhaustive coverage and large-scale migration support are the priority. Both are stronger than no data validation at all.

Data Review Best Practices for Modern Data Teams

Thu, 19 Feb 2026 00:00:00 GMT

## What Is Data Review? Data review is the practice of systematically validating data model changes before merging them into production. Unlike code review, which examines logic, data review examines the actual output — the rows, columns, and values that downstream consumers depend on. Modern data teams working with dbt (data build tool) face a core challenge: a single model change can affect every downstream dependency in the DAG — from intermediate models to dashboards and ML features. Data review provides visibility into this blast radius before changes ship. ## Why Data Review Matters Traditional data quality approaches rely on post-deployment monitoring — catching issues after they've already affected production dashboards and reports. Data review shifts this left: - **Pre-merge validation**: Compare branch output against production baseline - **Impact analysis**: Understand which downstream models are affected - **Automated diffing**: Detect schema changes, row count shifts, and value distribution changes - **Human judgment**: Flag changes that are technically valid but semantically wrong ## Core Components of a Data Review Process ### 1. Impact Analysis Before reviewing data, understand the scope of change. Impact analysis maps which models are modified and traces their downstream dependencies. This tells reviewers where to focus attention. Key metrics for impact analysis: | Metric | What It Measures | Why It Matters | | ------------------ | ----------------------- | -------------------------- | | Modified models | Direct code changes | Primary review targets | | Downstream models | Transitive dependencies | Blast radius of the change | | Affected exposures | Dashboards, ML features | Business impact visibility | | Row count delta | Production vs. branch | Data volume changes | ### 2. Automated Checks Automate the repetitive parts of data review: - **Schema diff**: Detect added, removed, or renamed columns - **Row count comparison**: Flag unexpected increases or decreases - **Value distribution**: Compare histograms of key columns - **Primary key validation**: Ensure uniqueness constraints hold ### 3. PR-Level Reporting Integrate data review results into your pull request workflow. A data review summary posted as a PR comment gives reviewers context without switching tools. ## Implementing Data Review with Recce Recce automates data review for dbt projects. The typical workflow: 1. Developer opens a PR with model changes 2. CI runs `dbt build` on the PR branch 3. Recce compares branch output against the production baseline 4. Recce posts a diff report as a PR comment 5. Reviewers approve or request changes based on data impact ### Integration with dbt CI/CD Recce plugs into existing dbt CI pipelines. After `dbt build` completes, Recce runs its comparison checks and reports results. No changes to your dbt project structure are required. ## Best Practices 1. **Review data, not just code**: A syntactically correct model can produce wrong results. Always check the output. 2. **Scope reviews to impact radius**: Don't review every model — focus on modified models and their direct downstream dependencies. 3. **Automate the baseline**: Use CI to maintain a production baseline that Recce compares against automatically. 4. **Set blocking thresholds**: Define what constitutes a blocking data change (e.g., >10% row count change) and enforce it in CI. 5. **Document expected changes**: When a PR intentionally changes data output, annotate the expected changes in the PR description.

What Is Impact Radius in Data Modeling?

Thu, 19 Feb 2026 00:00:00 GMT

## Defining Impact Radius Impact radius measures how far a data model change propagates through your data pipeline. When you modify a model in a directed acyclic graph (DAG), the impact radius is the complete set of downstream nodes that depend — directly or transitively — on that model. Understanding impact radius is critical for data review: it tells you exactly where to look when validating a change. ## How Impact Radius Works Consider a simplified dbt DAG: ``` raw_orders → stg_orders → fct_orders → mart_revenue → mart_customer_ltv → fct_order_items → mart_product_performance ``` If you modify `stg_orders`, the impact radius includes: - `fct_orders` (direct dependent) - `fct_order_items` (direct dependent) - `mart_revenue` (transitive via fct_orders) - `mart_customer_ltv` (transitive via fct_orders) - `mart_product_performance` (transitive via fct_order_items) The impact radius is 5 models. The modification touches 1 model but affects 5. ## Calculating Impact Radius Impact radius calculation is a graph traversal problem. Starting from each modified node, perform a breadth-first or depth-first traversal of all downstream edges. | Input | Output | | -------------------------------- | ------------------------------------ | | Set of modified models | All transitively dependent models | | Modified model + depth limit | Dependents within N hops | | Modified model + exposure filter | Only affected dashboards/ML features | ### Depth-Bounded Impact Radius For large DAGs, full transitive impact radius can be overwhelming. Depth-bounded analysis limits traversal to N hops downstream: - **Depth 1**: Direct dependents only — the models that SELECT FROM the modified model - **Depth 2**: Dependents of dependents — one additional layer of propagation - **Depth N**: Full transitive closure when N equals the DAG diameter ## Why Impact Radius Matters for Data Review ### Scoping Reviews Without impact radius, reviewers must manually trace dependencies or review everything. Impact radius automatically identifies the relevant subset of models to check. ### Risk Assessment Larger impact radius means higher risk. A change to a foundational staging model that affects 30 downstream models requires more scrutiny than a change to a leaf mart model with no dependents. ### CI/CD Gating Teams can set CI rules based on impact radius: - Impact radius < 5: Auto-approve with standard checks - Impact radius 5–15: Require data review approval - Impact radius > 15: Require senior review + stakeholder notification ## Reducing Impact Radius Design patterns that limit propagation: 1. **Interface layers**: Insert stable interface models between raw/staging and mart layers. These absorb schema changes. 2. **Model modularity**: Break large models into focused components. A change to one component doesn't propagate through unrelated paths. 3. **Schema contracts**: Use dbt contracts to enforce column-level stability. Downstream models depend on the contract, not the implementation. 4. **Incremental isolation**: Design incremental models so that logic changes affect only the incremental window, not the full table. ## Impact Radius in Recce Recce calculates impact radius automatically from your dbt project manifest. When a PR modifies models, Recce: 1. Parses the project DAG from `manifest.json` 2. Identifies all modified models from the git diff 3. Traverses downstream edges to compute full impact radius 4. Filters to relevant exposures and metrics 5. Reports the impact radius in the PR comment with visual lineage diff This gives reviewers immediate visibility into the scope of every change without manual DAG tracing.

What Should a dbt CI Pipeline Check Beyond Tests?

Wed, 18 Feb 2026 00:00:00 GMT

## What Gap Do dbt Tests Leave in CI? Most dbt CI pipelines follow the same pattern: run `dbt build`, execute tests, and merge if everything is green. The problem is that dbt tests validate constraints — not-null, uniqueness, accepted values, referential integrity — but they do not validate data output. A model can pass every test and still produce [wrong results](/ai-blog/why-dbt-data-wrong-when-tests-pass/). An incorrect filter silently drops records. A bad JOIN fans out rows. A calculation uses the wrong column. The tests pass because the output is structurally valid. The data is just wrong. This is the gap: dbt tests check structure, but your CI pipeline needs to check data impact. ## What Additional Checks Should CI Include? Beyond dbt tests, a robust CI pipeline should compare PR branch output against the production baseline using four types of data validation checks: | Check Type | What It Detects | When It Matters | | -------------- | ----------------------------------------- | ---------------------------------------------------- | | Schema diff | Column additions, removals, type changes | Every PR — schema changes break downstream consumers | | Row count diff | Data loss, duplication, unexpected growth | Every PR — row count shifts signal logic errors | | Profile diff | Distribution changes in column values | PRs touching business logic or filters | | Value diff | Specific value-level differences | PRs affecting critical models | ### Schema Diff A schema diff compares the column structure of a model between your PR branch and production. It catches added columns, removed columns, renamed columns, and type changes. Schema changes are especially important because they can silently break downstream models, dashboards, and reverse ETL pipelines that depend on specific column names or types. ### Row Count Diff A row count diff compares the number of rows in a model between environments. A significant drop often indicates a filter bug or failed JOIN. A significant increase may indicate duplication or an overly permissive filter. Either way, it is a signal that something changed beyond what the code alone would suggest. ### Profile Diff A profile diff compares statistical summaries of columns — min, max, mean, median, null percentage, distinct count — between environments. It catches distribution shifts that row counts miss. For example, if the average order value drops by 30% because a filter excluded high-value transactions, the row count might barely change, but the profile diff will flag it immediately. ### Value Diff A value diff compares actual row-level data between environments. This is the most granular and most expensive check. Reserve it for critical models where being wrong has direct business impact — revenue tables, customer-facing metrics, models feeding ML pipelines. ## What Are Critical Models and How Do You Identify Them? Not every model in your dbt project needs the same level of CI validation. Critical models are the ones where being wrong has significant business consequences. Critical models typically share one or more of these characteristics: - **Customer-facing** — powers dashboards, reports, or APIs that stakeholders see directly - **Revenue-related** — feeds billing, financial reporting, or pricing logic - **ML pipeline input** — serves as a feature table for machine learning models - **High fan-out** — has many downstream dependencies, amplifying the impact of errors Identify them through domain knowledge. Ask your team: "Which models trigger a stakeholder call when something goes wrong?" Those are your critical models. ## How Do You Automate Data Checks in CI? Automation turns data checks from a manual practice into a repeatable process. The key concept is preset checks — a configuration file that defines which models to check and what types of diffs to run. A typical preset configuration (in a file like `recce.yml`) specifies: - **Target models** — which models to run checks on (usually critical models) - **Check types** — schema diff, row count, profile diff, or value diff for each target - **Thresholds** — what constitutes a blocking difference (e.g., row count change > 5%) ### A Practical CI Workflow Here is the end-to-end workflow for a dbt CI pipeline with data checks: 1. **`dbt build`** — build and test the PR branch against a CI environment 2. **Run preset checks** — execute the configured data checks, comparing PR output against the production baseline 3. **Compare to baseline** — evaluate results against defined thresholds 4. **Post summary to PR** — format the results as a [structured PR comment](/ai-blog/how-to-write-a-good-dbt-pull-request/) so reviewers have immediate context 5. **Block on mismatches** — if critical model checks detect unexpected differences, mark the CI check as failed This workflow integrates naturally with GitHub Actions or any CI provider. The data checks run after `dbt build` completes and report results back to the pull request. ## What Does "All Signal, No Noise" Mean for CI Checks? A common failure mode for data checks is alert fatigue. If every PR generates a wall of "check passed" messages, reviewers stop reading them. The all signal, no noise philosophy means: only report when something is different. In practice this means: - **Do not report passing checks** — if the schema matches, the row count is stable, and the profile is unchanged, there is nothing to report - **Report only mismatches** — surface the specific models and columns where differences were detected - **Provide context with the signal** — show the actual values (e.g., "row count changed from 142,387 to 138,201") so reviewers can assess severity immediately - **Distinguish blocking from informational** — some checks should block the merge; others are advisory This approach keeps the CI output actionable. When a reviewer sees a data check result, it means something actually changed and needs attention. ## How Do CI Checks and PR Review Work Together? CI checks and human PR review serve complementary roles: | | CI Checks | Human Review | | -------------- | --------------------------------------------------------- | --------------------------------------------- | | **Speed** | Runs in minutes on every PR | Requires scheduling and availability | | **Scope** | Predefined checks on critical models | Can explore any model or question | | **Strength** | Catches obvious regressions consistently | Evaluates intent, context, and business logic | | **Limitation** | Cannot judge whether a change is correct for the business | Cannot run on every PR at full depth | CI catches the obvious and repetitive — schema breaks, row count drops, distribution shifts. Human review handles the nuanced — "Is this metric change expected given the business context?" Together, they form a layered defense against data errors. ## How Does This Prevent "Tests Pass but Data Is Wrong"? The [tests-pass-data-wrong scenario](/ai-blog/why-dbt-data-wrong-when-tests-pass/) happens when a change is structurally valid but semantically incorrect. dbt tests confirm the structure is fine. CI data checks close the gap by comparing actual output: - A buggy filter that drops records? **Row count diff catches it.** - A calculation error that shifts averages? **Profile diff catches it.** - A column rename that breaks downstream? **Schema diff catches it.** - A JOIN change that subtly inflates metrics? **Value diff on the critical model catches it.** Without these checks, the only defense is a human reviewer manually running queries — which happens inconsistently at best. Automated CI checks make this validation happen on every single PR. ## Summary dbt tests validate structural constraints, but your CI pipeline needs to go further. Add data-level checks — comparing structure, volume, distributions, and values between environments — to catch issues that tests miss. Focus validation on critical models — the ones where being wrong triggers stakeholder calls. Automate checks using preset configurations that define what to check and what thresholds to enforce. Follow the "all signal, no noise" philosophy: only report differences, not passing checks. CI handles the obvious regressions; human review handles the nuanced judgment calls. Together, they close the gap between "tests pass" and "data is correct."

How to Write a Good dbt Pull Request

Tue, 17 Feb 2026 00:00:00 GMT

## Why Are Data PRs Different from Code PRs? In a typical software project, a pull request tells a clear story: here is the code that changed, here is what it does, here are the tests that prove it works. Reviewers can read the diff and reason about correctness. dbt pull requests are fundamentally different. The code — SQL or Jinja — is visible, but its output is not. A one-line change to a `WHERE` clause can silently shift revenue numbers, customer counts, or ML feature values across every model downstream of the change. You can read the SQL and understand the intent; you cannot read it and know whether the data is correct. This is the core problem: code is visible, but data is a black box. A good dbt PR must open that box. ## What Should a Good dbt Pull Request Include? A dbt PR comment template standardizes the information that every pull request should contain. Without a template, PRs tend toward vague descriptions like "updated customer model" — leaving reviewers to guess at the scope and impact. A structured template should include these sections: ### Type of Change Classify the change so reviewers know what to expect: | Type | Description | Review Focus | | --------------- | --------------------------------------- | ------------------------------------------------ | | New model | Adds a new model to the project | Schema design, naming conventions, test coverage | | Bugfix | Corrects incorrect logic | Data diff against production, downstream impact | | Refactor | Restructures without changing output | Confirm output is identical to production | | Breaking change | Intentionally changes output | Full data validation, stakeholder notification | | Source change | Updates source definitions or freshness | Upstream dependency review | ### Description and Motivation Explain _why_ the change exists, not just _what_ it does. A good description answers: What business problem does this solve? What triggered this change? What alternatives were considered? ### Related Issues Link to the issue tracker. This creates traceability between business requests and data changes. ### Lineage Diff Show which models are directly modified and which downstream models are impacted. A lineage diff visualizes the blast radius of your change — the set of models, exposures, and dashboards that could be affected. ### Data Validation Results This is the section most dbt PRs lack entirely. Include the results of [data diffs](/ai-blog/what-is-a-data-diff/) on impacted models: - Did columns change? (schema comparison) - Did the volume of data change? (row count comparison) - Did the statistical distribution of key columns shift? (profile comparison) - For critical models, how do specific values compare between dev and prod? (value-level comparison) ### dbt Test Results Confirm that all dbt tests pass on the PR branch. This is the baseline — necessary but not sufficient. ### Impact Considerations Note any downstream consumers that should be aware: dashboards, reverse ETL pipelines, ML features, or other teams' models. ### Reviewer Checklist Provide a checklist of items for the reviewer to verify, such as: naming conventions followed, tests added for new models, data validation reviewed, breaking changes communicated. ## How Do You Perform a Data Impact Assessment? A data impact assessment compares the actual data output of your PR branch against the production baseline. The goal is to answer: "Did the data change the way I expected, and only the way I expected?" The process follows a funnel — start broad, narrow down: 1. **Run lineage analysis** — identify all models in the impact radius of your change 2. **Check schema diffs** — confirm no unintended column changes 3. **Compare row counts** — catch data loss or unexpected duplication 4. **Run profile diffs** — check that distributions on key columns look reasonable 5. **Drill into value diffs** — on critical models, compare actual values between dev and prod This layered approach is efficient. Most models will pass the first two checks and need no further investigation. You only invest deep review time where the data shows something unexpected. ## What Are the Benefits of Structured PR Templates? Structured templates deliver three distinct benefits: **Define your own work.** Writing a structured PR forces the author to think through the impact of their change. You cannot fill in a "data validation results" section without actually running the validation. The template makes thoroughness the default. **Help your reviewers.** Reviewers should not have to reverse-engineer the purpose and impact of a change from a code diff alone. A structured PR gives them the context they need to review efficiently and ask the right questions. **Create a historical record.** Six months from now, when someone asks "why did the revenue model change in February?", the PR is the source of truth. A well-documented PR with data validation evidence is far more useful than a one-line commit message. ## How Do Teams Use Structured PR Review in Practice? Teams across industries have adopted structured data PR review. Municipal government data teams use PR templates to document changes to public-facing datasets, where data errors can erode public trust. The structured format ensures that every change to a critical model includes validation evidence and a clear explanation of intent. In the dbt community, the Jaffle Shop demo project demonstrates how even a small project benefits from documenting data impact alongside code changes. The pattern scales: what works for a demo project works for a 500-model production project. The common thread is that teams who adopt structured PR review catch more issues before production and spend less time debugging after. ## How Do Teams Automate PR Validation? Manual validation works, but it depends on the author remembering to run diffs, format results, and paste them into the PR comment. Consistency drops when the process is entirely manual. Tools like Recce automate this by generating PR-ready validation checklists that export directly to GitHub comments. After analyzing the PR branch against the production baseline, the tool runs structural and statistical checks on impacted models and formats the results as a checklist. Every PR gets the same level of validation — not just the ones where the author was thorough. For teams looking to extend this into CI, [preset checks in your CI pipeline](/ai-blog/what-should-dbt-ci-check-beyond-tests/) can run these validations on every pull request automatically, closing the gap between manual best practice and repeatable process. ## Summary A good dbt pull request goes beyond code changes to include data validation evidence. Use a structured template with sections for change type, description, lineage diff, data validation results, test results, and a reviewer checklist. Perform data impact assessments by comparing dev output against production — checking structure, volume, distributions, and values at increasing levels of granularity. Structured templates define your work, help reviewers, and create a historical record. Tools like Recce automate the generation of PR-ready validation checklists, making thorough data review the default rather than the exception.

Why Is My dbt Data Wrong Even When Tests Pass?

Mon, 16 Feb 2026 00:00:00 GMT

## The False Sense of Security Your dbt CI pipeline is green. All tests pass — not-null, unique, accepted_values, relationships. You merge the PR. Two days later, a stakeholder messages: "The revenue numbers look wrong." This scenario is more common than most data teams admit. Data correctness — whether the data is right for the business — is fundamentally different from data quality, which measures structural integrity like completeness, format, and uniqueness. Data can be high quality and still be wrong. ## What dbt Tests Actually Check dbt's built-in tests validate structural properties: | Test Type | What It Checks | What It Misses | | ----------------- | ------------------------------------ | -------------------------------------------------- | | `not_null` | Column has no null values | Whether the non-null values are correct | | `unique` | No duplicate values in a column | Whether the values themselves are right | | `accepted_values` | Values fall within a defined set | Whether the distribution across values makes sense | | `relationships` | Foreign keys reference valid parents | Whether the join logic produces correct results | Custom tests and packages like `dbt-expectations` extend this to statistical checks (e.g., column means within bounds), but they still validate against predefined rules. They cannot catch a bug you didn't anticipate. ## Why Semantic Errors Slip Through Semantic errors are logical mistakes that produce structurally valid but meaningfully wrong data. They pass all tests because the output looks fine at a structural level. Common examples: - **Incorrect filter logic** — a `WHERE` clause that silently excludes valid records. Row counts and uniqueness are fine, but the data is incomplete. - **Wrong JOIN condition** — a join that fans out rows or drops records. The output has the right columns and no nulls, but metrics are inflated or deflated. - **Calculation bugs** — business logic that uses the wrong column, wrong aggregation, or wrong date range. The result is a number, just not the right number. - **Upstream changes** — a column's meaning changes in a source system. Your tests still pass because the format hasn't changed, but the semantics have. In each case, the data "passes the test but fails the business." ## A Real-World Example A data team pushed a change to a core model that fed a reverse ETL pipeline powering marketing automation. The model was fully tested — schema checks, null checks, uniqueness constraints. But the bug was a logical one: an incorrect filter that subtly changed which records were included in a calculation. The corrupted data reached the experimentation platform and wasn't discovered for almost a week. The aftermath: - Pipeline halted, financial data updates paused - Downstream consumers had to delete bad data and re-ingest - Model fix took a full day; platform cleanup took several more - The experimentation team was blocked the entire time Nothing looked obviously broken at a glance. It only became apparent when someone calculated metrics and noticed patterns that didn't make sense. The monetary cost was significant, but the loss of trust was immeasurable. ## How to Catch What Tests Miss The gap between testing and correctness requires a different approach: comparing actual data output against a known-good baseline. ### Cross-Reference Against Production Before merging, compare your development environment's data against production. If historical metrics changed when they shouldn't have, something is wrong. This is the historical context test — trusted production data serves as your benchmark for correctness. ### Use Data Diffs at the Right Granularity [Data diffs](/ai-blog/what-is-a-data-diff/) compare datasets between environments. Start with cheap structural checks (has the schema changed? did row counts shift?), move to statistical checks (are column distributions still reasonable?), and drill into row-level comparisons only where the signal warrants deeper investigation. For the example above, a profile diff would have shown the CLV distribution shifting. A value diff would have quantified that 99% of rows changed in the affected column while all other columns matched 100%. ### Automate Checks on Critical Models Every project has models where being wrong is expensive — customer-facing tables, revenue metrics, models that feed ML pipelines. Identify these critical models and automate [data checks in your CI pipeline](/ai-blog/what-should-dbt-ci-check-beyond-tests/) so they run on every PR, not just when someone remembers. ### Involve Domain Experts Strategically Not every change needs human review. Focus human attention on changes where the cost of being wrong is high and detection time is slow. For a marketing-critical model, ask: "Would a stakeholder notice if these numbers shifted by 5%?" If the answer is "not until next month's report," that model deserves a human-in-the-loop review. ## Building a Data Correctness Workflow A practical workflow combines automated checks with targeted human review: 1. **dbt tests** — catch structural issues (they're still essential) 2. **Automated data diffs in CI** — schema, row count, and profile checks on critical models 3. **Manual exploration on high-risk changes** — use lineage to scope impact, run targeted diffs, check distributions 4. **PR documentation** — record what you checked, what you found, and why the change is safe to merge 5. **Domain review on high-stakes changes** — get a second pair of eyes when the business impact is significant This layered approach follows the [data review best practices](/ai-blog/data-review-best-practices/) that catch issues at the right level. Tests are the foundation, but they're not the whole building. ## Summary dbt tests validate that data meets structural constraints — not that it's correct for the business. Semantic errors (wrong calculations, incorrect filters, unexpected upstream changes) pass all tests while producing wrong results. To catch these issues, compare development data against production baselines using data diffs, automate checks on critical models in CI, and involve domain experts on high-stakes changes. The goal is not to replace testing but to complement it with data-level validation that catches what tests inherently cannot.

What Is Column-Level Lineage and Why Does It Matter?

Sun, 15 Feb 2026 00:00:00 GMT

## What Is Column-Level Lineage? Column-level lineage (CLL) tracks how individual columns flow through transformations across your data pipeline. While model-level lineage shows that model B depends on model A, column-level lineage shows that `B.total_revenue` is derived from `A.price * A.quantity`. This granularity matters because not every column in a model is affected by every change. When you modify a calculation in an upstream model, CLL tells you exactly which downstream columns are impacted — and which are safe to ignore. ## Three Core Use Cases for Column-Level Lineage ### Source Exploration During development, CLL helps you understand how a column is derived. When you encounter a column like `customer_lifetime_value` in a downstream mart, CLL traces it back through intermediate models to the original source columns, showing each transformation along the way. ### Impact Analysis When modifying column logic, CLL lets you assess potential impact across the entire [DAG](/ai-blog/what-is-dbt-dag-lineage/). Instead of manually checking every downstream model after changing a column definition, you trace the column forward to see exactly which models and columns depend on it. This scopes your [data review](/ai-blog/data-review-best-practices/) to the affected columns rather than entire models. ### Root Cause Analysis When a downstream metric looks wrong, CLL helps identify the possible source of the error. Trace the problematic column backward through the lineage to find where a transformation may have introduced the issue. ## How Column-Level Lineage Works CLL is typically constructed by parsing SQL and analyzing how columns are referenced, transformed, and projected through each model. The process involves: 1. **Parsing SQL into an abstract syntax tree (AST)** — tools like SQLGlot parse each model's SQL into a tree structure that can be traversed programmatically. 2. **Traversing scopes** — each CTE, subquery, and root query is a scope with locally available columns. The parser walks through each scope to resolve column references. 3. **Classifying transformations** — for each output column, the parser determines how it relates to its input columns. ## Column Transformation Types Understanding how a column was transformed is as important as knowing which upstream columns it depends on. | Transformation | Description | Example | | -------------- | ------------------------------------------------------- | ---------------------------------------- | | Pass-through | Column selected as-is, no modification | `SELECT user_id FROM ref('users')` | | Renamed | Single upstream column with an alias change | `SELECT user_id AS id FROM ref('users')` | | Derived | Result of an expression, calculation, or aggregation | `SELECT price * quantity AS total` | | Source | Not based on any upstream column (literal or function) | `SELECT CURRENT_TIMESTAMP AS created_at` | | Unknown | Parsing failed or logic involves unsupported constructs | Ambiguous references in complex JOINs | Derived columns usually deserve the closest attention during review because they introduce business logic. A pass-through column is unlikely to be the source of a data issue, but a derived column that aggregates or transforms data is where bugs hide. ## The WHERE Clause Caveat An important limitation of most CLL implementations: columns used in `WHERE` clauses, `JOIN` conditions, or `GROUP BY` expressions typically do not appear in the column-level lineage graph. These clauses filter or group data but don't directly produce output columns. For example, if a model filters on `order_status = 'completed'`, the `order_status` column won't appear in the CLL for any output column — even though changing its values would affect the model's output. This is a model-to-column relationship rather than a column-to-column dependency. This means CLL is powerful for tracing data flow but should be complemented with model-level lineage and [data diffs](/ai-blog/what-is-a-data-diff/) for full coverage. ## Comparing CLL Tools in the dbt Ecosystem Each CLL implementation differs in meaningful ways: | Tool | CLL Display | Key Characteristics | | ----------------------- | ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Recce** | Integrated on lineage DAG diff | CLL shown directly on the main lineage view; click columns to trace dependencies; shows transformation types (pass-through, renamed, derived); open source | | **dbt Cloud Explorer** | Separate column-level view | Enterprise only; each column is a distinct node; requires navigating in/out of model detail pages | | **Power User (VSCode)** | Integrated in editor lineage | Open source; CLL shown within the model lineage panel; stays in your editor; requires beta UX toggle | | **SQLMesh** | Native feature | Open source; works with dbt projects; shows CTEs as nodes (can cause info overload); upstream-only; each click refreshes lineage | The best choice depends on your workflow. If you want CLL as part of a broader [data validation workflow](/ai-blog/data-review-best-practices/), tools that integrate CLL with diff capabilities (like Recce) reduce context-switching. If you primarily need CLL for exploration during development, editor-integrated tools work well. ## Column-Level Lineage and Impact Radius CLL directly improves how you calculate [impact radius](/ai-blog/what-is-impact-radius/). Without CLL, a change to any column in a model means you must consider all downstream models as potentially impacted. With CLL, you can narrow the impact radius to only the downstream models and columns that actually depend on your changed column. This precision matters on large DAGs where a single model might have hundreds of downstream dependents. CLL turns a potentially overwhelming review into a targeted one. ## Summary Column-level lineage tracks how individual columns flow and transform through your data pipeline. Its three core use cases — source exploration, impact analysis, and root cause analysis — make it essential for efficient data review on complex DAGs. Most CLL implementations work by parsing SQL into an AST and classifying each column's transformation type. While CLL has limitations (notably the WHERE clause caveat), it significantly reduces the effort needed to validate data model changes when combined with model-level lineage and data diffs.

What Is a Data Diff and When Should You Use One?

Sat, 14 Feb 2026 00:00:00 GMT

## What Is a Data Diff? Data diff is the practice of comparing two versions of a dataset to identify what changed between them. In dbt workflows, this typically means comparing the output of a model in your development environment against the same model in production. The goal is to understand the actual data impact of your code changes before merging to production. Unlike a code diff, which shows you what lines of SQL changed, a data diff shows you what happened to the rows, columns, and values that downstream consumers depend on. This distinction matters because a one-line code change can ripple through your entire DAG — affecting models, dashboards, and metrics in ways that are invisible from the SQL alone. ## Types of Data Diffs Not all diffs operate at the same granularity. The right type depends on what you're trying to validate. | Diff Type | What It Compares | Cost | Best For | | -------------- | -------------------------------------------------- | ----------- | ------------------------------------- | | Schema Diff | Column names, types, ordering | Very low | Catching structural breaking changes | | Row Count Diff | Total row counts per model | Very low | Detecting data loss or duplication | | Profile Diff | Column-level statistics (min, max, avg, null rate) | Low | Spotting distribution shifts | | Histogram Diff | Value distributions overlaid on shared axes | Medium | Visualizing how distributions shifted | | Top-K Diff | Most frequent categorical values | Medium | Comparing category distributions | | Value Diff | Per-column match percentage using a primary key | Medium-high | Quantifying exact change scope | | Query Diff | Row-by-row comparison of arbitrary queries | High | Fine-grained spot-checks | The key insight is that these types form a natural funnel. Start cheap and broad, then drill down where the signal warrants it. ## When Should You Use a Data Diff? Data diffs serve two distinct validation modes: **Impact analysis** — when you expect data to change and want to verify the change is correct. For example, fixing a `customer_lifetime_value` calculation to only include completed orders. You expect CLV to decrease. A value diff confirms that CLV changed while other columns stayed the same. **Regression testing** — when you expect data to remain unchanged. For example, refactoring a model's SQL without changing its logic. A profile diff or row count diff can quickly confirm nothing shifted. For both modes, the drill-down approach works best: 1. **Start with lineage** — identify which models were impacted and scope your review to the [impact radius](/ai-blog/what-is-impact-radius/) 2. **Check structure** — schema diff and row count catch the obvious issues 3. **Check distributions** — profile diff, histogram overlay, and top-k reveal statistical shifts 4. **Spot-check values** — value diff and query diff confirm specific rows when needed ## When a Data Diff Is Not Enough A data diff shows you _what_ changed, but not _why_ or _what to do next_. Not all differences are problems. Without context, diffing generates false alerts that demand attention but not action. The hidden costs of a "diff everything" approach include: - **Compute cost** — diffing two full tables triggers heavy queries on large datasets. Auto-diffing every model on every PR drains warehouse budgets. - **Noise** — a small upstream change cascades through the DAG, creating downstream diffs that mostly don't matter. Teams learn to ignore the alerts. - **Configuration burden** — accurate row-level diffs require primary keys or unique identifiers, which aren't always available or documented. Better alternatives often exist for the first pass. Data profiling (null rates, distributions), group-based aggregation (counts and sums by dimension), and [column-level lineage](/ai-blog/what-is-column-level-lineage/) can tell you where to focus before you start diffing. ## How Data Diffs Fit Into dbt PR Review The most effective teams use data diffs as part of a structured [data review process](/ai-blog/data-review-best-practices/). The workflow looks like: 1. **Explore** — use lineage diff to scope the blast radius 2. **Validate** — run targeted diffs on the models that matter 3. **Document** — add diff results to a checklist with notes explaining what you checked and why 4. **Share** — export the checklist to your PR comment for reviewers This approach treats diffing as a tool in a larger toolkit, not as the goal itself. The goal is understanding — confirming that your change did what you intended and nothing else. ## Data Diff Tools in the dbt Ecosystem Several tools support data diffing for dbt projects: - **Recce** — open-source toolkit with lineage diff, profile diff, value diff, top-k diff, histogram overlay, query diff, and a checklist workflow for PR review. Supports selective, human-in-the-loop validation. - **dbt-audit-helper** — dbt package for comparing relations with `compare_relations` and `compare_column_values` macros. Lightweight but manual. - **Datafold** — commercial platform with automated cross-environment diffing on every PR. Full coverage but can generate noise on large DAGs. - **SQLMesh** — dbt alternative with built-in table diff capabilities. The right choice depends on your team's workflow. If you want targeted validation with business context, tools like Recce that support a drill-down approach work well. If you need comprehensive automated coverage, full-table diff tools may fit better. ## Summary A data diff compares datasets between environments to surface what changed. Use structural diffs (schema, row count) for quick sanity checks, statistical diffs (profile, histogram, top-k) for distribution insight, and row-level diffs (value, query) for fine-grained confirmation. Scope your diffs to the models that matter rather than diffing everything, and combine diffs with lineage analysis and business context for effective data review.