How Did Recce Build an AI Data Review Agent?
Why Did a Single Prompt Approach Fail?
The first approach seemed obvious: gather all context of a code change, write one prompt, see the output, iterate. Using Anthropic’s API, the team fed PR details, code diffs, lineage changes, row counts, and schema diffs into a single large prompt.
The first PR looked good. The second PR looked good. Then the edge cases arrived.
The summary format varied between runs. The output missed information randomly. The model stopped following prompt instructions carefully. The root cause: cramming everything into one API call pushed against prompt size limits. Information got cut off or the model could not process it all properly.
More critically, the LLM could only work with pre-selected data. Reviewing the output, the team would think “why didn’t it check the row count for model X?” — then realize the answer was simple: because that data was not included in the prompt. The model could not go get data on its own.
The single prompt approach works for simple PRs. It breaks down as complexity increases.
How Did the Agent Architecture Change Everything?
Moving to an agent architecture meant the LLM could call tools during its reasoning. Instead of pre-selecting all context, the agent could explore:
- Check the lineage diff, notice model X changed
- Follow the lineage, check downstream impacts
- Run row counts for model X, see a 50% drop
- Investigate why — check code diff, schema changes, write custom queries
The agent explores like a human reviewer: discover something, dig deeper, follow the trail.
The first agent architecture wrapped Recce as an MCP tool (giving the agent access to lineage diffs, row counts, query results, and schema changes), used gh CLI for PR context, and let the agent decide which tools to call and when.
Results improved immediately. Then dogfooding on more complex internal PRs revealed new failure modes.
What Token Limits Constrain AI Agent Design?
Three limits shape what an AI agent can realistically do:
| Limit | Value | Impact |
|---|---|---|
| Model context window | 200k tokens (Claude) | Total information the agent can hold |
| Single prompt limit | ~90k tokens (practical) | Maximum per API call |
| MCP tool response | 25k tokens per call | Maximum data returned by any single tool |
The MCP tool limit hit first. A lineage diff for a PR with only 5 changed models easily exceeded 25k tokens when returning the full API payload — every node detail, all dependencies, complete diff information. The tool call failed outright.
PR context fetching was similarly expensive. The agent needed 5-10 back-and-forth gh CLI calls to gather details, code diffs, comments, and metadata. Each round-trip added latency and burned tokens.
How Do Subagents Solve the Context Problem?
With a single agent hitting both context window and prompt limits, the team needed to distribute the analysis load. Subagents — specialist agents each with their own 200k context window — provided the solution:
- Main agent: Orchestrates the overall job, receives summaries (not full details) from subagents
- pr-analyzer subagent: Extracts and interprets PR context, returns a summary
- recce-analyzer subagent: Explores data with Recce MCP tools, returns findings
The main agent does not need full exploration details — just the summaries. By delegating deep analysis to specialists, available context capacity effectively triples. This is the same multi-agent pattern that makes the production system reliable.
How Did the Team Optimize MCP Tool Responses?
Staying under the 25k MCP tool limit required several optimizations:
Return only changed and downstream nodes. In a 200-model dbt project, most models are unchanged upstream dependencies. For a PR with 5 changed models, keep those 5 plus their downstream impacts, filter out the ~150 unchanged upstream models.
Use dataframes instead of key-value objects. Key-value entries repeat keys for every record, wasting tokens. Dataframe format reduces duplication dramatically.
Use numeric indices instead of long node IDs. Replacing model.my_project.customer_orders with compact integers (1, 2, 3) substantially reduced token counts.
Wrap PR fetching in a single GraphQL call. Instead of 5-10 gh CLI round-trips, one custom MCP tool using a single GitHub GraphQL call fetches complete PR context. This reduced latency, token usage, and the risk of the agent losing track of partially fetched information.
How Was the Lineage Graph Accuracy Problem Solved?
Generated lineage graphs were sometimes incorrect, showing wrong connections between models. The original response format followed manifest.json structure with a parent_map — a nested mapping the agent had to reason over carefully to produce correct Mermaid diagrams.
The fix was changing the tool output to match Mermaid’s native edge representation. Instead of:
{
"parent_map": {
"node_1": ["node_2", "node_3"]
}
}
The response became:
{
"edges": {
"columns": ["from", "to"],
"data": [
["node_2", "node_1"],
["node_3", "node_1"]
]
}
}
With explicit edge pairs, the agent no longer needs to infer relationships from nested mappings. The generated diagrams became significantly more stable. This lesson — match the output format to how the consumer will use the data — applies broadly to MCP tool design.
What Happened When the Agent Read Its Own Previous Output?
An unexpected failure: the agent consumed its own previous PR comment summaries as part of the PR context. When it read all PR comments to understand the discussion, it treated old summaries as new information. This caused two problems:
- It failed to follow updated prompts, thinking the old summary was the correct format
- It mistook old analysis as current context
The fix was filtering out agent-generated comments by their signature before feeding PR context to the agent. A simple but non-obvious solution to a problem that only surfaces during real usage.
Architectural Evolution: A Timeline
| Stage | Architecture | What Broke |
|---|---|---|
| 1 | Single prompt with API calls | Prompt size limits, no autonomous exploration |
| 2 | Agent with Claude CLI + MCP tools | Token limits, unreliable lineage, slow PR fetching |
| 3 | Subagents + optimized MCP responses | Scaling needs, multi-platform support requirements |
| 4 | Claude Agent SDK | Production infrastructure for ongoing iteration |
Each stage was driven by dogfooding on real PRs of increasing complexity. What looks like a simple prompting problem turned out to be an infrastructure and observability challenge that took two months of iteration.
What Did the Team Learn?
Domain-specific AI agents require deep iteration on token limits, context management, tool design, output formats, and edge cases discovered only through real usage. The gap between a weekend prototype and a production system is wider than most teams expect.
The system now works on GitHub PRs and GitLab MRs, uses six Recce MCP tools for data validation, and understands context across PR metadata, Recce analysis, and the data warehouse. But every feature in that list represents a specific failure mode that was discovered, diagnosed, and solved through iteration on real data.
Frequently Asked Questions
- How did Recce build its AI data review agent?
- Recce iterated through four architectural stages: a single prompt with API calls, an agent architecture with Claude CLI, custom MCP tools and subagents for token limits, and finally migration to Claude Agent SDK for production infrastructure. Each stage was driven by real failures discovered through dogfooding on internal PRs of increasing complexity.
- Why did a single prompt approach fail for AI data reviews?
- A single prompt approach failed because cramming PR context, code diffs, lineage changes, row counts, and schema diffs into one API call exceeded prompt size limits. Information got cut off or the model could not process it properly. The LLM could only work with pre-selected data and could not explore context independently — it could not go check row counts for a model unless that data was already included.
- What token limits affect AI agent architecture?
- Three token limits constrain AI agent design: the model context window (200k tokens for Claude), the single prompt limit (approximately 90k tokens in practice), and the MCP tool response limit (25k tokens per tool call). A lineage diff for just 5 changed models can exceed the 25k MCP limit when returning full API payloads. These limits drove the move to subagent architecture and MCP response optimization.
- How do subagents solve AI agent context limits?
- Subagents solve context limits by distributing the analysis across multiple isolated 200k context windows. A main orchestrator agent delegates PR understanding to a pr-analyzer subagent and data exploration to a recce-analyzer subagent. Each specialist works with full context in its domain, then returns a summary to the main agent. This effectively triples available context capacity.