Skip to content

QA Agent Infrastructure: Trust But Verify

Last Updated: 2026-01-29
Related Slack Discussion: "Trust + Verify, QA Agent" (AI Channel)
Status: ✅ Production - Fully Implemented


Overview

Seer's QA Agent Infrastructure provides automated quality validation for AI-generated deliverables. This system ensures data accuracy, proper citation, and brand compliance before client delivery.

Core Components

Component Purpose Location
Trust But Verify Playbook Practitioner guide for data validation docs/wiki/how-to/trust-but-verify.md
/qa-check Command On-demand quality validation plugins/core-dependencies/commands/qa-check.md
Stop Hook (QA Check) Auto-validation before completion plugins/core-dependencies/scripts/stop-qa-check.sh
Quality Standards Skill Auto-activated QA standards plugins/core-dependencies/skills/quality-standards/
PostToolUse Hook File change tracking for verification plugins/core-dependencies/scripts/post-tool-use.sh

How It Works

1. Automatic Quality Gates (Stop Hook)

When Claude attempts to complete work, the Stop hook automatically:

  • Detects completion claims ("done", "finished", "works")
  • Triggers QA validation checks
  • Blocks completion if critical issues found
  • Provides actionable feedback for fixes

What it checks:

  • ✅ Test coverage for code changes
  • ✅ Verification of completion claims
  • ✅ Commit checkpoints for multi-file edits

2. On-Demand Validation (/qa-check)

Runs comprehensive checks on current deliverable:

/qa-check

Output:

QA CHECK RESULTS
================

Deliverable Type: slide-deck

Action Titles: PASS
  ✓ All titles are conclusions

Data Sources: FAIL
  ✗ Line 42: "Traffic increased 34%" - no source cited
  ✗ Line 67: "Competitor rankings improved" - no data reference

Brand Compliance: WARN
  ⚠ Line 89: "We recommend" - prefer "The data shows"

Completeness: PASS
  ✓ No placeholders found

Overall: NEEDS REVISION

3. Quality Standards Skill (Auto-Activated)

Activates automatically when working with:

  • Data analysis or metrics
  • Projections/forecasts
  • Recommendations based on data
  • Client-facing deliverables

Core Principles:

  1. Explicit is better than implicit - State assumptions, cite sources
  2. Evidence-based recommendations - Every claim backed by data
  3. Conservative estimates - Under-promise, over-deliver
  4. Sanity checks - Validate math, logic, conclusions

Quality Check Categories

1. Action Titles (Presentations)

Rule: Slide titles must be conclusions, not labels.

❌ Bad (Label) ✅ Good (Conclusion)
Traffic Overview Organic traffic increased 34% YoY
Performance Summary Mobile conversion rate improved despite traffic decline
Q3 Results Competitive gap closed by 15 positions

2. Data Source Verification

Rule: All metrics must cite sources with date ranges.

Required format:

"Based on BigQuery OrganicRankings_Daily table (March 1-31, 2024), 
traffic increased 18% (12,400 → 14,616 sessions)."

Acceptable sources:

  • BigQuery / Seer Signals
  • Google Analytics 4
  • Google Search Console
  • DataForSEO
  • Platform APIs (Google Ads, Meta Ads, etc.)

3. Projection Methodology

Rule: Projections must show calculation + assumptions.

Example:

Potential traffic lift: 180-220 clicks/month

Methodology:
- Current position: 7
- Target position: 3
- Monthly search volume: 2,400
- CTR improvement: 4.5% → 9.2% (AWR 2024 CTR Study)
- Calculation: 2,400 × (0.092 - 0.045) = 113 clicks (base)
- Scenarios: Best (+50%), Better (+30%), Good (+10%)

Assumptions:
1. Competitive landscape remains stable
2. Content implemented within 30 days
3. Technical SEO issues addressed
4. No major algorithm updates

4. Brand Compliance (Seer Voice)

Issues flagged:

  • Passive voice where active is better
  • Jargon without explanation
  • "We recommend" (prefer "The data shows" or "Testing confirmed")
  • Overly formal language
  • Guarantee language ("will", "guaranteed", "100% will")

5. Completeness

Issues flagged:

  • [TODO], [TBD], [PLACEHOLDER] markers
  • Empty sections in templates
  • Incomplete sentences or bullet points

QA Tiers (When Peer Review Required)

Auto-Ship (No Peer Review)

  • Data extraction queries
  • Keyword research (volume, difficulty, SERP features)
  • Competitive research (what competitors are doing)
  • Traffic/ranking reports

Peer Review Required

  • Strategic recommendations
  • Content differentiation strategies
  • Client-facing deliverables (audits, outlines, analyses)
  • ROI projections and Expected Outcome tables
  • Priority recommendations and roadmaps

Shadow Mode (Senior Review + Validation)

  • New methodologies not yet proven
  • Experimental approaches
  • High-stakes client deliverables (>$100K projected impact)
  • Sensitive competitive positioning

Decision rule:

  • If deliverable includes projections/recommendations → Peer Review
  • If deliverable goes directly to client → Peer Review
  • If stakes are high (revenue, relationship) → Shadow Mode
  • If just data extraction → Auto-Ship

Verification Paths by Data Source

BigQuery / Seer Signals

What to verify:

-- Re-run query to confirm data freshness
SELECT * FROM `project.dataset.table`
WHERE org_name = 'ClientName'
  AND date >= '2024-03-01'
  AND date <= '2024-03-31'

Check:

  • org_name filter matches client
  • ✅ Date range matches deliverable
  • ✅ No outliers or anomalies

Google Analytics 4

Where to verify:

  • GA4 UI → Reports → Acquisition → Traffic acquisition
  • GA4 UI → Reports → Engagement → Pages and screens
  • GA4 UI → Explore → Free form

Check:

  • ✅ Date range matches exactly
  • ✅ Segment filters correct (device, geography, user type)
  • ✅ Cross-check conversions with CRM (expect 5-15% variance)

Google Search Console

Where to verify:

  • GSC UI → Performance → Search results
  • GSC UI → Performance → Queries tab (keyword data)
  • GSC UI → Performance → Pages tab (landing page data)

Check:

  • ✅ Date range matches (GSC has 2-3 day lag)
  • ✅ Filter matches (device, country, search type)
  • ✅ Compare GSC clicks with GA4 organic (expect 10-20% variance)

Why variance exists:

  • GSC tracks Google-only; GA4 includes all search engines
  • GSC counts clicks; GA4 counts sessions
  • Bot traffic filtered differently

DataForSEO (SERP Analysis)

What to verify:

  • Manual Google search - Confirm rankings/SERP features
  • SEMrush/Ahrefs - Cross-check keyword volumes
  • Multiple browsers/locations - Account for personalization

Check:

  • ✅ Rankings can fluctuate daily - note snapshot date
  • ✅ SERP features are dynamic - verify current state
  • ✅ Competitor analysis is point-in-time

The Rule of Five (Self-Review Protocol)

Agent outputs are first drafts, not final deliverables.
Self-review at least 5 times before delivery:

Pass 1: Data Accuracy

  • Are all metrics cited with sources?
  • Are calculations correct?
  • Do date ranges make sense?
  • Is sample size adequate?

Pass 2: Logic & Reasoning

  • Do recommendations follow from data?
  • Are there alternative explanations?
  • Did I consider external factors (seasonality, algorithm updates)?
  • Are assumptions reasonable?

Pass 3: Client Context

  • Does this align with client's business model?
  • Is language client-appropriate (no jargon)?
  • Are recommendations actionable for their team?
  • Is tone suitable for relationship stage?

Pass 4: Deliverable Quality

  • Is formatting clean and consistent?
  • Do links work?
  • Are tables and charts clear?
  • Is document structure logical?

Pass 5: Final QA Gate

  • Run /qa-check for automated validation
  • Review Quality Standards
  • Review the QA Review Checklist (quality-standards skill resources)
  • Confirm peer review if required

Why 5 passes? Each review catches different types of issues. First pass sees data problems. Last pass catches subtle tone or framing issues.


Common QA Block Scenarios

Block Message What It Means How to Fix
"Metrics cited without data source" Numbers without citation Add (Source: GA4, March 2024) or (BigQuery OrganicRankings_Daily)
"Completion claimed but no verification" Said "done" without proof Run tests/build commands and confirm pass
"Action titles required" Slide titles are labels Change to conclusions ("Traffic grew 34% YoY")
"Unsupported claim detected" "Studies show..." without citation Cite specific study or rephrase as analysis
"Overpromising language detected" "Will increase", "guaranteed" Use qualified language: "could increase", "may improve"

Integration with Core Infrastructure

Hook Execution Flow

User completes work
Stop Hook fires automatically
Reads edit-log.txt (from PostToolUse)
Checks for:
    - TDD violations (source edited without tests)
    - Verification claims ("done", "works")
    - Commit reminders (3+ files edited)
Provides gentle reminders:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💭 Remember: Write tests for changed code
✅ Before claiming complete: Run tests and verify
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

File Change Tracking (PostToolUse Hook)

When: After every Edit/Write/MultiEdit operation

What it tracks:

  • File path and edit type
  • Division categorization (SEO, PDM, Analytics, etc.)
  • Session cache (.claude/edit-cache/)
  • Cross-hook state (context/edit-log.txt)

Why: Enables Stop hook to provide context-aware reminders (e.g., "Remember to test the 3 files you edited")


Practitioner Shortcuts

"Does this number look right?"

Sanity check questions:

  • Would a 500% traffic increase actually be realistic?
  • Do these rankings align with competitive landscape?
  • Is this CTR projection reasonable for this keyword type?
  • Does conversion rate match client's historical average?

Quick verification:

  • Compare to prior period (does trend make sense?)
  • Check against industry benchmarks (within 2x of normal?)
  • Cross-reference with another data source (GA4 vs. GSC)

"Where do I verify this metric?"

Metric Type Primary Source Backup Source
Organic traffic, rankings Google Search Console GA4 organic sessions
Sessions, conversions Google Analytics 4 Client CRM/backend
SERP features, competitors DataForSEO Manual Google search
Paid campaign performance Google Ads / Meta Ads Platform UI
Keyword volumes Seer Signals / DataForSEO SEMrush / Ahrefs

"What if sources conflict?"

Variance tolerance guidelines:

Comparison Expected Variance Action if Exceeded
GA4 vs. CRM conversions 5-15% Investigate attribution, tracking lag
GSC clicks vs. GA4 organic 10-20% Note in deliverable (different definitions)
DataForSEO rankings vs. manual ±2 positions Use manual as source of truth
Backend conversions vs. pixel >15% Flag for CAPI/Redundant Event Pipeline

Core QA Infrastructure

For Builders

The full QA skill resources (qa-review.md, fact-checking.md, quality.md) are in the plugin source at plugins/core-dependencies/skills/quality-standards/resources/.

Division-Specific Guidance

Development Context


Implementation Notes

Source: Slack Discussion Context

This document synthesizes QA infrastructure knowledge from:

  • Thread: "Trust + Verify, QA Agent" (AI Slack Channel, ~Jan 2026)
  • Implemented components in plugins/core-dependencies/
  • Practitioner feedback and quality gate patterns

Production Status

✅ Fully Operational:

  • Stop hook with gentle reminders
  • /qa-check command with structured validation
  • Quality Standards skill auto-activation
  • PostToolUse file tracking
  • Trust But Verify practitioner playbook

🚧 Enhancement Opportunities (from Slack discussion):

  • Override mechanism for /qa-check --override "reason"
  • Integration with /slide-deck (auto-run QA before output)
  • Division-specific quality checks (SEO, Analytics, PDM)
  • Automated peer review routing based on QA tier

Key Takeaways

  1. QA is automatic - Stop hook provides gentle reminders without being intrusive
  2. QA is on-demand - /qa-check provides detailed validation anytime
  3. QA is context-aware - Skills activate based on content type and division
  4. QA is practitioner-friendly - Trust But Verify playbook provides verification paths
  5. QA is non-blocking - Gentle reminders, not hard stops (unless critical)

Philosophy: Trust AI outputs, but verify before client delivery. The QA Agent Infrastructure makes verification systematic, not burdensome.