QA Agent Infrastructure: Trust But Verify¶
Last Updated: 2026-01-29
Related Slack Discussion: "Trust + Verify, QA Agent" (AI Channel)
Status: ✅ Production - Fully Implemented
Overview¶
Seer's QA Agent Infrastructure provides automated quality validation for AI-generated deliverables. This system ensures data accuracy, proper citation, and brand compliance before client delivery.
Core Components¶
| Component | Purpose | Location |
|---|---|---|
| Trust But Verify Playbook | Practitioner guide for data validation | docs/wiki/how-to/trust-but-verify.md |
/qa-check Command |
On-demand quality validation | plugins/core-dependencies/commands/qa-check.md |
| Stop Hook (QA Check) | Auto-validation before completion | plugins/core-dependencies/scripts/stop-qa-check.sh |
| Quality Standards Skill | Auto-activated QA standards | plugins/core-dependencies/skills/quality-standards/ |
| PostToolUse Hook | File change tracking for verification | plugins/core-dependencies/scripts/post-tool-use.sh |
How It Works¶
1. Automatic Quality Gates (Stop Hook)¶
When Claude attempts to complete work, the Stop hook automatically:
- Detects completion claims ("done", "finished", "works")
- Triggers QA validation checks
- Blocks completion if critical issues found
- Provides actionable feedback for fixes
What it checks:
- ✅ Test coverage for code changes
- ✅ Verification of completion claims
- ✅ Commit checkpoints for multi-file edits
2. On-Demand Validation (/qa-check)¶
Runs comprehensive checks on current deliverable:
Output:
QA CHECK RESULTS
================
Deliverable Type: slide-deck
Action Titles: PASS
✓ All titles are conclusions
Data Sources: FAIL
✗ Line 42: "Traffic increased 34%" - no source cited
✗ Line 67: "Competitor rankings improved" - no data reference
Brand Compliance: WARN
⚠ Line 89: "We recommend" - prefer "The data shows"
Completeness: PASS
✓ No placeholders found
Overall: NEEDS REVISION
3. Quality Standards Skill (Auto-Activated)¶
Activates automatically when working with:
- Data analysis or metrics
- Projections/forecasts
- Recommendations based on data
- Client-facing deliverables
Core Principles:
- Explicit is better than implicit - State assumptions, cite sources
- Evidence-based recommendations - Every claim backed by data
- Conservative estimates - Under-promise, over-deliver
- Sanity checks - Validate math, logic, conclusions
Quality Check Categories¶
1. Action Titles (Presentations)¶
Rule: Slide titles must be conclusions, not labels.
| ❌ Bad (Label) | ✅ Good (Conclusion) |
|---|---|
| Traffic Overview | Organic traffic increased 34% YoY |
| Performance Summary | Mobile conversion rate improved despite traffic decline |
| Q3 Results | Competitive gap closed by 15 positions |
2. Data Source Verification¶
Rule: All metrics must cite sources with date ranges.
Required format:
"Based on BigQuery OrganicRankings_Daily table (March 1-31, 2024),
traffic increased 18% (12,400 → 14,616 sessions)."
Acceptable sources:
- BigQuery / Seer Signals
- Google Analytics 4
- Google Search Console
- DataForSEO
- Platform APIs (Google Ads, Meta Ads, etc.)
3. Projection Methodology¶
Rule: Projections must show calculation + assumptions.
Example:
Potential traffic lift: 180-220 clicks/month
Methodology:
- Current position: 7
- Target position: 3
- Monthly search volume: 2,400
- CTR improvement: 4.5% → 9.2% (AWR 2024 CTR Study)
- Calculation: 2,400 × (0.092 - 0.045) = 113 clicks (base)
- Scenarios: Best (+50%), Better (+30%), Good (+10%)
Assumptions:
1. Competitive landscape remains stable
2. Content implemented within 30 days
3. Technical SEO issues addressed
4. No major algorithm updates
4. Brand Compliance (Seer Voice)¶
Issues flagged:
- Passive voice where active is better
- Jargon without explanation
- "We recommend" (prefer "The data shows" or "Testing confirmed")
- Overly formal language
- Guarantee language ("will", "guaranteed", "100% will")
5. Completeness¶
Issues flagged:
[TODO],[TBD],[PLACEHOLDER]markers- Empty sections in templates
- Incomplete sentences or bullet points
QA Tiers (When Peer Review Required)¶
Auto-Ship (No Peer Review)¶
- Data extraction queries
- Keyword research (volume, difficulty, SERP features)
- Competitive research (what competitors are doing)
- Traffic/ranking reports
Peer Review Required¶
- Strategic recommendations
- Content differentiation strategies
- Client-facing deliverables (audits, outlines, analyses)
- ROI projections and Expected Outcome tables
- Priority recommendations and roadmaps
Shadow Mode (Senior Review + Validation)¶
- New methodologies not yet proven
- Experimental approaches
- High-stakes client deliverables (>$100K projected impact)
- Sensitive competitive positioning
Decision rule:
- If deliverable includes projections/recommendations → Peer Review
- If deliverable goes directly to client → Peer Review
- If stakes are high (revenue, relationship) → Shadow Mode
- If just data extraction → Auto-Ship
Verification Paths by Data Source¶
BigQuery / Seer Signals¶
What to verify:
-- Re-run query to confirm data freshness
SELECT * FROM `project.dataset.table`
WHERE org_name = 'ClientName'
AND date >= '2024-03-01'
AND date <= '2024-03-31'
Check:
- ✅
org_namefilter matches client - ✅ Date range matches deliverable
- ✅ No outliers or anomalies
Google Analytics 4¶
Where to verify:
- GA4 UI → Reports → Acquisition → Traffic acquisition
- GA4 UI → Reports → Engagement → Pages and screens
- GA4 UI → Explore → Free form
Check:
- ✅ Date range matches exactly
- ✅ Segment filters correct (device, geography, user type)
- ✅ Cross-check conversions with CRM (expect 5-15% variance)
Google Search Console¶
Where to verify:
- GSC UI → Performance → Search results
- GSC UI → Performance → Queries tab (keyword data)
- GSC UI → Performance → Pages tab (landing page data)
Check:
- ✅ Date range matches (GSC has 2-3 day lag)
- ✅ Filter matches (device, country, search type)
- ✅ Compare GSC clicks with GA4 organic (expect 10-20% variance)
Why variance exists:
- GSC tracks Google-only; GA4 includes all search engines
- GSC counts clicks; GA4 counts sessions
- Bot traffic filtered differently
DataForSEO (SERP Analysis)¶
What to verify:
- Manual Google search - Confirm rankings/SERP features
- SEMrush/Ahrefs - Cross-check keyword volumes
- Multiple browsers/locations - Account for personalization
Check:
- ✅ Rankings can fluctuate daily - note snapshot date
- ✅ SERP features are dynamic - verify current state
- ✅ Competitor analysis is point-in-time
The Rule of Five (Self-Review Protocol)¶
Agent outputs are first drafts, not final deliverables.
Self-review at least 5 times before delivery:
Pass 1: Data Accuracy¶
- Are all metrics cited with sources?
- Are calculations correct?
- Do date ranges make sense?
- Is sample size adequate?
Pass 2: Logic & Reasoning¶
- Do recommendations follow from data?
- Are there alternative explanations?
- Did I consider external factors (seasonality, algorithm updates)?
- Are assumptions reasonable?
Pass 3: Client Context¶
- Does this align with client's business model?
- Is language client-appropriate (no jargon)?
- Are recommendations actionable for their team?
- Is tone suitable for relationship stage?
Pass 4: Deliverable Quality¶
- Is formatting clean and consistent?
- Do links work?
- Are tables and charts clear?
- Is document structure logical?
Pass 5: Final QA Gate¶
- Run
/qa-checkfor automated validation - Review Quality Standards
- Review the QA Review Checklist (quality-standards skill resources)
- Confirm peer review if required
Why 5 passes? Each review catches different types of issues. First pass sees data problems. Last pass catches subtle tone or framing issues.
Common QA Block Scenarios¶
| Block Message | What It Means | How to Fix |
|---|---|---|
| "Metrics cited without data source" | Numbers without citation | Add (Source: GA4, March 2024) or (BigQuery OrganicRankings_Daily) |
| "Completion claimed but no verification" | Said "done" without proof | Run tests/build commands and confirm pass |
| "Action titles required" | Slide titles are labels | Change to conclusions ("Traffic grew 34% YoY") |
| "Unsupported claim detected" | "Studies show..." without citation | Cite specific study or rephrase as analysis |
| "Overpromising language detected" | "Will increase", "guaranteed" | Use qualified language: "could increase", "may improve" |
Integration with Core Infrastructure¶
Hook Execution Flow¶
User completes work
↓
Stop Hook fires automatically
↓
Reads edit-log.txt (from PostToolUse)
↓
Checks for:
- TDD violations (source edited without tests)
- Verification claims ("done", "works")
- Commit reminders (3+ files edited)
↓
Provides gentle reminders:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💭 Remember: Write tests for changed code
✅ Before claiming complete: Run tests and verify
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
File Change Tracking (PostToolUse Hook)¶
When: After every Edit/Write/MultiEdit operation
What it tracks:
- File path and edit type
- Division categorization (SEO, PDM, Analytics, etc.)
- Session cache (
.claude/edit-cache/) - Cross-hook state (
context/edit-log.txt)
Why: Enables Stop hook to provide context-aware reminders (e.g., "Remember to test the 3 files you edited")
Practitioner Shortcuts¶
"Does this number look right?"¶
Sanity check questions:
- Would a 500% traffic increase actually be realistic?
- Do these rankings align with competitive landscape?
- Is this CTR projection reasonable for this keyword type?
- Does conversion rate match client's historical average?
Quick verification:
- Compare to prior period (does trend make sense?)
- Check against industry benchmarks (within 2x of normal?)
- Cross-reference with another data source (GA4 vs. GSC)
"Where do I verify this metric?"¶
| Metric Type | Primary Source | Backup Source |
|---|---|---|
| Organic traffic, rankings | Google Search Console | GA4 organic sessions |
| Sessions, conversions | Google Analytics 4 | Client CRM/backend |
| SERP features, competitors | DataForSEO | Manual Google search |
| Paid campaign performance | Google Ads / Meta Ads | Platform UI |
| Keyword volumes | Seer Signals / DataForSEO | SEMrush / Ahrefs |
"What if sources conflict?"¶
Variance tolerance guidelines:
| Comparison | Expected Variance | Action if Exceeded |
|---|---|---|
| GA4 vs. CRM conversions | 5-15% | Investigate attribution, tracking lag |
| GSC clicks vs. GA4 organic | 10-20% | Note in deliverable (different definitions) |
| DataForSEO rankings vs. manual | ±2 positions | Use manual as source of truth |
| Backend conversions vs. pixel | >15% | Flag for CAPI/Redundant Event Pipeline |
Related Resources¶
Core QA Infrastructure¶
- Trust But Verify Playbook - Comprehensive practitioner guide
- Quality Standards Skill - Auto-activated standards
/qa-checkCommand - On-demand validation
For Builders
The full QA skill resources (qa-review.md, fact-checking.md, quality.md) are in the plugin source at plugins/core-dependencies/skills/quality-standards/resources/.
Division-Specific Guidance¶
- SEO Workflows - SEO-specific verification patterns
- Analytics Workflows - Analytics verification patterns
- PDM Workflows - Paid media verification patterns
Development Context¶
- Best Practices - Working with agents effectively
- Troubleshooting - Common issues and solutions
Implementation Notes¶
Source: Slack Discussion Context¶
This document synthesizes QA infrastructure knowledge from:
- Thread: "Trust + Verify, QA Agent" (AI Slack Channel, ~Jan 2026)
- Implemented components in
plugins/core-dependencies/ - Practitioner feedback and quality gate patterns
Production Status¶
✅ Fully Operational:
- Stop hook with gentle reminders
/qa-checkcommand with structured validation- Quality Standards skill auto-activation
- PostToolUse file tracking
- Trust But Verify practitioner playbook
🚧 Enhancement Opportunities (from Slack discussion):
- Override mechanism for
/qa-check --override "reason" - Integration with
/slide-deck(auto-run QA before output) - Division-specific quality checks (SEO, Analytics, PDM)
- Automated peer review routing based on QA tier
Key Takeaways¶
- QA is automatic - Stop hook provides gentle reminders without being intrusive
- QA is on-demand -
/qa-checkprovides detailed validation anytime - QA is context-aware - Skills activate based on content type and division
- QA is practitioner-friendly - Trust But Verify playbook provides verification paths
- QA is non-blocking - Gentle reminders, not hard stops (unless critical)
Philosophy: Trust AI outputs, but verify before client delivery. The QA Agent Infrastructure makes verification systematic, not burdensome.