Skip to content

Quality gates integration

Quality Gates Integration Guide

This guide explains how quality gates are integrated into the AutoBuild workflow, providing automated quality enforcement at key phases of task implementation.

Table of Contents

  1. Overview
  2. Pre-Loop Quality Gates
  3. Loop Quality Gates
  4. Failure Scenarios
  5. Best Practices
  6. Troubleshooting

Overview

GuardKit's quality gates provide automated quality enforcement at critical points in the task workflow, preventing broken code from reaching production.

What Are Quality Gates?

Quality gates are automated checkpoints that verify: - Architectural quality (SOLID, DRY, YAGNI principles) - Test coverage (line, branch, function coverage) - Implementation fidelity (matches approved plan) - Code quality (maintainability, complexity)

When Do Quality Gates Execute?

Quality gates execute at three stages:

  1. Pre-Loop (Phases 1.6-2.8): Before implementation begins
  2. Loop (Player-Coach turns): During implementation
  3. Post-Loop (Finalization): After implementation completes

Pre-Loop Quality Gates

Pre-loop quality gates execute before the Player-Coach adversarial loop, ensuring that only well-designed tasks proceed to implementation.

Phase 1.6: Clarifying Questions

Purpose: Reduce rework by confirming user intent before planning.

Complexity Gating: - Complexity 1-2: Skipped - Complexity 3-4: Quick questions (15s timeout) - Complexity 5+: Full questions (blocking)

Example:

/task-work TASK-a3f8

Phase 1.6: Clarifying Questions
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Q1. Implementation Scope
    [M]inimal, [S]tandard, [C]omplete?
    Your choice: S

Q2. Testing Strategy
    [U]nit, [I]ntegration, [F]ull coverage?
    Your choice: I
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Flags: - --no-questions: Skip clarification - --with-questions: Force clarification - --answers="1:S 2:I": Inline answers

Phase 2: Implementation Planning

Purpose: Generate detailed implementation plan before coding.

Outputs: - Files to create/modify - Estimated lines of code - Implementation phases - Dependencies

Saved To: .claude/task-plans/{TASK-ID}-implementation-plan.md

Phase 2.5B: Architectural Review

Purpose: Verify SOLID, DRY, and YAGNI compliance before implementation.

Scoring: - 85-100: Excellent architecture, proceed - 60-84: Acceptable with minor concerns - <60: Blocked, redesign required

Quality Metrics:

SOLID:
  single_responsibility: 9/10
  open_closed: 8/10
  liskov_substitution: 9/10
  interface_segregation: 8/10
  dependency_inversion: 9/10

DRY:
  code_duplication: 9/10
  abstraction_level: 8/10

YAGNI:
  feature_necessity: 9/10
  complexity_justification: 8/10

Overall Score: 88/100

Blocking Behavior:

if architectural_score < 60:
    raise QualityGateBlocked(
        gate_name="architectural_review",
        score=45,
        threshold=60,
        message="Architecture needs redesign before implementation"
    )

Phase 2.7: Complexity Evaluation

Purpose: Assess task complexity and determine max_turns for adversarial loop.

Complexity Scale (0-10): - 1-3 (Simple): Basic CRUD, config changes, straightforward features - 4-6 (Medium): Multi-file features, API integration, moderate logic - 7-10 (Complex): State machines, parallel execution, architectural changes

Complexity-to-Max-Turns Mapping:

COMPLEXITY_TURNS_MAP = {
    (1, 3): 3,   # Simple: Quick completion
    (4, 6): 5,   # Medium: Standard iterations
    (7, 10): 7,  # Complex: Extended iterations
}

Factors Evaluated: - File count (0-3 points) - Pattern familiarity (0-2 points) - Risk assessment (0-3 points) - Dependencies (0-2 points)

Phase 2.8: Human Checkpoint

Purpose: Pause for human review on complex tasks before implementation.

Trigger Conditions: - Complexity >= 7 - High-risk changes (security, breaking, schema) - Explicit --design-first flag

Checkpoint Options: - [A]pprove: Proceed to implementation - [M]odify: Edit plan and re-run Phase 2.8 - [S]implify: Reduce scope and recalculate complexity - [R]eject: Abort task (return to backlog) - [P]ostpone: Save state for later

Example:

Phase 2.8: Human Checkpoint (Complexity 8/10)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Implementation Plan Ready

Estimated Effort: 8-12 hours
Files to Create: 10
Estimated LOC: 500
Architectural Score: 92/100

[A]pprove   [M]odify   [S]implify   [R]eject   [P]ostpone

Your choice: A
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


Loop Quality Gates

Loop quality gates execute during the Player-Coach adversarial loop, validating each implementation iteration.

Player Agent Responsibilities

The Player agent implements code according to the approved plan and reports results:

Player Report Structure:

{
  "task_id": "TASK-a3f8",
  "turn": 1,
  "files_modified": ["src/auth.py"],
  "files_created": ["src/oauth.py", "tests/test_oauth.py"],
  "tests_written": ["tests/test_oauth.py"],
  "tests_run": true,
  "tests_passed": true,
  "test_output_summary": "12 tests passed, 0 failed",
  "implementation_notes": "Implemented OAuth2 with token refresh",
  "concerns": [],
  "requirements_addressed": ["OAuth2 support", "Token refresh"],
  "requirements_remaining": []
}

Coach Agent Responsibilities

The Coach agent validates Player implementation via two mechanisms:

  1. Task-Work Results Validation (Lightweight, Preferred):
  2. Reads Phase 4.5 test results
  3. Reads Phase 5 code review scores
  4. Reads Phase 5.5 plan audit results
  5. Runs independent test verification

  6. Full SDK Invocation (Fallback):

  7. Used if task-work results unavailable
  8. Runs complete validation independently

Coach Decision Logic:

def validate_implementation(player_report, task):
    # Read task-work quality gate results
    test_results = read_phase_4_5_results()
    code_review = read_phase_5_results()
    plan_audit = read_phase_5_5_results()

    # Run independent tests (trust but verify)
    independent_tests = run_tests()

    # Validate requirements
    requirements_met = check_requirements(task.acceptance_criteria)

    # Quality gate checks
    if not test_results.all_passing:
        return "feedback", "Tests failing: {failed_tests}"

    if test_results.coverage.lines < 80:
        return "feedback", "Line coverage {cov}% below 80% threshold"

    if plan_audit.scope_creep_detected:
        return "feedback", "Scope creep: {variance}% LOC variance"

    if not requirements_met:
        return "feedback", "Requirements not satisfied: {missing}"

    return "approve", "Implementation meets all quality gates"

Coach Decision Outcomes: - approve: Implementation ready for human review - feedback: Issues detected, Player must address

Coach Decision Structure:

{
  "task_id": "TASK-a3f8",
  "turn": 1,
  "decision": "approve",
  "quality_gates": {
    "tests": {"passed": true, "score": 100},
    "coverage": {"passed": true, "score": 87},
    "architectural_review": {"passed": true, "score": 92},
    "plan_audit": {"passed": true, "score": 95}
  },
  "independent_test_result": {
    "tests_run": true,
    "tests_passed": true,
    "command": "pytest tests/",
    "output_summary": "All tests passed"
  },
  "requirements_validation": {
    "all_met": true,
    "met": ["OAuth2 support", "Token refresh"],
    "missing": []
  },
  "rationale": "Implementation meets all quality gates"
}

Phase 4.5: Test Enforcement Loop

Purpose: Ensure 100% test pass rate before moving task to IN_REVIEW.

Auto-Fix Behavior:

for attempt in range(1, 4):  # Max 3 attempts
    test_results = run_tests()

    if test_results.all_passing:
        break

    if attempt < 3:
        # Auto-fix: Analyze failures and regenerate
        failures = analyze_failures(test_results)
        apply_fixes(failures)
    else:
        # Block task after 3 failed attempts
        raise QualityGateBlocked(
            gate_name="test_enforcement",
            message=f"Tests still failing after {attempt} attempts"
        )

Quality Thresholds: - Compilation: 100% (must compile) - Tests passing: 100% (zero failures tolerated) - Line coverage: ≥80% - Branch coverage: ≥75%

Phase 5.5: Plan Audit

Purpose: Detect scope creep by comparing implementation to approved plan.

Audit Checks:

def audit_implementation(plan, implementation):
    checks = {
        "files_match": compare_files(plan.files, implementation.files),
        "loc_variance": calculate_variance(plan.loc, implementation.loc),
        "scope_creep": detect_scope_creep(plan, implementation),
        "implementation_complete": check_completeness(plan, implementation)
    }

    # Acceptable variance thresholds
    if checks["loc_variance"] > 20:  # ±20%
        checks["scope_creep"] = True

    return checks

Plan Audit Results:

files_match: true
implementation_completeness: 100%
scope_creep_detected: false
loc_variance_percent: 5
duration_variance_percent: 10


Failure Scenarios

Understanding how quality gates handle failures helps debug blocked tasks.

Test Failures

Scenario: Player reports test failures.

Auto-Fix Loop:

Turn 1: Player implements → Tests fail
        Coach analyzes failures
        Coach provides specific feedback

Turn 2: Player applies fixes → Tests fail
        Coach re-analyzes
        Coach provides updated feedback

Turn 3: Player applies fixes → Tests pass
        Coach approves

Blocking Condition: Tests fail after 3 auto-fix attempts.

Resolution:

# Review failure logs
cat .guardkit/autobuild/TASK-a3f8/coach_turn_3.json

# Inspect worktree
cd .guardkit/worktrees/TASK-a3f8
pytest tests/ -v  # Run tests manually

# Fix issues and re-run task-work
/task-work TASK-a3f8 --implement-only

Scope Creep Detection

Scenario: Implementation significantly larger than plan.

Detection Logic:

plan_loc = 250
implementation_loc = 400
variance = (implementation_loc - plan_loc) / plan_loc * 100  # 60%

if variance > 20:
    scope_creep_detected = True

Coach Feedback:

Implementation 60% larger than planned (400 LOC vs 250 LOC).

Possible causes:
- Additional features added beyond acceptance criteria
- Plan underestimated complexity
- Refactoring expanded scope

Recommendations:
1. Review if new features are necessary (YAGNI)
2. Update plan to reflect actual scope
3. Split task if too large

Resolution:

# Option 1: Update plan to match implementation
/task-refine TASK-a3f8 --update-plan

# Option 2: Remove extra features
/task-work TASK-a3f8 --simplify

# Option 3: Split into multiple tasks
/task-create "Extract feature X to separate task" parent:TASK-a3f8

Low Architectural Score

Scenario: Architectural review score < 60.

Blocking Behavior:

Phase 2.5B: Architectural Review
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Quality Gate Blocked

Architectural Score: 45/100 (threshold: 60)

Issues:
- SOLID violations: Tight coupling in auth module
- DRY violations: Duplicate validation logic
- YAGNI concerns: Premature optimization in caching

Status: BLOCKED (pre_loop_blocked)
Worktree: Preserved for review
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Resolution:

# Review architectural issues
cat .guardkit/autobuild/TASK-a3f8/architectural-review.md

# Redesign implementation plan
/task-work TASK-a3f8 --design-only

# Address architectural concerns
# - Decouple auth module
# - Extract common validation logic
# - Remove premature optimizations

# Re-run with updated plan
/task-work TASK-a3f8 --implement-only

Max Turns Exceeded

Scenario: Task reaches max_turns without approval.

Outcome:

Turn 5: Player implements  Coach provides feedback
                Max turns (5) reached
                Status: max_turns_exceeded
                Worktree preserved for inspection

Resolution:

# Review turn history
cat .guardkit/autobuild/TASK-a3f8/player_turn_*.json
cat .guardkit/autobuild/TASK-a3f8/coach_turn_*.json

# Check last Coach feedback
cat .guardkit/autobuild/TASK-a3f8/coach_turn_5.json

# Options:
# 1. Manual intervention
cd .guardkit/worktrees/TASK-a3f8
# Fix issues manually
git add . && git commit -m "Manual fixes"

# 2. Increase max_turns and resume
/task-work TASK-a3f8 --resume --max-turns=10

# 3. Split into smaller tasks
/task-create "Subtask 1" parent:TASK-a3f8
/task-create "Subtask 2" parent:TASK-a3f8


Best Practices

When to Use AutoBuild

Good Candidates: - Well-defined requirements - Clear acceptance criteria - Standard implementation patterns - Low-medium risk changes

Bad Candidates: - Exploratory work - Unclear requirements - Novel architectures - High-risk changes requiring human judgment

Complexity Guidelines

Simple Tasks (1-3): - Single file changes - Config updates - Bug fixes - Documentation

Medium Tasks (4-6): - Multi-file features - API integrations - Authentication flows - Standard CRUD operations

Complex Tasks (7-10): - State machines - Parallel execution - Architectural changes - Multi-system integration

Quality Gate Configuration

For Simple Tasks:

autobuild:
  enabled: true
  max_turns: 3
  base_branch: main

For Medium Tasks:

autobuild:
  enabled: true
  max_turns: 5
  base_branch: main
  pre_loop:
    no_questions: false  # Allow clarification

For Complex Tasks:

autobuild:
  enabled: true
  max_turns: 7
  base_branch: main
  pre_loop:
    with_questions: true  # Force clarification
    docs: comprehensive


Troubleshooting

Quality Gate Not Executing

Symptom: Pre-loop gates skipped, immediate loop execution.

Cause: enable_pre_loop: false in orchestrator config.

Fix:

orchestrator = AutoBuildOrchestrator(
    repo_root=Path.cwd(),
    max_turns=5,
    enable_pre_loop=True,  # Enable pre-loop gates
)

Checkpoint Not Triggering

Symptom: Complex task proceeds without human checkpoint.

Cause: Complexity evaluation incorrect or checkpoint bypassed.

Fix:

# Force design-first workflow
/task-work TASK-a3f8 --design-only

# Review and approve plan manually
cat .claude/task-plans/TASK-a3f8-implementation-plan.md

# Proceed to implementation
/task-work TASK-a3f8 --implement-only

Tests Keep Failing

Symptom: Tests fail repeatedly, auto-fix loop exhausted.

Cause: Test assumptions incorrect or environment issues.

Debug Steps:

# 1. Review test failures
cd .guardkit/worktrees/TASK-a3f8
pytest tests/ -v --tb=short

# 2. Check test assumptions
cat tests/test_feature.py

# 3. Verify environment
python -m pytest --version
pip list | grep -i test

# 4. Run tests individually
pytest tests/test_feature.py::test_specific_case -v

# 5. Fix and re-run
/task-work TASK-a3f8 --implement-only

Worktree Not Preserved

Symptom: Worktree deleted after orchestration.

Cause: None - worktrees are always preserved per architectural review.

Verify:

ls -la .guardkit/worktrees/TASK-a3f8
git worktree list

Coach Always Provides Feedback

Symptom: Coach never approves, continuous feedback loop.

Cause: Quality thresholds too strict or legitimate quality issues.

Debug:

# Review Coach decisions
cat .guardkit/autobuild/TASK-a3f8/coach_turn_*.json

# Check quality gate scores
cat .guardkit/autobuild/TASK-a3f8/task_work_results.json

# Adjust thresholds if needed (in task frontmatter)
quality_thresholds:
  line_coverage: 75  # Lower from 80%
  branch_coverage: 70  # Lower from 75%



Summary

Quality gates provide automated quality enforcement at three stages:

  1. Pre-Loop (Phases 1.6-2.8): Validate design before implementation
  2. Loop (Player-Coach turns): Validate each implementation iteration
  3. Post-Loop (Finalization): Preserve worktree for human review

Key benefits: - Prevents broken code from reaching production - Automates quality verification (tests, coverage, architecture) - Reduces rework through early detection of issues - Provides transparency via detailed turn history and reports

Remember: Quality gates are enablers, not blockers. They catch issues early when they're cheaper to fix, reducing total development time.