AutoBuild Instrumentation Guide¶
Version: 1.1.0 Last Updated: 2026-04-11 Compatibility: GuardKit v1.0+, AutoBuild with instrumentation (FEAT-INST) Document Type: Technical Reference
Table of Contents¶
- Overview
- Architecture
- Event Types Reference
- File Locations
- How to Use
- Prompt Profiles and Digests
- Template Pattern Context
- Adaptive Concurrency
- vLLM / Local Backend Specifics
- Secret Redaction
- Bootstrap Hard-Fail Gate (
bootstrap_failure_mode) - Troubleshooting
- If AutoBuild stalls immediately
Overview¶
AutoBuild instrumentation provides structured observability across the entire Player-Coach pipeline. Every LLM call, tool execution, task lifecycle event, wave completion, and Graphiti query is captured as a typed event and persisted to a local JSONL file.
Why instrumentation exists:
- Cost tracking — Token counts per call let you calculate spend per task and per feature.
- Performance analysis — Latency metrics (
latency_ms,ttft_ms) reveal bottlenecks in LLM calls and tool executions. - A/B prompt comparison — The
prompt_profilefield tags each call so you can compare digest-only vs digest+rules_bundle strategies. - Failure diagnosis — Structured
failure_categoryfields enable grouping and triaging failures without reading logs. - Adaptive concurrency — Wave completion events feed the
ConcurrencyControllerwhich auto-tunes worker count based on rate limits and p95 latency.
Relationship to the AutoBuild pipeline:
Feature Orchestrator → Waves → Tasks → Player-Coach Turns → LLM Calls + Tool Executions
│ │ │ │
wave.completed task.started task.completed/failed llm.call + tool.exec
Every layer of the pipeline emits events. The instrumentation is always-on JSONL — there is zero runtime cost when the data is not being analysed. Events are written append-only under an async lock and never block the LLM critical path.
Architecture¶
Event Emission Flow¶
┌─────────────────────────────────────────────────────────────────────┐
│ EVENT SOURCES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ AutoBuildOrchestrator ──► task.started / task.completed / task.failed
│ AgentInvoker ──────────► llm.call / tool.exec │
│ FeatureOrchestrator ───► wave.completed │
│ GraphitiContextLoader ─► graphiti.query │
│ │
├─────────────────────────────────────────────────────────────────────┤
│ MIDDLEWARE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ tool.exec events ──► SecretRedactor ──► redacted tool.exec │
│ (cmd, stdout_tail, stderr_tail fields are scrubbed) │
│ │
├─────────────────────────────────────────────────────────────────────┤
│ EVENT EMITTER (Protocol) │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ EventEmitter Protocol: │
│ async emit(event: BaseEvent) -> None │
│ async flush() -> None │
│ async close() -> None │
│ │
├─────────────────────────────────────────────────────────────────────┤
│ BACKENDS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ CompositeBackend (fan-out) │
│ ├── JSONLFileBackend (always-on) → events.jsonl │
│ └── NATSBackend (optional) → NATS subject │
│ │
└─────────────────────────────────────────────────────────────────────┘
Backend Options¶
| Backend | Purpose | Always On | Notes |
|---|---|---|---|
| JSONLFileBackend | Local JSONL persistence | Yes | One JSON object per line, async-lock protected |
| NATSBackend | Async publish to NATS | No | Logs warning on failure, never raises |
| CompositeBackend | Fan-out to multiple backends | — | If one backend fails, others still receive the event |
| NullEmitter | Testing | — | Optionally captures events in .events list |
Protocol-Based Injection¶
All orchestrators accept an EventEmitter via constructor injection. In production, a CompositeBackend wrapping JSONLFileBackend (+ optionally NATSBackend) is injected. In tests, NullEmitter(capture=True) is injected for assertion-based testing without file I/O.
Event Types Reference¶
All events extend BaseEvent which provides these common fields:
| Field | Type | Description |
|---|---|---|
run_id |
str |
Unique identifier for the AutoBuild run |
feature_id |
str? |
Optional feature identifier |
task_id |
str |
Task identifier (e.g., TASK-001) |
agent_role |
AgentRole |
player, coach, resolver, or router |
attempt |
int |
1-indexed attempt number |
timestamp |
str |
ISO 8601 formatted timestamp |
schema_version |
str |
Default 1.0.0 for forward compatibility |
Controlled Vocabularies¶
AgentRole: player | coach | resolver | router
FailureCategory: knowledge_gap | context_missing | spec_ambiguity | test_failure | env_failure | dependency_issue | rate_limit | timeout | tool_error | other
PromptProfile: digest_only | digest+graphiti | digest+rules_bundle | digest+graphiti+rules_bundle
LLMProvider: anthropic | openai | local-vllm
LLMCallStatus: ok | error
LLMCallErrorType: rate_limited | timeout | tool_error | other
GraphitiQueryType: context_loader | nearest_neighbours | adr_lookup
GraphitiStatus: ok | error
llm.call¶
Emitted for every LLM API call made by the Player or Coach.
| Field | Type | Description |
|---|---|---|
provider |
LLMProvider |
anthropic, openai, or local-vllm |
model |
str |
Model identifier (e.g., claude-sonnet-4-20250514) |
input_tokens |
int |
Number of input tokens |
output_tokens |
int |
Number of output tokens |
latency_ms |
float |
Total call latency in milliseconds |
ttft_ms |
float? |
Time-to-first-token (streaming only) |
prefix_cache_hit |
bool? |
Whether vLLM prefix cache was hit |
prefix_cache_estimated |
bool |
Whether cache hit was estimated |
context_bytes |
int? |
Context size in bytes |
prompt_profile |
PromptProfile |
Which prompt assembly strategy was used |
status |
LLMCallStatus |
ok or error |
error_type |
LLMCallErrorType? |
Error classification when status is error |
Example:
{
"run_id": "run-a1b2c3d4",
"feature_id": "FEAT-INST",
"task_id": "TASK-INST-005b",
"agent_role": "player",
"attempt": 1,
"timestamp": "2026-03-08T10:15:30.123Z",
"schema_version": "1.0.0",
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"input_tokens": 12500,
"output_tokens": 3200,
"latency_ms": 8450.2,
"ttft_ms": 1230.5,
"prefix_cache_hit": null,
"prefix_cache_estimated": false,
"context_bytes": 48000,
"prompt_profile": "digest+rules_bundle",
"status": "ok",
"error_type": null
}
tool.exec¶
Emitted for every tool invocation (Bash, Read, Write, Edit, etc.). Fields are redacted for security.
| Field | Type | Description |
|---|---|---|
tool_name |
str |
Sanitised tool name |
cmd |
str |
Command string (secrets redacted) |
exit_code |
int |
Process exit code |
latency_ms |
float |
Execution latency in milliseconds |
stdout_tail |
str |
Truncated tail of stdout (secrets redacted) |
stderr_tail |
str |
Truncated tail of stderr (secrets redacted) |
Example:
{
"run_id": "run-a1b2c3d4",
"feature_id": null,
"task_id": "TASK-INST-005b",
"agent_role": "player",
"attempt": 1,
"timestamp": "2026-03-08T10:16:02.456Z",
"schema_version": "1.0.0",
"tool_name": "Bash",
"cmd": "pytest tests/orchestrator/instrumentation/ -v --tb=short",
"exit_code": 0,
"latency_ms": 3200.1,
"stdout_tail": "12 passed in 2.8s",
"stderr_tail": ""
}
task.started¶
Emitted when an AutoBuild task begins execution.
No additional fields beyond BaseEvent.
Example:
{
"run_id": "run-a1b2c3d4",
"feature_id": null,
"task_id": "TASK-INST-005b",
"agent_role": "player",
"attempt": 1,
"timestamp": "2026-03-08T10:15:00.000Z",
"schema_version": "1.0.0"
}
task.completed¶
Emitted when a task finishes successfully.
| Field | Type | Description |
|---|---|---|
turn_count |
int |
Number of player-coach turns taken |
diff_stats |
str |
Summary of code changes (e.g., +50 -10) |
verification_status |
str |
Coach verification outcome |
prompt_profile |
PromptProfile |
Profile used for the task |
Example:
{
"run_id": "run-a1b2c3d4",
"feature_id": "FEAT-INST",
"task_id": "TASK-INST-005b",
"agent_role": "player",
"attempt": 1,
"timestamp": "2026-03-08T10:25:00.000Z",
"schema_version": "1.0.0",
"turn_count": 3,
"diff_stats": "+180 -12",
"verification_status": "pass",
"prompt_profile": "digest+rules_bundle"
}
task.failed¶
Emitted when a task fails after exhausting retries.
| Field | Type | Description |
|---|---|---|
failure_category |
FailureCategory |
Categorised reason for failure |
Example:
{
"run_id": "run-a1b2c3d4",
"feature_id": "FEAT-INST",
"task_id": "TASK-INST-005b",
"agent_role": "player",
"attempt": 3,
"timestamp": "2026-03-08T10:35:00.000Z",
"schema_version": "1.0.0",
"failure_category": "test_failure"
}
wave.completed¶
Emitted after each parallel wave in a feature build completes.
| Field | Type | Description |
|---|---|---|
wave_id |
str |
Unique identifier for the wave |
worker_count |
int |
Number of concurrent workers |
queue_depth_start |
int |
Queue depth at wave start |
queue_depth_end |
int |
Queue depth at wave end |
tasks_completed |
int |
Number of completed tasks |
task_failures |
int |
Number of failed tasks |
rate_limit_count |
int |
Number of rate-limit events |
p95_task_latency_ms |
float? |
95th percentile task latency |
Example:
{
"run_id": "run-a1b2c3d4",
"feature_id": "FEAT-INST",
"task_id": "wave-2",
"agent_role": "router",
"attempt": 1,
"timestamp": "2026-03-08T10:40:00.000Z",
"schema_version": "1.0.0",
"wave_id": "wave-2",
"worker_count": 3,
"queue_depth_start": 5,
"queue_depth_end": 2,
"tasks_completed": 3,
"task_failures": 0,
"rate_limit_count": 0,
"p95_task_latency_ms": 45000.0
}
graphiti.query¶
Emitted for each Graphiti knowledge graph query.
| Field | Type | Description |
|---|---|---|
query_type |
GraphitiQueryType |
context_loader, nearest_neighbours, or adr_lookup |
items_returned |
int |
Number of items returned |
tokens_injected |
int |
Number of tokens injected into context |
latency_ms |
float |
Query latency in milliseconds |
status |
GraphitiStatus |
ok or error |
Example:
{
"run_id": "run-a1b2c3d4",
"task_id": "TASK-INST-005b",
"agent_role": "player",
"attempt": 1,
"timestamp": "2026-03-08T10:15:05.000Z",
"schema_version": "1.0.0",
"query_type": "context_loader",
"items_returned": 4,
"tokens_injected": 2800,
"latency_ms": 320.5,
"status": "ok"
}
File Locations¶
| Resource | Path |
|---|---|
| Events file | .guardkit/autobuild/{task_id}/events.jsonl |
| Player reports | .guardkit/autobuild/{task_id}/player_turn_{turn}.json |
| Coach decisions | .guardkit/autobuild/{task_id}/coach_turn_{turn}.json |
| Task-work results | .guardkit/autobuild/{task_id}/task_work_results.json |
| Digest files | .guardkit/digests/{role}.md (player, coach, resolver, router) |
| Schemas module | guardkit/orchestrator/instrumentation/schemas.py |
| Emitter module | guardkit/orchestrator/instrumentation/emitter.py |
| Redaction module | guardkit/orchestrator/instrumentation/redaction.py |
| Digests module | guardkit/orchestrator/instrumentation/digests.py |
| Prompt profiles | guardkit/orchestrator/instrumentation/prompt_profile.py |
| Concurrency | guardkit/orchestrator/instrumentation/concurrency.py |
| LLM helpers | guardkit/orchestrator/instrumentation/llm_instrumentation.py |
| Template pattern loader | guardkit/knowledge/template_pattern_loader.py |
| Template resolver | guardkit/templates/resolver.py |
| Context loader (pattern wiring) | guardkit/knowledge/autobuild_context_loader.py |
| Tests (instrumentation) | tests/orchestrator/instrumentation/ |
| Tests (template patterns) | tests/unit/test_template_pattern_loader.py |
How to Use¶
Viewing Events¶
# Pretty-print all events for a task
cat .guardkit/autobuild/TASK-001/events.jsonl | jq .
# Count events by type
jq -r '.event_type' .guardkit/autobuild/TASK-001/events.jsonl | sort | uniq -c | sort -rn
Filtering by Event Type¶
# All LLM calls
jq 'select(.event_type == "llm.call")' .guardkit/autobuild/TASK-001/events.jsonl
# All tool executions
jq 'select(.event_type == "tool.exec")' .guardkit/autobuild/TASK-001/events.jsonl
# All failures
jq 'select(.event_type == "task.failed")' .guardkit/autobuild/TASK-001/events.jsonl
Token Usage Summary¶
# Token counts per LLM call
jq 'select(.event_type == "llm.call") | {model, input_tokens, output_tokens, latency_ms}' \
.guardkit/autobuild/TASK-001/events.jsonl
# Total tokens for a task
jq -s '[.[] | select(.event_type == "llm.call")] |
{total_input: (map(.input_tokens) | add),
total_output: (map(.output_tokens) | add),
call_count: length}' \
.guardkit/autobuild/TASK-001/events.jsonl
Comparing Prompt Profiles¶
# Group LLM calls by prompt profile
jq 'select(.event_type == "llm.call") | {prompt_profile, input_tokens, latency_ms}' \
.guardkit/autobuild/TASK-001/events.jsonl
# Average tokens by profile (for A/B comparison)
jq -s '[.[] | select(.event_type == "llm.call")] | group_by(.prompt_profile) | map({
profile: .[0].prompt_profile,
avg_input: (map(.input_tokens) | add / length),
avg_latency: (map(.latency_ms) | add / length),
count: length
})' .guardkit/autobuild/TASK-001/events.jsonl
Identifying Slow Tasks¶
# LLM calls sorted by latency (slowest first)
jq -s 'map(select(.event_type == "llm.call")) | sort_by(-.latency_ms) | .[:5] |
.[] | {task_id, model, latency_ms, input_tokens}' \
.guardkit/autobuild/TASK-001/events.jsonl
# Tool executions sorted by latency
jq -s 'map(select(.event_type == "tool.exec")) | sort_by(-.latency_ms) | .[:5] |
.[] | {tool_name, cmd: .cmd[:80], latency_ms, exit_code}' \
.guardkit/autobuild/TASK-001/events.jsonl
Heartbeat Labels¶
Heartbeat log lines emitted by async_heartbeat ([<task_id>] <phase> in progress... (<N>s elapsed)) use distinct phase labels so operators can disambiguate which agent is running:
task-work implementation— the actualtask-workPlayer invocation (the real implementation work).Player invocation/Coach invocation— direct_invoke_with_rolecalls without a delegation wrapper (legacy call sites).specialist:<name> invocation— orchestrator-driven specialists run viarun_specialist(specialist:test-orchestrator invocationfor Phase 4,specialist:code-reviewer invocationfor Phase 5). These run asagent_type=player/coachunder the hood but surface a distinct label so they aren't conflated with the actual Player. (TASK-ABSR-DIAG.)
Failure Analysis¶
# All failures grouped by category
jq 'select(.event_type == "task.failed") | {task_id, failure_category, attempt}' \
.guardkit/autobuild/TASK-001/events.jsonl
# Errors in LLM calls
jq 'select(.event_type == "llm.call" and .status == "error") |
{task_id, error_type, model, latency_ms}' \
.guardkit/autobuild/TASK-001/events.jsonl
Prompt Profiles and Digests¶
What Prompt Profiles Are¶
A prompt profile describes which context sources are assembled into the system prompt for each agent invocation. The prompt_profile field on llm.call and task.completed events enables A/B comparison of different prompt strategies.
| Profile | Content | Tokens (typical) |
|---|---|---|
digest_only |
Role-specific digest only | 300-600 |
digest+graphiti |
Digest + Graphiti knowledge context | 600-1200 |
digest+rules_bundle |
Digest + full rules bundle | 1500-3000 |
digest+graphiti+rules_bundle |
All sources combined | 2000-4000 |
Current default: digest+rules_bundle (Phase 1 baseline).
Where Digest Files Live¶
Digest files are role-specific condensed instructions at .guardkit/digests/:
| File | Role | Purpose |
|---|---|---|
player.md |
Player | Implementation rules, output contract, quality expectations |
coach.md |
Coach | Validation rules, output contract, anti-stub enforcement |
resolver.md |
Resolver | Conflict resolution rules |
router.md |
Router | Task routing and wave management rules |
Token Budget¶
- Target range: 300-600 tokens per digest
- Hard limit: 700 tokens (enforced by
DigestValidator) - Token counting: Uses
tiktoken(cl100k_baseencoding) with word-based fallback (~0.75 tokens per word)
Validating Digests¶
from guardkit.orchestrator.instrumentation.digests import DigestValidator
from pathlib import Path
validator = DigestValidator(digest_dir=Path(".guardkit/digests"))
# Validate a single role
result = validator.validate("player")
print(f"Tokens: {result.token_count}, Warning: {result.warning}")
# Validate all roles
results = validator.validate_all()
for r in results:
print(f"{r.role}: {r.token_count} tokens")
Prompt Profile Migration Phases¶
| Phase | Profile | Description |
|---|---|---|
| Phase 1 (baseline) | digest+rules_bundle |
Digest alongside full rules bundle; measures baseline token usage |
| Phase 2 | digest_only |
Digest only; measures token savings and quality delta |
| Phase 3 | digest+graphiti |
Digest with Graphiti context; measures knowledge-enhanced quality |
| Phase 4 | digest+graphiti+rules_bundle |
All sources; measures combined effectiveness |
Compare phases by filtering events on prompt_profile and comparing token counts, latency, and task success rates.
Template Pattern Context¶
Overview¶
Template pattern context injects stack-specific .template files into the Player's prompt at build time, giving the AI canonical code shapes to follow when generating code. This was implemented in FEAT-TPL-PLAYER (tasks TASK-TPL-001 through TASK-TPL-004).
Why template patterns exist:
- Consistency — The Player sees the same patterns the template author intended, producing code that matches the project's established architecture.
- Reduced hallucination — Instead of inferring patterns from agent prose descriptions, the Player reads actual parameterised code templates.
- Stack-specific guidance — Each builtin template ships its own
.templatefiles (FastAPI routers, .NET endpoints, React components, etc.). Only relevant files are loaded per task.
Data Flow¶
guardkit init <template>
│
├── Copies .claude/ config layer → project
└── Records template name in .claude/manifest.json
│
... (development happens) ...
│
/feature-build TASK-XXX
│
▼
AutoBuildContextLoader._append_template_patterns()
│
├── 1. Read .claude/manifest.json → extract `name` field
│
├── 2. resolve_template_source_dir(name)
│ → installer/core/templates/{name}/
│ → fallback: ~/.guardkit/templates/{name}/
│
├── 3. load_template_patterns(manifest_path)
│ → enumerate templates/*.template files
│
├── 4. select_patterns(context, tech_stack, file_path_hints)
│ → file-path hint matching (priority 1)
│ → tech-stack keyword fallback (priority 2)
│ → alphabetical fallback (priority 3)
│ → cap: max 5 files, ~3000 tokens
│
├── 5. format_pattern_block(context, file_contents)
│ → markdown block with fenced code blocks
│
└── 6. Append to AutoBuildContextResult.prompt_text
Key Components¶
| Component | File Path | Purpose |
|---|---|---|
| Template pattern loader | guardkit/knowledge/template_pattern_loader.py |
Reads manifest, resolves template dir, enumerates .template files |
| Template resolver | guardkit/templates/resolver.py |
Resolves template source directory from name |
| Context integration | guardkit/knowledge/autobuild_context_loader.py |
_append_template_patterns() wires loader into Player context |
| Tests | tests/unit/test_template_pattern_loader.py |
Unit tests for loader, selector, and formatter |
Manifest Requirement¶
The .claude/manifest.json file must contain a name field matching a known template:
This is written automatically by guardkit init <template>. Projects without a manifest (or with a missing/invalid name field) degrade gracefully — no patterns are loaded, no errors are raised, and the Player proceeds without pattern context.
Pattern Selection Rules¶
Selection follows a priority cascade:
- File-path hints — If the task touches files in
app/api/, match template files under anapi/subdirectory. - Tech-stack keywords — If no file-path matches, use the tech stack (e.g.,
"python"→ look forapi/,core/,config/,models/subdirectories). - Alphabetical fallback — If nothing else matches, take the first 3 files alphabetically.
- Budget enforcement — Maximum 5 files, approximately 3000 tokens. Files that would exceed the budget are skipped with a warning.
Graceful Degradation¶
The template pattern pipeline never raises exceptions. Each failure point produces a warning and returns empty context:
| Condition | Behaviour |
|---|---|
No worktree_path configured |
Skip silently (debug log) |
.claude/manifest.json missing |
Skip silently (debug log) |
name field missing or invalid |
Warning logged, empty context |
| Template source dir not found | Warning logged, empty context |
No templates/ subdirectory |
Warning logged, empty context |
| No files match selection criteria | Skip silently (debug log) |
| File read error | Individual file skipped, others still loaded |
Logging¶
Template pattern loading emits structured log messages at info and warning levels:
[TemplatePattern] Appended pattern block: 3 files, ~750 tokens (api/router.py.template, models/base.py.template, crud/crud_base.py.template)
[TemplatePattern] Skipped db/session.py.template: adding 400 tokens would exceed budget (2800/3000)
Token Impact¶
Typical token costs by template type:
| Template | Files Available | Files Selected (typical) | Token Cost |
|---|---|---|---|
| fastapi-python | 8-12 | 3-5 | ~500-800 |
| dotnet-railway-fastendpoints | 18-22 | 3-5 | ~600-1000 |
| react-typescript | 6-10 | 3-5 | ~400-700 |
| python-library | 4-6 | 3-4 | ~300-500 |
| No manifest / unknown template | — | 0 | 0 |
The ~500-1000 token overhead is comparable to a single Graphiti context block and is well within the Player's context budget.
Relationship to Other Context Sources¶
Template patterns are additive — they do not replace any existing context source:
Player Context Assembly:
├── Graphiti knowledge context (similar outcomes, patterns, ADRs)
├── Role constraints and quality gates
├── Turn states (cross-turn learning)
├── Implementation modes
├── Template pattern context ◄── NEW (FEAT-TPL-PLAYER)
└── Task description and acceptance criteria
The pattern block is appended to prompt_text after all other context sources. It appears in the Player's prompt as a "## Stack Pattern Reference" section.
Adaptive Concurrency¶
The ConcurrencyController uses wave.completed events to automatically adjust the number of parallel workers during feature builds.
How It Works¶
- The
FeatureOrchestratoremits awave.completedevent after each wave. - The
ConcurrencyController.on_wave_completed()method evaluates the event. - A
ConcurrencyDecisionis returned withaction(maintain,reduce,increase) andnew_workers.
Policy Rules¶
Rules are evaluated in order. First match wins.
| Priority | Trigger | Action | Detail |
|---|---|---|---|
| 1 | rate_limit_count > 0 |
Reduce by 50% | new = max(1, current // 2) |
| 2 | p95 latency > threshold | Reduce by 50% | Threshold = baseline * (1 + p95_threshold_pct / 100) |
| 3 | Stable for N minutes + reduced | Increase by +1 | new = min(current + 1, initial) |
| 4 | Default | Maintain | No change needed |
Configuration¶
from guardkit.orchestrator.instrumentation.concurrency import ConcurrencyController
controller = ConcurrencyController(
initial_workers=4, # Starting (and maximum) worker count
p95_threshold_pct=100.0, # 100% above baseline triggers reduction
stability_minutes=5.0 # Minutes of stability before recovery
)
Relevance to Local/vLLM Backends¶
On local inference backends, GPU contention is the primary bottleneck. The adaptive concurrency system is particularly valuable here because:
- Rate limits manifest differently — local vLLM servers queue requests rather than returning 429s, causing p95 latency spikes instead.
- Recovery is safe — the +1 gradual recovery avoids re-saturating the GPU.
- Default
max_parallel=1— for local backends (per TASK-VPT-001), the controller starts conservative and adapts.
vLLM / Local Backend Specifics¶
Provider Detection¶
The detect_provider() function identifies the LLM backend from the base URL:
| Base URL Pattern | Detected Provider |
|---|---|
None or empty |
anthropic |
Contains api.anthropic.com |
anthropic |
Contains openai |
openai |
Contains localhost or vllm |
local-vllm |
Prefix Cache Hit Detection¶
For vLLM backends, the x-vllm-prefix-cache-hit response header is inspected:
- Header value
"true"→prefix_cache_hit=True - Header value
"false"→prefix_cache_hit=False - Header absent →
prefix_cache_hit=None
The prefix_cache_estimated field is always False for direct header detection.
Key Metrics to Watch¶
| Metric | Field | Why It Matters |
|---|---|---|
| Time to first token | ttft_ms |
Measures prefill latency; high values indicate context too large for GPU |
| Total latency | latency_ms |
End-to-end call time including generation |
| Input tokens | input_tokens |
Prefill cost; digest-only profiles reduce this significantly |
| Prefix cache hit | prefix_cache_hit |
Cache hits reduce prefill time; monitor hit rate |
vLLM Defaults (from TASK-VPT-001)¶
| Setting | Value | Reason |
|---|---|---|
max_parallel |
1 | Prevents GPU contention on single-GPU setups |
timeout_multiplier |
3.0 | Local inference is slower than API calls |
jq Queries for vLLM Analysis¶
# All local-vllm calls with cache status
jq 'select(.event_type == "llm.call" and .provider == "local-vllm") |
{model, input_tokens, ttft_ms, latency_ms, prefix_cache_hit}' \
.guardkit/autobuild/TASK-001/events.jsonl
# Average TTFT for cached vs uncached
jq -s 'map(select(.event_type == "llm.call" and .provider == "local-vllm")) |
group_by(.prefix_cache_hit) | map({
cache_hit: .[0].prefix_cache_hit,
avg_ttft: (map(.ttft_ms // 0) | add / length),
count: length
})' .guardkit/autobuild/TASK-001/events.jsonl
See also: Local Backend AutoBuild Guide for vLLM server setup and tuning.
Secret Redaction¶
What Is Redacted Automatically¶
The SecretRedactor applies pattern-based redaction to tool.exec events only. The following patterns are scrubbed and replaced with [REDACTED]:
| Pattern | What It Catches |
|---|---|
sk-[A-Za-z0-9_-]{10,} |
OpenAI-style secret keys |
AKIA[A-Z0-9]{12,} |
AWS access key IDs |
gh[ps]_[A-Za-z0-9]{10,} |
GitHub tokens (personal + server) |
bearer [token] (case-insensitive) |
Bearer authentication tokens |
PASSWORD=..., PASS=..., SECRET=... |
Environment variable secrets |
token=..., api_key=... |
Generic token/key parameters |
://user:pass@host |
URL-embedded credentials |
Redacted Fields¶
Only tool.exec events are redacted. The affected fields are:
cmd— the command stringstdout_tail— truncated stdout outputstderr_tail— truncated stderr output
Additionally, tool_name is sanitised to remove shell metacharacters (;, |, &, $, `, >, <, (, )).
Adding Custom Redaction Patterns¶
from guardkit.orchestrator.instrumentation.redaction import SecretRedactor
custom_redactor = SecretRedactor(patterns=[
r"sk-[A-Za-z0-9_-]{10,}", # Keep defaults
r"AKIA[A-Z0-9]{12,}",
r"my-custom-prefix-[A-Za-z0-9]+", # Add your own
])
Pass the custom redactor to redact_tool_exec_event() to apply it.
Bootstrap Hard-Fail Gate (bootstrap_failure_mode)¶
When AutoBuild creates a shared worktree it runs an environment bootstrap (detect pyproject.toml / package.json / *.csproj / etc., run the matching install). By default this phase is non-blocking — if every install fails the orchestrator prints a yellow-⚠ line and proceeds into Wave 1 anyway.
For the forge/GB10 case this turned into a foot-gun: forge required Python ≥3.13 but the GB10 host ran 3.12, every pip install -e . failed, and AutoBuild still marched into Wave 1. Review TASK-REV-E4F5 recommendation R4a asked for an opt-in gate that turns this into a loud error instead of a silent warning. TASK-FIX-7A04 implements it.
Configuration¶
Set in .guardkit/config.yaml (feature-level), override per-invocation on the CLI:
# .guardkit/config.yaml
autobuild:
bootstrap:
failure_mode: warn # "warn" (default) or "block"
optional_stacks: [] # stacks that should NOT count toward the essential-stack check
# CLI override — wins over the yaml value
guardkit autobuild feature FEAT-XXX --bootstrap-failure-mode block
Modes¶
| Mode | Behavior when 0/N installs succeed | Behavior when partial success | Behavior when all succeed |
|---|---|---|---|
warn (default) |
yellow-⚠ line, continue to Wave 1 | yellow-⚠ line, continue | green-✓ line, continue |
block |
raise FeatureOrchestrationError before Wave 1 |
yellow-⚠ line, continue | green-✓ line, continue |
block fires only on total-failure (installs_failed == installs_attempted) of at least one essential stack (i.e. a detected stack not listed in optional_stacks). Partial success is never blocking — the orchestrator is still making progress.
When to use block¶
- Dedicated AutoBuild hosts (e.g. GB10-class boxes running
claude-agent-sdk) where a silent-broken environment will just produce Coach-runs-against-wrong-interpreter failures a wave later. The loud error at_bootstrap_environmentbeats a cryptic pytest failure two turns in. requires-pythonconstraints that don't match the host (the forge/GB10 case). The hard-fail message carries therequires-pythonvalue and the PEP-668 stderr tail, so the fix is usually a one-line host upgrade.- CI / scheduled feature builds where an unattended run with a broken venv is worse than a hard failure that gets picked up by the next agent.
When to stay on warn¶
- Developer machines where you expect to manually reconcile a partial install (e.g. you're iterating on a new stack and the install genuinely may not work yet).
- Features that intentionally span optional stacks. Use
optional_stacks:to downgrade specific stacks rather than flipping the whole feature back towarn.
Error message shape¶
On hard-fail the error includes the attempted stacks, the manifest path, requires-python (when declared), a PEP-668 marker (when the install tripped externally-managed-environment), the stderr tail, and a hint:
FeatureOrchestrationError: Bootstrap hard-fail: 0/1 install(s) succeeded for essential stack(s): python.
Manifest: /worktree/pyproject.toml
Manifest requires-python: >=3.13
Detected PEP 668 externally-managed-environment failure.
Install stderr (tail):
error: externally-managed-environment
…
Hint: set `bootstrap_failure_mode: warn` in .guardkit/config.yaml (or pass `--bootstrap-failure-mode warn`) to downgrade this to a non-blocking warning.
requires-python pre-check¶
The MacBook Pro jarvis FEAT-J002 run surfaced a neighboring failure mode: pip's error for an interpreter/requires-python mismatch is opaque (Package 'jarvis' requires a different Python: 3.14.2 not in '<3.13,>=3.12') and easy to miss in the install log. To cut that diagnostic round-trip the orchestrator runs a pre-pip requires-python check (amendment from TASK-REV-JMBP Workstream E).
How it works:
- For every Python manifest, read
requires-pythonfrom[project](PEP 621) or[tool.poetry].python(legacy Poetry). - Compare against the active interpreter's version via
packaging.specifiers.SpecifierSet. - If any manifest fails the check, act based on
bootstrap_failure_mode: warn— emit a structured⚠ Python X does not satisfy requires-python=Y`` line per mismatch, then continue to pip (pip remains authoritative).block— raise before pip runs, with a multi-line remediation hint that namesuv,pyenv, andcondainstall commands for a compatible minor version.
The pre-check is a silent no-op when:
- A manifest declares no
requires-pythonconstraint. - The specifier string is malformed (pip stays authoritative).
packaging.specifiersis unavailable at import time.
Example block-mode error:
FeatureOrchestrationError: Bootstrap requires-python mismatch (pre-pip).
Python 3.14.2 does not satisfy requires-python=`<3.13,>=3.12` for /worktree/pyproject.toml.
Install a compatible interpreter with one of:
• uv python install 3.12
• pyenv install 3.12 && pyenv local 3.12
• conda create -n <name> python=3.12 && conda activate <name>
Hint: set `bootstrap_failure_mode: warn` in .guardkit/config.yaml (or pass `--bootstrap-failure-mode warn`) to downgrade this to a non-blocking warning.
Related work¶
- Wave 2 task TASK-FIX-7A05 wires the bootstrap venv into the Coach pytest interpreter, closing the rest of the GB10 class-of-defect (venv-was-built-but-Coach-ignored-it).
- Review: TASK-REV-E4F5 findings F6 and F7.
- Amendment: TASK-REV-JMBP Workstream E (requires-python pre-check).
Troubleshooting¶
If AutoBuild stalls immediately¶
Symptom: AutoBuild exits or stalls within the first few seconds — no task.started event emitted, or the final-summary classification reads player_invocation_stall rather than a mid-loop coach-feedback stall.
This class-of-defect — Player invocation systematically errored before any work happened, and the orchestrator misnamed the problem at summary-time — has been observed twice:
- TASK-REV-8A08 on FEAT-486D / TASK-AD-004 (SDK stream timeout)
- TASK-REV-E4F5 on FEAT-FORGE-002 (SDK auth + version skew)
Use this triage table to self-diagnose before opening a review:
| Symptom (from summary) | Likely cause | Quick check |
|---|---|---|
player_invocation_stall + auth error |
Not logged into Claude on this host | claude CLI login |
player_invocation_stall + "Unknown message type" |
SDK version skew (e.g. rate_limit_event) |
pip show claude-agent-sdk — compare to working host |
player_invocation_stall + stream/timeout |
Network or endpoint config | ANTHROPIC_BASE_URL + vllm-serve.sh — see TASK-REV-8A08 |
unrecoverable_stall [environment_stall] |
Worktree environment is broken — Player gates pass but independent tests fail with failure_classification == "infrastructure" for ≥3 consecutive turns (e.g. interpreter mismatch with requires-python) |
Read the diagnostic line — it names the bootstrap state, the active interpreter, and the manifest's requires-python. Set bootstrap_failure_mode: block in .guardkit/config.yaml so the next run hard-fails at preflight, then install a compatible interpreter via uv python install <X> / pyenv install <X> / conda create -n <name> python=<X>. See TASK-ABSR-C3D4. |
Where the signal lives: player_result.error in the Player turn artefact, and recovery_metadata on synthetic reports. The orchestrator currently captures the signal at the call site but does not consult it at final-summary time; treat the summary's stall category as a starting guess, not an authoritative diagnosis, and open player_result.error first.
For environment_stall, the signal lives in the trailing coach_turn_*.json files: each turn carries validation_results.quality_gates.all_gates_passed == True, validation_results.independent_tests.tests_passed == False, and a test_verification issue with failure_classification == "infrastructure". The summary renderer reads <worktree>/.guardkit/bootstrap_state.json and the worktree's pyproject.toml (via DetectedManifest.get_requires_python()) to enrich the diagnostic; missing or corrupt state files are tolerated. Non-Python worktrees fall through to the generic stall message.
Related fix work: TASK-FIX-7A01 pins claude-agent-sdk to a known-good band and logs the version at startup, which removes the SDK-skew failure mode from this table. Until that lands, the pip show claude-agent-sdk check is the fastest way to confirm or rule out version skew. The companion task TASK-ABSR-A1B2 makes bootstrap_failure_mode: block the smart-default when requires-python is declared, eliminating the environment_stall precondition for new runs.
Environment-class conditional approval (environment_conditional_approval)¶
When a user explicitly opts into bootstrap_failure_mode: warn and the bootstrap install silently fails, Coach's independent pytest run will hit ImportError / ModuleNotFoundError even though Player's gates passed inside the (broken) venv. Without intervention this presents as a feedback loop on a purely environmental fault — Player has nothing to fix.
The fifth conditional-approval clause in CoachValidator.validate covers this case. It fires when all of the following are true:
failure_class == "infrastructure"andfailure_confidence == "ambiguous"(the_INFRA_AMBIGUOUSpatterns:ImportError,ModuleNotFoundError,No module named)gates_status.all_gates_passed == True(Player did everything right inside the venv it had)task.requires_infrastructureis empty / unset (the existingrequires_infrastructure + Docker unavailablebranch already handles the declared-deps case)<worktree>/.guardkit/bootstrap_state.jsonexists and reportssuccess: False(read by the_bootstrap_likely_brokenhelper — missing file orsuccess: trueis treated as "unknown / can't prove env failure" and the clause does not fire)
When the clause fires, the CoachValidationResult is tagged with environment_conditional_approval: True (visible in the JSON payload's top-level environment_conditional_approval key, alongside the longer-standing approved_without_independent_tests flag). Logs emit at WARNING level:
Conditional approval for {task_id}: environment-class infrastructure failure
(infrastructure/ambiguous) on a known-broken bootstrap; all Player gates passed.
Marking approved with environment flag.
The clause is deliberately narrow:
- A real
ImportErroron a healthy bootstrap (Player imported a non-existent module) does NOT match —bootstrap_state.jsonreportssuccess: True, so the helper returns False, so the clause doesn't fire, and the existing feedback path runs as before. - A code defect (
failure_class == "code") on a broken bootstrap also doesn't match — only the infrastructure/ambiguous classification triggers the clause. - It pairs with the smart-default
bootstrap_failure_mode: blockwork (TASK-ABSR-A1B2) and theenvironment_stallsub-type detection (TASK-ABSR-C3D4): when this clause fires, neither stall happens; when the clause doesn't fire (e.g.blockmode aborted the run at preflight), neither does the stall.
Origin: belt-and-braces layer for TASK-ABSR-A1B2's smart-default, motivated by the FEAT-J004-702C / TASK-J004-004 stall. See TASK-ABSR-2468.
Environment variable tunables¶
| Env var | Default | Purpose |
|---|---|---|
GUARDKIT_MIN_TURN_BUDGET |
600 |
Overrides MIN_TURN_BUDGET_SECONDS (the wall-clock floor required before _loop_phase starts a new turn). Lower (e.g. 300) on tasks where Coach feedback is small and short turn-2/3 retries are still useful; the orchestrator will exit with timeout_budget_exhausted only when the remaining budget falls below this floor. Read once at module load. See TASK-ABSR-MTBC. |
No Events in File¶
Symptom: .guardkit/autobuild/{task_id}/events.jsonl is empty or missing.
Cause: The emitter is not wired into the orchestrator. This was implemented in TASK-INST-013 (CLI emitter wiring).
Fix: Verify that the AutoBuild CLI creates a JSONLFileBackend and injects it into the orchestrators. Check guardkit/cli/autobuild.py for the emitter setup.
DigestLoadError¶
Symptom: DigestLoadError: Digest file not found for role 'player'
Cause: Digest files are missing from .guardkit/digests/. This was implemented in TASK-INST-014 (digest file creation).
Fix: Ensure all four digest files exist:
Events File Growing Large¶
The events file is append-only with no automatic rotation. For long-running feature builds, the file can grow large.
Safe to delete between runs:
A new file will be created on the next run. No data loss occurs for the current run — only historical events are removed.
NATS Backend Warnings¶
Symptom: WARNING: NATS publish failed: ...
Cause: The NATS backend is optional. If the NATS server is unreachable, warnings are logged but events are still persisted to JSONL.
Fix: No action required. JSONL persistence is always-on and unaffected by NATS failures. To silence warnings, remove the NATSBackend from the CompositeBackend configuration.
Events with status: "error"¶
Filter for error events to diagnose API issues:
# LLM call errors
jq 'select(.event_type == "llm.call" and .status == "error") |
{task_id, error_type, model, provider}' \
.guardkit/autobuild/TASK-001/events.jsonl
# Rate limit events specifically
jq 'select(.event_type == "llm.call" and .error_type == "rate_limited") |
{timestamp, model, provider}' \
.guardkit/autobuild/TASK-001/events.jsonl
Related Documentation¶
- AutoBuild Workflow Guide — Full AutoBuild architecture, Player-Coach loop, and CLI usage
- Local Backend AutoBuild Guide — vLLM server setup, model alignment, and performance tuning
- CLI vs Claude Code — When to use the CLI vs slash commands
- Template Pattern Player Context — Feature spec for template pattern injection (FEAT-TPL-PLAYER)