Relevance Tuning Testing Guide¶
Purpose: Manual testing procedures for validating context retrieval relevance and quality metrics.
Task: TASK-GR6-011 Feature: FEAT-GR-006 (Job-Specific Context Retrieval)
Overview¶
This guide provides manual testing procedures for validating that relevance tuning correctly filters context retrieval results based on task characteristics. The relevance tuning system uses configurable thresholds to ensure high-quality, relevant context is returned for each task type.
Relevance Threshold Configuration¶
Default Thresholds¶
| Task Type | Threshold | Rationale |
|---|---|---|
| First-of-type | 0.5 | More inclusive - needs broader context for novel tasks |
| Standard | 0.6 | Balanced filtering for typical tasks |
| Refinement | 0.55 | Slightly more inclusive - needs context about similar approaches |
| AutoBuild | 0.5 | More inclusive - autonomous workflows benefit from more context |
Configuration Options¶
from guardkit.knowledge.relevance_tuning import (
RelevanceConfig,
default_config,
strict_config,
relaxed_config,
)
# Default thresholds
config = default_config() # first_of_type=0.5, standard=0.6, refinement=0.55
# Stricter filtering (higher quality, fewer results)
config = strict_config() # first_of_type=0.6, standard=0.7, refinement=0.65
# More inclusive (more results, potentially lower relevance)
config = relaxed_config() # first_of_type=0.35, standard=0.45, refinement=0.4
# Custom configuration
config = RelevanceConfig(
first_of_type_threshold=0.45,
standard_threshold=0.65,
refinement_threshold=0.5,
autobuild_threshold=0.45,
)
Test Scenarios¶
Scenario 1: First-of-Type Task¶
Description: Test context retrieval for a task that is the first of its type in the project.
Setup:
from guardkit.knowledge.task_analyzer import TaskCharacteristics, TaskType, TaskPhase
from guardkit.knowledge.relevance_tuning import default_config
characteristics = TaskCharacteristics(
task_id="TASK-TEST-001",
description="Implement new GraphQL endpoint",
tech_stack="python",
task_type=TaskType.IMPLEMENTATION,
current_phase=TaskPhase.IMPLEMENT,
complexity=5,
is_first_of_type=True, # First GraphQL task
similar_task_count=0,
feature_id="FEAT-API-001",
is_refinement=False,
refinement_attempt=0,
previous_failure_type=None,
avg_turns_for_type=3.0,
success_rate_for_type=0.8,
)
config = default_config()
threshold = config.get_threshold(characteristics)
assert threshold == 0.5, f"Expected 0.5, got {threshold}"
Expected Behavior: - Threshold should be 0.5 (lower to allow more context) - More results should pass filtering - Architecture and pattern context should be emphasized
Verification Checklist: - [ ] Threshold returns 0.5 for first-of-type tasks - [ ] Retrieved context includes architecture guidance - [ ] Retrieved context includes relevant patterns - [ ] Context budget is increased by 30% for novelty
Scenario 2: Standard Implementation Task¶
Description: Test context retrieval for a typical implementation task.
Setup:
characteristics = TaskCharacteristics(
task_id="TASK-TEST-002",
description="Add validation to user registration",
tech_stack="python",
task_type=TaskType.IMPLEMENTATION,
current_phase=TaskPhase.IMPLEMENT,
complexity=4,
is_first_of_type=False,
similar_task_count=5, # Similar validation tasks exist
feature_id="FEAT-USER-001",
is_refinement=False,
refinement_attempt=0,
previous_failure_type=None,
avg_turns_for_type=2.5,
success_rate_for_type=0.85,
)
config = default_config()
threshold = config.get_threshold(characteristics)
assert threshold == 0.6, f"Expected 0.6, got {threshold}"
Expected Behavior: - Threshold should be 0.6 (standard filtering) - Only results above 0.6 relevance score should pass - Balanced context allocation across categories
Verification Checklist: - [ ] Threshold returns 0.6 for standard tasks - [ ] Low-relevance results (score < 0.6) are filtered out - [ ] Context quality remains high (>70% relevance rate)
Scenario 3: Refinement Task¶
Description: Test context retrieval for a task that is a retry after a previous failure.
Setup:
characteristics = TaskCharacteristics(
task_id="TASK-TEST-003",
description="Fix authentication middleware",
tech_stack="python",
task_type=TaskType.IMPLEMENTATION,
current_phase=TaskPhase.IMPLEMENT,
complexity=5,
is_first_of_type=False,
similar_task_count=3,
feature_id="FEAT-AUTH-001",
is_refinement=True, # This is a retry
refinement_attempt=2,
previous_failure_type="circular_dependency",
avg_turns_for_type=4.0,
success_rate_for_type=0.7,
)
config = default_config()
threshold = config.get_threshold(characteristics)
assert threshold == 0.55, f"Expected 0.55, got {threshold}"
Expected Behavior: - Threshold should be 0.55 (slightly more inclusive for refinements) - Warning and failure pattern context should be emphasized - Context budget increased by 20% for refinement bonus
Verification Checklist: - [ ] Threshold returns 0.55 for refinement tasks - [ ] Warnings section is prominently included - [ ] Context about similar failures is retrieved - [ ] Budget allocation shifts toward warnings (35%)
Scenario 4: AutoBuild Task¶
Description: Test context retrieval for tasks running in /feature-build autonomous mode.
Setup:
characteristics = TaskCharacteristics(
task_id="TASK-TEST-004",
description="Implement user profile page",
tech_stack="python",
task_type=TaskType.IMPLEMENTATION,
current_phase=TaskPhase.IMPLEMENT,
complexity=5,
is_first_of_type=False,
similar_task_count=2,
feature_id="FEAT-USER-001",
is_refinement=False,
refinement_attempt=0,
previous_failure_type=None,
avg_turns_for_type=3.0,
success_rate_for_type=0.8,
current_actor="player",
turn_number=2,
is_autobuild=True, # Running in AutoBuild mode
has_previous_turns=True,
)
config = default_config()
threshold = config.get_threshold(characteristics)
assert threshold == 0.5, f"Expected 0.5, got {threshold}"
Expected Behavior: - Threshold should be 0.5 (AutoBuild takes priority) - Role constraints context should be included - Turn state context from previous turns should be loaded - Quality gate configs should be retrieved
Verification Checklist: - [ ] Threshold returns 0.5 for AutoBuild tasks - [ ] Role constraints section is present in context - [ ] Previous turn context is loaded - [ ] Quality gate thresholds are specified
Scenario 5: Review Task¶
Description: Test context retrieval for code review tasks.
Setup:
characteristics = TaskCharacteristics(
task_id="TASK-TEST-005",
description="Review authentication implementation",
tech_stack="python",
task_type=TaskType.REVIEW,
current_phase=TaskPhase.REVIEW,
complexity=4,
is_first_of_type=False,
similar_task_count=4,
feature_id=None,
is_refinement=False,
refinement_attempt=0,
previous_failure_type=None,
avg_turns_for_type=1.5,
success_rate_for_type=0.9,
)
config = default_config()
threshold = config.get_threshold(characteristics)
assert threshold == 0.6, f"Expected 0.6, got {threshold}"
Expected Behavior: - Threshold should be 0.6 (standard for review tasks) - Pattern and architecture context should be emphasized - Budget allocation shifts toward patterns (30%) and architecture (25%)
Verification Checklist: - [ ] Threshold returns 0.6 for review tasks - [ ] Pattern context allocation is ~30% - [ ] Architecture context allocation is ~25%
Quality Metrics Testing¶
Metric Collection¶
from guardkit.knowledge.relevance_tuning import MetricsCollector, ContextQualityMetrics
# Create collector with threshold and budget
collector = MetricsCollector(
threshold=0.6,
total_budget=4000,
budget_per_category={"feature_context": 10, "similar_outcomes": 10}
)
# Simulate adding results during retrieval
collector.add_result({"score": 0.8, "fact": "Pattern A"}, category="similar_outcomes")
collector.add_result({"score": 0.7, "fact": "Pattern B"}, category="similar_outcomes")
collector.add_result({"score": 0.5, "fact": "Pattern C"}, category="similar_outcomes") # Below threshold
collector.add_result({"score": 0.9, "fact": "Feature X"}, category="feature_context")
# Track budget usage
collector.add_budget_usage(2500)
# Get aggregated metrics
metrics = collector.get_metrics()
# Verify metrics
assert metrics.total_items_retrieved == 4
assert metrics.items_above_threshold == 3 # 0.8, 0.7, 0.9
assert metrics.items_below_threshold == 1 # 0.5
assert metrics.relevance_rate == 0.75 # 3/4
assert metrics.budget_utilization == 0.625 # 2500/4000
assert metrics.is_quality_acceptable() # 0.75 >= 0.7
Quality Acceptance Criteria¶
| Metric | Minimum Acceptable | Target | Excellent |
|---|---|---|---|
| Relevance Rate | 70% | 80% | 90%+ |
| Budget Utilization | 60% | 75% | 85-95% |
| Avg Relevance Score | 0.55 | 0.65 | 0.75+ |
Integration Testing¶
Test with JobContextRetriever¶
import asyncio
from guardkit.knowledge.job_context_retriever import JobContextRetriever
from guardkit.knowledge.task_analyzer import TaskPhase
async def test_integration():
# Create retriever with default config
retriever = JobContextRetriever(
graphiti=mock_graphiti_client,
relevance_config=default_config()
)
# Test task data
task = {
"id": "TASK-INT-001",
"description": "Add caching to API endpoints",
"tech_stack": "python",
"task_type": "implementation",
"complexity": 5,
"feature_id": "FEAT-PERF-001",
}
# Retrieve with metrics
context = await retriever.retrieve(
task=task,
phase=TaskPhase.IMPLEMENT,
collect_metrics=True
)
# Verify context structure
assert context.task_id == "TASK-INT-001"
assert context.budget_used <= context.budget_total
# Verify metrics
metrics = context.metrics
assert metrics.is_quality_acceptable()
# Print context for manual review
print(context.to_prompt())
return context
# Run test
context = asyncio.run(test_integration())
Troubleshooting¶
Issue: All Results Filtered Out¶
Symptom: Context is empty despite Graphiti returning results.
Diagnosis:
# Check if threshold is too high
config = RelevanceConfig(standard_threshold=0.9) # Very strict
# Most results will be filtered
# Solution: Use relaxed config for initial testing
config = relaxed_config()
Issue: Too Many Irrelevant Results¶
Symptom: Context includes low-quality items.
Diagnosis:
# Check if threshold is too low
config = RelevanceConfig(standard_threshold=0.3) # Too permissive
# Solution: Increase threshold
config = strict_config()
Issue: Budget Exceeded¶
Symptom: Retrieved context uses more tokens than allocated.
Diagnosis:
# Check budget utilization
metrics = collector.get_metrics()
if metrics.budget_utilization > 1.0:
print(f"Budget exceeded by {(metrics.budget_utilization - 1.0) * 100:.1f}%")
# Solution: Reduce max_results_per_category or increase budget
Performance Benchmarks¶
Expected Performance¶
| Operation | Target | Maximum |
|---|---|---|
| Task Analysis | < 100ms | 200ms |
| Budget Calculation | < 10ms | 50ms |
| Context Retrieval | < 2000ms | 5000ms |
| Metrics Calculation | < 10ms | 50ms |
Load Testing¶
import time
import statistics
async def benchmark_retrieval(retriever, task, iterations=10):
times = []
for _ in range(iterations):
start = time.time()
await retriever.retrieve(task, TaskPhase.IMPLEMENT)
times.append(time.time() - start)
print(f"Average: {statistics.mean(times)*1000:.1f}ms")
print(f"Median: {statistics.median(times)*1000:.1f}ms")
print(f"P95: {sorted(times)[int(len(times)*0.95)]*1000:.1f}ms")
Regression Testing Checklist¶
Before releasing changes to relevance tuning:
- All unit tests pass (
pytest tests/knowledge/test_relevance_tuning.py -v) - Integration tests pass (
pytest tests/knowledge/test_job_context_retriever.py -v) - First-of-type threshold is 0.5
- Standard threshold is 0.6
- Refinement threshold is 0.55
- AutoBuild threshold is 0.5
- Quality metrics calculation is accurate
- Budget allocation sums to 1.0
- No performance regression (retrieval < 2s)