testing-agent
skillRun goal-based evaluation tests for agents. Use when you need to verify an agent meets its goals, debug failing tests, or iterate on agent improvements based on test results.
apm::install
apm install @majiayu000/testing-agentapm::skill.md
---
name: testing-agent
description: Run goal-based evaluation tests for agents. Use when you need to verify an agent meets its goals, debug failing tests, or iterate on agent improvements based on test results.
---
# Testing Workflow
This skill provides tools for testing agents built with the building-agents skill.
## Workflow Overview
1. `mcp__agent-builder__list_tests` - Check what tests exist
2. `mcp__agent-builder__generate_constraint_tests` or `mcp__agent-builder__generate_success_tests` - Get test guidelines
3. **Write tests directly** using the Write tool with the guidelines provided
4. `mcp__agent-builder__run_tests` - Execute tests
5. `mcp__agent-builder__debug_test` - Debug failures
## How Test Generation Works
The `generate_*_tests` MCP tools return **guidelines and templates** - they do NOT generate test code via LLM.
You (Claude) write the tests directly using the Write tool based on the guidelines.
### Example Workflow
```python
# Step 1: Get test guidelines
result = mcp__agent-builder__generate_constraint_tests(
goal_id="my-goal",
goal_json='{"id": "...", "constraints": [...]}',
agent_path="exports/my_agent"
)
# Step 2: The result contains:
# - output_file: where to write tests
# - file_header: imports and fixtures to use
# - test_template: format for test functions
# - constraints_formatted: the constraints to test
# - test_guidelines: rules for writing tests
# Step 3: Write tests directly using the Write tool
Write(
file_path=result["output_file"],
content=result["file_header"] + test_code_you_write
)
# Step 4: Run tests via MCP tool
mcp__agent-builder__run_tests(
goal_id="my-goal",
agent_path="exports/my_agent"
)
# Step 5: Debug failures via MCP tool
mcp__agent-builder__debug_test(
goal_id="my-goal",
test_name="test_constraint_foo",
agent_path="exports/my_agent"
)
```
---
# Testing Agents with MCP Tools
Run goal-based evaluation tests for agents built with the building-agents skill.
**Key Principle: MCP tools provide guidelines, Claude writes tests directly**
- ✅ Get guidelines: `generate_constraint_tests`, `generate_success_tests` → returns templates and guidelines
- ✅ Write tests: Use the Write tool with the provided file_header and test_template
- ✅ Run tests: `run_tests` (runs pytest via subprocess)
- ✅ Debug failures: `debug_test` (re-runs single test with verbose output)
- ✅ List tests: `list_tests` (scans Python test files)
- ✅ Tests stored in `exports/{agent}/tests/test_*.py`
## Architecture: Python Test Files
```
exports/my_agent/
├── __init__.py
├── agent.py ← Agent to test
├── nodes/__init__.py
├── config.py
├── __main__.py
└── tests/ ← Test files written by MCP tools
├── conftest.py # Shared fixtures (auto-created)
├── test_constraints.py
├── test_success_criteria.py
└── test_edge_cases.py
```
**Tests import the agent directly:**
```python
import pytest
from exports.my_agent import default_agent
@pytest.mark.asyncio
async def test_happy_path(mock_mode):
result = await default_agent.run({"query": "test"}, mock_mode=mock_mode)
assert result.success
assert len(result.output) > 0
```
## Why This Approach
- MCP tools provide consistent test guidelines with proper imports, fixtures, and API key enforcement
- Claude writes tests directly, eliminating circular LLM dependencies in the MCP server
- `run_tests` parses pytest output into structured results for iteration
- `debug_test` provides formatted output with actionable debugging info
- File headers include conftest.py setup with proper fixtures
## Quick Start
1. **Check existing tests** - `list_tests(goal_id, agent_path)`
2. **Get test guidelines** - `generate_constraint_tests` or `generate_success_tests`
3. **Write tests** - Use the Write tool with the provided file_header and guidelines
4. **Run tests** - `run_tests(goal_id, agent_path)`
5. **Debug failures** - `debug_test(goal_id, test_name, agent_path)`
6. **Iterate** - Repeat steps 4-5 until all pass
## ⚠️ Credential Requirements for Testing
**CRITICAL: Testing requires ALL credentials the agent depends on.** This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.).
### Prerequisites
Before running agent tests, you MUST collect ALL required credentials from the user.
**Step 1: LLM API Key (always required)**
```bash
export ANTHROPIC_API_KEY="your-key-here"
```
**Step 2: Tool-specific credentials (depends on agent's tools)**
Inspect the agent's `mcp_servers.json` and tool configuration to determine which tools the agent uses, then check for all required credentials:
```python
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS
creds = CredentialManager()
# Determine which tools the agent uses (from agent.json or mcp_servers.json)
agent_tools = [...] # e.g., ["hubspot_search_contacts", "web_search", ...]
# Find all missing credentials for those tools
missing = creds.get_missing_for_tools(agent_tools)
```
Common tool credentials:
| Tool | Env Var | Help URL |
|------|---------|----------|
| HubSpot CRM | `HUBSPOT_ACCESS_TOKEN` | https://developers.hubspot.com/docs/api/private-apps |
| Brave Search | `BRAVE_SEARCH_API_KEY` | https://brave.com/search/api/ |
| Google Search | `GOOGLE_SEARCH_API_KEY` + `GOOGLE_SEARCH_CX` | https://developers.google.com/custom-search |
**Why ALL credentials are required:**
- Tests need to execute the agent's LLM nodes to validate behavior
- Tools with missing credentials will return error dicts instead of real data
- Mock mode bypasses everything, providing no confidence in real-world performance
- The `AgentRunner.run()` method validates credentials at startup and will fail fast if any are missing
### Mock Mode Limitations
Mock mode (`--mock` flag or `mock_mode=True`) is **ONLY for structure validation**:
✓ Validates graph structure (nodes, edges, connections)
✓ Tests that code doesn't crash on execution
✗ Does NOT test LLM message generation
✗ Does NOT test reasoning or decision-making quality
✗ Does NOT test constraint validation (length limits, format rules)
✗ Does NOT test real API integrations or tool use
✗ Does NOT test personalization or content quality
**Bottom line:** If you're testing whether an agent achieves its goal, you MUST use real credentials for ALL services.
### Enforcing Credentials in Tests
When generating tests, **ALWAYS include credential checks for ALL required services**:
```python
import os
import pytest
from aden_tools.credentials import CredentialManager
# At the top of every test file
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1 for structure validation only."
)
@pytest.fixture(scope="session", autouse=True)
def check_credentials():
"""Ensure ALL required credentials are set for real testing."""
creds = CredentialManager()
mock_mode = os.environ.get("MOCK_MODE")
# Always check LLM key
if not creds.is_available("anthropic"):
if mock_mode:
print("\n⚠️ Running in MOCK MODE - structure validation only")
print(" This does NOT test LLM behavior or agent quality")
print(" Set ANTHROPIC_API_KEY for real testing\n")
else:
pytest.fail(
"\n❌ ANTHROPIC_API_KEY not set!\n\n"
"Real testing requires an API key. Choose one:\n"
"1. Set API key (RECOMMENDED):\n"
" export ANTHROPIC_API_KEY='your-key-here'\n"
"2. Run structure validation only:\n"
" MOCK_MODE=1 pytest exports/{agent}/tests/\n\n"
"Note: Mock mode does NOT validate agent behavior or quality."
)
# Check tool-specific credentials (skip in mock mode)
if not mock_mode:
# List the tools this agent uses - update per agent
agent_tools = [] # e.g., ["hubspot_search_contacts", "hubspot_get_contact"]
missing = creds.get_missing_for_tools(agent_tools)
if missing:
lines = ["\n❌ Missing tool credentials!\n"]
for name in missing:
spec = creds.specs.get(name)
if spec:
lines.append(f" {spec.env_var} - {spec.description}")
if spec.help_url:
lines.append(f" Setup: {spec.help_url}")
lines.append("\nSet the required environment variables and re-run.")
pytest.fail("\n".join(lines))
```
### User Communication
When the user asks to test an agent, **ALWAYS check for ALL credentials first** — not just the LLM key:
1. **Identify the agent's tools** from `agent.json` or `mcp_servers.json`
2. **Check ALL required credentials** using `CredentialManager`
3. **Ask the user to provide any missing credentials** before proceeding
```python
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS
creds = CredentialManager()
# 1. Check LLM key
missing_creds = []
if not creds.is_available("anthropic"):
missing_creds.append(("ANTHROPIC_API_KEY", "Anthropic API key for LLM calls"))
# 2. Check tool-specific credentials
agent_tools = [...] # Determined from agent config
missing_tools = creds.get_missing_for_tools(agent_tools)
for name in missing_tools:
spec = CREDENTIAL_SPECS.get(name)
if spec:
missing_creds.append((spec.env_var, spec.description))
# 3. Present ALL missing credentials to the user at once
if missing_creds:
print("⚠️ Missing credentials required by this agent:\n")
for env_var, description in missing_creds:
print(f" • {env_var} — {description}")
print()
print("Please set the missing environment variables:")
for env_var, _ in missing_creds:
print(f" export {env_var}='your-value-here'")
print()
print("Or run in mock mode (structure validation only):")
print(" MOCK_MODE=1 pytest exports/{agent}/tests/")
# Ask user to provide credentials or choose mock mode
AskUserQuestion(...)
```
**IMPORTANT:** Do NOT skip credential collection. If an agent uses HubSpot tools, the user MUST provide `HUBSPOT_ACCESS_TOKEN`. If it uses web search, the user MUST provide the appropriate search API key. Collect ALL missing credentials in a single prompt rather than discovering them one at a time during test failures.
## The Three-Stage Flow
```
┌─────────────────────────────────────────────────────────────────────────┐
│ GOAL STAGE │
│ (building-agents skill) │
│ │
│ 1. User defines goal with success_criteria and constraints │
│ 2. Goal written to agent.py immediately │
│ 3. Generate CONSTRAINT TESTS → Write to tests/ → USER APPROVAL │
│ Files created: exports/{agent}/tests/test_constraints.py │
└─────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────┐
│ AGENT STAGE │
│ (building-agents skill) │
│ │
│ Build nodes + edges, written immediately to files │
│ Constraint tests can run during development: │
│ run_tests(goal_id, agent_path, test_types='["constraint"]') │
└─────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────┐
│ EVAL STAGE (this skill) │
│ │
│ 1. Generate SUCCESS_CRITERIA TESTS → Write to tests/ → USER APPROVAL │
│ Files created: exports/{agent}/tests/test_success_criteria.py │
│ 2. Run all tests: run_tests(goal_id, agent_path) │
│ 3. On failure → debug_test(goal_id, test_name, agent_path) │
│ 4. Iterate: Edit agent code → Re-run run_tests (instant feedback) │
└─────────────────────────────────────────────────────────────────────────┘
```
## Step-by-Step: Testing an Agent
### Step 1: Check Existing Tests
**ALWAYS check first** before generating new tests:
```python
mcp__agent-builder__list_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)
```
This shows what test files already exist. If tests exist:
- Review the list to see what's covered
- Ask user if they want to add more or run existing tests
### Step 2: Get Constraint Test Guidelines (Goal Stage)
After goal is defined, get test guidelines using the MCP tool:
```python
# First, read the goal from agent.py to get the goal JSON
goal_code = Read(file_path="exports/your_agent/agent.py")
# Extract the goal definition and convert to JSON
# Get constraint test guidelines via MCP tool
result = mcp__agent-builder__generate_constraint_tests(
goal_id="your-goal-id",
goal_json='{"id": "goal-id", "name": "...", "constraints": [...]}',
agent_path="exports/your_agent"
)
```
**Response includes:**
- `output_file`: Where to write tests (e.g., `exports/your_agent/tests/test_constraints.py`)
- `file_header`: Imports, fixtures, and pytest setup to use at the top of the file
- `test_template`: Format for test functions
- `constraints_formatted`: The constraints to test
- `test_guidelines`: Rules and best practices for writing tests
- `instruction`: How to proceed
**Write tests directly** using the provided guidelines:
```python
# Write tests using the Write tool
Write(
file_path=result["output_file"],
content=result["file_header"] + "\n\n" + your_test_code
)
```
### Step 3: Get Success Criteria Test Guidelines (Eval Stage)
After agent is fully built, get success criteria test guidelines:
```python
# Get success criteria test guidelines via MCP tool
result = mcp__agent-builder__generate_success_tests(
goal_id="your-goal-id",
goal_json='{"id": "goal-id", "name": "...", "success_criteria": [...]}',
node_names="analyze_request,search_web,format_results",
tool_names="web_search,web_scrape",
agent_path="exports/your_agent"
)
```
**Write tests directly** using the provided guidelines:
```python
# Write tests using the Write tool
Write(
file_path=result["output_file"],
content=result["file_header"] + "\n\n" + your_test_code
)
```
### Step 4: Test Fixtures (conftest.py)
The `file_header` returned by the MCP tools includes proper imports and fixtures.
You should also create a conftest.py file in the tests directory with shared fixtures:
```python
# Create conftest.py with the conftest template
Write(
file_path="exports/your_agent/tests/conftest.py",
content=conftest_content # Use PYTEST_CONFTEST_TEMPLATE format
)
```
### Step 5: Run Tests
**Use the MCP tool to run tests** (not pytest directly):
```python
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)
**Response includes structured results:**
```json
{
"goal_id": "your-goal-id",
"overall_passed": false,
"summary": {
"total": 12,
"passed": 10,
"failed": 2,
"skipped": 0,
"errors": 0,
"pass_rate": "83.3%"
},
"test_results": [
{"file": "test_constraints.py", "test_name": "test_constraint_api_rate_limits", "status": "passed"},
{"file": "test_success_criteria.py", "test_name": "test_success_find_relevant_results", "status": "failed"}
],
"failures": [
{"test_name": "test_success_find_relevant_results", "details": "AssertionError: Expected 3-5 results..."}
]
}
```
**Options for `run_tests`:**
```python
# Run only constraint tests
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
test_types='["constraint"]'
)
# Run with parallel workers
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
parallel=4
)
# Stop on first failure
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
fail_fast=True
)
```
### Step 6: Debug Failed Tests
**Use the MCP tool to debug** (not Bash/pytest directly):
```python
mcp__agent-builder__debug_test(
goal_id="your-goal-id",
test_name="test_success_find_relevant_results",
agent_path="exports/your_agent"
)
```
**Response includes:**
- Full verbose output from the test
- Stack trace with exact line numbers
- Captured logs and prints
- Suggestions for fixing the issue
### Step 7: Categorize Errors
When a test fails, categorize the error to guide iteration:
```python
def categorize_test_failure(test_output, agent_code):
"""Categorize test failure to guide iteration."""
# Read test output and agent code
failure_info = {
"test_name": "...",
"error_message": "...",
"stack_trace": "...",
}
# Pattern-based categorization
if any(pattern in failure_info["error_message"].lower() for pattern in [
"typeerror", "attributeerror", "keyerror", "valueerror",
"null", "none", "undefined", "tool call failed"
]):
category = "IMPLEMENTATION_ERROR"
guidance = {
"stage": "Agent",
"action": "Fix the bug in agent code",
"files_to_edit": ["agent.py", "nodes/__init__.py"],
"restart_required": False,
"description": "Code bug - fix and re-run tests"
}
elif any(pattern in failure_info["error_message"].lower() for pattern in [
"assertion", "expected", "got", "should be", "success criteria"
]):
category = "LOGIC_ERROR"
guidance = {
"stage": "Goal",
"action": "Update goal definition",
"files_to_edit": ["agent.py (goal section)"],
"restart_required": True,
"description": "Goal definition is wrong - update and rebuild"
}
elif any(pattern in failure_info["error_message"].lower() for pattern in [
"timeout", "rate limit", "empty", "boundary", "edge case"
]):
category = "EDGE_CASE"
guidance = {
"stage": "Eval",
"action": "Add edge case test and fix handling",
"files_to_edit": ["agent.py", "tests/test_edge_cases.py"],
"restart_required": False,
"description": "New scenario - add test and handle it"
}
else:
category = "UNKNOWN"
guidance = {
"stage": "Unknown",
"action": "Manual investigation required",
"restart_required": False
}
return {
"category": category,
"guidance": guidance,
"failure_info": failure_info
}
```
**Show categorization to user:**
```python
AskUserQuestion(
questions=[{
"question": f"Test failed with {category}. How would you like to proceed?",
"header": "Test Failure",
"options": [
{
"label": "Fix code directly (Recommended)" if category == "IMPLEMENTATION_ERROR" else "Update goal",
"description": guidance["description"]
},
{
"label": "Show detailed error info",
"description": "View full stack trace and logs"
},
{
"label": "Skip for now",
"description": "Continue with other tests"
}
],
"multiSelect": false
}]
)
```
### Step 8: Iterate Based on Error Category
#### IMPLEMENTATION_ERROR → Fix Agent Code
```python
# 1. Show user the exact file and line that failed
print(f"Error in: exports/{agent_name}/nodes/__init__.py:42")
print(f"Issue: 'NoneType' object has no attribute 'get'")
# 2. Read the problematic code
code = Read(file_path=f"exports/{agent_name}/nodes/__init__.py")
# 3. User can fix directly, or you suggest a fix:
Edit(
file_path=f"exports/{agent_name}/nodes/__init__.py",
old_string="if results.get('videos'):",
new_string="if results and results.get('videos'):"
)
# 4. Re-run tests immediately (instant feedback!)
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path=f"exports/{agent_name}"
)
```
#### LOGIC_ERROR → Update Goal
```python
# 1. Show user the goal definition
goal_code = Read(file_path=f"exports/{agent_name}/agent.py")
# 2. Discuss what needs to change in success_criteria or constraints
# 3. Edit the goal
Edit(
file_path=f"exports/{agent_name}/agent.py",
old_string='target="3-5 videos"',
new_string='target="1-5 videos"' # More realistic
)
# 4. May need to regenerate agent nodes if goal changed significantly
# This requires going back to building-agents skill
```
#### EDGE_CASE → Add Test and Fix
```python
# 1. Create new edge case test with API key enforcement
edge_case_test = '''
@pytest.mark.asyncio
async def test_edge_case_empty_results(mock_mode):
"""Test: Agent handles no results gracefully"""
result = await default_agent.run({{"query": "xyzabc123nonsense"}}, mock_mode=mock_mode)
# Should succeed with empty results, not crash
assert result.success or result.error is not None
if result.success:
assert result.output.get("message") == "No results found"
'''
# 2. Add to test file
Edit(
file_path=f"exports/{agent_name}/tests/test_edge_cases.py",
old_string="# Add edge case tests here",
new_string=edge_case_test
)
# 3. Fix agent to handle edge case
# Edit agent code to handle empty results
# 4. Re-run tests
```
## Test File Templates (Reference Only)
**⚠️ Do NOT copy-paste these templates directly.** Use `generate_constraint_tests` and `generate_success_tests` MCP tools to create properly structured tests with correct imports and fixtures.
These templates show the structure of generated tests for reference only.
### Constraint Test Template
```python
"""Constraint tests for {agent_name}.
These tests validate that the agent respects its defined constraints.
Requires ANTHROPIC_API_KEY for real testing.
"""
import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager
# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.mark.asyncio
async def test_constraint_{constraint_id}():
"""Test: {constraint_description}"""
# Test implementation based on constraint type
mock_mode = bool(os.environ.get("MOCK_MODE"))
result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)
# Assert constraint is respected
assert True # Replace with actual check
```
### Success Criteria Test Template
```python
"""Success criteria tests for {agent_name}.
These tests validate that the agent achieves its defined success criteria.
Requires ANTHROPIC_API_KEY for real testing - mock mode cannot validate success criteria.
"""
import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager
# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.mark.asyncio
async def test_success_{criteria_id}():
"""Test: {criteria_description}"""
mock_mode = bool(os.environ.get("MOCK_MODE"))
result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)
assert result.success, f"Agent failed: {{result.error}}"
# Verify success criterion met
# e.g., assert metric meets target
assert True # Replace with actual check
```
### Edge Case Test Template
```python
"""Edge case tests for {agent_name}.
These tests validate agent behavior in unusual or boundary conditions.
Requires ANTHROPIC_API_KEY for real testing.
"""
import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager
# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.mark.asyncio
async def test_edge_case_{scenario_name}():
"""Test: Agent handles {scenario_description}"""
mock_mode = bool(os.environ.get("MOCK_MODE"))
result = await default_agent.run({{"edge": "case_input"}}, mock_mode=mock_mode)
# Verify graceful handling
assert result.success or result.error is not None
```
## Interactive Build + Test Loop
During agent construction (Agent stage), you can run constraint tests incrementally:
```python
# After adding first node
print("Added search_node. Running relevant constraint tests...")
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path=f"exports/{agent_name}",
test_types='["constraint"]'
)
# After adding second node
print("Added filter_node. Running all constraint tests...")
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path=f"exports/{agent_name}",
test_types='["constraint"]'
)
```
This provides **immediate feedback** during development, catching issues early.
## Common Test Patterns
**Note:** All test patterns should include API key enforcement via conftest.py.
### ⚠️ CRITICAL: Framework Features You Must Know
#### OutputCleaner - Automatic I/O Cleaning (NEW!)
**The framework now automatically validates and cleans node outputs** using a fast LLM (Cerebras llama-3.3-70b) at edge traversal time. This prevents cascading failures from malformed output.
**What OutputCleaner does**:
- ✅ Validates output matches next node's input schema
- ✅ Detects JSON parsing trap (entire response in one key)
- ✅ Cleans malformed output automatically (~200-500ms, ~$0.001 per cleaning)
- ✅ Boosts success rates by 1.8-2.2x
**Impact on tests**: Tests should still use safe patterns because OutputCleaner may not catch all issues in test mode.
#### Safe Test Patterns (REQUIRED)
**❌ UNSAFE** (will cause test failures):
```python
# Direct key access - can crash!
approval_decision = result.output["approval_decision"]
assert approval_decision == "APPROVED"
# Nested access without checks
category = result.output["analysis"]["category"]
# Assuming parsed JSON structure
for issue in result.output["compliance_issues"]:
...
```
**✅ SAFE** (correct patterns):
```python
# 1. Safe dict access with .get()
output = result.output or {}
approval_decision = output.get("approval_decision", "UNKNOWN")
assert "APPROVED" in approval_decision or approval_decision == "APPROVED"
# 2. Type checking before operations
analysis = output.get("analysis", {})
if isinstance(analysis, dict):
category = analysis.get("category", "unknown")
# 3. Parse JSON from strings (the JSON parsing trap!)
import json
recommendation = output.get("recommendation", "{}")
if isinstance(recommendation, str):
try:
parsed = json.loads(recommendation)
if isinstance(parsed, dict):
approval = parsed.get("approval_decision", "UNKNOWN")
except json.JSONDecodeError:
approval = "UNKNOWN"
elif isinstance(recommendation, dict):
approval = recommendation.get("approval_decision", "UNKNOWN")
# 4. Safe iteration with type check
compliance_issues = output.get("compliance_issues", [])
if isinstance(compliance_issues, list):
for issue in compliance_issues:
...
```
#### Helper Functions for Safe Access
**Add to conftest.py**:
```python
import json
import re
def _parse_json_from_output(result, key):
"""Parse JSON from agent output (framework may store full LLM response as string)."""
response_text = result.output.get(key, "")
# Remove markdown code blocks if present
json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip()
try:
return json.loads(json_text)
except (json.JSONDecodeError, AttributeError, TypeError):
return result.output.get(key)
def safe_get_nested(result, key_path, default=None):
"""Safely get nested value from result.output."""
output = result.output or {}
current = output
for key in key_path:
if isinstance(current, dict):
current = current.get(key)
elif isinstance(current, str):
try:
json_text = re.sub(r'```json\s*|\s*```', '', current).strip()
parsed = json.loads(json_text)
if isinstance(parsed, dict):
current = parsed.get(key)
else:
return default
except json.JSONDecodeError:
return default
else:
return default
return current if current is not None else default
# Make available in tests
pytest.parse_json_from_output = _parse_json_from_output
pytest.safe_get_nested = safe_get_nested
```
**Usage in tests**:
```python
# Use helper to parse JSON safely
parsed = pytest.parse_json_from_output(result, "recommendation")
if isinstance(parsed, dict):
approval = parsed.get("approval_decision", "UNKNOWN")
# Safe nested access
risk_score = pytest.safe_get_nested(result, ["analysis", "risk_score"], default=0.0)
```
#### Test Count Guidance
**Generate 8-15 tests total, NOT 30+**
- ✅ 2-3 tests per success criterion
- ✅ 1 happy path test
- ✅ 1 boundary/edge case test
- ✅ 1 error handling test (optional)
**Why fewer tests?**:
- Each test requires real LLM call (~3 seconds, costs money)
- 30 tests = 90 seconds, $0.30+ in costs
- 12 tests = 36 seconds, $0.12 in costs
- Focus on quality over quantity
#### ExecutionResult Fields (Important!)
**`result.success=True` means NO exception, NOT goal achieved**
```python
# ❌ WRONG - assumes goal achieved
assert result.success
# ✅ RIGHT - check success AND output
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
approval = output.get("approval_decision")
assert approval == "APPROVED", f"Expected APPROVED, got {approval}"
```
**All ExecutionResult fields**:
- `success: bool` - Execution completed without exception (NOT goal achieved!)
- `output: dict` - Complete memory snapshot (may contain raw strings)
- `error: str | None` - Error message if failed
- `steps_executed: int` - Number of nodes executed
- `total_tokens: int` - Cumulative token usage
- `total_latency_ms: int` - Total execution time
- `path: list[str]` - Node IDs traversed
- `paused_at: str | None` - Node ID if HITL pause occurred
- `session_state: dict` - State for resuming
### Happy Path Test
```python
@pytest.mark.asyncio
async def test_happy_path(mock_mode):
"""Test normal successful execution"""
result = await default_agent.run({{"query": "python tutorials"}}, mock_mode=mock_mode)
assert result.success
assert len(result.output) > 0
```
### Boundary Condition Test
```python
@pytest.mark.asyncio
async def test_boundary_minimum(mock_mode):
"""Test at minimum threshold"""
result = await default_agent.run({{"query": "very specific niche topic"}}, mock_mode=mock_mode)
assert result.success
assert len(result.output.get("results", [])) >= 1
```
### Error Handling Test
```python
@pytest.mark.asyncio
async def test_error_handling(mock_mode):
"""Test graceful error handling"""
result = await default_agent.run({{"query": ""}}, mock_mode=mock_mode) # Invalid input
assert not result.success or result.output.get("error") is not None
```
### Performance Test
```python
@pytest.mark.asyncio
async def test_performance_latency(mock_mode):
"""Test response time is acceptable"""
import time
start = time.time()
result = await default_agent.run({{"query": "test"}}, mock_mode=mock_mode)
duration = time.time() - start
assert duration < 5.0, f"Took {{duration}}s, expected <5s"
```
## Integration with building-agents
### Handoff Points
| Scenario | From | To | Action |
|----------|------|-----|--------|
| Agent built, ready to test | building-agents | testing-agent | Generate success tests |
| LOGIC_ERROR found | testing-agent | building-agents | Update goal, rebuild |
| IMPLEMENTATION_ERROR found | testing-agent | Direct fix | Edit agent files, re-run tests |
| EDGE_CASE found | testing-agent | testing-agent | Add edge case test |
| All tests pass | testing-agent | Done | Agent validated ✅ |
### Iteration Speed Comparison
| Scenario | Old Approach | New Approach |
|----------|--------------|--------------|
| **Bug Fix** | Rebuild via MCP tools (14 min) | Edit Python file, pytest (2 min) |
| **Add Test** | Generate via MCP, export (5 min) | Write test file directly (1 min) |
| **Debug** | Read subprocess logs | pdb, breakpoints, prints |
| **Inspect** | Limited visibility | Full Python introspection |
## Anti-Patterns
### Testing Best Practices
| Don't | Do Instead |
|-------|------------|
| ❌ Write tests without getting guidelines first | ✅ Use `generate_*_tests` to get proper file_header and guidelines |
| ❌ Run pytest via Bash | ✅ Use `run_tests` MCP tool for structured results |
| ❌ Debug tests with Bash pytest -vvs | ✅ Use `debug_test` MCP tool for formatted output |
| ❌ Check for tests with Glob | ✅ Use `list_tests` MCP tool |
| ❌ Skip the file_header from guidelines | ✅ Always include the file_header for proper imports and fixtures |
### General Testing
| Don't | Do Instead |
|-------|------------|
| ❌ Treat all failures the same | ✅ Use debug_test to categorize and iterate appropriately |
| ❌ Rebuild entire agent for small bugs | ✅ Edit code directly, re-run tests |
| ❌ Run tests without API key | ✅ Always set ANTHROPIC_API_KEY first |
| ❌ Write tests without understanding the constraints/criteria | ✅ Read the formatted constraints/criteria from guidelines |
## Workflow Summary
```
1. Check existing tests: list_tests(goal_id, agent_path)
→ Scans exports/{agent}/tests/test_*.py
↓
2. Get test guidelines: generate_constraint_tests, generate_success_tests
→ Returns file_header, test_template, constraints/criteria, guidelines
↓
3. Write tests: Use Write tool with the provided guidelines
→ Write tests to exports/{agent}/tests/test_*.py
↓
4. Run tests: run_tests(goal_id, agent_path)
→ Executes: pytest exports/{agent}/tests/ -v
↓
5. Debug failures: debug_test(goal_id, test_name, agent_path)
→ Re-runs single test with verbose output
↓
6. Fix based on category:
- IMPLEMENTATION_ERROR → Edit agent code directly
- ASSERTION_FAILURE → Fix agent logic or update test
- IMPORT_ERROR → Check package structure
- API_ERROR → Check API keys and connectivity
↓
7. Re-run tests: run_tests(goal_id, agent_path)
↓
8. Repeat until all pass ✅
```
## MCP Tools Reference
```python
# Check existing tests (scans Python test files)
mcp__agent-builder__list_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)
# Get constraint test guidelines (returns templates and guidelines, NOT generated tests)
mcp__agent-builder__generate_constraint_tests(
goal_id="your-goal-id",
goal_json='{"id": "...", "constraints": [...]}',
agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines
# Get success criteria test guidelines
mcp__agent-builder__generate_success_tests(
goal_id="your-goal-id",
goal_json='{"id": "...", "success_criteria": [...]}',
node_names="node1,node2",
tool_names="tool1,tool2",
agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines
# Run tests via pytest subprocess
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)
# Debug a failed test (re-runs with verbose output)
mcp__agent-builder__debug_test(
goal_id="your-goal-id",
test_name="test_constraint_foo",
agent_path="exports/your_agent"
)
```
## run_tests Options
```python
# Run only constraint tests
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
test_types='["constraint"]'
)
# Run only success criteria tests
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
test_types='["success"]'
)
# Run with pytest-xdist parallelism (requires pytest-xdist)
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
parallel=4
)
# Stop on first failure
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
fail_fast=True
)
```
## Direct pytest Commands
You can also run tests directly with pytest (the MCP tools use pytest internally):
```bash
# Run all tests
pytest exports/your_agent/tests/ -v
# Run specific test file
pytest exports/your_agent/tests/test_constraints.py -v
# Run specific test
pytest exports/your_agent/tests/test_constraints.py::test_constraint_foo -vvs
# Run in mock mode (structure validation only)
MOCK_MODE=1 pytest exports/your_agent/tests/ -v
```
---
**MCP tools generate tests, write them to Python files, and run them via pytest.**