APM

>Agent Skill

@majiayu000/testing-agent

skilldevelopment

Run goal-based evaluation tests for agents. Use when you need to verify an agent meets its goals, debug failing tests, or iterate on agent improvements based on test results.

pythontesting
apm::install
$apm install @majiayu000/testing-agent
apm::skill.md
---
name: testing-agent
description: Run goal-based evaluation tests for agents. Use when you need to verify an agent meets its goals, debug failing tests, or iterate on agent improvements based on test results.
---

# Testing Workflow

This skill provides tools for testing agents built with the building-agents skill.

## Workflow Overview

1. `mcp__agent-builder__list_tests` - Check what tests exist
2. `mcp__agent-builder__generate_constraint_tests` or `mcp__agent-builder__generate_success_tests` - Get test guidelines
3. **Write tests directly** using the Write tool with the guidelines provided
4. `mcp__agent-builder__run_tests` - Execute tests
5. `mcp__agent-builder__debug_test` - Debug failures

## How Test Generation Works

The `generate_*_tests` MCP tools return **guidelines and templates** - they do NOT generate test code via LLM.
You (Claude) write the tests directly using the Write tool based on the guidelines.

### Example Workflow

```python
# Step 1: Get test guidelines
result = mcp__agent-builder__generate_constraint_tests(
    goal_id="my-goal",
    goal_json='{"id": "...", "constraints": [...]}',
    agent_path="exports/my_agent"
)

# Step 2: The result contains:
# - output_file: where to write tests
# - file_header: imports and fixtures to use
# - test_template: format for test functions
# - constraints_formatted: the constraints to test
# - test_guidelines: rules for writing tests

# Step 3: Write tests directly using the Write tool
Write(
    file_path=result["output_file"],
    content=result["file_header"] + test_code_you_write
)

# Step 4: Run tests via MCP tool
mcp__agent-builder__run_tests(
    goal_id="my-goal",
    agent_path="exports/my_agent"
)

# Step 5: Debug failures via MCP tool
mcp__agent-builder__debug_test(
    goal_id="my-goal",
    test_name="test_constraint_foo",
    agent_path="exports/my_agent"
)
```

---

# Testing Agents with MCP Tools

Run goal-based evaluation tests for agents built with the building-agents skill.

**Key Principle: MCP tools provide guidelines, Claude writes tests directly**
- ✅ Get guidelines: `generate_constraint_tests`, `generate_success_tests` → returns templates and guidelines
- ✅ Write tests: Use the Write tool with the provided file_header and test_template
- ✅ Run tests: `run_tests` (runs pytest via subprocess)
- ✅ Debug failures: `debug_test` (re-runs single test with verbose output)
- ✅ List tests: `list_tests` (scans Python test files)
- ✅ Tests stored in `exports/{agent}/tests/test_*.py`

## Architecture: Python Test Files

```
exports/my_agent/
├── __init__.py
├── agent.py              ← Agent to test
├── nodes/__init__.py
├── config.py
├── __main__.py
└── tests/                ← Test files written by MCP tools
    ├── conftest.py       # Shared fixtures (auto-created)
    ├── test_constraints.py
    ├── test_success_criteria.py
    └── test_edge_cases.py
```

**Tests import the agent directly:**
```python
import pytest
from exports.my_agent import default_agent


@pytest.mark.asyncio
async def test_happy_path(mock_mode):
    result = await default_agent.run({"query": "test"}, mock_mode=mock_mode)
    assert result.success
    assert len(result.output) > 0
```

## Why This Approach

- MCP tools provide consistent test guidelines with proper imports, fixtures, and API key enforcement
- Claude writes tests directly, eliminating circular LLM dependencies in the MCP server
- `run_tests` parses pytest output into structured results for iteration
- `debug_test` provides formatted output with actionable debugging info
- File headers include conftest.py setup with proper fixtures

## Quick Start

1. **Check existing tests** - `list_tests(goal_id, agent_path)`
2. **Get test guidelines** - `generate_constraint_tests` or `generate_success_tests`
3. **Write tests** - Use the Write tool with the provided file_header and guidelines
4. **Run tests** - `run_tests(goal_id, agent_path)`
5. **Debug failures** - `debug_test(goal_id, test_name, agent_path)`
6. **Iterate** - Repeat steps 4-5 until all pass

## ⚠️ Credential Requirements for Testing

**CRITICAL: Testing requires ALL credentials the agent depends on.** This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.).

### Prerequisites

Before running agent tests, you MUST collect ALL required credentials from the user.

**Step 1: LLM API Key (always required)**
```bash
export ANTHROPIC_API_KEY="your-key-here"
```

**Step 2: Tool-specific credentials (depends on agent's tools)**

Inspect the agent's `mcp_servers.json` and tool configuration to determine which tools the agent uses, then check for all required credentials:

```python
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS

creds = CredentialManager()

# Determine which tools the agent uses (from agent.json or mcp_servers.json)
agent_tools = [...]  # e.g., ["hubspot_search_contacts", "web_search", ...]

# Find all missing credentials for those tools
missing = creds.get_missing_for_tools(agent_tools)
```

Common tool credentials:
| Tool | Env Var | Help URL |
|------|---------|----------|
| HubSpot CRM | `HUBSPOT_ACCESS_TOKEN` | https://developers.hubspot.com/docs/api/private-apps |
| Brave Search | `BRAVE_SEARCH_API_KEY` | https://brave.com/search/api/ |
| Google Search | `GOOGLE_SEARCH_API_KEY` + `GOOGLE_SEARCH_CX` | https://developers.google.com/custom-search |

**Why ALL credentials are required:**
- Tests need to execute the agent's LLM nodes to validate behavior
- Tools with missing credentials will return error dicts instead of real data
- Mock mode bypasses everything, providing no confidence in real-world performance
- The `AgentRunner.run()` method validates credentials at startup and will fail fast if any are missing

### Mock Mode Limitations

Mock mode (`--mock` flag or `mock_mode=True`) is **ONLY for structure validation**:

✓ Validates graph structure (nodes, edges, connections)
✓ Tests that code doesn't crash on execution
✗ Does NOT test LLM message generation
✗ Does NOT test reasoning or decision-making quality
✗ Does NOT test constraint validation (length limits, format rules)
✗ Does NOT test real API integrations or tool use
✗ Does NOT test personalization or content quality

**Bottom line:** If you're testing whether an agent achieves its goal, you MUST use real credentials for ALL services.

### Enforcing Credentials in Tests

When generating tests, **ALWAYS include credential checks for ALL required services**:

```python
import os
import pytest
from aden_tools.credentials import CredentialManager

# At the top of every test file
pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1 for structure validation only."
)


@pytest.fixture(scope="session", autouse=True)
def check_credentials():
    """Ensure ALL required credentials are set for real testing."""
    creds = CredentialManager()
    mock_mode = os.environ.get("MOCK_MODE")

    # Always check LLM key
    if not creds.is_available("anthropic"):
        if mock_mode:
            print("\n⚠️  Running in MOCK MODE - structure validation only")
            print("   This does NOT test LLM behavior or agent quality")
            print("   Set ANTHROPIC_API_KEY for real testing\n")
        else:
            pytest.fail(
                "\n❌ ANTHROPIC_API_KEY not set!\n\n"
                "Real testing requires an API key. Choose one:\n"
                "1. Set API key (RECOMMENDED):\n"
                "   export ANTHROPIC_API_KEY='your-key-here'\n"
                "2. Run structure validation only:\n"
                "   MOCK_MODE=1 pytest exports/{agent}/tests/\n\n"
                "Note: Mock mode does NOT validate agent behavior or quality."
            )

    # Check tool-specific credentials (skip in mock mode)
    if not mock_mode:
        # List the tools this agent uses - update per agent
        agent_tools = []  # e.g., ["hubspot_search_contacts", "hubspot_get_contact"]
        missing = creds.get_missing_for_tools(agent_tools)
        if missing:
            lines = ["\n❌ Missing tool credentials!\n"]
            for name in missing:
                spec = creds.specs.get(name)
                if spec:
                    lines.append(f"  {spec.env_var} - {spec.description}")
                    if spec.help_url:
                        lines.append(f"    Setup: {spec.help_url}")
            lines.append("\nSet the required environment variables and re-run.")
            pytest.fail("\n".join(lines))
```

### User Communication

When the user asks to test an agent, **ALWAYS check for ALL credentials first** — not just the LLM key:

1. **Identify the agent's tools** from `agent.json` or `mcp_servers.json`
2. **Check ALL required credentials** using `CredentialManager`
3. **Ask the user to provide any missing credentials** before proceeding

```python
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS

creds = CredentialManager()

# 1. Check LLM key
missing_creds = []
if not creds.is_available("anthropic"):
    missing_creds.append(("ANTHROPIC_API_KEY", "Anthropic API key for LLM calls"))

# 2. Check tool-specific credentials
agent_tools = [...]  # Determined from agent config
missing_tools = creds.get_missing_for_tools(agent_tools)
for name in missing_tools:
    spec = CREDENTIAL_SPECS.get(name)
    if spec:
        missing_creds.append((spec.env_var, spec.description))

# 3. Present ALL missing credentials to the user at once
if missing_creds:
    print("⚠️  Missing credentials required by this agent:\n")
    for env_var, description in missing_creds:
        print(f"  • {env_var}{description}")
    print()
    print("Please set the missing environment variables:")
    for env_var, _ in missing_creds:
        print(f"  export {env_var}='your-value-here'")
    print()
    print("Or run in mock mode (structure validation only):")
    print("  MOCK_MODE=1 pytest exports/{agent}/tests/")

    # Ask user to provide credentials or choose mock mode
    AskUserQuestion(...)
```

**IMPORTANT:** Do NOT skip credential collection. If an agent uses HubSpot tools, the user MUST provide `HUBSPOT_ACCESS_TOKEN`. If it uses web search, the user MUST provide the appropriate search API key. Collect ALL missing credentials in a single prompt rather than discovering them one at a time during test failures.

## The Three-Stage Flow

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           GOAL STAGE                                     │
│  (building-agents skill)                                                 │
│                                                                          │
│  1. User defines goal with success_criteria and constraints             │
│  2. Goal written to agent.py immediately                                │
│  3. Generate CONSTRAINT TESTS → Write to tests/ → USER APPROVAL         │
│     Files created: exports/{agent}/tests/test_constraints.py            │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                          AGENT STAGE                                     │
│  (building-agents skill)                                                 │
│                                                                          │
│  Build nodes + edges, written immediately to files                      │
│  Constraint tests can run during development:                           │
│    run_tests(goal_id, agent_path, test_types='["constraint"]')          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                           EVAL STAGE (this skill)                        │
│                                                                          │
│  1. Generate SUCCESS_CRITERIA TESTS → Write to tests/ → USER APPROVAL   │
│     Files created: exports/{agent}/tests/test_success_criteria.py       │
│  2. Run all tests: run_tests(goal_id, agent_path)                       │
│  3. On failure → debug_test(goal_id, test_name, agent_path)             │
│  4. Iterate: Edit agent code → Re-run run_tests (instant feedback)      │
└─────────────────────────────────────────────────────────────────────────┘
```

## Step-by-Step: Testing an Agent

### Step 1: Check Existing Tests

**ALWAYS check first** before generating new tests:

```python
mcp__agent-builder__list_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent"
)
```

This shows what test files already exist. If tests exist:
- Review the list to see what's covered
- Ask user if they want to add more or run existing tests

### Step 2: Get Constraint Test Guidelines (Goal Stage)

After goal is defined, get test guidelines using the MCP tool:

```python
# First, read the goal from agent.py to get the goal JSON
goal_code = Read(file_path="exports/your_agent/agent.py")
# Extract the goal definition and convert to JSON

# Get constraint test guidelines via MCP tool
result = mcp__agent-builder__generate_constraint_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "goal-id", "name": "...", "constraints": [...]}',
    agent_path="exports/your_agent"
)
```

**Response includes:**
- `output_file`: Where to write tests (e.g., `exports/your_agent/tests/test_constraints.py`)
- `file_header`: Imports, fixtures, and pytest setup to use at the top of the file
- `test_template`: Format for test functions
- `constraints_formatted`: The constraints to test
- `test_guidelines`: Rules and best practices for writing tests
- `instruction`: How to proceed

**Write tests directly** using the provided guidelines:

```python
# Write tests using the Write tool
Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + your_test_code
)
```

### Step 3: Get Success Criteria Test Guidelines (Eval Stage)

After agent is fully built, get success criteria test guidelines:

```python
# Get success criteria test guidelines via MCP tool
result = mcp__agent-builder__generate_success_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "goal-id", "name": "...", "success_criteria": [...]}',
    node_names="analyze_request,search_web,format_results",
    tool_names="web_search,web_scrape",
    agent_path="exports/your_agent"
)
```

**Write tests directly** using the provided guidelines:

```python
# Write tests using the Write tool
Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + your_test_code
)
```

### Step 4: Test Fixtures (conftest.py)

The `file_header` returned by the MCP tools includes proper imports and fixtures.
You should also create a conftest.py file in the tests directory with shared fixtures:

```python
# Create conftest.py with the conftest template
Write(
    file_path="exports/your_agent/tests/conftest.py",
    content=conftest_content  # Use PYTEST_CONFTEST_TEMPLATE format
)
```

### Step 5: Run Tests

**Use the MCP tool to run tests** (not pytest directly):

```python
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent"
)

**Response includes structured results:**
```json
{
  "goal_id": "your-goal-id",
  "overall_passed": false,
  "summary": {
    "total": 12,
    "passed": 10,
    "failed": 2,
    "skipped": 0,
    "errors": 0,
    "pass_rate": "83.3%"
  },
  "test_results": [
    {"file": "test_constraints.py", "test_name": "test_constraint_api_rate_limits", "status": "passed"},
    {"file": "test_success_criteria.py", "test_name": "test_success_find_relevant_results", "status": "failed"}
  ],
  "failures": [
    {"test_name": "test_success_find_relevant_results", "details": "AssertionError: Expected 3-5 results..."}
  ]
}
```

**Options for `run_tests`:**
```python
# Run only constraint tests
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    test_types='["constraint"]'
)

# Run with parallel workers
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    parallel=4
)

# Stop on first failure
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    fail_fast=True
)
```

### Step 6: Debug Failed Tests

**Use the MCP tool to debug** (not Bash/pytest directly):

```python
mcp__agent-builder__debug_test(
    goal_id="your-goal-id",
    test_name="test_success_find_relevant_results",
    agent_path="exports/your_agent"
)
```

**Response includes:**
- Full verbose output from the test
- Stack trace with exact line numbers
- Captured logs and prints
- Suggestions for fixing the issue

### Step 7: Categorize Errors

When a test fails, categorize the error to guide iteration:

```python
def categorize_test_failure(test_output, agent_code):
    """Categorize test failure to guide iteration."""

    # Read test output and agent code
    failure_info = {
        "test_name": "...",
        "error_message": "...",
        "stack_trace": "...",
    }

    # Pattern-based categorization
    if any(pattern in failure_info["error_message"].lower() for pattern in [
        "typeerror", "attributeerror", "keyerror", "valueerror",
        "null", "none", "undefined", "tool call failed"
    ]):
        category = "IMPLEMENTATION_ERROR"
        guidance = {
            "stage": "Agent",
            "action": "Fix the bug in agent code",
            "files_to_edit": ["agent.py", "nodes/__init__.py"],
            "restart_required": False,
            "description": "Code bug - fix and re-run tests"
        }

    elif any(pattern in failure_info["error_message"].lower() for pattern in [
        "assertion", "expected", "got", "should be", "success criteria"
    ]):
        category = "LOGIC_ERROR"
        guidance = {
            "stage": "Goal",
            "action": "Update goal definition",
            "files_to_edit": ["agent.py (goal section)"],
            "restart_required": True,
            "description": "Goal definition is wrong - update and rebuild"
        }

    elif any(pattern in failure_info["error_message"].lower() for pattern in [
        "timeout", "rate limit", "empty", "boundary", "edge case"
    ]):
        category = "EDGE_CASE"
        guidance = {
            "stage": "Eval",
            "action": "Add edge case test and fix handling",
            "files_to_edit": ["agent.py", "tests/test_edge_cases.py"],
            "restart_required": False,
            "description": "New scenario - add test and handle it"
        }

    else:
        category = "UNKNOWN"
        guidance = {
            "stage": "Unknown",
            "action": "Manual investigation required",
            "restart_required": False
        }

    return {
        "category": category,
        "guidance": guidance,
        "failure_info": failure_info
    }
```

**Show categorization to user:**

```python
AskUserQuestion(
    questions=[{
        "question": f"Test failed with {category}. How would you like to proceed?",
        "header": "Test Failure",
        "options": [
            {
                "label": "Fix code directly (Recommended)" if category == "IMPLEMENTATION_ERROR" else "Update goal",
                "description": guidance["description"]
            },
            {
                "label": "Show detailed error info",
                "description": "View full stack trace and logs"
            },
            {
                "label": "Skip for now",
                "description": "Continue with other tests"
            }
        ],
        "multiSelect": false
    }]
)
```

### Step 8: Iterate Based on Error Category

#### IMPLEMENTATION_ERROR → Fix Agent Code

```python
# 1. Show user the exact file and line that failed
print(f"Error in: exports/{agent_name}/nodes/__init__.py:42")
print(f"Issue: 'NoneType' object has no attribute 'get'")

# 2. Read the problematic code
code = Read(file_path=f"exports/{agent_name}/nodes/__init__.py")

# 3. User can fix directly, or you suggest a fix:
Edit(
    file_path=f"exports/{agent_name}/nodes/__init__.py",
    old_string="if results.get('videos'):",
    new_string="if results and results.get('videos'):"
)

# 4. Re-run tests immediately (instant feedback!)
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path=f"exports/{agent_name}"
)
```

#### LOGIC_ERROR → Update Goal

```python
# 1. Show user the goal definition
goal_code = Read(file_path=f"exports/{agent_name}/agent.py")

# 2. Discuss what needs to change in success_criteria or constraints

# 3. Edit the goal
Edit(
    file_path=f"exports/{agent_name}/agent.py",
    old_string='target="3-5 videos"',
    new_string='target="1-5 videos"'  # More realistic
)

# 4. May need to regenerate agent nodes if goal changed significantly
# This requires going back to building-agents skill
```

#### EDGE_CASE → Add Test and Fix

```python
# 1. Create new edge case test with API key enforcement
edge_case_test = '''
@pytest.mark.asyncio
async def test_edge_case_empty_results(mock_mode):
    """Test: Agent handles no results gracefully"""
    result = await default_agent.run({{"query": "xyzabc123nonsense"}}, mock_mode=mock_mode)

    # Should succeed with empty results, not crash
    assert result.success or result.error is not None
    if result.success:
        assert result.output.get("message") == "No results found"
'''

# 2. Add to test file
Edit(
    file_path=f"exports/{agent_name}/tests/test_edge_cases.py",
    old_string="# Add edge case tests here",
    new_string=edge_case_test
)

# 3. Fix agent to handle edge case
# Edit agent code to handle empty results

# 4. Re-run tests
```

## Test File Templates (Reference Only)

**⚠️ Do NOT copy-paste these templates directly.** Use `generate_constraint_tests` and `generate_success_tests` MCP tools to create properly structured tests with correct imports and fixtures.

These templates show the structure of generated tests for reference only.

### Constraint Test Template

```python
"""Constraint tests for {agent_name}.

These tests validate that the agent respects its defined constraints.
Requires ANTHROPIC_API_KEY for real testing.
"""

import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager


# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)


@pytest.mark.asyncio
async def test_constraint_{constraint_id}():
    """Test: {constraint_description}"""
    # Test implementation based on constraint type
    mock_mode = bool(os.environ.get("MOCK_MODE"))
    result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)

    # Assert constraint is respected
    assert True  # Replace with actual check
```

### Success Criteria Test Template

```python
"""Success criteria tests for {agent_name}.

These tests validate that the agent achieves its defined success criteria.
Requires ANTHROPIC_API_KEY for real testing - mock mode cannot validate success criteria.
"""

import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager


# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)


@pytest.mark.asyncio
async def test_success_{criteria_id}():
    """Test: {criteria_description}"""
    mock_mode = bool(os.environ.get("MOCK_MODE"))
    result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)

    assert result.success, f"Agent failed: {{result.error}}"

    # Verify success criterion met
    # e.g., assert metric meets target
    assert True  # Replace with actual check
```

### Edge Case Test Template

```python
"""Edge case tests for {agent_name}.

These tests validate agent behavior in unusual or boundary conditions.
Requires ANTHROPIC_API_KEY for real testing.
"""

import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager


# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)


@pytest.mark.asyncio
async def test_edge_case_{scenario_name}():
    """Test: Agent handles {scenario_description}"""
    mock_mode = bool(os.environ.get("MOCK_MODE"))
    result = await default_agent.run({{"edge": "case_input"}}, mock_mode=mock_mode)

    # Verify graceful handling
    assert result.success or result.error is not None
```

## Interactive Build + Test Loop

During agent construction (Agent stage), you can run constraint tests incrementally:

```python
# After adding first node
print("Added search_node. Running relevant constraint tests...")
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path=f"exports/{agent_name}",
    test_types='["constraint"]'
)

# After adding second node
print("Added filter_node. Running all constraint tests...")
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path=f"exports/{agent_name}",
    test_types='["constraint"]'
)
```

This provides **immediate feedback** during development, catching issues early.

## Common Test Patterns

**Note:** All test patterns should include API key enforcement via conftest.py.

### ⚠️ CRITICAL: Framework Features You Must Know

#### OutputCleaner - Automatic I/O Cleaning (NEW!)

**The framework now automatically validates and cleans node outputs** using a fast LLM (Cerebras llama-3.3-70b) at edge traversal time. This prevents cascading failures from malformed output.

**What OutputCleaner does**:
- ✅ Validates output matches next node's input schema
- ✅ Detects JSON parsing trap (entire response in one key)
- ✅ Cleans malformed output automatically (~200-500ms, ~$0.001 per cleaning)
- ✅ Boosts success rates by 1.8-2.2x

**Impact on tests**: Tests should still use safe patterns because OutputCleaner may not catch all issues in test mode.

#### Safe Test Patterns (REQUIRED)

**❌ UNSAFE** (will cause test failures):
```python
# Direct key access - can crash!
approval_decision = result.output["approval_decision"]
assert approval_decision == "APPROVED"

# Nested access without checks
category = result.output["analysis"]["category"]

# Assuming parsed JSON structure
for issue in result.output["compliance_issues"]:
    ...
```

**✅ SAFE** (correct patterns):
```python
# 1. Safe dict access with .get()
output = result.output or {}
approval_decision = output.get("approval_decision", "UNKNOWN")
assert "APPROVED" in approval_decision or approval_decision == "APPROVED"

# 2. Type checking before operations
analysis = output.get("analysis", {})
if isinstance(analysis, dict):
    category = analysis.get("category", "unknown")

# 3. Parse JSON from strings (the JSON parsing trap!)
import json
recommendation = output.get("recommendation", "{}")
if isinstance(recommendation, str):
    try:
        parsed = json.loads(recommendation)
        if isinstance(parsed, dict):
            approval = parsed.get("approval_decision", "UNKNOWN")
    except json.JSONDecodeError:
        approval = "UNKNOWN"
elif isinstance(recommendation, dict):
    approval = recommendation.get("approval_decision", "UNKNOWN")

# 4. Safe iteration with type check
compliance_issues = output.get("compliance_issues", [])
if isinstance(compliance_issues, list):
    for issue in compliance_issues:
        ...
```

#### Helper Functions for Safe Access

**Add to conftest.py**:
```python
import json
import re

def _parse_json_from_output(result, key):
    """Parse JSON from agent output (framework may store full LLM response as string)."""
    response_text = result.output.get(key, "")
    # Remove markdown code blocks if present
    json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip()

    try:
        return json.loads(json_text)
    except (json.JSONDecodeError, AttributeError, TypeError):
        return result.output.get(key)

def safe_get_nested(result, key_path, default=None):
    """Safely get nested value from result.output."""
    output = result.output or {}
    current = output

    for key in key_path:
        if isinstance(current, dict):
            current = current.get(key)
        elif isinstance(current, str):
            try:
                json_text = re.sub(r'```json\s*|\s*```', '', current).strip()
                parsed = json.loads(json_text)
                if isinstance(parsed, dict):
                    current = parsed.get(key)
                else:
                    return default
            except json.JSONDecodeError:
                return default
        else:
            return default

    return current if current is not None else default

# Make available in tests
pytest.parse_json_from_output = _parse_json_from_output
pytest.safe_get_nested = safe_get_nested
```

**Usage in tests**:
```python
# Use helper to parse JSON safely
parsed = pytest.parse_json_from_output(result, "recommendation")
if isinstance(parsed, dict):
    approval = parsed.get("approval_decision", "UNKNOWN")

# Safe nested access
risk_score = pytest.safe_get_nested(result, ["analysis", "risk_score"], default=0.0)
```

#### Test Count Guidance

**Generate 8-15 tests total, NOT 30+**

- ✅ 2-3 tests per success criterion
- ✅ 1 happy path test
- ✅ 1 boundary/edge case test
- ✅ 1 error handling test (optional)

**Why fewer tests?**:
- Each test requires real LLM call (~3 seconds, costs money)
- 30 tests = 90 seconds, $0.30+ in costs
- 12 tests = 36 seconds, $0.12 in costs
- Focus on quality over quantity

#### ExecutionResult Fields (Important!)

**`result.success=True` means NO exception, NOT goal achieved**

```python
# ❌ WRONG - assumes goal achieved
assert result.success

# ✅ RIGHT - check success AND output
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
approval = output.get("approval_decision")
assert approval == "APPROVED", f"Expected APPROVED, got {approval}"
```

**All ExecutionResult fields**:
- `success: bool` - Execution completed without exception (NOT goal achieved!)
- `output: dict` - Complete memory snapshot (may contain raw strings)
- `error: str | None` - Error message if failed
- `steps_executed: int` - Number of nodes executed
- `total_tokens: int` - Cumulative token usage
- `total_latency_ms: int` - Total execution time
- `path: list[str]` - Node IDs traversed
- `paused_at: str | None` - Node ID if HITL pause occurred
- `session_state: dict` - State for resuming

### Happy Path Test
```python
@pytest.mark.asyncio
async def test_happy_path(mock_mode):
    """Test normal successful execution"""
    result = await default_agent.run({{"query": "python tutorials"}}, mock_mode=mock_mode)
    assert result.success
    assert len(result.output) > 0
```

### Boundary Condition Test
```python
@pytest.mark.asyncio
async def test_boundary_minimum(mock_mode):
    """Test at minimum threshold"""
    result = await default_agent.run({{"query": "very specific niche topic"}}, mock_mode=mock_mode)
    assert result.success
    assert len(result.output.get("results", [])) >= 1
```

### Error Handling Test
```python
@pytest.mark.asyncio
async def test_error_handling(mock_mode):
    """Test graceful error handling"""
    result = await default_agent.run({{"query": ""}}, mock_mode=mock_mode)  # Invalid input
    assert not result.success or result.output.get("error") is not None
```

### Performance Test
```python
@pytest.mark.asyncio
async def test_performance_latency(mock_mode):
    """Test response time is acceptable"""
    import time
    start = time.time()
    result = await default_agent.run({{"query": "test"}}, mock_mode=mock_mode)
    duration = time.time() - start
    assert duration < 5.0, f"Took {{duration}}s, expected <5s"
```

## Integration with building-agents

### Handoff Points

| Scenario | From | To | Action |
|----------|------|-----|--------|
| Agent built, ready to test | building-agents | testing-agent | Generate success tests |
| LOGIC_ERROR found | testing-agent | building-agents | Update goal, rebuild |
| IMPLEMENTATION_ERROR found | testing-agent | Direct fix | Edit agent files, re-run tests |
| EDGE_CASE found | testing-agent | testing-agent | Add edge case test |
| All tests pass | testing-agent | Done | Agent validated ✅ |

### Iteration Speed Comparison

| Scenario | Old Approach | New Approach |
|----------|--------------|--------------|
| **Bug Fix** | Rebuild via MCP tools (14 min) | Edit Python file, pytest (2 min) |
| **Add Test** | Generate via MCP, export (5 min) | Write test file directly (1 min) |
| **Debug** | Read subprocess logs | pdb, breakpoints, prints |
| **Inspect** | Limited visibility | Full Python introspection |

## Anti-Patterns

### Testing Best Practices

| Don't | Do Instead |
|-------|------------|
| ❌ Write tests without getting guidelines first | ✅ Use `generate_*_tests` to get proper file_header and guidelines |
| ❌ Run pytest via Bash | ✅ Use `run_tests` MCP tool for structured results |
| ❌ Debug tests with Bash pytest -vvs | ✅ Use `debug_test` MCP tool for formatted output |
| ❌ Check for tests with Glob | ✅ Use `list_tests` MCP tool |
| ❌ Skip the file_header from guidelines | ✅ Always include the file_header for proper imports and fixtures |

### General Testing

| Don't | Do Instead |
|-------|------------|
| ❌ Treat all failures the same | ✅ Use debug_test to categorize and iterate appropriately |
| ❌ Rebuild entire agent for small bugs | ✅ Edit code directly, re-run tests |
| ❌ Run tests without API key | ✅ Always set ANTHROPIC_API_KEY first |
| ❌ Write tests without understanding the constraints/criteria | ✅ Read the formatted constraints/criteria from guidelines |

## Workflow Summary

```
1. Check existing tests: list_tests(goal_id, agent_path)
   → Scans exports/{agent}/tests/test_*.py

2. Get test guidelines: generate_constraint_tests, generate_success_tests
   → Returns file_header, test_template, constraints/criteria, guidelines

3. Write tests: Use Write tool with the provided guidelines
   → Write tests to exports/{agent}/tests/test_*.py

4. Run tests: run_tests(goal_id, agent_path)
   → Executes: pytest exports/{agent}/tests/ -v

5. Debug failures: debug_test(goal_id, test_name, agent_path)
   → Re-runs single test with verbose output

6. Fix based on category:
   - IMPLEMENTATION_ERROR → Edit agent code directly
   - ASSERTION_FAILURE → Fix agent logic or update test
   - IMPORT_ERROR → Check package structure
   - API_ERROR → Check API keys and connectivity

7. Re-run tests: run_tests(goal_id, agent_path)

8. Repeat until all pass ✅
```

## MCP Tools Reference

```python
# Check existing tests (scans Python test files)
mcp__agent-builder__list_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent"
)

# Get constraint test guidelines (returns templates and guidelines, NOT generated tests)
mcp__agent-builder__generate_constraint_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "constraints": [...]}',
    agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines

# Get success criteria test guidelines
mcp__agent-builder__generate_success_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "success_criteria": [...]}',
    node_names="node1,node2",
    tool_names="tool1,tool2",
    agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines

# Run tests via pytest subprocess
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent"
)

# Debug a failed test (re-runs with verbose output)
mcp__agent-builder__debug_test(
    goal_id="your-goal-id",
    test_name="test_constraint_foo",
    agent_path="exports/your_agent"
)
```

## run_tests Options

```python
# Run only constraint tests
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    test_types='["constraint"]'
)

# Run only success criteria tests
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    test_types='["success"]'
)

# Run with pytest-xdist parallelism (requires pytest-xdist)
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    parallel=4
)

# Stop on first failure
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    fail_fast=True
)
```

## Direct pytest Commands

You can also run tests directly with pytest (the MCP tools use pytest internally):

```bash
# Run all tests
pytest exports/your_agent/tests/ -v

# Run specific test file
pytest exports/your_agent/tests/test_constraints.py -v

# Run specific test
pytest exports/your_agent/tests/test_constraints.py::test_constraint_foo -vvs

# Run in mock mode (structure validation only)
MOCK_MODE=1 pytest exports/your_agent/tests/ -v
```

---

**MCP tools generate tests, write them to Python files, and run them via pytest.**