APM

>Agent Skill

@greynewell/swe-bench-lite

skilldevelopment

Quick-start command to run SWE-bench Lite evaluation with sensible defaults.

apm::install
$apm install @greynewell/swe-bench-lite
apm::skill.md
---
name: swe-bench-lite
description: Quick-start command to run SWE-bench Lite evaluation with sensible defaults.
---

# Instructions
This skill provides a streamlined way to run the SWE-bench Lite benchmark with pre-configured defaults.

## What This Skill Does

This skill runs a quick SWE-bench Lite evaluation with:
- 5 sample tasks (configurable)
- Verbose output for visibility
- Results saved to `results.json`
- Report saved to `report.md`

## Prerequisites Check

Before running, verify:

1. **Docker is running:**
   ```bash
   docker ps
   ```

2. **API key is set:**
   ```bash
   echo $ANTHROPIC_API_KEY
   ```

3. **Config file exists:**
   - Check for `mcpbr.yaml` in the current directory
   - If missing, run `mcpbr init` to generate it

## Default Command

The default command for SWE-bench Lite:

```bash
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -n 5 -v -o results.json -r report.md
```

## Customization Options

Users can customize the run by modifying:

- **Sample size:** Change `-n 5` to any number (or remove for full dataset)
- **Config file:** Change `-c mcpbr.yaml` to point to a different config
- **Verbosity:** Use `-vv` for very verbose output
- **Output files:** Change `results.json` and `report.md` to different paths

## Example Variations

### Minimal quick test (1 task)
```bash
mcpbr run -c mcpbr.yaml -n 1 -v
```

### Full evaluation (all ~300 tasks)
```bash
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -v -o results.json
```

### MCP-only (skip baseline)
```bash
mcpbr run -c mcpbr.yaml -n 5 -M -v -o results.json
```

### Specific tasks
```bash
mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099 -v
```

## Expected Runtime & Cost

For 5 tasks with default settings:
- **Runtime:** 15-30 minutes (depends on task complexity)
- **Cost:** $2-5 (depends on task complexity and model used)

## What to Do If It Fails

1. **Docker not running:** Start Docker Desktop
2. **API key missing:** Set with `export ANTHROPIC_API_KEY="sk-ant-..."`
3. **Config missing:** Run `mcpbr init` to generate default config
4. **Config invalid:** Check that `{workdir}` placeholder is in the `args` array
5. **MCP server fails:** Test the server command independently

## After the Run

Once complete, you'll have:
- **results.json:** Full evaluation data with metrics, token usage, and per-task results
- **report.md:** Human-readable summary with resolution rates and comparisons
- **Console output:** Real-time progress and summary table

Review the results to see how your MCP server performed compared to the baseline!

## Pro Tips

- Start with `-n 1` to verify everything works before running larger evaluations
- Use `--log-dir logs/` to save detailed per-task logs for debugging
- Compare multiple runs by changing the MCP server config between runs
- Use `--baseline-results baseline.json` to detect regressions between versions