swe-bench-lite
skillQuick-start command to run SWE-bench Lite evaluation with sensible defaults.
apm::install
apm install @greynewell/swe-bench-liteapm::skill.md
---
name: swe-bench-lite
description: Quick-start command to run SWE-bench Lite evaluation with sensible defaults.
---
# Instructions
This skill provides a streamlined way to run the SWE-bench Lite benchmark with pre-configured defaults.
## What This Skill Does
This skill runs a quick SWE-bench Lite evaluation with:
- 5 sample tasks (configurable)
- Verbose output for visibility
- Results saved to `results.json`
- Report saved to `report.md`
## Prerequisites Check
Before running, verify:
1. **Docker is running:**
```bash
docker ps
```
2. **API key is set:**
```bash
echo $ANTHROPIC_API_KEY
```
3. **Config file exists:**
- Check for `mcpbr.yaml` in the current directory
- If missing, run `mcpbr init` to generate it
## Default Command
The default command for SWE-bench Lite:
```bash
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -n 5 -v -o results.json -r report.md
```
## Customization Options
Users can customize the run by modifying:
- **Sample size:** Change `-n 5` to any number (or remove for full dataset)
- **Config file:** Change `-c mcpbr.yaml` to point to a different config
- **Verbosity:** Use `-vv` for very verbose output
- **Output files:** Change `results.json` and `report.md` to different paths
## Example Variations
### Minimal quick test (1 task)
```bash
mcpbr run -c mcpbr.yaml -n 1 -v
```
### Full evaluation (all ~300 tasks)
```bash
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -v -o results.json
```
### MCP-only (skip baseline)
```bash
mcpbr run -c mcpbr.yaml -n 5 -M -v -o results.json
```
### Specific tasks
```bash
mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099 -v
```
## Expected Runtime & Cost
For 5 tasks with default settings:
- **Runtime:** 15-30 minutes (depends on task complexity)
- **Cost:** $2-5 (depends on task complexity and model used)
## What to Do If It Fails
1. **Docker not running:** Start Docker Desktop
2. **API key missing:** Set with `export ANTHROPIC_API_KEY="sk-ant-..."`
3. **Config missing:** Run `mcpbr init` to generate default config
4. **Config invalid:** Check that `{workdir}` placeholder is in the `args` array
5. **MCP server fails:** Test the server command independently
## After the Run
Once complete, you'll have:
- **results.json:** Full evaluation data with metrics, token usage, and per-task results
- **report.md:** Human-readable summary with resolution rates and comparisons
- **Console output:** Real-time progress and summary table
Review the results to see how your MCP server performed compared to the baseline!
## Pro Tips
- Start with `-n 1` to verify everything works before running larger evaluations
- Use `--log-dir logs/` to save detailed per-task logs for debugging
- Compare multiple runs by changing the MCP server config between runs
- Use `--baseline-results baseline.json` to detect regressions between versions