@potatodog1669/insight-pilot — Agent Skill

---
name: insight-pilot
description: Frontier content tracking automation - search papers, code, and blogs, deduplicate, download PDFs, analyze and generate research reports. Supports incremental updates.
version: 0.3.0
---

# Insight-Pilot Skill

A workflow skill for frontier tracking: search papers/code/blogs, deduplicate, download, analyze, and generate incremental reports.

## 1) Setup

Run bootstrap first (auto-checks env, creates venv, installs dependencies when needed):

```bash
bash .codex/skills/insight-pilot/scripts/bootstrap_env.sh
```

Bootstrap prefers local editable install inside this repo so CLI behavior stays aligned with this skill doc.

Recommended preflight:

```bash
source ~/.insight-pilot-venv/bin/activate
insight-pilot --version
insight-pilot init -h | rg -- '--intent|--language'
```

## 2) CLI Quick Reference

Activate env before running commands:

```bash
source ~/.insight-pilot-venv/bin/activate
```

| Command | Purpose | Required Args | Key Optional Args |
|---|---|---|---|
| `init` | Create project + intent plan | `--topic`, `--output` | `--intent`, `--language`, `--keywords`, `--sources-config` |
| `search` | Search + merge + dedup | `--project`, `--source`, `--query` | `--limit`, `--since`, `--until` |
| `download` | Download PDFs + convert to Markdown | `--project` | - |
| `analyze` | Analyze with LLM or prepare agent-mode tasks | `--project` | `--mode`, `--config`, `--force` |
| `index` | Generate `index.md` + `reports/` | `--project` | `--template` |
| `status` | Show project status | `--project` | `--json` |
| `sources` | Manage blog sources | `--project` | `--add`, `--remove`, `--config` |

`search --since/--until` currently applies to: `arxiv`, `openalex`, `github`, `blog`.

## 3) Skill-First Principle (Mandatory)

This workflow is not script-only. Agent judgment is required.

Before large-scale search, the agent must:
1. Summarize user intent in one sentence.
2. Propose keyword sets:
   - core terms (exact targets)
   - expansion terms (synonyms/variants)
   - benchmark terms (e.g., `WebArena`, `Mind2Web`, `BrowserGym`)
3. Propose source plan:
   - default: `arxiv`, `openalex`, `github`, `blog`
   - conditional: `pubmed` (only biomedical/medical topics)
   - default blog presets: OpenAI, Anthropic, MiniMax, Kimi, Seed, Hacker News
4. Show available/selected/unselected sources to user.
5. Ask user to confirm add/remove/update before search.

Language policy:
- Persisted analysis/report language defaults to Chinese (`zh`).
- `init` stores detected input language as reference.

Suggested init command:

```bash
PROJECT=./research/webagent
insight-pilot init \
  --topic "WebAgent Research" \
  --intent "Track browser use, computer use, gui agent frontier progress" \
  --language zh \
  --keywords "browser use,computer use,gui agent" \
  --output $PROJECT
```

Minimal confirmation template:

```text
Intent: <one-line>
Keywords: <k1>, <k2>, <k3>
Sources - Selected: <...>; Unselected: <...>
Blog presets - Selected: <...>; Unselected: <...>
Confirm:
1) Keep
2) Add sources/blogs: <list>
3) Remove sources/blogs: <list>
4) Update keywords: <new>
```

## 4) End-to-End Workflow

Execution rule: run CLI phases directly in sequence; no line-by-line confirmation needed.

### Phase 1: Search and Dedup

```bash
PROJECT=./research/webagent
insight-pilot init --topic "WebAgent Research" --keywords "web agent,browser agent" --output $PROJECT
insight-pilot search --project $PROJECT --source arxiv openalex github pubmed blog --query "web agent" --limit 50
```

### Phase 2: Manual Review (Agent)

After dedup, review and exclude irrelevant items.

```bash
insight-pilot status --json --project $PROJECT
```

Agent actions:
1. Read `$PROJECT/.insight/items.json`.
2. Check `title` and `abstract` for each item.
3. Mark off-topic items with:
   - `status: "excluded"`
   - `exclude_reason: "..."`
4. Save updated `items.json`.

Example:

```json
{
  "id": "i0023",
  "title": "Unrelated Paper Title",
  "status": "excluded",
  "exclude_reason": "Not related to web agents"
}
```

### Phase 3: Download and Convert

```bash
insight-pilot download --project $PROJECT
```

Result semantics:
- success: `download_status="success"`, file under `papers/`
- failed: recorded in `$PROJECT/.insight/download_failed.json`

Failure record example:

```json
[
  {
    "id": "i0015",
    "title": "Paper Title",
    "url": "https://...",
    "error": "Connection timeout",
    "failed_at": "2026-01-17T10:30:00Z"
  }
]
```

### Phase 3b: Failed-Site Recovery (Agent + `agent-browser`)

When `download_failed.json` is non-empty:
1. Read failure list.
2. Try `url`, then `alternative_urls`.
3. Use `agent-browser` to trigger download links/buttons.
4. Save success to `$PROJECT/papers/{id}.pdf`.
5. Update `$PROJECT/.insight/items.json`:
   - success: `download_status="success"`, set `local_path`, clear `download_error`
   - failed: keep `download_status="failed"`, update `download_error`
6. Rewrite `download_failed.json` with only remaining failures and increment `retry_count`.

Recommended command pattern:

```bash
agent-browser --session "ip-l2-i0015" open "https://target-site.example/paper"
agent-browser --session "ip-l2-i0015" wait --load networkidle
agent-browser --session "ip-l2-i0015" snapshot -i
agent-browser --session "ip-l2-i0015" find text "Download PDF" click
agent-browser --session "ip-l2-i0015" download "a[href$='.pdf']" "$PROJECT/papers/i0015.pdf"
agent-browser --session "ip-l2-i0015" close
```

If exact label is absent, try: `PDF`, `Full Text PDF`, `View PDF`, `Download`.

### Phase 4: Analyze

Precondition: finish Phase 3 (and 3b if needed).

Default (LLM):

```bash
insight-pilot analyze --project $PROJECT
```

Content priority:
1. Markdown converted from PDF (`pymupdf4llm`)
2. PDF extraction (`PyMuPDF`)

LLM config file: `.codex/skills/insight-pilot/llm.yaml`

```yaml
provider: openai  # openai / anthropic / ollama
model: gpt-4o-mini
api_key: sk-xxx
```

Optional no-API mode:

```bash
insight-pilot analyze --project $PROJECT --mode agent
```

Generates:
- `$PROJECT/.insight/analysis_agent_tasks.json`
- `$PROJECT/.insight/analysis_inputs/{id}.md`

Final analysis output path:
- `$PROJECT/.insight/analysis/{id}.json`

#### LLM Unavailable/Failing: High-Quality Manual Analysis Required

Manual mode is mandatory for these cases:
- no LLM config
- auth errors (401/403)
- unresolved rate limit/timeout (429/timeout)

Recommended fallback command first:

```bash
insight-pilot analyze --project $PROJECT --mode agent
```

Then execute this protocol:
1. Build queue by relevance + recency; exclude off-topic items first.
2. Gather evidence per type:
   - paper: abstract + markdown/pdf + IDs/links
   - github: summary + topics + stars/forks + README key sections + update date
   - blog: title + full text + source link + date
3. Write differentiated analysis:
   - unique `brief_analysis` and `detailed_analysis` per item
   - include at least 3 concrete facts (numbers/mechanisms/scope/benchmark/architecture)
   - include limitations and risk notes
4. Add cross-item positioning:
   - research prototype vs production framework vs benchmark/tooling
   - compare adjacent approaches when evidence allows
5. Pass quality gate before index generation.

Manual DoD (must all pass):
1. Source coverage:
   - papers use markdown/pdf + abstract + metadata
   - github uses metadata + README/fulltext + topics
   - blogs use fulltext/summary + date/link
2. Evidence density:
   - each item has >=3 concrete facts
3. Differentiation:
   - no reusable template paragraphs across items
4. Risk clarity:
   - each item has explicit limitations and deployment/quality risk
5. Output completeness:
   - all active items have `$PROJECT/.insight/analysis/{id}.json`

If key evidence is missing:
- explicitly mark low confidence
- lower `relevance_score`
- do not invent details

Analysis JSON example:

```json
{
  "id": "i0001",
  "title": "Paper Title",
  "summary": "One sentence summary",
  "brief_analysis": "2-3 sentences brief analysis",
  "detailed_analysis": "300-500 words detailed analysis",
  "contributions": ["Contribution 1", "Contribution 2"],
  "methodology": "Methodology description",
  "key_findings": ["Finding 1", "Finding 2"],
  "limitations": ["Limitation"],
  "future_work": ["Future work"],
  "relevance_score": 8,
  "tags": ["webagent", "benchmark", "multimodal"],
  "analyzed_at": "2026-01-17T12:00:00Z"
}
```

### Phase 5: Generate Report

```bash
insight-pilot index --project $PROJECT
```

Outputs:
- `$PROJECT/index.md` (analyzed items only)
- `$PROJECT/reports/{id}.md`

Index quality requirements:
1. only high-confidence analyzed items are listed
2. brief analyses are item-specific (no template duplicates)
3. relevance order is sensible (higher first)
4. clearly contrast research-style and implementation-style items when both exist
5. stats match analyzed/failed/excluded/total counts

## 5) Configuration and Sources

### JSON output mode

Use `--json` whenever possible for structured parsing:

```bash
insight-pilot status --json --project ./research/myproject
```

### Blog source config (`sources.yaml`)

```yaml
blogs:
  - name: "OpenAI News"
    type: "rss"
    url: "https://openai.com/news"
    rss_url: "https://openai.com/news/rss.xml"
    category: "frontier-ai"
  - name: "Anthropic News"
    type: "browser"
    url: "https://www.anthropic.com/news"
    category: "frontier-ai"
```

Manage sources:

```bash
insight-pilot sources --project ./research/webagent
```

Store blog full text during search:

```bash
insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20 --blog-store-fulltext
```

Full text output:
- `$PROJECT/.insight/blog_fulltext/*.md`
- `$PROJECT/.insight/blog_fulltext/_manifest.json`

`insight-pilot analyze` automatically reads `metadata.fulltext_path` (if present) for blog/github analysis input.

### Environment variables

- `GITHUB_TOKEN` (higher GitHub API rate limit)
- `PUBMED_EMAIL` (required by NCBI)
- `OPENALEX_MAILTO` (polite usage for OpenAlex)
- `INSIGHT_PILOT_SOURCES` (override sources.yaml path)

Common source commands:

```bash
# GitHub repositories + code + issues
insight-pilot search --project $PROJECT --source github --query "agent framework" --limit 30

# PubMed (requires PUBMED_EMAIL)
insight-pilot search --project $PROJECT --source pubmed --query "clinical agents" --limit 20

# Blogs (rss/browser/auto from sources.yaml)
insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20

# Blogs + full text persistence
insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20 --blog-store-fulltext
```

## 6) Incremental Update Workflow

For daily/weekly updates:

```bash
# 1) Search new items
insight-pilot search --project $PROJECT --source arxiv openalex --query "web agent" --since 2026-01-17 --limit 20

# 2) Agent review new items in items.json

# 3) Download new PDFs
insight-pilot download --project $PROJECT

# 4) If needed, do Phase 3b retries via agent-browser

# 5) Analyze new items (LLM or manual fallback)

# 6) Regenerate report
insight-pilot index --project $PROJECT
```

## 7) Project Structure

```text
research/myproject/
├── .insight/
│   ├── config.yaml          # project config
│   ├── state.json           # workflow state
│   ├── items.json           # item metadata (status/exclude_reason)
│   ├── raw_arxiv.json       # raw source results
│   ├── raw_openalex.json
│   ├── download_failed.json # for agent-browser retries
│   ├── analysis/            # analysis outputs
│   │   ├── i0001.json
│   │   └── ...
│   └── markdown/            # converted markdown from PDF
│       ├── i0001/
│       │   ├── i0001.md
│       │   └── metadata.json
│       └── ...
├── papers/                  # downloaded PDFs
├── reports/                 # per-item reports
└── index.md                 # current incremental report
```

## 8) Data Schema (Reference)

Item example:

```json
{
  "id": "i0001",
  "type": "paper",
  "title": "Paper Title",
  "authors": ["Author One", "Author Two"],
  "date": "2026-01-15",
  "abstract": "...",
  "status": "active|excluded|pending",
  "exclude_reason": null,
  "identifiers": {
    "doi": "10.1234/example",
    "arxiv_id": "2601.12345",
    "openalex_id": "W1234567890"
  },
  "urls": {
    "abstract": "https://arxiv.org/abs/2601.12345",
    "pdf": "https://arxiv.org/pdf/2601.12345"
  },
  "download_status": "success|pending|failed|unavailable",
  "local_path": "./papers/i0001.pdf",
  "citation_count": 42,
  "source": ["arxiv", "openalex"],
  "collected_at": "2026-01-17T10:00:00Z"
}
```

## 9) Error Codes (Reference)

| Code | Meaning | Retryable |
|---|---|---|
| `PROJECT_NOT_FOUND` | Project path does not exist | No |
| `NO_INPUT_FILES` | Required input files missing | No |
| `NO_ITEMS_FILE` | `items.json` missing | No |
| `INVALID_SOURCE` | Unknown source | No |
| `NETWORK_ERROR` | API/network failure | Yes |
| `RATE_LIMITED` | Rate limit hit | Yes |
| `DOWNLOAD_FAILED` | PDF download failed | Yes |
| `CONVERSION_FAILED` | PDF-to-Markdown conversion failed | Yes |
| `MISSING_DEPENDENCY` | Required package missing | No |

## 10) Agent Operating Rules

1. Always run bootstrap on first execution in a fresh environment.
2. Prefer `--json` outputs for machine-readable checks.
3. Run workflow phases in order; no per-command confirmation.
4. Agent must intervene in:
   - Phase 0 (planning)
   - Phase 2 (review/exclusion)
   - Phase 3b (failed-site recovery)
   - Phase 4 manual fallback (when LLM unavailable)
5. For incremental updates, process new items first and keep existing analyses.