Frontier content tracking automation - search papers, code, and blogs, deduplicate, download PDFs, analyze and generate research reports. Supports incremental updates.
apm install @potatodog1669/insight-pilot[](https://apm-p1ls2dz87-atlamors-projects.vercel.app/packages/@potatodog1669/insight-pilot)---
name: insight-pilot
description: Frontier content tracking automation - search papers, code, and blogs, deduplicate, download PDFs, analyze and generate research reports. Supports incremental updates.
version: 0.3.0
---
# Insight-Pilot Skill
A workflow skill for frontier tracking: search papers/code/blogs, deduplicate, download, analyze, and generate incremental reports.
## 1) Setup
Run bootstrap first (auto-checks env, creates venv, installs dependencies when needed):
```bash
bash .codex/skills/insight-pilot/scripts/bootstrap_env.sh
```
Bootstrap prefers local editable install inside this repo so CLI behavior stays aligned with this skill doc.
Recommended preflight:
```bash
source ~/.insight-pilot-venv/bin/activate
insight-pilot --version
insight-pilot init -h | rg -- '--intent|--language'
```
## 2) CLI Quick Reference
Activate env before running commands:
```bash
source ~/.insight-pilot-venv/bin/activate
```
| Command | Purpose | Required Args | Key Optional Args |
|---|---|---|---|
| `init` | Create project + intent plan | `--topic`, `--output` | `--intent`, `--language`, `--keywords`, `--sources-config` |
| `search` | Search + merge + dedup | `--project`, `--source`, `--query` | `--limit`, `--since`, `--until` |
| `download` | Download PDFs + convert to Markdown | `--project` | - |
| `analyze` | Analyze with LLM or prepare agent-mode tasks | `--project` | `--mode`, `--config`, `--force` |
| `index` | Generate `index.md` + `reports/` | `--project` | `--template` |
| `status` | Show project status | `--project` | `--json` |
| `sources` | Manage blog sources | `--project` | `--add`, `--remove`, `--config` |
`search --since/--until` currently applies to: `arxiv`, `openalex`, `github`, `blog`.
## 3) Skill-First Principle (Mandatory)
This workflow is not script-only. Agent judgment is required.
Before large-scale search, the agent must:
1. Summarize user intent in one sentence.
2. Propose keyword sets:
- core terms (exact targets)
- expansion terms (synonyms/variants)
- benchmark terms (e.g., `WebArena`, `Mind2Web`, `BrowserGym`)
3. Propose source plan:
- default: `arxiv`, `openalex`, `github`, `blog`
- conditional: `pubmed` (only biomedical/medical topics)
- default blog presets: OpenAI, Anthropic, MiniMax, Kimi, Seed, Hacker News
4. Show available/selected/unselected sources to user.
5. Ask user to confirm add/remove/update before search.
Language policy:
- Persisted analysis/report language defaults to Chinese (`zh`).
- `init` stores detected input language as reference.
Suggested init command:
```bash
PROJECT=./research/webagent
insight-pilot init \
--topic "WebAgent Research" \
--intent "Track browser use, computer use, gui agent frontier progress" \
--language zh \
--keywords "browser use,computer use,gui agent" \
--output $PROJECT
```
Minimal confirmation template:
```text
Intent: <one-line>
Keywords: <k1>, <k2>, <k3>
Sources - Selected: <...>; Unselected: <...>
Blog presets - Selected: <...>; Unselected: <...>
Confirm:
1) Keep
2) Add sources/blogs: <list>
3) Remove sources/blogs: <list>
4) Update keywords: <new>
```
## 4) End-to-End Workflow
Execution rule: run CLI phases directly in sequence; no line-by-line confirmation needed.
### Phase 1: Search and Dedup
```bash
PROJECT=./research/webagent
insight-pilot init --topic "WebAgent Research" --keywords "web agent,browser agent" --output $PROJECT
insight-pilot search --project $PROJECT --source arxiv openalex github pubmed blog --query "web agent" --limit 50
```
### Phase 2: Manual Review (Agent)
After dedup, review and exclude irrelevant items.
```bash
insight-pilot status --json --project $PROJECT
```
Agent actions:
1. Read `$PROJECT/.insight/items.json`.
2. Check `title` and `abstract` for each item.
3. Mark off-topic items with:
- `status: "excluded"`
- `exclude_reason: "..."`
4. Save updated `items.json`.
Example:
```json
{
"id": "i0023",
"title": "Unrelated Paper Title",
"status": "excluded",
"exclude_reason": "Not related to web agents"
}
```
### Phase 3: Download and Convert
```bash
insight-pilot download --project $PROJECT
```
Result semantics:
- success: `download_status="success"`, file under `papers/`
- failed: recorded in `$PROJECT/.insight/download_failed.json`
Failure record example:
```json
[
{
"id": "i0015",
"title": "Paper Title",
"url": "https://...",
"error": "Connection timeout",
"failed_at": "2026-01-17T10:30:00Z"
}
]
```
### Phase 3b: Failed-Site Recovery (Agent + `agent-browser`)
When `download_failed.json` is non-empty:
1. Read failure list.
2. Try `url`, then `alternative_urls`.
3. Use `agent-browser` to trigger download links/buttons.
4. Save success to `$PROJECT/papers/{id}.pdf`.
5. Update `$PROJECT/.insight/items.json`:
- success: `download_status="success"`, set `local_path`, clear `download_error`
- failed: keep `download_status="failed"`, update `download_error`
6. Rewrite `download_failed.json` with only remaining failures and increment `retry_count`.
Recommended command pattern:
```bash
agent-browser --session "ip-l2-i0015" open "https://target-site.example/paper"
agent-browser --session "ip-l2-i0015" wait --load networkidle
agent-browser --session "ip-l2-i0015" snapshot -i
agent-browser --session "ip-l2-i0015" find text "Download PDF" click
agent-browser --session "ip-l2-i0015" download "a[href$='.pdf']" "$PROJECT/papers/i0015.pdf"
agent-browser --session "ip-l2-i0015" close
```
If exact label is absent, try: `PDF`, `Full Text PDF`, `View PDF`, `Download`.
### Phase 4: Analyze
Precondition: finish Phase 3 (and 3b if needed).
Default (LLM):
```bash
insight-pilot analyze --project $PROJECT
```
Content priority:
1. Markdown converted from PDF (`pymupdf4llm`)
2. PDF extraction (`PyMuPDF`)
LLM config file: `.codex/skills/insight-pilot/llm.yaml`
```yaml
provider: openai # openai / anthropic / ollama
model: gpt-4o-mini
api_key: sk-xxx
```
Optional no-API mode:
```bash
insight-pilot analyze --project $PROJECT --mode agent
```
Generates:
- `$PROJECT/.insight/analysis_agent_tasks.json`
- `$PROJECT/.insight/analysis_inputs/{id}.md`
Final analysis output path:
- `$PROJECT/.insight/analysis/{id}.json`
#### LLM Unavailable/Failing: High-Quality Manual Analysis Required
Manual mode is mandatory for these cases:
- no LLM config
- auth errors (401/403)
- unresolved rate limit/timeout (429/timeout)
Recommended fallback command first:
```bash
insight-pilot analyze --project $PROJECT --mode agent
```
Then execute this protocol:
1. Build queue by relevance + recency; exclude off-topic items first.
2. Gather evidence per type:
- paper: abstract + markdown/pdf + IDs/links
- github: summary + topics + stars/forks + README key sections + update date
- blog: title + full text + source link + date
3. Write differentiated analysis:
- unique `brief_analysis` and `detailed_analysis` per item
- include at least 3 concrete facts (numbers/mechanisms/scope/benchmark/architecture)
- include limitations and risk notes
4. Add cross-item positioning:
- research prototype vs production framework vs benchmark/tooling
- compare adjacent approaches when evidence allows
5. Pass quality gate before index generation.
Manual DoD (must all pass):
1. Source coverage:
- papers use markdown/pdf + abstract + metadata
- github uses metadata + README/fulltext + topics
- blogs use fulltext/summary + date/link
2. Evidence density:
- each item has >=3 concrete facts
3. Differentiation:
- no reusable template paragraphs across items
4. Risk clarity:
- each item has explicit limitations and deployment/quality risk
5. Output completeness:
- all active items have `$PROJECT/.insight/analysis/{id}.json`
If key evidence is missing:
- explicitly mark low confidence
- lower `relevance_score`
- do not invent details
Analysis JSON example:
```json
{
"id": "i0001",
"title": "Paper Title",
"summary": "One sentence summary",
"brief_analysis": "2-3 sentences brief analysis",
"detailed_analysis": "300-500 words detailed analysis",
"contributions": ["Contribution 1", "Contribution 2"],
"methodology": "Methodology description",
"key_findings": ["Finding 1", "Finding 2"],
"limitations": ["Limitation"],
"future_work": ["Future work"],
"relevance_score": 8,
"tags": ["webagent", "benchmark", "multimodal"],
"analyzed_at": "2026-01-17T12:00:00Z"
}
```
### Phase 5: Generate Report
```bash
insight-pilot index --project $PROJECT
```
Outputs:
- `$PROJECT/index.md` (analyzed items only)
- `$PROJECT/reports/{id}.md`
Index quality requirements:
1. only high-confidence analyzed items are listed
2. brief analyses are item-specific (no template duplicates)
3. relevance order is sensible (higher first)
4. clearly contrast research-style and implementation-style items when both exist
5. stats match analyzed/failed/excluded/total counts
## 5) Configuration and Sources
### JSON output mode
Use `--json` whenever possible for structured parsing:
```bash
insight-pilot status --json --project ./research/myproject
```
### Blog source config (`sources.yaml`)
```yaml
blogs:
- name: "OpenAI News"
type: "rss"
url: "https://openai.com/news"
rss_url: "https://openai.com/news/rss.xml"
category: "frontier-ai"
- name: "Anthropic News"
type: "browser"
url: "https://www.anthropic.com/news"
category: "frontier-ai"
```
Manage sources:
```bash
insight-pilot sources --project ./research/webagent
```
Store blog full text during search:
```bash
insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20 --blog-store-fulltext
```
Full text output:
- `$PROJECT/.insight/blog_fulltext/*.md`
- `$PROJECT/.insight/blog_fulltext/_manifest.json`
`insight-pilot analyze` automatically reads `metadata.fulltext_path` (if present) for blog/github analysis input.
### Environment variables
- `GITHUB_TOKEN` (higher GitHub API rate limit)
- `PUBMED_EMAIL` (required by NCBI)
- `OPENALEX_MAILTO` (polite usage for OpenAlex)
- `INSIGHT_PILOT_SOURCES` (override sources.yaml path)
Common source commands:
```bash
# GitHub repositories + code + issues
insight-pilot search --project $PROJECT --source github --query "agent framework" --limit 30
# PubMed (requires PUBMED_EMAIL)
insight-pilot search --project $PROJECT --source pubmed --query "clinical agents" --limit 20
# Blogs (rss/browser/auto from sources.yaml)
insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20
# Blogs + full text persistence
insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20 --blog-store-fulltext
```
## 6) Incremental Update Workflow
For daily/weekly updates:
```bash
# 1) Search new items
insight-pilot search --project $PROJECT --source arxiv openalex --query "web agent" --since 2026-01-17 --limit 20
# 2) Agent review new items in items.json
# 3) Download new PDFs
insight-pilot download --project $PROJECT
# 4) If needed, do Phase 3b retries via agent-browser
# 5) Analyze new items (LLM or manual fallback)
# 6) Regenerate report
insight-pilot index --project $PROJECT
```
## 7) Project Structure
```text
research/myproject/
├── .insight/
│ ├── config.yaml # project config
│ ├── state.json # workflow state
│ ├── items.json # item metadata (status/exclude_reason)
│ ├── raw_arxiv.json # raw source results
│ ├── raw_openalex.json
│ ├── download_failed.json # for agent-browser retries
│ ├── analysis/ # analysis outputs
│ │ ├── i0001.json
│ │ └── ...
│ └── markdown/ # converted markdown from PDF
│ ├── i0001/
│ │ ├── i0001.md
│ │ └── metadata.json
│ └── ...
├── papers/ # downloaded PDFs
├── reports/ # per-item reports
└── index.md # current incremental report
```
## 8) Data Schema (Reference)
Item example:
```json
{
"id": "i0001",
"type": "paper",
"title": "Paper Title",
"authors": ["Author One", "Author Two"],
"date": "2026-01-15",
"abstract": "...",
"status": "active|excluded|pending",
"exclude_reason": null,
"identifiers": {
"doi": "10.1234/example",
"arxiv_id": "2601.12345",
"openalex_id": "W1234567890"
},
"urls": {
"abstract": "https://arxiv.org/abs/2601.12345",
"pdf": "https://arxiv.org/pdf/2601.12345"
},
"download_status": "success|pending|failed|unavailable",
"local_path": "./papers/i0001.pdf",
"citation_count": 42,
"source": ["arxiv", "openalex"],
"collected_at": "2026-01-17T10:00:00Z"
}
```
## 9) Error Codes (Reference)
| Code | Meaning | Retryable |
|---|---|---|
| `PROJECT_NOT_FOUND` | Project path does not exist | No |
| `NO_INPUT_FILES` | Required input files missing | No |
| `NO_ITEMS_FILE` | `items.json` missing | No |
| `INVALID_SOURCE` | Unknown source | No |
| `NETWORK_ERROR` | API/network failure | Yes |
| `RATE_LIMITED` | Rate limit hit | Yes |
| `DOWNLOAD_FAILED` | PDF download failed | Yes |
| `CONVERSION_FAILED` | PDF-to-Markdown conversion failed | Yes |
| `MISSING_DEPENDENCY` | Required package missing | No |
## 10) Agent Operating Rules
1. Always run bootstrap on first execution in a fresh environment.
2. Prefer `--json` outputs for machine-readable checks.
3. Run workflow phases in order; no per-command confirmation.
4. Agent must intervene in:
- Phase 0 (planning)
- Phase 2 (review/exclusion)
- Phase 3b (failed-site recovery)
- Phase 4 manual fallback (when LLM unavailable)
5. For incremental updates, process new items first and keep existing analyses.