@microsoft/image-vision — Agent Skill

---
name: image-vision
description: "Analyze images using LLM vision APIs (Anthropic Claude, OpenAI GPT-4, Google Gemini, Azure OpenAI). Use when tasks require: (1) Understanding image content, (2) Describing visual elements, (3) Answering questions about images, (4) Comparing images, (5) Extracting text from images (OCR). Provides ready-to-use scripts - no custom code needed for simple cases."
license: MIT
---

# Image Vision Analysis

## Overview

Analyze images using state-of-the-art LLM vision models. **Use the provided scripts** for most tasks - custom code only needed for advanced scenarios.

## Workflow Decision Tree

### First time using this skill?
→ Read [`setup.md`](setup.md) for one-time environment and API key setup

### Simple image analysis (most common)
→ Use "Quick Start" canned scripts below

### Batch processing or multi-turn conversations
→ Read [`patterns.md`](patterns.md) for advanced patterns

### Something failing?
→ Check setup.md for troubleshooting

## Quick Start (Use Wrapper Scripts)

**ALWAYS use the wrapper scripts** - they handle venv setup automatically:

```bash
# Simple analysis (auto-creates venv on first use)
./vision-analyze.sh <provider> <image_path> <prompt>

# Robust analysis (auto-fallback if provider times out)
./vision-analyze-robust.sh <image_path> <prompt> [timeout_seconds]
```

**The wrapper scripts automatically:**
- Create venv if it doesn't exist
- Install required SDKs
- Use venv Python (no manual activation needed)
- Handle errors gracefully

**Example usage:**

```bash
# Analyze a UI screenshot (Anthropic Claude)
./vision-analyze.sh anthropic screenshot.png "Describe any UI bugs or issues you see"

# Extract text (Google Gemini - fastest)
./vision-analyze.sh gemini document.jpg "Extract all text from this image"

# Robust analysis with auto-fallback (tries Gemini → Anthropic → OpenAI)
./vision-analyze-robust.sh photo.png "Describe this image in detail"

# With custom timeout (default is 60 seconds)
./vision-analyze-robust.sh large-image.png "Analyze this" 120
```

### Advanced: Direct Script Usage (Not Recommended)

If you need to call the Python scripts directly, you MUST use the venv Python:

```bash
# ❌ WRONG - uses system Python, will fail
python examples/anthropic-vision.py image.png "prompt"

# ✅ CORRECT - uses venv Python
./.venv/bin/python examples/anthropic-vision.py image.png "prompt"
```

**For agents:** Always use the wrapper scripts to avoid setup issues.

## Provider Comparison

| Provider | Model | Best For | Speed | Cost |
|----------|-------|----------|-------|------|
| **Anthropic** | claude-sonnet-4-5 | Latest, balanced quality/speed | Fast | $$ |
| **Anthropic** | claude-3-opus | Highest quality (older) | Slow | $$$ |
| **Anthropic** | claude-3-haiku | Fastest, simple tasks | Very Fast | $ |
| **OpenAI** | gpt-5 | Latest flagship model | Fast | $$$ |
| **OpenAI** | gpt-4.1 | High-volume production | Fast | $$ |
| **Gemini** | gemini-2.5-flash | Latest, excellent balance | Very Fast | $ |
| **Gemini** | gemini-2.5-pro | Large images, best quality | Medium | $$ |
| **Azure** | (deployment-based) | Enterprise, compliance | Varies | Varies |

## Supported Image Formats

- **JPEG/JPG** - Most common
- **PNG** - With transparency
- **GIF** - Static or animated
- **WEBP** - Modern format

**Max sizes:**
- Anthropic: 5MB per image
- OpenAI: 20MB (auto-resizes)
- Gemini: Varies by model (1.5 pro handles very large)

## Common Use Cases

```bash
# UI/UX Analysis - High-level layout and spacing
./vision-analyze.sh anthropic app-screenshot.png \
  "Analyze this UI for accessibility issues and suggest improvements"

# Bug Identification (use robust for auto-fallback)
./vision-analyze-robust.sh error-state.png \
  "What's wrong with this interface? Describe any visual bugs."

# Content Moderation
./vision-analyze.sh openai user-upload.jpg \
  "Does this image contain inappropriate content? Yes or no, and explain."

# Document Understanding (Gemini is fastest)
./vision-analyze.sh gemini invoice.png \
  "Extract the total amount, date, and vendor name from this invoice"

# Design Review - Layout, color, hierarchy (not typography details)
./vision-analyze-robust.sh mockup.png \
  "Provide design feedback on this mockup. Consider layout, color hierarchy, and spacing."
```

## ⚠️ Known Limitations for Web UI Analysis

### Typography and Font Detection

Vision models **struggle with precise typography** at typical screenshot resolutions:

**❌ Unreliable for:**
- Distinguishing serif vs sans-serif fonts at small sizes (<16px)
- Identifying specific font families (Inter vs Roboto vs Arial)
- Detecting subtle weight differences (400 vs 500)
- Precise alignment measurements (<5px differences)

**✅ Reliable for:**
- High-level layout issues (spacing, hierarchy, colors)
- Large size differences (14px vs 24px heading sizes)
- Missing elements or obviously broken UI states
- Color contrast and accessibility problems

### Best Practice: Multi-Modal Investigation

**For Web UI bugs, use this hierarchy:**

```bash
# 1. Vision for TRIAGE (identify area of concern)
./vision-analyze-robust.sh screenshot.png "Are there any visual inconsistencies in the navigation?"

# 2. Browser inspection for FACTS (if typography/font suspected)
# Use Playwright or DevTools to query computed CSS:
# const styles = await page.evaluate(() => ({
#   fontFamily: getComputedStyle(element).fontFamily
# }));

# 3. Code investigation for ROOT CAUSE
# grep -r ".suspicious-class" src/

# 4. Vision for VERIFICATION (after fix applied)
./vision-analyze-robust.sh fixed.png "Is the navigation font now consistent?"
```

### When to Stop Using Vision

If vision gives **contradictory results** across 2+ attempts on similar screenshots:
1. **Stop** asking vision for more detailed analysis
2. **Switch** to browser DevTools inspection (query computed styles)
3. **Use vision only** for final verification after fix is applied

This indicates the issue is too subtle for vision models to detect reliably.

### Prompt Patterns for Web UI

**Font/Typography (with caveats):**
```bash
# Be explicit about what to look for
./vision-analyze.sh anthropic ui.png \
  "Look at the navigation text. Do any items have decorative 'feet' at letter ends (serif font) 
  while others have clean straight edges (sans-serif)? Point out any font style differences."
  
# Note: Small fonts may be unreliable - verify with browser inspection
```

**Alignment (relative observations):**
```bash
# Ask for noticeable differences, not pixel precision
./vision-analyze.sh anthropic ui.png \
  "Is the bullet (•) noticeably misaligned with the text baseline? 
  Describe its vertical position relative to the text."
```

**Layout and Spacing:**
```bash
# Vision is GOOD at this
./vision-analyze.sh anthropic ui.png \
  "Compare the spacing between navigation sections. Is it consistent?"
```

## Output Format

All scripts output to stdout as plain text. The LLM's analysis is printed directly:

```bash
$ python examples/anthropic-vision.py screenshot.png "What's in this image?"

This image shows a web application dashboard with a navigation bar at the top,
a sidebar on the left with menu items, and a main content area displaying...
```

**For structured output**, modify your prompt:

```bash
python examples/openai-vision.py data.png \
  "Extract data as JSON with keys: title, date, amount"
```

## When to Write Custom Scripts

**Use the canned scripts for:**
- ✅ Single image + single prompt analysis
- ✅ Quick one-off tasks
- ✅ Simple Q&A about images

**Write custom scripts when you need:**
- ❌ Batch processing (analyze 100 images)
- ❌ Multi-turn conversations (follow-up questions on same image)
- ❌ Custom output formatting (generate markdown reports)
- ❌ Image preprocessing (resize, crop, filter)
- ❌ Provider fallback logic (try Gemini, then Claude)

→ See [`patterns.md`](patterns.md) for custom script examples

## Anti-Patterns

| ❌ Don't | ✅ Do |
|----------|-------|
| Write custom script for simple analysis | Use canned scripts |
| Use low-quality compressed images | Use clear, high-res images |
| Ask vague questions | Be specific in prompts |
| Forget to set API keys | Set keys in environment variables |
| Mix up provider-specific model names | Check provider comparison table |

## Quick Reference

| Task | Command |
|------|---------|
| Analyze (single provider) | `./vision-analyze.sh anthropic img.png "prompt"` |
| Analyze (auto-fallback) | `./vision-analyze-robust.sh img.png "prompt"` |
| Extract text (OCR) | `./vision-analyze.sh gemini img.png "Extract all text"` |
| Health check | `./health-check.sh` |
| Compare images | See patterns.md for custom script |
| Batch process | See patterns.md for custom script |

## ⚠️ CRITICAL INSTRUCTIONS FOR AGENTS

**READ THIS BEFORE USING THIS SKILL:**

### 1. Always Use the Wrapper Scripts

```bash
# For AI agents (recommended) - auto-fallback on timeout
~/.amplifier/skills/image-vision/vision-analyze-robust.sh <image_path> <prompt>

# Single provider (faster if you know which to use)
~/.amplifier/skills/image-vision/vision-analyze.sh <provider> <image_path> <prompt>
```

**Examples:**
```bash
# Robust analysis (tries multiple providers if timeout)
~/.amplifier/skills/image-vision/vision-analyze-robust.sh screenshot.png "Analyze this UI"

# Specific provider
~/.amplifier/skills/image-vision/vision-analyze.sh anthropic screenshot.png "Describe this"
```

### 2. ALWAYS Check Exit Code Before Using Output

```bash
# Correct usage pattern
OUTPUT=$(~/.amplifier/skills/image-vision/vision-analyze-robust.sh image.png "Analyze this" 2>&1)
EXIT_CODE=$?

if [ $EXIT_CODE -eq 0 ]; then
    echo "Vision analysis succeeded"
    # Now you can use $OUTPUT
else
    echo "ERROR: Vision analysis failed (exit code: $EXIT_CODE)"
    echo "Error details: $OUTPUT"
    # STOP HERE - do NOT proceed
    exit 1
fi
```

### 3. NEVER Fabricate Visual Observations

**If vision analysis fails, you MUST:**

✅ **DO:**
- Report failure explicitly to user
- Provide error details from stderr
- Ask user how to proceed (retry? different provider? skip visual analysis?)
- Wait for user direction before continuing

❌ **NEVER:**
- Write analysis documents without successfully seeing images
- Fabricate visual observations based on context/guesswork
- Guess pixel measurements or UI element details
- Pretend you analyzed screenshots you didn't actually see
- Continue with tasks that require visual inspection if vision failed

**Example of CORRECT failure handling:**

```
Agent: I attempted to analyze the 3 screenshots using the image-vision skill:
- screenshot-1.png: ✗ Anthropic timed out (60s)
- screenshot-1.png: ✗ Gemini timed out (60s)  
- screenshot-1.png: ✗ OpenAI failed (API error)

I have NOT successfully analyzed any of the screenshots. I cannot provide visual design 
feedback without actually seeing the images.

Options:
1. Retry with different settings
2. Investigate why all providers are failing
3. Defer visual analysis until the issue is resolved

I will NOT write design analysis documents based on guesswork or context alone.
```

### 4. Timeout Considerations

Vision API calls typically take 5-60 seconds:
- Gemini Flash: 3-10s (fastest)
- Anthropic Claude: 5-15s
- OpenAI GPT-4: 8-20s

The wrapper scripts handle timeouts with:
- 60-second default timeout (configurable)
- Auto-fallback to faster providers (robust script)
- Retry logic on transient failures

If still hitting timeouts:
- Use smaller images (resize to 2000px max)
- Simplify prompts
- Use faster models (Gemini Flash)

## Environment Setup Reminder

**For interactive use:**
1. Create venv: `cd image-vision && uv venv`
2. Install SDKs: `uv pip install anthropic openai google-generativeai`
3. Set API keys: Export `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`

**For agents:**
- Just use the wrapper scripts - they auto-setup on first use
- Verify health: `./health-check.sh`

→ See [`setup.md`](setup.md) for complete instructions

## See Also

- [`setup.md`](setup.md) — One-time environment setup, API keys, troubleshooting
- [`patterns.md`](patterns.md) — Advanced patterns: batch processing, multi-turn, custom output