document-converter
skillConvert between Markdown, DOCX, and PDF formats bidirectionally. Handles text extraction from PDF/DOCX, markdown to document conversion. Use when converting document formats or extracting structured content from Word or PDF files.
apm::install
apm install @mattnigh/document-converterapm::allowed-tools
BashReadGlobWrite
apm::skill.md
---
name: document-converter
description: Convert between Markdown, DOCX, and PDF formats bidirectionally. Handles text extraction from PDF/DOCX, markdown to document conversion. Use when converting document formats or extracting structured content from Word or PDF files.
allowed-tools: Bash, Read, Glob, Write
dependencies:
- pandoc>=2.0
- python3>=3.8
- markitdown (optional, recommended)
- pymupdf4llm (optional, recommended)
- google-genai (optional, for enhanced PDF conversion)
- pdf2docx (optional, for PDF to DOCX)
model: haiku-4.5
model-justification: Orchestrates external conversion tools with minimal AI reasoning required
fallback-model: sonnet-4.5
---
# Document Converter Skill
Convert documents bidirectionally between Markdown, DOCX, and PDF formats. This skill automatically detects optimal conversion tools, handles batch processing, and ensures quality output with appropriate fallback mechanisms.
## Core Capabilities
### Conversion Modes
The skill supports two conversion modes:
**Default Mode (API)**: When GEMINI_API_KEY is set, PDF-to-Markdown conversions use Google Gemini API for significantly improved quality (+20-30% fidelity). Other conversions use local tools.
**Offline Mode**: Use --no-api flag or set CONVERT_DOCS_OFFLINE=true to disable all API calls. All conversions use local tools only.
### Conversion Directions
**TO Markdown** (text extraction from documents):
- DOCX → Markdown (MarkItDown or Pandoc)
- PDF → Markdown (Gemini API, PyMuPDF4LLM, or MarkItDown)
**FROM Markdown** (document generation):
- Markdown → DOCX (Pandoc)
- Markdown → PDF (Pandoc with Typst or XeLaTeX)
**Direct Conversion**:
- PDF → DOCX (pdf2docx - direct conversion preserves layout)
### Features
- Automatic tool detection and selection
- Cascading fallback mechanisms
- Batch processing support
- Image extraction and embedding
- Filename sanitization (spaces to underscores)
- Quality validation and reporting
- Concurrent conversion support
## Tool Priority Matrix
The skill uses intelligent tool selection based on format and quality metrics:
### PDF → Markdown (Mode-Dependent)
**Gemini Mode** (when GEMINI_API_KEY is set):
1. **Gemini API** (primary) - 95%+ fidelity
- Vision-based understanding of layout
- Semantic structure preservation
- Code block language detection
- 60 req/min, 1000 req/day free tier
2. **PyMuPDF4LLM** (fallback) - 70-75% fidelity
3. **MarkItDown** (fallback) - 65-70% fidelity
**Offline Mode** (--no-api or no API key):
1. **PyMuPDF4LLM** (primary) - 70-75% fidelity
- Zero configuration required
- Perfect Unicode preservation
- Good for simple PDFs
2. **MarkItDown** (fallback) - 65-70% fidelity
- Consistent quality across document types
- Easy to configure
### PDF → DOCX
1. **pdf2docx** (only option) - 80-85% fidelity
- Direct conversion (no intermediate format)
- Preserves images and layout better than Gemini->Pandoc
- Fast processing
### DOCX → Markdown
1. **MarkItDown** (primary) - 75-80% fidelity
- Perfect table preservation (pipe-style markdown)
- Excellent Unicode/emoji support
- Fast processing
2. **Pandoc** (fallback) - 68% fidelity
- Reliable baseline conversion
- Tables converted to grid format
### Markdown → DOCX
1. **Pandoc** (only option) - 95%+ quality preservation
### Markdown → PDF
1. **Pandoc + Typst** (primary) - Fast, modern PDF engine
2. **Pandoc + XeLaTeX** (fallback) - Traditional LaTeX engine
## Usage Patterns
### Basic Conversion
```bash
# Source conversion core library
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
# Detect available tools
detect_tools
# Convert DOCX/PDF to Markdown
main_conversion /path/to/documents /path/to/output
# Convert Markdown to DOCX
main_conversion /path/to/markdown /path/to/output
```
### Batch Processing
The conversion core automatically processes all files in the input directory:
- Discovers all convertible files (.docx, .pdf, or .md)
- Detects conversion direction automatically
- Processes files concurrently (default: 4 parallel conversions)
- Generates conversion.log with statistics
### Progress Streaming
The conversion script emits PROGRESS markers:
```
[PROGRESS] Converting: file1.docx → file1.md
[PROGRESS] Converting: file2.pdf → file2.md (2/10)
[SUCCESS] Converted file1.docx → file1.md
[FAILED] file3.pdf: Conversion timeout after 300s
```
## Conversion Workflow
### Phase 1: Tool Detection
- Check for MarkItDown availability (`command -v markitdown`)
- Check for Pandoc availability (`command -v pandoc`)
- Check for PyMuPDF4LLM availability (`python3 -c "import pymupdf4llm"`)
- Check for PDF engines (Typst, XeLaTeX)
- Set availability flags for tool selection
### Phase 2: File Discovery
- Scan input directory for convertible files
- Detect conversion direction (TO_MARKDOWN or FROM_MARKDOWN)
- Validate mixed-mode errors (cannot mix directions)
- Create output directory structure
### Phase 3: Conversion Execution
- Process files using optimal tool based on priority matrix
- Apply timeout limits (60s DOCX, 300s PDF, 120s MD→PDF)
- Handle collisions (overwrite or skip existing files)
- Extract/embed images appropriately
- Retry with fallback tools on failure
### Phase 4: Validation
- Verify output file exists
- Check for broken image links
- Validate document structure (headings present)
- Report any quality issues
### Phase 5: Reporting
- Generate conversion.log with statistics
- Report success/failure counts by format
- List timeout occurrences
- Summarize validation issues
## Quality Considerations
### Fidelity Expectations
- DOCX→Markdown: 75-80% with MarkItDown (best tables)
- PDF→Markdown: Varies by PDF complexity (scan quality critical)
- Markdown→DOCX: 95%+ with Pandoc (excellent preservation)
- Markdown→PDF: High quality with Typst/XeLaTeX engines
### Known Limitations
- **Scanned PDFs**: OCR quality depends on scan resolution
- **Complex layouts**: Multi-column or nested tables may degrade
- **Embedded fonts**: PDF fonts may affect text extraction accuracy
- **Images**: Large images may cause timeout issues
### Best Practices
- Use MarkItDown for DOCX/PDF when available (better quality)
- Allow longer timeouts for large PDFs (300s default)
- Review conversion.log for failure patterns
- Test output files for critical conversions
## Error Handling
### Common Errors
**Tool not available**:
```
Error: No conversion tool available for DOCX→Markdown
Required: markitdown or pandoc
```
→ Install required tools
**Conversion timeout**:
```
[FAILED] large_document.pdf: Conversion timeout after 300s
```
→ Increase TIMEOUT_PDF_TO_MD or use simpler PDF
**Validation failure**:
```
[WARNING] output.md: No headings found (possible conversion issue)
```
→ Check source document structure
### Recovery Strategies
- Failed conversions automatically retry with fallback tool
- Timeouts skip to next file (batch processing continues)
- Validation warnings don't block workflow (reported only)
## Configuration Options
Environment variables to tune conversion behavior:
```bash
# Timeout multipliers (seconds)
TIMEOUT_MULTIPLIER=1.5 # Increase all timeouts by 50%
# Disk usage limits
MAX_DISK_USAGE_GB=10 # Abort if output exceeds 10GB
MIN_FREE_SPACE_MB=500 # Require 500MB free space
# Concurrency
MAX_CONCURRENT_CONVERSIONS=4 # Parallel conversion limit
```
## Integration Examples
### From Claude Code Agents
When working within agent contexts, the skill automatically triggers when Claude detects conversion needs:
```markdown
User: "Extract text from these PDF reports"
→ Skill auto-invokes: document-converter
→ Converts PDFs to Markdown
→ Returns structured text
```
### From Slash Commands
The `/convert-docs` command delegates to this skill when available:
```bash
/convert-docs ./documents ./output
→ Checks skill availability
→ Delegates to document-converter skill
→ Falls back to script mode if skill unavailable
```
### From Other Skills
Skills can compose with document-converter:
```yaml
# research-specialist skill
dependencies:
- document-converter # Auto-loads for PDF analysis
```
## Script Locations
The skill relies on conversion scripts in the project:
- **Core orchestration**: `.claude/lib/convert/convert-core.sh`
- **DOCX functions**: `.claude/lib/convert/convert-docx.sh`
- **PDF functions**: `.claude/lib/convert/convert-pdf.sh`
- **Gemini wrapper**: `.claude/lib/convert/convert-gemini.sh`
- **Gemini Python**: `.claude/lib/convert/convert_gemini.py`
- **Markdown utilities**: `.claude/lib/convert/convert-markdown.sh`
Scripts are symlinked in the skill's `scripts/` directory for easy access.
## Testing
Test the skill with sample conversions:
```bash
# Test DOCX→Markdown
/convert-docs ./test/sample.docx ./output
# Test PDF→Markdown
/convert-docs ./test/report.pdf ./output
# Test Markdown→DOCX
/convert-docs ./test/document.md ./output
# Batch test
/convert-docs ./test/documents ./output
```
Verify:
- Conversion.log generated with statistics
- Output files created with correct extensions
- Image directories created when needed
- Quality meets expectations (check tables, formatting)
## Troubleshooting
### Skill Not Triggering
If the skill doesn't auto-invoke when expected:
- Check description includes trigger keywords (convert, document, PDF, DOCX, Markdown)
- Test with explicit skill invocation: "Use document-converter skill"
- Verify skill is in `.claude/skills/` directory (project-level)
### Tool Installation
**MarkItDown** (recommended):
```bash
pip install markitdown
```
**PyMuPDF4LLM** (optional):
```bash
pip install pymupdf4llm
```
**pdf2docx** (optional, for PDF to DOCX):
```bash
pip install pdf2docx
```
**google-genai** (optional, for Gemini API):
```bash
pip install google-genai
# Set API key (free tier available at https://aistudio.google.com/)
export GEMINI_API_KEY="your-api-key"
```
**Pandoc** (required):
```bash
# Ubuntu/Debian
apt install pandoc
# macOS
brew install pandoc
```
**Typst** (optional, for MD→PDF):
```bash
# Ubuntu/Debian
wget https://github.com/typst/typst/releases/latest/download/typst-x86_64-unknown-linux-musl.tar.xz
tar -xf typst-*.tar.xz && sudo mv typst-*/typst /usr/local/bin/
# macOS
brew install typst
```
### Performance Issues
If conversions are slow:
- Reduce MAX_CONCURRENT_CONVERSIONS (lower parallelism)
- Increase timeout values for large files
- Check disk I/O (slow storage may bottleneck)
- Use PyMuPDF4LLM for simple PDFs (faster than MarkItDown)
## See Also
- [reference.md](./reference.md) - Detailed tool documentation and metrics
- [examples.md](./examples.md) - Usage examples and common patterns
- [Convert-Docs Command Guide](../../docs/guides/commands/convert-docs-command-guide.md)
- [MarkItDown Documentation](https://github.com/microsoft/markitdown)
- [Pandoc Manual](https://pandoc.org/MANUAL.html)