APM

>Agent Skill

@kreuzberg-dev/visitor-pattern-usage

skilldevelopment

Instructions for visitor pattern usage.

rustperformance
apm::install
$apm install @kreuzberg-dev/visitor-pattern-usage
apm::skill.md
---
name: visitor-pattern-usage
description: "Instructions for visitor pattern usage."
---

# Visitor Pattern Usage for html-to-markdown

## Overview

The visitor pattern in html-to-markdown provides extensible hooks into the HTML-to-Markdown conversion pipeline. Custom visitors can inspect, modify, or replace the default conversion behavior for any of the 60+ HTML element types.

## Architecture Philosophy

**Key Principles:**
- **Flexibility over performance**: Give users full control over conversion
- **Zero-cost when unused**: No overhead if visitor feature disabled
- **Comprehensive coverage**: All HTML element types have dedicated visitor methods
- **Pre/post hooks**: Both entry and exit points exposed for elements

## Feature Gates

The visitor pattern is conditionally compiled:

```rust
#[cfg(feature = "visitor")]
pub mod visitor;

#[cfg(feature = "async-visitor")]
pub use visitor_helpers::AsyncVisitorHandle;
```

**In Cargo.toml:**
```toml
[features]
default = ["metadata"]
visitor = []
async-visitor = ["visitor", "dep:async-trait"]
```

## Core Traits and Types

### NodeType Enumeration

Located in `/crates/html-to-markdown/src/visitor.rs`, categorizes all HTML elements:

```rust
pub enum NodeType {
    // Text content
    Text,

    // Block elements
    Heading,
    Paragraph,
    Div,
    Blockquote,
    Pre,
    Hr,

    // Lists
    List,               // ul, ol
    ListItem,           // li
    DefinitionList,     // dl
    DefinitionTerm,     // dt
    DefinitionDescription, // dd

    // Tables
    Table,
    TableRow,
    TableCell,
    TableHeader,
    TableBody,
    TableHead,
    TableFoot,

    // Inline formatting
    Link,
    Image,
    Strong,
    Em,
    Code,
    Strikethrough,
    Mark,
    Sub,
    Sup,
    LineBreak,
    Ruby,

    // Semantic HTML5
    Article,
    Section,
    Nav,
    Aside,
    Header,
    Footer,
    Main,

    // Media
    Audio,
    Video,
    Picture,
    Iframe,
    Svg,

    // Forms
    Input,
    Select,
    Button,
    Textarea,
    Fieldset,

    // Other
    Form,
    Label,
    Span,
    Generic(String),  // Unknown tags
}
```

### VisitResult Enumeration

Specifies what action the conversion should take:

```rust
pub enum VisitResult {
    /// Use default conversion for this element
    Default,

    /// Skip this element entirely (no output)
    Skip,

    /// Custom markdown for this element
    Custom(String),

    /// Process children normally, wrap with custom before/after
    Custom(String),  // Could also support Wrap variant

    /// Replace element content with custom markdown
    Replace(String),
}
```

### NodeContext Structure

Provides context about the current node being visited:

```rust
pub struct NodeContext {
    pub node_type: NodeType,
    pub tag_name: Option<String>,        // Actual HTML tag if element
    pub attributes: BTreeMap<String, String>,  // All HTML attributes
    pub parent_node_type: Option<NodeType>,    // Parent element type
    pub depth: usize,                    // Nesting depth
    pub position_in_parent: usize,       // Sibling index
}
```

## HtmlVisitor Trait

The main visitor trait with methods for each element type:

```rust
pub trait HtmlVisitor {
    // Generic element fallback
    fn visit_element(
        &mut self,
        ctx: &NodeContext,
        tag: &str,
        attributes: &BTreeMap<String, String>,
    ) -> VisitResult;

    // Text content
    fn visit_text(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;

    // Headings
    fn visit_heading(
        &mut self,
        ctx: &NodeContext,
        level: u8,  // 1-6
        text: &str,
    ) -> VisitResult;

    fn visit_paragraph(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;

    // Links and images
    fn visit_link(
        &mut self,
        ctx: &NodeContext,
        href: &str,
        text: &str,
        title: Option<&str>,
    ) -> VisitResult;

    fn visit_image(
        &mut self,
        ctx: &NodeContext,
        src: &str,
        alt: &str,
        title: Option<&str>,
    ) -> VisitResult;

    // Formatting
    fn visit_strong(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;
    fn visit_em(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;
    fn visit_code(&mut self, ctx: &NodeContext, code: &str) -> VisitResult;
    fn visit_code_block(
        &mut self,
        ctx: &NodeContext,
        code: &str,
        language: Option<&str>,
    ) -> VisitResult;
    fn visit_strikethrough(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;

    // Lists
    fn visit_list(
        &mut self,
        ctx: &NodeContext,
        ordered: bool,
        items: &[String],
    ) -> VisitResult;

    fn visit_list_item(
        &mut self,
        ctx: &NodeContext,
        content: &str,
        index: usize,
    ) -> VisitResult;

    // Tables
    fn visit_table(
        &mut self,
        ctx: &NodeContext,
        rows: &[Vec<String>],
    ) -> VisitResult;

    fn visit_table_cell(
        &mut self,
        ctx: &NodeContext,
        content: &str,
        is_header: bool,
    ) -> VisitResult;

    // ... and 40+ more visitor methods
}
```

## Basic Example: Custom Link Converter

Convert all external links to plain text with URLs in parentheses:

```rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};

#[derive(Debug)]
struct PlainTextLinkVisitor;

impl HtmlVisitor for PlainTextLinkVisitor {
    fn visit_link(
        &mut self,
        _ctx: &NodeContext,
        href: &str,
        text: &str,
        _title: Option<&str>,
    ) -> VisitResult {
        // Convert all links to plain text with URL
        VisitResult::Custom(format!("{} ({})", text, href))
    }

    // ... implement other visitor methods as Default
}

// Usage
let html = r#"<p>Visit <a href="https://example.com">our site</a></p>"#;
let mut visitor = PlainTextLinkVisitor;
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;
// Output: Visit our site (https://example.com)
```

## Advanced Example: Custom Code Block Highlighter

Highlight code blocks with language-specific syntax:

```rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};

#[derive(Debug)]
struct HighlightingVisitor;

impl HtmlVisitor for HighlightingVisitor {
    fn visit_code_block(
        &mut self,
        _ctx: &NodeContext,
        code: &str,
        language: Option<&str>,
    ) -> VisitResult {
        match language {
            Some("python") => {
                // Custom Python highlighting
                VisitResult::Custom(format!(
                    "```python\n<!-- HIGHLIGHTED -->\n{}\n```",
                    code
                ))
            }
            Some("rust") => {
                // Custom Rust highlighting
                VisitResult::Custom(format!(
                    "```rust\n<!-- WITH SYNTAX HIGHLIGHTING -->\n{}\n```",
                    code
                ))
            }
            _ => VisitResult::Default,  // Use default for other languages
        }
    }

    fn visit_link(
        &mut self,
        _ctx: &NodeContext,
        href: &str,
        text: &str,
        title: Option<&str>,
    ) -> VisitResult {
        // Links in documentation: add reference-style syntax
        VisitResult::Custom(format!("[{}][{}]", text, href))
    }

    fn visit_heading(
        &mut self,
        _ctx: &NodeContext,
        level: u8,
        text: &str,
    ) -> VisitResult {
        // Add anchor links to all headings
        let id = text.to_lowercase().replace(' ', '-');
        VisitResult::Custom(format!(
            "{} {{#{}}}\n",
            "#".repeat(level as usize),
            id
        ))
    }
}

// Usage
let html = r#"
<h1>Documentation</h1>
<p>See <a href="https://docs.rs">our docs</a></p>
<pre><code class="language-rust">fn main() {}</code></pre>
"#;
let mut visitor = HighlightingVisitor;
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;
```

## Filtering by Node Type

Visit only specific element types:

```rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult, NodeType};

#[derive(Debug)]
struct ImageOnlyVisitor {
    image_count: usize,
}

impl HtmlVisitor for ImageOnlyVisitor {
    fn visit_image(
        &mut self,
        _ctx: &NodeContext,
        src: &str,
        alt: &str,
        _title: Option<&str>,
    ) -> VisitResult {
        self.image_count += 1;
        println!("Image {}: {} ({})", self.image_count, alt, src);

        // Could extract images to separate directory
        VisitResult::Custom(format!("![{}]({})", alt, src))
    }

    fn visit_text(&mut self, _ctx: &NodeContext, _text: &str) -> VisitResult {
        VisitResult::Skip  // Skip all text, only output images
    }
}

// Usage
let mut visitor = ImageOnlyVisitor { image_count: 0 };
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;
println!("Found {} images", visitor.image_count);
```

## Context-Aware Transformations

Use parent context and depth to transform based on structure:

```rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult, NodeType};

#[derive(Debug)]
struct DepthTrackingVisitor {
    current_depth: usize,
}

impl HtmlVisitor for DepthTrackingVisitor {
    fn visit_paragraph(
        &mut self,
        ctx: &NodeContext,
        text: &str,
    ) -> VisitResult {
        // Different formatting based on depth
        match ctx.depth {
            0 => VisitResult::Custom(format!("**{}**\n", text)),  // Bold at top level
            1 => VisitResult::Custom(format!("*{}*\n", text)),    // Italic nested once
            _ => VisitResult::Default,                            // Normal elsewhere
        }
    }

    fn visit_link(
        &mut self,
        ctx: &NodeContext,
        href: &str,
        text: &str,
        _title: Option<&str>,
    ) -> VisitResult {
        // Links in blockquotes get footnote style
        if let Some(NodeType::Blockquote) = ctx.parent_node_type {
            VisitResult::Custom(format!("{}[^{}]", text, href))
        } else {
            VisitResult::Default
        }
    }
}

// Usage
let mut visitor = DepthTrackingVisitor { current_depth: 0 };
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;
```

## Attribute-Based Routing

Route handling based on HTML attributes:

```rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};

#[derive(Debug)]
struct AttributeRoutingVisitor;

impl HtmlVisitor for AttributeRoutingVisitor {
    fn visit_link(
        &mut self,
        ctx: &NodeContext,
        href: &str,
        text: &str,
        title: Option<&str>,
    ) -> VisitResult {
        // Custom handling for data attributes
        if let Some(tracking_id) = ctx.attributes.get("data-tracking-id") {
            return VisitResult::Custom(format!(
                "[{}]({} \"{}\")",
                text,
                href,
                tracking_id
            ));
        }

        // Skip links marked with data-skip="true"
        if ctx.attributes.get("data-skip").map_or(false, |v| v == "true") {
            return VisitResult::Skip;
        }

        VisitResult::Default
    }

    fn visit_paragraph(
        &mut self,
        ctx: &NodeContext,
        text: &str,
    ) -> VisitResult {
        // Blockquote paragraphs differently
        if ctx.attributes.get("data-featured") == Some(&"true".to_string()) {
            VisitResult::Custom(format!("> {}\n", text))
        } else {
            VisitResult::Default
        }
    }
}

// Usage
let html = r#"
<a href="/page" data-tracking-id="click-001">Track me</a>
<a href="/skip" data-skip="true">Skip me</a>
<p data-featured="true">Important paragraph</p>
"#;
let mut visitor = AttributeRoutingVisitor;
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;
```

## Async Visitor Support

For languages with native async/await (Python, TypeScript, Elixir):

```rust
#[cfg(feature = "async-visitor")]
pub async fn convert_with_async_visitor(
    html: &str,
    options: Option<ConversionOptions>,
    visitor: Option<AsyncVisitorHandle>,
) -> Result<String> { ... }
```

### Python Async Example (PyO3)

```python
import asyncio
import html_to_markdown

class AsyncSyntaxHighlighter:
    async def visit_code_block(self, ctx, code, language):
        # Call async syntax highlighting service
        highlighted = await highlight_service.highlight(code, language)
        return f"```{language}\n{highlighted}\n```"

    async def visit_link(self, ctx, href, text, title):
        # Check external link status asynchronously
        is_valid = await check_link_validity(href)
        if is_valid:
            return f"[{text}]({href})"
        else:
            return f"~~[{text}]({href})~~ (broken)"

# Usage
markdown = await html_to_markdown.convert_with_async_visitor(
    html,
    None,
    AsyncSyntaxHighlighter()
)
```

### TypeScript Async Example (NAPI-RS)

```typescript
import { convertWithAsyncVisitor } from 'html-to-markdown';

class AsyncContentProcessor {
    async visitLink(ctx, href, text, title) {
        // Fetch metadata for link
        const metadata = await fetch(href).then(r => r.json());
        return `[${text}](${href} "${metadata.title}")`;
    }

    async visitImage(ctx, src, alt, title) {
        // Optimize image
        const optimized = await imageOptimizer.optimize(src);
        return `![${alt}](${optimized})`;
    }
}

const markdown = await convertWithAsyncVisitor(html, undefined, new AsyncContentProcessor());
```

## State Management in Visitors

Maintain state across multiple visits:

```rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};
use std::collections::HashSet;

#[derive(Debug)]
struct LinkCollectorVisitor {
    external_links: HashSet<String>,
    email_links: HashSet<String>,
    internal_links: HashSet<String>,
}

impl HtmlVisitor for LinkCollectorVisitor {
    fn visit_link(
        &mut self,
        _ctx: &NodeContext,
        href: &str,
        _text: &str,
        _title: Option<&str>,
    ) -> VisitResult {
        if href.starts_with("mailto:") {
            self.email_links.insert(href.to_string());
        } else if href.starts_with("http") {
            self.external_links.insert(href.to_string());
        } else {
            self.internal_links.insert(href.to_string());
        }

        VisitResult::Default  // Keep default link formatting
    }
}

// Usage
let mut visitor = LinkCollectorVisitor {
    external_links: HashSet::new(),
    email_links: HashSet::new(),
    internal_links: HashSet::new(),
};

let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;

println!("External: {:?}", visitor.external_links);
println!("Email: {:?}", visitor.email_links);
println!("Internal: {:?}", visitor.internal_links);
```

## Performance Considerations

### Impact on Conversion Speed

- Visitor trait calls add function dispatch overhead
- For large documents (10K+ elements), overhead ~5-10%
- Use feature gate to eliminate entirely if not needed

### Optimization Strategies

1. **Fast path for most elements:**
   ```rust
   fn visit_text(&mut self, _ctx: &NodeContext, _text: &str) -> VisitResult {
       VisitResult::Default  // Quick return for most text nodes
   }
   ```

2. **Only override when needed:**
   ```rust
   // Only override link handling
   // All other methods inherit Default implementation
   ```

3. **Avoid allocations in hot path:**
   ```rust
   // Bad: allocate string for every node
   VisitResult::Custom(format!(">{}<", text))

   // Better: pre-allocate or use Cow
   let mut result = String::with_capacity(text.len() + 2);
   result.push('>');
   result.push_str(text);
   result.push('<');
   VisitResult::Custom(result)
   ```

## Integration with Conversion Options

Visitors work alongside `ConversionOptions`:

```rust
use html_to_markdown_rs::{ConversionOptions, HeadingStyle};

let options = ConversionOptions {
    heading_style: HeadingStyle::AtxClosed,  // User preference
    wrap: true,
    wrap_width: 80,
    ..Default::default()
};

// Visitor can override specific behaviors
let mut visitor = CustomVisitor;
let markdown = convert_with_visitor(html, Some(options), Some(&mut visitor))?;
```

**Priority:** Visitor always takes precedence. If visitor returns `Custom` or `Skip`, conversion options are bypassed for that element.

## Error Handling in Visitors

The visitor pattern doesn't support errors directly. Return `Default` or `Skip` instead:

```rust
impl HtmlVisitor for SafeVisitor {
    fn visit_link(
        &mut self,
        _ctx: &NodeContext,
        href: &str,
        text: &str,
        title: Option<&str>,
    ) -> VisitResult {
        // Can't return error, so validate and fallback
        if href.is_empty() {
            return VisitResult::Custom(text.to_string());  // Fallback to text
        }

        VisitResult::Default
    }
}
```

## Testing Visitors

Located in binding test suites (Python, TypeScript, Ruby, PHP):

```bash
# Test visitor feature
task rust:test  # Includes visitor tests

# Binding-specific visitor tests
task python:test  # tests/test_visitor.py
task typescript:test  # packages/typescript/tests/visitor.spec.ts
task ruby:test  # packages/ruby/spec/visitor_spec.rb
```

## Implementation Location

**Core Files:**
- `/crates/html-to-markdown/src/visitor.rs` - Trait definitions and NodeType enum
- `/crates/html-to-markdown/src/visitor_helpers.rs` - VisitorHandle and async support
- `/crates/html-to-markdown/src/converter.rs` - Integration with conversion pipeline

**Binding Examples:**
- `/crates/html-to-markdown-py/src/lib.rs` - PyO3 visitor wrapping
- `/crates/html-to-markdown-node/src/lib.rs` - NAPI-RS visitor support
- `/packages/ruby/lib/visitor.rb` - Ruby visitor interface
- `/packages/php/src/Visitor.php` - PHP visitor base class

## API Pattern

```rust
// Simple visitor (sync)
pub fn convert_with_visitor(
    html: &str,
    options: Option<ConversionOptions>,
    visitor: Option<visitor::VisitorHandle>,
) -> Result<String>

// Async visitor (for languages with native async)
#[cfg(feature = "async-visitor")]
pub async fn convert_with_async_visitor(
    html: &str,
    options: Option<ConversionOptions>,
    visitor: Option<AsyncVisitorHandle>,
) -> Result<String>

// Combined with metadata (future enhancement)
// pub fn convert_with_metadata_and_visitor(...) -> Result<(String, ExtendedMetadata)>
```

## Quick Reference: Common Visitor Patterns

| Use Case | Implementation |
|----------|----------------|
| Skip certain elements | Return `VisitResult::Skip` |
| Modify element output | Return `VisitResult::Custom(new_markdown)` |
| Track state | Use `&mut self` fields to accumulate data |
| Conditional routing | Use `ctx` fields (parent, depth, attributes) |
| Preserve default | Return `VisitResult::Default` |
| Context-aware | Match on `ctx.parent_node_type`, `ctx.depth` |
| Attribute-based | Read from `ctx.attributes` map |
| Stateless transformation | Implement stateless visitor struct |