@a5c-ai/stencil-convolution — Agent Skill

---
name: stencil-convolution
description: Expert skill for optimized stencil and convolution pattern implementations on GPU. Design tiled stencil algorithms with halos, implement 2D/3D convolution kernels, optimize boundary condition handling, apply temporal blocking techniques, generate separable filter implementations, and profile stencil memory bandwidth.
allowed-tools: Bash(*) Read Write Edit Glob Grep WebFetch
metadata:
  author: babysitter-sdk
  version: "1.0.0"
  category: domain-algorithms
  backlog-id: SK-013
---

# stencil-convolution

You are **stencil-convolution** - a specialized skill for optimized stencil and convolution pattern implementations on GPU. This skill provides expert capabilities for scientific computing, image processing, and numerical simulations requiring neighborhood computations.

## Overview

This skill enables AI-powered stencil and convolution operations including:
- Designing tiled stencil algorithms with halos
- Implementing 2D/3D convolution kernels
- Optimizing boundary condition handling
- Applying temporal blocking techniques
- Generating separable filter implementations
- Configuring shared memory tiling strategies
- Profiling stencil memory bandwidth
- Supporting multi-resolution stencils

## Prerequisites

- NVIDIA CUDA Toolkit 11.0+
- GPU with compute capability 3.5+
- Understanding of memory coalescing patterns
- Nsight Compute for memory analysis
- Optional: cuDNN for optimized convolutions

## Capabilities

### 1. Basic 2D Stencil (5-Point Laplacian)

```cuda
// Naive 5-point stencil (for comparison)
__global__ void laplacian2D_naive(
    float* out, const float* in,
    int width, int height
) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= 1 && x < width - 1 && y >= 1 && y < height - 1) {
        int idx = y * width + x;
        out[idx] = -4.0f * in[idx]
                 + in[idx - 1]      // left
                 + in[idx + 1]      // right
                 + in[idx - width]  // up
                 + in[idx + width]; // down
    }
}
```

### 2. Tiled Stencil with Shared Memory and Halo

```cuda
#define TILE_X 32
#define TILE_Y 32
#define HALO 1

__global__ void laplacian2D_tiled(
    float* out, const float* in,
    int width, int height
) {
    // Shared memory with halo
    __shared__ float tile[TILE_Y + 2 * HALO][TILE_X + 2 * HALO];

    // Global coordinates
    int gx = blockIdx.x * TILE_X + threadIdx.x;
    int gy = blockIdx.y * TILE_Y + threadIdx.y;

    // Local coordinates in shared memory (offset by halo)
    int lx = threadIdx.x + HALO;
    int ly = threadIdx.y + HALO;

    // Load center tile
    if (gx < width && gy < height) {
        tile[ly][lx] = in[gy * width + gx];
    }

    // Load halo regions
    // Left halo
    if (threadIdx.x < HALO && gx >= HALO) {
        tile[ly][lx - HALO] = in[gy * width + (gx - HALO)];
    }
    // Right halo
    if (threadIdx.x >= TILE_X - HALO && gx + HALO < width) {
        tile[ly][lx + HALO] = in[gy * width + (gx + HALO)];
    }
    // Top halo
    if (threadIdx.y < HALO && gy >= HALO) {
        tile[ly - HALO][lx] = in[(gy - HALO) * width + gx];
    }
    // Bottom halo
    if (threadIdx.y >= TILE_Y - HALO && gy + HALO < height) {
        tile[ly + HALO][lx] = in[(gy + HALO) * width + gx];
    }

    // Corner halos (if needed for larger stencils)
    // ...

    __syncthreads();

    // Compute stencil using shared memory
    if (gx >= 1 && gx < width - 1 && gy >= 1 && gy < height - 1) {
        out[gy * width + gx] = -4.0f * tile[ly][lx]
                             + tile[ly][lx - 1]
                             + tile[ly][lx + 1]
                             + tile[ly - 1][lx]
                             + tile[ly + 1][lx];
    }
}
```

### 3. Generic N-Point Stencil with Configurable Radius

```cuda
template <int RADIUS>
__global__ void stencil2D_generic(
    float* out, const float* in,
    const float* weights,  // Stencil weights
    int width, int height
) {
    extern __shared__ float tile[];

    const int TILE_X = blockDim.x;
    const int TILE_Y = blockDim.y;
    const int TILE_PITCH = TILE_X + 2 * RADIUS;

    int gx = blockIdx.x * TILE_X + threadIdx.x;
    int gy = blockIdx.y * TILE_Y + threadIdx.y;

    int lx = threadIdx.x + RADIUS;
    int ly = threadIdx.y + RADIUS;

    // Load tile with halos
    // ... (similar to above but generic)

    __syncthreads();

    if (gx >= RADIUS && gx < width - RADIUS &&
        gy >= RADIUS && gy < height - RADIUS) {

        float result = 0.0f;
        int wIdx = 0;

        // Apply stencil weights
        for (int dy = -RADIUS; dy <= RADIUS; dy++) {
            for (int dx = -RADIUS; dx <= RADIUS; dx++) {
                result += weights[wIdx++] * tile[(ly + dy) * TILE_PITCH + (lx + dx)];
            }
        }

        out[gy * width + gx] = result;
    }
}
```

### 4. 2D Convolution with Arbitrary Kernel

```cuda
#define CONV_TILE_X 32
#define CONV_TILE_Y 32
#define MAX_KERNEL_RADIUS 8

// Kernel weights in constant memory for fast access
__constant__ float c_kernel[(2 * MAX_KERNEL_RADIUS + 1) * (2 * MAX_KERNEL_RADIUS + 1)];

__global__ void convolution2D(
    float* out, const float* in,
    int width, int height,
    int kernelRadius
) {
    extern __shared__ float tile[];

    int TILE_PITCH = CONV_TILE_X + 2 * kernelRadius;

    int gx = blockIdx.x * CONV_TILE_X + threadIdx.x;
    int gy = blockIdx.y * CONV_TILE_Y + threadIdx.y;

    int lx = threadIdx.x + kernelRadius;
    int ly = threadIdx.y + kernelRadius;

    // Load center
    if (gx < width && gy < height) {
        tile[ly * TILE_PITCH + lx] = in[gy * width + gx];
    } else {
        tile[ly * TILE_PITCH + lx] = 0.0f;  // Zero padding
    }

    // Load halos with boundary handling
    // Left
    if (threadIdx.x < kernelRadius) {
        int srcX = gx - kernelRadius;
        tile[ly * TILE_PITCH + (lx - kernelRadius)] =
            (srcX >= 0 && gy < height) ? in[gy * width + srcX] : 0.0f;
    }
    // Right
    if (threadIdx.x >= CONV_TILE_X - kernelRadius) {
        int srcX = gx + kernelRadius;
        tile[ly * TILE_PITCH + (lx + kernelRadius)] =
            (srcX < width && gy < height) ? in[gy * width + srcX] : 0.0f;
    }
    // Top and bottom (similar pattern)
    // ...

    __syncthreads();

    if (gx < width && gy < height) {
        float sum = 0.0f;
        int kernelSize = 2 * kernelRadius + 1;

        for (int ky = -kernelRadius; ky <= kernelRadius; ky++) {
            for (int kx = -kernelRadius; kx <= kernelRadius; kx++) {
                int kidx = (ky + kernelRadius) * kernelSize + (kx + kernelRadius);
                sum += c_kernel[kidx] * tile[(ly + ky) * TILE_PITCH + (lx + kx)];
            }
        }

        out[gy * width + gx] = sum;
    }
}
```

### 5. Separable Convolution (2-Pass for Performance)

```cuda
// Separable convolution is faster: O(2*r) vs O(r^2)
// First pass: horizontal convolution
__global__ void convolutionRow(
    float* out, const float* in,
    int width, int height, int radius
) {
    extern __shared__ float tile[];

    int gx = blockIdx.x * blockDim.x + threadIdx.x;
    int gy = blockIdx.y;

    int TILE_WIDTH = blockDim.x + 2 * radius;
    int lx = threadIdx.x + radius;

    // Load with halos
    if (gx < width) {
        tile[lx] = in[gy * width + gx];
    }
    if (threadIdx.x < radius) {
        tile[threadIdx.x] = (gx >= radius) ? in[gy * width + gx - radius] : 0.0f;
        tile[lx + blockDim.x] = (gx + blockDim.x < width) ?
            in[gy * width + gx + blockDim.x] : 0.0f;
    }

    __syncthreads();

    if (gx < width) {
        float sum = 0.0f;
        for (int k = -radius; k <= radius; k++) {
            sum += c_kernelRow[k + radius] * tile[lx + k];
        }
        out[gy * width + gx] = sum;
    }
}

// Second pass: vertical convolution
__global__ void convolutionColumn(
    float* out, const float* in,
    int width, int height, int radius
) {
    extern __shared__ float tile[];

    int gx = blockIdx.x;
    int gy = blockIdx.y * blockDim.y + threadIdx.y;

    int TILE_HEIGHT = blockDim.y + 2 * radius;
    int ly = threadIdx.y + radius;

    // Load with halos
    if (gy < height) {
        tile[ly] = in[gy * width + gx];
    }
    if (threadIdx.y < radius) {
        tile[threadIdx.y] = (gy >= radius) ? in[(gy - radius) * width + gx] : 0.0f;
        tile[ly + blockDim.y] = (gy + blockDim.y < height) ?
            in[(gy + blockDim.y) * width + gx] : 0.0f;
    }

    __syncthreads();

    if (gy < height) {
        float sum = 0.0f;
        for (int k = -radius; k <= radius; k++) {
            sum += c_kernelCol[k + radius] * tile[ly + k];
        }
        out[gy * width + gx] = sum;
    }
}
```

### 6. 3D Stencil (7-Point Laplacian)

```cuda
#define TILE_X 16
#define TILE_Y 16
#define TILE_Z 4

__global__ void laplacian3D(
    float* out, const float* in,
    int nx, int ny, int nz
) {
    __shared__ float current[TILE_Y + 2][TILE_X + 2];
    __shared__ float above[TILE_Y][TILE_X];
    __shared__ float below[TILE_Y][TILE_X];

    int gx = blockIdx.x * TILE_X + threadIdx.x;
    int gy = blockIdx.y * TILE_Y + threadIdx.y;
    int gz = blockIdx.z * TILE_Z;

    int lx = threadIdx.x + 1;
    int ly = threadIdx.y + 1;

    // Process TILE_Z planes
    for (int z = gz; z < min(gz + TILE_Z, nz - 1); z++) {
        if (z == 0) continue;

        // Load current plane with halos
        if (gx < nx && gy < ny) {
            current[ly][lx] = in[z * ny * nx + gy * nx + gx];
        }

        // Load halos
        if (threadIdx.x == 0 && gx > 0) {
            current[ly][0] = in[z * ny * nx + gy * nx + (gx - 1)];
        }
        if (threadIdx.x == TILE_X - 1 && gx < nx - 1) {
            current[ly][TILE_X + 1] = in[z * ny * nx + gy * nx + (gx + 1)];
        }
        if (threadIdx.y == 0 && gy > 0) {
            current[0][lx] = in[z * ny * nx + (gy - 1) * nx + gx];
        }
        if (threadIdx.y == TILE_Y - 1 && gy < ny - 1) {
            current[TILE_Y + 1][lx] = in[z * ny * nx + (gy + 1) * nx + gx];
        }

        // Load above and below planes
        if (gx < nx && gy < ny) {
            above[threadIdx.y][threadIdx.x] = in[(z + 1) * ny * nx + gy * nx + gx];
            below[threadIdx.y][threadIdx.x] = in[(z - 1) * ny * nx + gy * nx + gx];
        }

        __syncthreads();

        // Compute 7-point stencil
        if (gx >= 1 && gx < nx - 1 && gy >= 1 && gy < ny - 1) {
            out[z * ny * nx + gy * nx + gx] =
                -6.0f * current[ly][lx]
                + current[ly][lx - 1]   // x-1
                + current[ly][lx + 1]   // x+1
                + current[ly - 1][lx]   // y-1
                + current[ly + 1][lx]   // y+1
                + above[threadIdx.y][threadIdx.x]    // z+1
                + below[threadIdx.y][threadIdx.x];   // z-1
        }

        __syncthreads();
    }
}
```

### 7. Temporal Blocking (Multi-Timestep)

```cuda
// Process multiple timesteps before writing back to global memory
template <int TIMESTEPS>
__global__ void stencil_temporal_blocking(
    float* out, const float* in,
    int width, int height
) {
    // Larger shared memory to accommodate temporal expansion
    // Each timestep expands the halo by 1
    const int HALO = TIMESTEPS;
    extern __shared__ float smem[];

    float* current = smem;
    float* next = smem + (blockDim.y + 2 * HALO) * (blockDim.x + 2 * HALO);

    // Load initial data with expanded halo
    // ...

    __syncthreads();

    // Multiple timesteps in shared memory
    for (int t = 0; t < TIMESTEPS; t++) {
        int shrinkHalo = TIMESTEPS - t - 1;
        int validXStart = shrinkHalo;
        int validXEnd = blockDim.x + 2 * HALO - shrinkHalo;
        int validYStart = shrinkHalo;
        int validYEnd = blockDim.y + 2 * HALO - shrinkHalo;

        int lx = threadIdx.x + HALO;
        int ly = threadIdx.y + HALO;

        // Only threads in valid region compute
        if (lx >= validXStart + 1 && lx < validXEnd - 1 &&
            ly >= validYStart + 1 && ly < validYEnd - 1) {

            int PITCH = blockDim.x + 2 * HALO;
            next[ly * PITCH + lx] = -4.0f * current[ly * PITCH + lx]
                                  + current[ly * PITCH + lx - 1]
                                  + current[ly * PITCH + lx + 1]
                                  + current[(ly - 1) * PITCH + lx]
                                  + current[(ly + 1) * PITCH + lx];
        }

        __syncthreads();

        // Swap buffers
        float* temp = current;
        current = next;
        next = temp;

        __syncthreads();
    }

    // Write final result to global memory
    // ...
}
```

### 8. Boundary Condition Patterns

```cuda
// Different boundary condition strategies
enum BoundaryCondition {
    BC_ZERO,       // Zero padding
    BC_REPLICATE,  // Replicate edge values
    BC_REFLECT,    // Mirror reflection
    BC_PERIODIC    // Wrap around
};

__device__ inline int applyBoundary(int idx, int size, BoundaryCondition bc) {
    if (idx >= 0 && idx < size) return idx;

    switch (bc) {
        case BC_ZERO:
            return -1;  // Signal to use zero
        case BC_REPLICATE:
            return (idx < 0) ? 0 : size - 1;
        case BC_REFLECT:
            if (idx < 0) return -idx - 1;
            if (idx >= size) return 2 * size - idx - 1;
            return idx;
        case BC_PERIODIC:
            return ((idx % size) + size) % size;
        default:
            return idx;
    }
}

__device__ inline float loadWithBoundary(
    const float* data, int x, int y,
    int width, int height, BoundaryCondition bc
) {
    int bx = applyBoundary(x, width, bc);
    int by = applyBoundary(y, height, bc);

    if (bx < 0 || by < 0) return 0.0f;

    return data[by * width + bx];
}
```

## Best Practices

### Memory Access Patterns

| Pattern | Impact | Recommendation |
|---------|--------|----------------|
| Coalesced global reads | High | Align thread access to memory layout |
| Shared memory bank conflicts | Medium | Pad shared memory arrays |
| Halo loading efficiency | Medium | Use cooperative loading |

### Tile Size Selection

| GPU Architecture | Recommended Tile Size |
|-----------------|----------------------|
| Volta/Turing | 32x32 or 16x16 |
| Ampere | 32x32 |
| Hopper | 32x32 or 64x32 |

### Performance Tips

1. **Use constant memory** for stencil weights
2. **Maximize data reuse** in shared memory
3. **Consider separable filters** for 2D convolutions
4. **Temporal blocking** for iterative stencils
5. **Profile memory bandwidth** - stencils are memory-bound

## Process Integration

This skill integrates with the following processes:
- `stencil-computation-optimization.js` - Stencil optimization workflows
- `gpu-image-video-processing.js` - Image filtering
- `parallel-algorithm-design.js` - Algorithm patterns

## Output Format

When executing operations, provide structured output:

```json
{
  "operation": "generate-stencil",
  "status": "success",
  "stencil": {
    "type": "2D",
    "points": 5,
    "radius": 1,
    "boundary": "replicate"
  },
  "optimization": {
    "tile_size": [32, 32],
    "shared_memory_bytes": 4624,
    "halo_size": 1,
    "temporal_blocking": false
  },
  "performance": {
    "achieved_bandwidth_gbps": 850,
    "peak_bandwidth_gbps": 1555,
    "efficiency_percent": 54.7
  },
  "recommendations": [
    "Consider separable implementation for Gaussian filter",
    "Temporal blocking could reduce memory traffic by 2x"
  ],
  "artifacts": ["stencil_kernel.cu", "benchmark_results.json"]
}
```

## Constraints

- Stencils are typically memory-bandwidth bound
- Shared memory limits tile size
- Boundary handling adds complexity
- 3D stencils have higher memory requirements
- Profile to ensure memory coalescing