APM

>Agent Skill

@athola/cpu-gpu-performance

skilldevelopment

Use this skill BEFORE resource-intensive operations. Establish baselines proactively. location: plugin token_budget: 400. Use when session starts (auto-load with token-conservation), planning builds or training that could pin CPUs/GPUs for >1 minute, retrying failed resource-heavy commands. Do not use when simple operations with no resource impact. DO NOT use when: quick single-file operations.

apm::install
$apm install @athola/cpu-gpu-performance
apm::skill.md
---
name: cpu-gpu-performance
description: 'Use this skill BEFORE resource-intensive operations. Establish baselines
  proactively. location: plugin token_budget: 400. Use when session starts (auto-load
  with token-conservation), planning builds or training that could pin CPUs/GPUs for
  >1 minute, retrying failed resource-heavy commands. Do not use when simple operations
  with no resource impact. DO NOT use when: quick single-file operations.'
progressive_loading: true
dependencies:
  hub:
  - token-conservation
  modules: []
---
## Table of Contents

- [When to Use](#when-to-use)
- [Required TodoWrite Items](#required-todowrite-items)
- [Step 1: Establish Current Baseline](#step-1-establish-current-baseline)
- [Step 2: Narrow the Scope](#step-2-narrow-the-scope)
- [Step 3: Instrument Before You Optimize](#step-3-instrument-before-you-optimize)
- [Step 4: Throttle and Sequence Work](#step-4-throttle-and-sequence-work)
- [Step 5: Log Decisions and Next Steps](#step-5-log-decisions-and-next-steps)
- [Output Expectations](#output-expectations)


# CPU/GPU Performance Discipline

## When To Use
- At the beginning of every session (auto-load alongside `token-conservation`).
- Whenever you plan to build, train, or test anything that could pin CPU cores
  or GPUs for more than a minute.
- Before retrying a failing command that previously consumed significant resources.

## When NOT To Use

- Simple operations with no resource impact
- Quick single-file operations

## Required TodoWrite Items
1. `cpu-gpu-performance:baseline`
2. `cpu-gpu-performance:scope`
3. `cpu-gpu-performance:instrument`
4. `cpu-gpu-performance:throttle`
5. `cpu-gpu-performance:log`

## Step 1: Establish Current Baseline
- Capture current utilization:
  - `uptime`
  - `ps -eo pcpu,cmd | head`
  - `nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv`

  Note which hosts/GPUs are already busy.
- Record any CI/cluster budgets (time quotas, GPU hours) before launching work.
- Set a per-task CPU minute / GPU minute budget that respects those limits.

## Step 2: Narrow the Scope
- Avoid running "whole world" jobs after a small fix. Prefer diff-based
  or tag-based selective testing:
  - `pytest -k`
  - Bazel target patterns
  - `cargo test <module>`
- Batch low-level fixes so you can validate multiple changes with a single targeted command.
- For GPU jobs, favor unit-scale smoke inputs or lower epoch counts before
  scheduling the full training/eval sweep.

## Step 3: Instrument Before You Optimize
- Pick the right profiler/monitor:
  - CPU work:
    - `perf`
    - `intel vtune`
    - `cargo flamegraph`
    - language-specific profilers
  - GPU work:
    - `nvidia-smi dmon`
    - `nsys`
    - `nvprof`
    - DLProf
    - framework timeline tracers
- Capture kernel/ops timelines, memory footprints, and data pipeline latency
  so you have evidence when throttling or parallelizing.
- Record hot paths + I/O bottlenecks in notes so future reruns can jump straight to the culprit.

## Step 4: Throttle and Sequence Work
- Use `nice`, `ionice`, or Kubernetes/Slurm quotas to prevent starvation of shared nodes.
- Chain heavy tasks with guardrails:
  - Rerun only the failed test/module
  - Then (optionally) escalate to the next-wider shard
  - Reserve the full suite for the final gate
- Stagger GPU kernels (smaller batch sizes or gradient accumulation) when memory
  pressure risks eviction; prefer checkpoint/restore over restarts.

## Step 5: Log Decisions and Next Steps

Conclude by documenting the commands that were run and their resource cost
(duration, CPU%, GPU%), confirming whether they remained within the per-task
budget. If a full suite or long training run was necessary, justify why selective
or staged approaches were not feasible. Capture any follow-up tasks, such as
adding a new test marker or profiling documentation, to simplify future sessions.

## Output Expectations
- Brief summary covering:
  - baseline metrics
  - scope chosen
  - instrumentation captured
  - throttling tactics
  - follow-up items
- Concrete example(s) of what ran (e.g.):
  - "reran `pytest tests/test_orders.py -k test_refund` instead of `pytest -m slow`"
  - "profiled `nvidia-smi dmon` output to prove GPU idle time before scaling"
## Troubleshooting

### Common Issues

**Command not found**
Ensure all dependencies are installed and in PATH

**Permission errors**
Check file permissions and run with appropriate privileges

**Unexpected behavior**
Enable verbose logging with `--verbose` flag