@majiayu000/bio-chip-seq-motif-analysis — Agent Skill

---
name: bio-chip-seq-motif-analysis
description: De novo motif discovery and known motif enrichment analysis using HOMER and MEME-ChIP. Identify transcription factor binding motifs in ChIP-seq, ATAC-seq, or other genomic peak data. Use when finding enriched DNA motifs in peak sequences.
tool_type: cli
primary_tool: HOMER
---

# Motif Analysis

Identify DNA sequence motifs enriched in ChIP-seq or ATAC-seq peaks to discover transcription factor binding sites.

## Tool Comparison

| Tool | Strengths | Use Case |
|------|-----------|----------|
| HOMER | Fast, comprehensive, built-in databases | General motif analysis |
| MEME-ChIP | Multiple algorithms, web interface | Publication-quality |
| MEME | De novo discovery only | Simple discovery |
| FIMO | Known motif scanning | Genome-wide scanning |

## HOMER

### Installation

```bash
conda install -c bioconda homer

# Configure genome (required once)
perl /path/to/homer/configureHomer.pl -install hg38
perl /path/to/homer/configureHomer.pl -install mm10
```

### De Novo Motif Discovery

```bash
# Basic motif finding
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200

# With background regions
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -bg background.bed

# Specify motif lengths to search
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -len 8,10,12
```

### Key Options

| Option | Description |
|--------|-------------|
| `-size <#>` | Fragment size for analysis (default 200) |
| `-size given` | Use actual peak sizes |
| `-bg <file>` | Background regions (BED) |
| `-len <#,#,...>` | Motif lengths to search |
| `-mask` | Mask repeats |
| `-p <#>` | Number of CPUs |
| `-S <#>` | Number of motifs to find (default 25) |
| `-mis <#>` | Mismatches allowed (default 2) |
| `-noweight` | Don't adjust for GC content |

### Output Files

```
output_dir/
├── homerResults.html      # Main results page
├── knownResults.html      # Known motif enrichment
├── homerMotifs.all.motifs # All discovered motifs
├── knownResults.txt       # Known motif statistics
└── motif1.motif           # Individual motif files
```

### Known Motif Enrichment Only

```bash
# Skip de novo, only check known motifs
findMotifsGenome.pl peaks.bed hg38 output_dir/ -size 200 -nomotif
```

### Scan for Specific Motifs

```bash
# Find instances of motif in peaks
annotatePeaks.pl peaks.bed hg38 -m motif.motif > annotated.txt

# Scan genome for motif occurrences
scanMotifGenomeWide.pl motif.motif hg38 > motif_sites.bed
```

### Motif Comparison

```bash
# Compare discovered motifs to known database
compareMotifs.pl motifs.motif output_dir/ -known
```

### Create Custom Motif

```bash
# From consensus sequence
seq2profile.pl CACGTG 4 > MYC.motif

# From aligned sequences
cat aligned_seqs.txt | alignAndConvert.pl - > custom.motif
```

## MEME Suite

### Installation

```bash
conda install -c bioconda meme
```

### Extract Sequences from Peaks

```bash
# Get FASTA sequences under peaks
bedtools getfasta -fi genome.fa -bed peaks.bed -fo peaks.fa

# Center peaks and resize
bedtools slop -i peaks.bed -g genome.sizes -b 100 | \
    bedtools getfasta -fi genome.fa -bed - -fo peaks_centered.fa
```

### MEME (De Novo Discovery)

```bash
# Basic de novo discovery
meme peaks.fa -dna -oc meme_output -mod zoops -nmotifs 10 -minw 6 -maxw 20

# With Markov background
fasta-get-markov peaks.fa > background.model
meme peaks.fa -dna -oc meme_output -bfile background.model -mod zoops -nmotifs 10
```

### MEME Options

| Option | Description |
|--------|-------------|
| `-mod zoops` | Zero or one per sequence (default for ChIP) |
| `-mod oops` | Exactly one per sequence |
| `-mod anr` | Any number of repeats |
| `-nmotifs <#>` | Number of motifs to find |
| `-minw <#>` | Minimum motif width |
| `-maxw <#>` | Maximum motif width |
| `-revcomp` | Search both strands |
| `-bfile <file>` | Background model file |

### MEME-ChIP (Comprehensive Pipeline)

```bash
# All-in-one ChIP-seq motif analysis
meme-chip -oc meme_chip_output -db motif_database.meme peaks.fa
```

MEME-ChIP runs:
1. MEME - De novo discovery (central enrichment)
2. DREME - Short motif discovery
3. CentriMo - Central enrichment analysis
4. TOMTOM - Compare to known motifs
5. FIMO - Find motif instances

### DREME (Short Motifs)

```bash
# Find short enriched motifs
dreme -oc dreme_output -p peaks.fa -n background.fa
```

### CentriMo (Central Enrichment)

```bash
# Test for central enrichment of known motifs
centrimo -oc centrimo_output peaks.fa motif_database.meme
```

### TOMTOM (Motif Comparison)

```bash
# Compare discovered motifs to database
tomtom -oc tomtom_output discovered.meme database.meme
```

### FIMO (Motif Scanning)

```bash
# Scan sequences for motif matches
fimo --oc fimo_output motif.meme sequences.fa

# Scan genome
fimo --oc fimo_output --max-stored-scores 1000000 motif.meme genome.fa
```

## Motif Databases

### HOMER Built-in

```bash
# List available motif sets
ls /path/to/homer/data/knownTFs/

# Vertebrate, known motifs (default)
findMotifsGenome.pl peaks.bed hg38 output/ -mknown vertebrates/known.motifs
```

### JASPAR

```bash
# Download JASPAR motifs
wget https://jaspar.genereg.net/download/data/2024/CORE/JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt

# Use with MEME suite
meme-chip -db JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt peaks.fa
```

### HOCOMOCO

```bash
# Download HOCOMOCO
wget https://hocomoco11.autosome.org/final_bundle/hocomoco11/core/HUMAN/mono/HOCOMOCOv11_core_HUMAN_mono_meme_format.meme

# Use with MEME suite
tomtom discovered.meme HOCOMOCOv11_core_HUMAN_mono_meme_format.meme
```

## Python: Parse HOMER Results

```python
import pandas as pd

def parse_homer_known(results_file):
    '''Parse HOMER knownResults.txt.'''
    df = pd.read_csv(results_file, sep='\t')
    df.columns = ['Motif', 'Consensus', 'P-value', 'Log P-value',
                  'q-value', 'Targets', 'Target%', 'Background', 'Background%']
    df['P-value'] = df['P-value'].astype(float)
    return df.sort_values('P-value')

known = parse_homer_known('output_dir/knownResults.txt')
print(known[['Motif', 'P-value', 'Target%']].head(20))
```

## Python: Parse MEME Results

```python
from Bio import motifs

def parse_meme_file(meme_file):
    '''Parse MEME output file.'''
    with open(meme_file) as f:
        record = motifs.parse(f, 'meme')
    return record

record = parse_meme_file('meme_output/meme.txt')
for m in record:
    print(f'{m.name}: {m.consensus}')
    print(m.counts)
```

## Complete Workflows

### ChIP-seq Motif Analysis

```bash
#!/bin/bash
set -euo pipefail

PEAKS=$1  # narrowPeak or BED file
GENOME=$2  # hg38, mm10, etc.
OUTDIR=$3

mkdir -p $OUTDIR

# HOMER analysis
echo "Running HOMER..."
findMotifsGenome.pl $PEAKS $GENOME ${OUTDIR}/homer \
    -size 200 -p 8 -mask

# Extract sequences for MEME
echo "Extracting sequences..."
bedtools slop -i $PEAKS -g ${GENOME}.chrom.sizes -b 0 | \
    awk 'BEGIN{OFS="\t"} {center=int(($2+$3)/2); print $1,center-100,center+100}' | \
    bedtools getfasta -fi ${GENOME}.fa -bed - -fo ${OUTDIR}/peaks.fa

# MEME-ChIP analysis
echo "Running MEME-ChIP..."
meme-chip -oc ${OUTDIR}/meme_chip \
    -db /path/to/JASPAR.meme \
    ${OUTDIR}/peaks.fa

echo "Done. Results in ${OUTDIR}/"
```

### ATAC-seq Footprint Motifs

```bash
# Analyze motifs in footprint regions
findMotifsGenome.pl footprints.bed hg38 footprint_motifs/ \
    -size given -mask -p 8

# Compare to accessible regions background
findMotifsGenome.pl footprints.bed hg38 footprint_motifs/ \
    -size given -bg accessible_peaks.bed -mask -p 8
```

## Visualization

### HOMER Logo

```bash
# Generate sequence logo
motif2Logo.pl motif.motif > logo.eps
```

### Plot with Python

```python
import logomaker
import pandas as pd
import matplotlib.pyplot as plt

def plot_motif(pwm_file):
    '''Plot sequence logo from HOMER PWM.'''
    pwm = pd.read_csv(pwm_file, sep='\t', skiprows=1, header=None)
    pwm.columns = ['A', 'C', 'G', 'T']
    logo = logomaker.Logo(pwm, shade_below=0.5, fade_below=0.5)
    plt.show()
```

## Quality Metrics

| Metric | Good | Concerning |
|--------|------|------------|
| P-value | < 1e-10 | > 1e-5 |
| Target % | > 20% | < 5% |
| Background % | < Target/2 | Similar to Target |
| Bit score | > 10 | < 5 |

## Common Issues

### No Significant Motifs
- Check peak quality (too few peaks?)
- Try different peak sizes (`-size`)
- Ensure genome build matches
- Check for repeat masking issues

### Too Many Motifs
- Increase significance threshold
- Use `-S` to limit number of motifs
- Filter by target percentage

### Wrong Background
- Use matched GC content background
- Consider using input/control peaks
- Try shuffled sequences

## Related Skills

- peak-calling - Generate input peaks
- peak-annotation - Annotate peaks with genes
- atac-seq/footprinting - TF footprint analysis
- genome-intervals - BED file operations