Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
apm install @ancoleman/evaluating-llms[](https://apm-p1ls2dz87-atlamors-projects.vercel.app/packages/@ancoleman/evaluating-llms)---
name: evaluating-llms
description: Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
---
# LLM Evaluation
Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.
## When to Use This Skill
Apply this skill when:
- Testing individual prompts for correctness and formatting
- Validating RAG (Retrieval-Augmented Generation) pipeline quality
- Measuring hallucinations, bias, or toxicity in LLM outputs
- Comparing different models or prompt configurations (A/B testing)
- Running benchmark tests (MMLU, HumanEval) to assess model capabilities
- Setting up production monitoring for LLM applications
- Integrating LLM quality checks into CI/CD pipelines
Common triggers:
- "How do I test if my RAG system is working correctly?"
- "How can I measure hallucinations in LLM outputs?"
- "What metrics should I use to evaluate generation quality?"
- "How do I compare GPT-4 vs Claude for my use case?"
- "How do I detect bias in LLM responses?"
## Evaluation Strategy Selection
### Decision Framework: Which Evaluation Approach?
**By Task Type:**
| Task Type | Primary Approach | Metrics | Tools |
|-----------|------------------|---------|-------|
| **Classification** (sentiment, intent) | Automated metrics | Accuracy, Precision, Recall, F1 | scikit-learn |
| **Generation** (summaries, creative text) | LLM-as-judge + automated | BLEU, ROUGE, BERTScore, Quality rubric | GPT-4/Claude for judging |
| **Question Answering** | Exact match + semantic similarity | EM, F1, Cosine similarity | Custom evaluators |
| **RAG Systems** | RAGAS framework | Faithfulness, Answer/Context relevance | RAGAS library |
| **Code Generation** | Unit tests + execution | Pass@K, Test pass rate | HumanEval, pytest |
| **Multi-step Agents** | Task completion + tool accuracy | Success rate, Efficiency | Custom evaluators |
**By Volume and Cost:**
| Samples | Speed | Cost | Recommended Approach |
|---------|-------|------|---------------------|
| 1,000+ | Immediate | $0 | Automated metrics (regex, JSON validation) |
| 100-1,000 | Minutes | $0.01-0.10 each | LLM-as-judge (GPT-4, Claude) |
| < 100 | Hours | $1-10 each | Human evaluation (pairwise comparison) |
**Layered Approach (Recommended for Production):**
1. **Layer 1:** Automated metrics for all outputs (fast, cheap)
2. **Layer 2:** LLM-as-judge for 10% sample (nuanced quality)
3. **Layer 3:** Human review for 1% edge cases (validation)
## Core Evaluation Patterns
### Unit Evaluation (Individual Prompts)
Test single prompt-response pairs for correctness.
**Methods:**
- **Exact Match:** Response exactly matches expected output
- **Regex Matching:** Response follows expected pattern
- **JSON Schema Validation:** Structured output validation
- **Keyword Presence:** Required terms appear in response
- **LLM-as-Judge:** Binary pass/fail using evaluation prompt
**Example Use Cases:**
- Email classification (spam/not spam)
- Entity extraction (dates, names, locations)
- JSON output formatting validation
- Sentiment analysis (positive/negative/neutral)
**Quick Start (Python):**
```python
import pytest
from openai import OpenAI
client = OpenAI()
def classify_sentiment(text: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
{"role": "user", "content": text}
],
temperature=0
)
return response.choices[0].message.content.strip().lower()
def test_positive_sentiment():
result = classify_sentiment("I love this product!")
assert result == "positive"
```
For complete unit evaluation examples, see `examples/python/unit_evaluation.py` and `examples/typescript/unit-evaluation.ts`.
### RAG (Retrieval-Augmented Generation) Evaluation
Evaluate RAG systems using RAGAS framework metrics.
**Critical Metrics (Priority Order):**
1. **Faithfulness** (Target: > 0.8) - **MOST CRITICAL**
- Measures: Is the answer grounded in retrieved context?
- Prevents hallucinations
- If failing: Adjust prompt to emphasize grounding, require citations
2. **Answer Relevance** (Target: > 0.7)
- Measures: How well does the answer address the query?
- If failing: Improve prompt instructions, add few-shot examples
3. **Context Relevance** (Target: > 0.7)
- Measures: Are retrieved chunks relevant to the query?
- If failing: Improve retrieval (better embeddings, hybrid search)
4. **Context Precision** (Target: > 0.5)
- Measures: Are relevant chunks ranked higher than irrelevant?
- If failing: Add re-ranking step to retrieval pipeline
5. **Context Recall** (Target: > 0.8)
- Measures: Are all relevant chunks retrieved?
- If failing: Increase retrieval count, improve chunking strategy
**Quick Start (Python with RAGAS):**
```python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset
data = {
"question": ["What is the capital of France?"],
"answer": ["The capital of France is Paris."],
"contexts": [["Paris is the capital of France."]],
"ground_truth": ["Paris"]
}
dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")
```
For comprehensive RAG evaluation patterns, see `references/rag-evaluation.md` and `examples/python/ragas_example.py`.
### LLM-as-Judge Evaluation
Use powerful LLMs (GPT-4, Claude Opus) to evaluate other LLM outputs.
**When to Use:**
- Generation quality assessment (summaries, creative writing)
- Nuanced evaluation criteria (tone, clarity, helpfulness)
- Custom rubrics for domain-specific tasks
- Medium-volume evaluation (100-1,000 samples)
**Correlation with Human Judgment:** 0.75-0.85 for well-designed rubrics
**Best Practices:**
- Use clear, specific rubrics (1-5 scale with detailed criteria)
- Include few-shot examples in evaluation prompt
- Average multiple evaluations to reduce variance
- Be aware of biases (position bias, verbosity bias, self-preference)
**Quick Start (Python):**
```python
from openai import OpenAI
client = OpenAI()
def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
"""Returns (score 1-5, reasoning)"""
eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.
USER PROMPT: {prompt}
LLM RESPONSE: {response}
Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
result = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": eval_prompt}],
temperature=0.3
)
content = result.choices[0].message.content
lines = content.strip().split('\n')
score = int(lines[0].split(':')[1].strip())
reasoning = lines[1].split(':', 1)[1].strip()
return score, reasoning
```
For detailed LLM-as-judge patterns and prompt templates, see `references/llm-as-judge.md` and `examples/python/llm_as_judge.py`.
### Safety and Alignment Evaluation
Measure hallucinations, bias, and toxicity in LLM outputs.
#### Hallucination Detection
**Methods:**
1. **Faithfulness to Context (RAG):**
- Use RAGAS faithfulness metric
- LLM checks if claims are supported by context
- Score: Supported claims / Total claims
2. **Factual Accuracy (Closed-Book):**
- LLM-as-judge with access to reliable sources
- Fact-checking APIs (Google Fact Check)
- Entity-level verification (dates, names, statistics)
3. **Self-Consistency:**
- Generate multiple responses to same question
- Measure agreement between responses
- Low consistency suggests hallucination
#### Bias Evaluation
**Types of Bias:**
- Gender bias (stereotypical associations)
- Racial/ethnic bias (discriminatory outputs)
- Cultural bias (Western-centric assumptions)
- Age/disability bias (ableist or ageist language)
**Evaluation Methods:**
1. **Stereotype Tests:**
- BBQ (Bias Benchmark for QA): 58,000 question-answer pairs
- BOLD (Bias in Open-Ended Language Generation)
2. **Counterfactual Evaluation:**
- Generate responses with demographic swaps
- Example: "Dr. Smith (he/she) recommended..." → compare outputs
- Measure consistency across variations
#### Toxicity Detection
**Tools:**
- **Perspective API (Google):** Toxicity, threat, insult scores
- **Detoxify (HuggingFace):** Open-source toxicity classifier
- **OpenAI Moderation API:** Hate, harassment, violence detection
For comprehensive safety evaluation patterns, see `references/safety-evaluation.md`.
### Benchmark Testing
Assess model capabilities using standardized benchmarks.
**Standard Benchmarks:**
| Benchmark | Coverage | Format | Difficulty | Use Case |
|-----------|----------|--------|------------|----------|
| **MMLU** | 57 subjects (STEM, humanities) | Multiple choice | High school - professional | General intelligence |
| **HellaSwag** | Sentence completion | Multiple choice | Common sense | Reasoning validation |
| **GPQA** | PhD-level science | Multiple choice | Very high (expert-level) | Frontier model testing |
| **HumanEval** | 164 Python problems | Code generation | Medium | Code capability |
| **MATH** | 12,500 competition problems | Math solving | High school competitions | Math reasoning |
**Domain-Specific Benchmarks:**
- **Medical:** MedQA (USMLE), PubMedQA
- **Legal:** LegalBench
- **Finance:** FinQA, ConvFinQA
**When to Use Benchmarks:**
- Comparing multiple models (GPT-4 vs Claude vs Llama)
- Model selection for specific domains
- Baseline capability assessment
- Academic research and publication
**Quick Start (lm-evaluation-harness):**
```bash
pip install lm-eval
# Evaluate GPT-4 on MMLU
lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5
```
For detailed benchmark testing patterns, see `references/benchmarks.md` and `scripts/benchmark_runner.py`.
### Production Evaluation
Monitor and optimize LLM quality in production environments.
#### A/B Testing
Compare two LLM configurations:
- **Variant A:** GPT-4 (expensive, high quality)
- **Variant B:** Claude Sonnet (cheaper, fast)
**Metrics:**
- User satisfaction scores (thumbs up/down)
- Task completion rates
- Response time and latency
- Cost per successful interaction
#### Online Evaluation
Real-time quality monitoring:
- **Response Quality:** LLM-as-judge scoring every Nth response
- **User Feedback:** Explicit ratings, thumbs up/down
- **Business Metrics:** Conversion rates, support ticket resolution
- **Cost Tracking:** Tokens used, inference costs
#### Human-in-the-Loop
Sample-based human evaluation:
- **Random Sampling:** Evaluate 10% of responses
- **Confidence-Based:** Evaluate low-confidence outputs
- **Error-Triggered:** Flag suspicious responses for review
For production evaluation patterns and monitoring strategies, see `references/production-evaluation.md`.
## Classification Task Evaluation
For tasks with discrete outputs (sentiment, intent, category).
**Metrics:**
- **Accuracy:** Correct predictions / Total predictions
- **Precision:** True positives / (True positives + False positives)
- **Recall:** True positives / (True positives + False negatives)
- **F1 Score:** Harmonic mean of precision and recall
- **Confusion Matrix:** Detailed breakdown of prediction errors
**Quick Start (Python):**
```python
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
y_true = ["positive", "negative", "neutral", "positive", "negative"]
y_pred = ["positive", "negative", "neutral", "neutral", "negative"]
accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
```
For complete classification evaluation examples, see `examples/python/classification_metrics.py`.
## Generation Task Evaluation
For open-ended text generation (summaries, creative writing, responses).
**Automated Metrics (Use with Caution):**
- **BLEU:** N-gram overlap with reference text (0-1 score)
- **ROUGE:** Recall-oriented overlap (ROUGE-1, ROUGE-L)
- **METEOR:** Semantic similarity with stemming
- **BERTScore:** Contextual embedding similarity (0-1 score)
**Limitation:** Automated metrics correlate weakly with human judgment for creative/subjective generation.
**Recommended Approach:**
1. **Automated metrics:** Fast feedback for objective aspects (length, format)
2. **LLM-as-judge:** Nuanced quality assessment (relevance, coherence, helpfulness)
3. **Human evaluation:** Final validation for subjective criteria (preference, creativity)
For detailed generation evaluation patterns, see `references/evaluation-types.md`.
## Quick Reference Tables
### Evaluation Framework Selection
| If Task Is... | Use This Framework | Primary Metric |
|---------------|-------------------|----------------|
| RAG system | RAGAS | Faithfulness > 0.8 |
| Classification | scikit-learn metrics | Accuracy, F1 |
| Generation quality | LLM-as-judge | Quality rubric (1-5) |
| Code generation | HumanEval | Pass@1, Test pass rate |
| Model comparison | Benchmark testing | MMLU, HellaSwag scores |
| Safety validation | Hallucination detection | Faithfulness, Fact-check |
| Production monitoring | Online evaluation | User feedback, Business KPIs |
### Python Library Recommendations
| Library | Use Case | Installation |
|---------|----------|--------------|
| **RAGAS** | RAG evaluation | `pip install ragas` |
| **DeepEval** | General LLM evaluation, pytest integration | `pip install deepeval` |
| **LangSmith** | Production monitoring, A/B testing | `pip install langsmith` |
| **lm-eval** | Benchmark testing (MMLU, HumanEval) | `pip install lm-eval` |
| **scikit-learn** | Classification metrics | `pip install scikit-learn` |
### Safety Evaluation Priority Matrix
| Application | Hallucination Risk | Bias Risk | Toxicity Risk | Evaluation Priority |
|-------------|-------------------|-----------|---------------|---------------------|
| Customer Support | High | Medium | High | 1. Faithfulness, 2. Toxicity, 3. Bias |
| Medical Diagnosis | Critical | High | Low | 1. Factual Accuracy, 2. Hallucination, 3. Bias |
| Creative Writing | Low | Medium | Medium | 1. Quality/Fluency, 2. Content Policy |
| Code Generation | Medium | Low | Low | 1. Functional Correctness, 2. Security |
| Content Moderation | Low | Critical | Critical | 1. Bias, 2. False Positives/Negatives |
## Detailed References
For comprehensive documentation on specific topics:
- **Evaluation types (classification, generation, QA, code):** `references/evaluation-types.md`
- **RAG evaluation deep dive (RAGAS framework):** `references/rag-evaluation.md`
- **Safety evaluation (hallucination, bias, toxicity):** `references/safety-evaluation.md`
- **Benchmark testing (MMLU, HumanEval, domain benchmarks):** `references/benchmarks.md`
- **LLM-as-judge best practices and prompts:** `references/llm-as-judge.md`
- **Production evaluation (A/B testing, monitoring):** `references/production-evaluation.md`
- **All metrics definitions and formulas:** `references/metrics-reference.md`
## Working Examples
**Python Examples:**
- `examples/python/unit_evaluation.py` - Basic prompt testing with pytest
- `examples/python/ragas_example.py` - RAGAS RAG evaluation
- `examples/python/deepeval_example.py` - DeepEval framework usage
- `examples/python/llm_as_judge.py` - GPT-4 as evaluator
- `examples/python/classification_metrics.py` - Accuracy, precision, recall
- `examples/python/benchmark_testing.py` - HumanEval example
**TypeScript Examples:**
- `examples/typescript/unit-evaluation.ts` - Vitest + OpenAI
- `examples/typescript/llm-as-judge.ts` - GPT-4 evaluation
- `examples/typescript/langsmith-integration.ts` - Production monitoring
## Executable Scripts
Run evaluations without loading code into context (token-free):
- `scripts/run_ragas_eval.py` - Run RAGAS evaluation on dataset
- `scripts/compare_models.py` - A/B test two models
- `scripts/benchmark_runner.py` - Run MMLU/HumanEval benchmarks
- `scripts/hallucination_checker.py` - Detect hallucinations in outputs
**Example usage:**
```bash
# Run RAGAS evaluation on custom dataset
python scripts/run_ragas_eval.py --dataset data/qa_dataset.json --output results.json
# Compare GPT-4 vs Claude on benchmark
python scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval
```
## Integration with Other Skills
**Related Skills:**
- **`building-ai-chat`:** Evaluate AI chat applications (this skill tests what that skill builds)
- **`prompt-engineering`:** Test prompt quality and effectiveness
- **`testing-strategies`:** Apply testing pyramid to LLM evaluation (unit → integration → E2E)
- **`observability`:** Production monitoring and alerting for LLM quality
- **`building-ci-pipelines`:** Integrate LLM evaluation into CI/CD
**Workflow Integration:**
1. Write prompt (use `prompt-engineering` skill)
2. Unit test prompt (use `llm-evaluation` skill)
3. Build AI feature (use `building-ai-chat` skill)
4. Integration test RAG pipeline (use `llm-evaluation` skill)
5. Deploy to production (use `deploying-applications` skill)
6. Monitor quality (use `llm-evaluation` + `observability` skills)
## Common Pitfalls
**1. Over-reliance on Automated Metrics for Generation**
- BLEU/ROUGE correlate weakly with human judgment for creative text
- Solution: Layer LLM-as-judge or human evaluation
**2. Ignoring Faithfulness in RAG Systems**
- Hallucinations are the #1 RAG failure mode
- Solution: Prioritize faithfulness metric (target > 0.8)
**3. No Production Monitoring**
- Models can degrade over time, prompts can break with updates
- Solution: Set up continuous evaluation (LangSmith, custom monitoring)
**4. Biased LLM-as-Judge Evaluation**
- Evaluator LLMs have biases (position bias, verbosity bias)
- Solution: Average multiple evaluations, use diverse evaluation prompts
**5. Insufficient Benchmark Coverage**
- Single benchmark doesn't capture full model capability
- Solution: Use 3-5 benchmarks across different domains
**6. Missing Safety Evaluation**
- Production LLMs can generate harmful content
- Solution: Add toxicity, bias, and hallucination checks to evaluation pipeline