python-resilience
skillPython resilience patterns including automatic retries, exponential backoff, timeouts, and fault-tolerant decorators. Use when adding retry logic, implementing timeouts, building fault-tolerant services, or handling transient failures.
apm::install
apm install @wshobson/python-resilienceapm::skill.md
---
name: python-resilience
description: Python resilience patterns including automatic retries, exponential backoff, timeouts, and fault-tolerant decorators. Use when adding retry logic, implementing timeouts, building fault-tolerant services, or handling transient failures.
---
# Python Resilience Patterns
Build fault-tolerant Python applications that gracefully handle transient failures, network issues, and service outages. Resilience patterns keep systems running when dependencies are unreliable.
## When to Use This Skill
- Adding retry logic to external service calls
- Implementing timeouts for network operations
- Building fault-tolerant microservices
- Handling rate limiting and backpressure
- Creating infrastructure decorators
- Designing circuit breakers
## Core Concepts
### 1. Transient vs Permanent Failures
Retry transient errors (network timeouts, temporary service issues). Don't retry permanent errors (invalid credentials, bad requests).
### 2. Exponential Backoff
Increase wait time between retries to avoid overwhelming recovering services.
### 3. Jitter
Add randomness to backoff to prevent thundering herd when many clients retry simultaneously.
### 4. Bounded Retries
Cap both attempt count and total duration to prevent infinite retry loops.
## Quick Start
```python
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential_jitter(initial=1, max=10),
)
def call_external_service(request: dict) -> dict:
return httpx.post("https://api.example.com", json=request).json()
```
## Fundamental Patterns
### Pattern 1: Basic Retry with Tenacity
Use the `tenacity` library for production-grade retry logic. For simpler cases, consider built-in retry functionality or a lightweight custom implementation.
```python
from tenacity import (
retry,
stop_after_attempt,
stop_after_delay,
wait_exponential_jitter,
retry_if_exception_type,
)
TRANSIENT_ERRORS = (ConnectionError, TimeoutError, OSError)
@retry(
retry=retry_if_exception_type(TRANSIENT_ERRORS),
stop=stop_after_attempt(5) | stop_after_delay(60),
wait=wait_exponential_jitter(initial=1, max=30),
)
def fetch_data(url: str) -> dict:
"""Fetch data with automatic retry on transient failures."""
response = httpx.get(url, timeout=30)
response.raise_for_status()
return response.json()
```
### Pattern 2: Retry Only Appropriate Errors
Whitelist specific transient exceptions. Never retry:
- `ValueError`, `TypeError` - These are bugs, not transient issues
- `AuthenticationError` - Invalid credentials won't become valid
- HTTP 4xx errors (except 429) - Client errors are permanent
```python
from tenacity import retry, retry_if_exception_type
import httpx
# Define what's retryable
RETRYABLE_EXCEPTIONS = (
ConnectionError,
TimeoutError,
httpx.ConnectTimeout,
httpx.ReadTimeout,
)
@retry(
retry=retry_if_exception_type(RETRYABLE_EXCEPTIONS),
stop=stop_after_attempt(3),
wait=wait_exponential_jitter(initial=1, max=10),
)
def resilient_api_call(endpoint: str) -> dict:
"""Make API call with retry on network issues."""
return httpx.get(endpoint, timeout=10).json()
```
### Pattern 3: HTTP Status Code Retries
Retry specific HTTP status codes that indicate transient issues.
```python
from tenacity import retry, retry_if_result, stop_after_attempt
import httpx
RETRY_STATUS_CODES = {429, 502, 503, 504}
def should_retry_response(response: httpx.Response) -> bool:
"""Check if response indicates a retryable error."""
return response.status_code in RETRY_STATUS_CODES
@retry(
retry=retry_if_result(should_retry_response),
stop=stop_after_attempt(3),
wait=wait_exponential_jitter(initial=1, max=10),
)
def http_request(method: str, url: str, **kwargs) -> httpx.Response:
"""Make HTTP request with retry on transient status codes."""
return httpx.request(method, url, timeout=30, **kwargs)
```
### Pattern 4: Combined Exception and Status Retry
Handle both network exceptions and HTTP status codes.
```python
from tenacity import (
retry,
retry_if_exception_type,
retry_if_result,
stop_after_attempt,
wait_exponential_jitter,
before_sleep_log,
)
import logging
import httpx
logger = logging.getLogger(__name__)
TRANSIENT_EXCEPTIONS = (
ConnectionError,
TimeoutError,
httpx.ConnectError,
httpx.ReadTimeout,
)
RETRY_STATUS_CODES = {429, 500, 502, 503, 504}
def is_retryable_response(response: httpx.Response) -> bool:
return response.status_code in RETRY_STATUS_CODES
@retry(
retry=(
retry_if_exception_type(TRANSIENT_EXCEPTIONS) |
retry_if_result(is_retryable_response)
),
stop=stop_after_attempt(5),
wait=wait_exponential_jitter(initial=1, max=30),
before_sleep=before_sleep_log(logger, logging.WARNING),
)
def robust_http_call(
method: str,
url: str,
**kwargs,
) -> httpx.Response:
"""HTTP call with comprehensive retry handling."""
return httpx.request(method, url, timeout=30, **kwargs)
```
## Advanced Patterns
### Pattern 5: Logging Retry Attempts
Track retry behavior for debugging and alerting.
```python
from tenacity import retry, stop_after_attempt, wait_exponential
import structlog
logger = structlog.get_logger()
def log_retry_attempt(retry_state):
"""Log detailed retry information."""
exception = retry_state.outcome.exception()
logger.warning(
"Retrying operation",
attempt=retry_state.attempt_number,
exception_type=type(exception).__name__,
exception_message=str(exception),
next_wait_seconds=retry_state.next_action.sleep if retry_state.next_action else None,
)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, max=10),
before_sleep=log_retry_attempt,
)
def call_with_logging(request: dict) -> dict:
"""External call with retry logging."""
...
```
### Pattern 6: Timeout Decorator
Create reusable timeout decorators for consistent timeout handling.
```python
import asyncio
from functools import wraps
from typing import TypeVar, Callable
T = TypeVar("T")
def with_timeout(seconds: float):
"""Decorator to add timeout to async functions."""
def decorator(func: Callable[..., T]) -> Callable[..., T]:
@wraps(func)
async def wrapper(*args, **kwargs) -> T:
return await asyncio.wait_for(
func(*args, **kwargs),
timeout=seconds,
)
return wrapper
return decorator
@with_timeout(30)
async def fetch_with_timeout(url: str) -> dict:
"""Fetch URL with 30 second timeout."""
async with httpx.AsyncClient() as client:
response = await client.get(url)
return response.json()
```
### Pattern 7: Cross-Cutting Concerns via Decorators
Stack decorators to separate infrastructure from business logic.
```python
from functools import wraps
from typing import TypeVar, Callable
import structlog
logger = structlog.get_logger()
T = TypeVar("T")
def traced(name: str | None = None):
"""Add tracing to function calls."""
def decorator(func: Callable[..., T]) -> Callable[..., T]:
span_name = name or func.__name__
@wraps(func)
async def wrapper(*args, **kwargs) -> T:
logger.info("Operation started", operation=span_name)
try:
result = await func(*args, **kwargs)
logger.info("Operation completed", operation=span_name)
return result
except Exception as e:
logger.error("Operation failed", operation=span_name, error=str(e))
raise
return wrapper
return decorator
# Stack multiple concerns
@traced("fetch_user_data")
@with_timeout(30)
@retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter())
async def fetch_user_data(user_id: str) -> dict:
"""Fetch user with tracing, timeout, and retry."""
...
```
### Pattern 8: Dependency Injection for Testability
Pass infrastructure components through constructors for easy testing.
```python
from dataclasses import dataclass
from typing import Protocol
class Logger(Protocol):
def info(self, msg: str, **kwargs) -> None: ...
def error(self, msg: str, **kwargs) -> None: ...
class MetricsClient(Protocol):
def increment(self, metric: str, tags: dict | None = None) -> None: ...
def timing(self, metric: str, value: float) -> None: ...
@dataclass
class UserService:
"""Service with injected infrastructure."""
repository: UserRepository
logger: Logger
metrics: MetricsClient
async def get_user(self, user_id: str) -> User:
self.logger.info("Fetching user", user_id=user_id)
start = time.perf_counter()
try:
user = await self.repository.get(user_id)
self.metrics.increment("user.fetch.success")
return user
except Exception as e:
self.metrics.increment("user.fetch.error")
self.logger.error("Failed to fetch user", user_id=user_id, error=str(e))
raise
finally:
elapsed = time.perf_counter() - start
self.metrics.timing("user.fetch.duration", elapsed)
# Easy to test with fakes
service = UserService(
repository=FakeRepository(),
logger=FakeLogger(),
metrics=FakeMetrics(),
)
```
### Pattern 9: Fail-Safe Defaults
Degrade gracefully when non-critical operations fail.
```python
from typing import TypeVar
from collections.abc import Callable
T = TypeVar("T")
def fail_safe(default: T, log_failure: bool = True):
"""Return default value on failure instead of raising."""
def decorator(func: Callable[..., T]) -> Callable[..., T]:
@wraps(func)
async def wrapper(*args, **kwargs) -> T:
try:
return await func(*args, **kwargs)
except Exception as e:
if log_failure:
logger.warning(
"Operation failed, using default",
function=func.__name__,
error=str(e),
)
return default
return wrapper
return decorator
@fail_safe(default=[])
async def get_recommendations(user_id: str) -> list[str]:
"""Get recommendations, return empty list on failure."""
...
```
## Best Practices Summary
1. **Retry only transient errors** - Don't retry bugs or authentication failures
2. **Use exponential backoff** - Give services time to recover
3. **Add jitter** - Prevent thundering herd from synchronized retries
4. **Cap total duration** - `stop_after_attempt(5) | stop_after_delay(60)`
5. **Log every retry** - Silent retries hide systemic problems
6. **Use decorators** - Keep retry logic separate from business logic
7. **Inject dependencies** - Make infrastructure testable
8. **Set timeouts everywhere** - Every network call needs a timeout
9. **Fail gracefully** - Return cached/default values for non-critical paths
10. **Monitor retry rates** - High retry rates indicate underlying issues