platform-health
skillCheck comprehensive platform health including ArgoCD apps, pods, services, certificates, and resources across the Kagenti platform
apm::install
apm install @redhat-et/platform-healthapm::skill.md
---
name: platform-health
description: Check comprehensive platform health including ArgoCD apps, pods, services, certificates, and resources across the Kagenti platform
---
# Platform Health Check Skill
This skill helps you perform comprehensive platform health checks and identify issues quickly.
## When to Use
- After deployments or cluster restarts
- Before making changes (baseline health)
- During incident investigation
- Regular health monitoring
- After running tests
- User requests "check platform" or "is everything working"
## What This Skill Does
1. **Quick Health Overview**: One-command platform status
2. **ArgoCD Apps**: Health and sync status of all applications
3. **Pod Health**: Check pods across all namespaces
4. **Service Accessibility**: Test Gateway routes and certificates
5. **Resource Usage**: CPU/memory consumption
6. **Component-Specific Checks**: Detailed validation per component
## Quick Health Check
### Comprehensive Platform Status
```bash
# Single command for full platform health (includes pytest tests)
./scripts/platform-status.sh
# What it checks:
# ✓ ArgoCD applications (health & sync status)
# ✓ Platform pods (all namespaces)
# ✓ Gateway & certificates
# ✓ Istio mTLS configuration
# ✓ Service accessibility (via Gateway)
# ✓ OAuth authentication
# ✓ Integration tests (pytest)
```
**Expected Output**:
```
=== ArgoCD Applications Status ===
✓ gateway-api: Healthy, Synced
✓ cert-manager: Healthy, Synced
✓ istio-base: Healthy, Synced
...
=== Platform Pods ===
observability grafana-xxx 2/2 Running
observability prometheus-xxx 2/2 Running
...
=== Gateway & Certificates ===
✓ external-gateway: Programmed
✓ grafana-cert: Ready
...
=== Integration Tests ===
PASSED tests/validation/test_app_state.py::test_critical_apps
...
```
### Quick Status Commands
```bash
# ArgoCD apps summary
argocd app list --port-forward --port-forward-namespace argocd --grpc-web
# All pods summary
kubectl get pods -A
# Failing pods only
kubectl get pods -A | grep -vE "Running|Completed"
# Service endpoints
kubectl get svc -A
# Gateway status
kubectl get gateway -A
# Certificate status
kubectl get certificate -A
```
## Detailed Health Checks
### 1. ArgoCD Application Health
```bash
# List all apps with health status
argocd app list --port-forward --port-forward-namespace argocd --grpc-web \
-o json | jq -r '.[] | "\(.metadata.name): \(.status.health.status), \(.status.sync.status)"'
# Check for unhealthy apps
argocd app list --port-forward --port-forward-namespace argocd --grpc-web \
| grep -E "Degraded|OutOfSync|Unknown|Missing"
# Get details for specific app
argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web
# Check app sync history
argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web
```
**Expected States**:
- **Health**: `Healthy` (✓), `Progressing` (⚠️), `Degraded` (❌), `Missing` (❌)
- **Sync**: `Synced` (✓), `OutOfSync` (⚠️)
**Critical Apps** (must be Healthy):
- gateway-api
- cert-manager
- istio-base, istiod
- tekton-pipelines
- keycloak
- kagenti-operator, kagenti-platform-operator
- kagenti-platform
- kagenti-ui
**Optional Apps** (can be Progressing):
- observability (large images, slow startup)
- kiali
- ollama
### 2. Pod Health by Namespace
```bash
# All pods with status
kubectl get pods -A -o wide
# Pods sorted by restarts
kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20
# Pods with issues
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# Pod resource usage
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu
# Specific namespace health
kubectl get pods -n observability
kubectl get pods -n keycloak
kubectl get pods -n kagenti-system
```
**Check for these statuses**:
- ❌ **CrashLoopBackOff**: Application crashes on startup
- ❌ **ImagePullBackOff**: Image not available
- ❌ **Error**: Container exited with error
- ⚠️ **Pending**: Waiting for resources or scheduling
- ⚠️ **Init**: Init containers still running
- ✓ **Running**: Pod healthy
- ✓ **Completed**: Job finished successfully
### 3. Service Accessibility
```bash
# Test all platform services via Gateway
for service in grafana prometheus tempo phoenix kiali keycloak kagenti; do
echo "=== Testing https://$service.localtest.me:9443/ ==="
curl -k -I -m 5 "https://$service.localtest.me:9443/" 2>&1 | head -3
echo
done
# Check Gateway status
kubectl get gateway -A
kubectl describe gateway external-gateway -n default
# Check HTTPRoutes
kubectl get httproute -A
kubectl describe httproute <route-name> -n <namespace>
# Check service endpoints (should have IP addresses)
kubectl get endpoints -A | grep -v "<none>"
```
**Expected Results**:
- Grafana: HTTP/2 302 (redirect to /login)
- Prometheus: HTTP/2 302 (OAuth redirect)
- Keycloak: HTTP/2 200
- Kagenti UI: HTTP/2 200
### 4. Certificate Health
```bash
# All certificates status
kubectl get certificate -A
# Check certificate details
kubectl describe certificate <cert-name> -n <namespace>
# Check cert-manager logs for issues
kubectl logs -n cert-manager deployment/cert-manager --tail=50
# Verify certificate expiration
kubectl get certificate -A -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): expires \(.status.notAfter)"'
```
**Expected State**: All certificates show `Ready=True`
### 5. Istio Service Mesh Health
```bash
# Check Istio components
kubectl get pods -n istio-system
# Verify sidecar injection (should show 2/2 containers)
kubectl get pods -A -o wide | grep "2/2"
# Check mTLS policies
kubectl get peerauthentication -A
kubectl get destinationrule -A
# Istio proxy status
istioctl proxy-status
# Check specific pod mesh config
istioctl x describe pod <pod-name> -n <namespace>
```
### 6. Resource Usage
```bash
# Node resources
kubectl top nodes
# Cluster-wide pod resources
kubectl top pods -A --sort-by=memory | head -20
kubectl top pods -A --sort-by=cpu | head -20
# Namespace resource usage
kubectl top pods -n observability
kubectl top pods -n keycloak
kubectl top pods -n kagenti-system
# Check for resource pressure
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="MemoryPressure" or .type=="DiskPressure") | .type)=\(.status)"'
```
### 7. Storage Health
```bash
# PersistentVolumes
kubectl get pv
# PersistentVolumeClaims
kubectl get pvc -A
# Check PVC usage via metrics
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100' \
| python3 -m json.tool
```
## Component-Specific Health Checks
### Observability Stack
```bash
# Prometheus
kubectl get pods -n observability -l app=prometheus
kubectl exec -n observability deployment/grafana -- \
curl -s http://prometheus.observability.svc:9090/-/ready
# Grafana
kubectl get pods -n observability -l app=grafana
curl -k -I https://grafana.localtest.me:9443/api/health
# Loki
kubectl get pods -n observability -l app=loki
kubectl exec -n observability deployment/grafana -- \
curl -s http://loki.observability.svc:3100/ready
# Tempo
kubectl get pods -n observability -l app=tempo
kubectl exec -n observability deployment/grafana -- \
curl -s http://tempo-query-frontend.observability.svc:3100/ready
# Phoenix
kubectl get pods -n observability -l app=phoenix
curl -k -I https://phoenix.localtest.me:9443/
# AlertManager
kubectl get pods -n observability -l app=alertmanager
kubectl exec -n observability deployment/alertmanager -c alertmanager -- \
wget -qO- http://localhost:9093/-/ready
```
### Authentication & Authorization
```bash
# Keycloak
kubectl get pods -n keycloak -l app=keycloak
kubectl exec -n keycloak statefulset/keycloak -- \
curl -s http://localhost:8080/health/ready | python3 -m json.tool
# OAuth2-Proxy instances
kubectl get pods -n oauth2-proxy
kubectl get deployment -n oauth2-proxy
# Test Keycloak SSO
curl -k "https://keycloak.localtest.me:9443/realms/master/.well-known/openid-configuration"
```
### Platform Components
```bash
# Kagenti Operator
kubectl get pods -n kagenti-operator
kubectl logs -n kagenti-operator deployment/kagenti-operator --tail=20
# Kagenti Platform Operator
kubectl get pods -n kagenti-platform-operator
kubectl logs -n kagenti-platform-operator deployment/kagenti-platform-operator --tail=20
# Kagenti UI
kubectl get pods -n kagenti-platform -l app=kagenti-ui
curl -k -I https://kagenti.localtest.me:9443/
# Tekton Pipelines
kubectl get pods -n tekton-pipelines
kubectl get pipelineruns -A
```
## Health Check Checklists
### Post-Deployment Health Check
- [ ] All ArgoCD apps Healthy and Synced
- [ ] No pods in CrashLoopBackOff/ImagePullBackOff
- [ ] All services have endpoints
- [ ] All certificates Ready
- [ ] All Gateway routes Programmed
- [ ] Services accessible via browser
- [ ] Integration tests passing
- [ ] No firing critical alerts
### Pre-Change Health Check
- [ ] Capture platform snapshot: `./scripts/capture-platform-snapshot.sh before-change`
- [ ] All critical apps Healthy
- [ ] No existing incidents in TODO_INCIDENTS.md
- [ ] Resource usage within limits
- [ ] Recent Git commits validated
### Incident Investigation Health Check
- [ ] Identify degraded components
- [ ] Check recent events
- [ ] Collect logs from affected pods
- [ ] Query metrics for anomalies
- [ ] Check for correlated failures
- [ ] Review recent changes (Git history)
## Common Health Issues
### Issue: Pods stuck in Pending
```bash
# Check pod description for reason
kubectl describe pod <pod-name> -n <namespace>
# Common causes:
# - Insufficient CPU/memory
# - No nodes matching nodeSelector
# - Unbound PersistentVolumeClaim
```
### Issue: Pods CrashLoopBackOff
```bash
# Check previous logs
kubectl logs <pod-name> -n <namespace> --previous
# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
# Common causes:
# - Application error on startup
# - Missing configuration
# - Dependency not available
```
### Issue: Service not accessible
```bash
# Check pod status
kubectl get pods -n <namespace> -l app=<service>
# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>
# Check HTTPRoute
kubectl get httproute -n <namespace>
# Test from inside cluster
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \
-- curl http://<service-name>.<namespace>.svc:PORT
```
### Issue: Certificate not Ready
```bash
# Check certificate status
kubectl describe certificate <cert-name> -n <namespace>
# Check cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
# Common causes:
# - DNS validation failing
# - Rate limit reached
# - Invalid configuration
```
### Issue: High resource usage
```bash
# Find top consumers
kubectl top pods -A --sort-by=memory | head -10
kubectl top pods -A --sort-by=cpu | head -10
# Check for memory leaks
kubectl logs <pod-name> -n <namespace> | grep -i "out of memory"
# Check resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Limits:"
```
## Automation & Monitoring
### Continuous Health Monitoring
```bash
# Watch pod status
watch -n 5 'kubectl get pods -A | grep -vE "Running|Completed"'
# Watch ArgoCD apps
watch -n 10 'argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -vE "Healthy.*Synced"'
# Monitor specific namespace
watch -n 5 'kubectl get pods -n observability'
```
### Scheduled Health Checks
```bash
# Cron job for periodic health checks (local dev)
# Add to crontab: crontab -e
*/15 * * * * /path/to/kagenti-demo-deployment/scripts/platform-status.sh > /tmp/health-$(date +\%Y\%m\%d-\%H\%M).log 2>&1
# Compare snapshots over time
./scripts/capture-platform-snapshot.sh hourly-check
```
## Related Documentation
- [CLAUDE.md Platform Status](../../../CLAUDE.md#monitoring--access) - Monitoring commands
- [scripts/platform-status.sh](../../../scripts/platform-status.sh) - Automated health check
- [TODO_INCIDENTS.md](../../../TODO_INCIDENTS.md) - Active incidents
- [docs/INTEGRATION_TESTS.md](../../../docs/INTEGRATION_TESTS.md) - Test strategy
## Integration with Other Skills
**After health check, if issues found**:
- Use **investigate-incident** skill for RCA
- Use **check-logs** skill to examine error logs
- Use **check-metrics** skill for performance analysis
- Use **check-alerts** skill to see if alerts fired
## Pro Tips
1. **Always baseline first**: Run health check BEFORE making changes
2. **Use platform-status.sh**: Single command for comprehensive check
3. **Capture snapshots**: Use `capture-platform-snapshot.sh` for historical comparison
4. **Check critical apps first**: Focus on gateway-api, istio, keycloak, operators
5. **Look for patterns**: Multiple pods failing often indicates cluster-wide issue
6. **Check Git history**: Recent commits may explain new issues
7. **Verify after fixes**: Always re-run health check after remediation
🤖 Generated with [Claude Code](https://claude.com/claude-code)