Incident Response Guide
This guide outlines how to handle production incidents effectively.
Incident Severity Levels
Severity 1 (Critical)
Impact: Complete service outage or data loss
Examples:
- Application completely down
- Database unreachable
- Data corruption
- Security breach
Response Time: Immediate Who to Contact: Everyone (Slack @channel in #alerts) Communication: Every 15 minutes
Severity 2 (High)
Impact: Major feature broken, affecting many users
Examples:
- Login not working
- Payment processing failed
- Critical API endpoint down
- Performance degradation >50%
Response Time: Within 15 minutes Who to Contact: Tech lead, on-call developer (Slack in #alerts) Communication: Every 30 minutes
Severity 3 (Medium)
Impact: Minor feature broken, affecting some users
Examples:
- Single endpoint failing
- UI bug affecting workflow
- Email delivery delayed
- Background job failures
Response Time: Within 1 hour Who to Contact: Developer responsible for feature (Slack in #dev-team) Communication: Hourly updates
Severity 4 (Low)
Impact: Cosmetic issue or minor inconvenience
Examples:
- Typo in UI
- Alignment issue
- Non-critical log errors
- Minor performance issue
Response Time: Next business day Who to Contact: Create GitHub issue, discuss in #dev-team Communication: When fixed
Incident Response Process
1. Detect (0-5 minutes)
How incidents are detected:
- Monitoring alerts (CloudWatch, Datadog)
- User reports (support tickets, Slack)
- Error tracking (Sentry)
- Health check failures
Initial Actions:
- Acknowledge the alert
- Assess severity
- Post in Slack #alerts:
🚨 INCIDENT: [Brief description] Severity: [1-4] Investigating...
2. Assess (5-15 minutes)
Gather information:
# Check application health
curl https://api.yourapp.com/health
# Check recent deployments
# Railway dashboard → Deployments
# Check error logs
railway logs --environment production | grep ERROR
# Check metrics
# CloudWatch → Dashboards → API Metrics
# Check database
# RDS → Monitoring → CPU, Connections
Questions to answer:
- What is broken?
- How many users affected?
- When did it start?
- Any recent changes (deployments, config)?
- Is data at risk?
Update Slack:
Update: Identified issue with [component]
Started: ~10 minutes ago
Affected: [X users / all users / specific feature]
Investigating: [what you're checking]
3. Contain (15-30 minutes)
Immediate mitigation:
# If caused by recent deployment → ROLLBACK
railway rollback
# If database issue → Increase resources temporarily
# AWS RDS → Modify → Instance class
# If rate limiting → Increase limits temporarily
# If memory leak → Restart containers
railway restart
# If external API down → Enable fallback/cache
Stop the bleeding before fixing root cause.
4. Investigate (30 minutes - 2 hours)
Root cause analysis:
# Check logs around incident start time
railway logs --since 30m | grep ERROR
# Check database for locks
SELECT * FROM pg_stat_activity WHERE state = 'active';
# Check memory/CPU usage trends
# CloudWatch → Metrics
# Check recent code changes
git log --since="2 hours ago" --oneline
# Test hypothesis locally
# Reproduce the issue if possible
Document findings in Slack thread.
5. Fix (varies)
Permanent fix options:
Option 1: Hotfix (if quick fix available)
# Create hotfix branch
git checkout main
git pull
git checkout -b hotfix/fix-description
# Make fix
# ... edit code ...
# Test locally
pytest
npm test
# Commit and push
git add .
git commit -m "fix: [description]"
git push origin hotfix/fix-description
# Create PR with "HOTFIX" label
# Get 1 approval (instead of usual 2)
# Merge and deploy
Option 2: Workaround (if fix needs more time)
# Implement temporary workaround
# Document in code that it's temporary
# Create issue for proper fix
Option 3: Feature Toggle (disable broken feature)
# Add feature flag
if settings.ENABLE_NEW_FEATURE:
return new_feature()
else:
return old_feature()
# Set in environment
ENABLE_NEW_FEATURE=false
6. Verify (30 minutes)
Confirm fix:
# Check health endpoint
curl https://api.yourapp.com/health
# Check error rate
# Should return to normal
# Check metrics
# Response times, error rates
# Test affected functionality
# Manual testing or automated tests
# Monitor for 30 minutes
# Ensure no regressions
Update Slack:
✅ RESOLVED: [Brief description]
Fix deployed: [deployment link]
Monitoring: Will monitor for 30 minutes
Root cause: [brief explanation]
7. Communicate (throughout)
Internal Communication (#alerts Slack channel):
Initial:
🚨 INCIDENT: Login endpoint returning 500
Severity: 2
Time detected: 14:30 UTC
Status: Investigating
Update 1 (15 min):
Identified: Database connection pool exhausted
Action: Increasing pool size
ETA: 10 minutes
Update 2 (30 min):
Deployed: Pool size increased
Status: Monitoring
Error rate: Decreasing
Resolution (45 min):
✅ RESOLVED: Login endpoint healthy
Duration: 45 minutes
Root cause: Traffic spike + insufficient pool size
Next steps: Post-mortem scheduled
External Communication (if user-facing):
To customers/stakeholders:
Subject: Service Disruption - Resolved
Hi [stakeholders],
We experienced a brief service disruption today from 14:30-15:15 UTC affecting login functionality.
Issue: Some users unable to log in
Impact: ~10% of login attempts failed
Resolution: Database configuration updated
Status: Fully resolved, monitoring ongoing
We apologise for any inconvenience and have implemented measures to prevent recurrence.
If you have questions, please contact support@yourapp.com.
Who to Contact
Severity 1 (Critical)
- Slack: @channel in #alerts
- Contact: Everyone
- Escalation: CEO/CTO if >1 hour
Severity 2 (High)
- Slack: @here in #alerts
- Contact: Tech lead, on-call developer
- Escalation: Product manager if >2 hours
Severity 3 (Medium)
- Slack: Post in #dev-team
- Contact: Responsible developer
- Escalation: Tech lead if >4 hours
Severity 4 (Low)
- GitHub: Create issue
- Slack: Mention in #dev-team
- Escalation: None
On-Call Schedule
- Weekdays (9am-6pm): Developer roster (see Slack topic)
- Nights/Weekends: On-call developer (rotation)
- Escalation: Tech lead → CTO
Investigation Checklist
Use this checklist when investigating incidents:
Recent Changes:
- Deployments in last 2 hours?
- Database migrations?
- Configuration changes?
- Dependency updates?
System Health:
- Application responding to health checks?
- Database connections available?
- Redis/cache accessible?
- Disk space sufficient?
- Memory usage normal?
- CPU usage normal?
External Dependencies:
- Third-party APIs accessible?
- AWS services operational?
- DNS resolving correctly?
- SSL certificates valid?
Data Integrity:
- Recent backups available?
- Data corruption detected?
- Transactions completing?
Logs & Metrics:
- Error logs reviewed?
- Access logs show traffic patterns?
- Metrics show anomalies?
- Alert history reviewed?
Communication Templates
Initial Alert
🚨 INCIDENT
Description: [Brief description]
Severity: [1-4]
Started: [Time/duration]
Impact: [Who/what is affected]
Status: Investigating
Lead: @[your-name]
Updates: Every [15/30/60] minutes
Status Update
📊 UPDATE - [Time since start]
Status: [Investigating / Fixing / Testing / Resolved]
Findings: [What you've learned]
Actions: [What you're doing]
ETA: [When you expect resolution]
Resolution
✅ RESOLVED - [Total duration]
Issue: [What was wrong]
Fix: [What was done]
Impact: [Summary of impact]
Prevention: [How we'll prevent this]
Post-mortem: [Link / date scheduled]
Post-Mortem Template
After resolving Severity 1-2 incidents, conduct a post-mortem within 48 hours.
# Post-Mortem: [Incident Title]
**Date**: [YYYY-MM-DD]
**Duration**: [Start time] - [End time] ([Duration])
**Severity**: [1-4]
**Impact**: [Summary of user impact]
## Timeline
- 14:30 - Issue started (database pool exhausted)
- 14:35 - Alert triggered, investigation began
- 14:45 - Root cause identified
- 14:50 - Fix deployed (increased pool size)
- 15:00 - Monitoring, errors decreasing
- 15:15 - Fully resolved
## Root Cause
Database connection pool size (10 connections) insufficient for traffic spike from marketing campaign. Pool exhausted, new requests failed.
## Impact
- **Users affected**: ~1,000 users (10% of login attempts)
- **Duration**: 45 minutes
- **Revenue impact**: Minimal (no orders lost)
- **Customer complaints**: 3 support tickets
## What Went Well
- Quick detection (5 minutes via automated alert)
- Fast rollback capability available
- Clear communication in Slack
- Fix deployed within 20 minutes
## What Went Wrong
- Insufficient load testing before campaign
- No alerts on connection pool utilization
- Manual intervention required (should be auto-scaling)
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add connection pool monitoring | @dev1 | 2025-01-20 | ✅ Done |
| Auto-scale pool based on traffic | @dev2 | 2025-01-25 | 🔄 In Progress |
| Load test before major campaigns | @dev1 | Ongoing | 📋 Process |
| Document runbook for pool issues | @dev3 | 2025-01-18 | ✅ Done |
## Lessons Learned
1. Monitor resource utilization proactively
2. Load test matches real-world patterns
3. Automated scaling reduces manual intervention
4. Clear communication helped quick resolution
Prevention
Prevent incidents through:
Monitoring:
- Set up alerts for critical metrics
- Monitor error rates continuously
- Track resource utilization
- Set up health checks
Testing:
- Load test before major releases
- Test failure scenarios
- Automate critical path tests
- Test with production-like data
Deployment:
- Deploy during low-traffic windows
- Use feature flags for risky changes
- Have rollback plan ready
- Deploy in stages (canary deployments)
Documentation:
- Keep runbooks updated
- Document common issues
- Share knowledge across team
- Review post-mortems regularly
Tools & Resources
Monitoring:
- CloudWatch: https://console.aws.amazon.com/cloudwatch
- Railway: https://railway.app
- Datadog: (if configured)
Communication:
- Slack #alerts: Critical incidents
- Slack #dev-team: Development issues
- Status page: (if you have one)
Logs:
# Production logs
railway logs --environment production
# Database logs
# AWS RDS → Logs
Dashboards:
- API Metrics: [Link to CloudWatch dashboard]
- Database Performance: [Link to RDS dashboard]
- Error Tracking: [Link to Sentry/equivalent]
Next Steps
- Review
common-issues.mdfor quick fixes - Check
debugging.mdfor investigation techniques - See
../03-workflows/deployment.mdfor rollback procedures