Incident Response Guide

This guide outlines how to handle production incidents effectively.

Incident Severity Levels

Severity 1 (Critical)

Impact: Complete service outage or data loss

Examples:

Application completely down
Database unreachable
Data corruption
Security breach

Response Time: Immediate Who to Contact: Everyone (Slack @channel in #alerts) Communication: Every 15 minutes

Severity 2 (High)

Impact: Major feature broken, affecting many users

Examples:

Login not working
Payment processing failed
Critical API endpoint down
Performance degradation >50%

Response Time: Within 15 minutes Who to Contact: Tech lead, on-call developer (Slack in #alerts) Communication: Every 30 minutes

Severity 3 (Medium)

Impact: Minor feature broken, affecting some users

Examples:

Single endpoint failing
UI bug affecting workflow
Email delivery delayed
Background job failures

Response Time: Within 1 hour Who to Contact: Developer responsible for feature (Slack in #dev-team) Communication: Hourly updates

Severity 4 (Low)

Impact: Cosmetic issue or minor inconvenience

Examples:

Typo in UI
Alignment issue
Non-critical log errors
Minor performance issue

Response Time: Next business day Who to Contact: Create GitHub issue, discuss in #dev-team Communication: When fixed

Incident Response Process

1. Detect (0-5 minutes)

How incidents are detected:

Monitoring alerts (CloudWatch, Datadog)
User reports (support tickets, Slack)
Error tracking (Sentry)
Health check failures

Initial Actions:

Acknowledge the alert
Assess severity

Post in Slack #alerts:

  🚨 INCIDENT: [Brief description]
Severity: [1-4]
Investigating...

2. Assess (5-15 minutes)

Gather information:

  # Check application health
curl https://api.yourapp.com/health

# Check recent deployments
# Railway dashboard → Deployments

# Check error logs
railway logs --environment production | grep ERROR

# Check metrics
# CloudWatch → Dashboards → API Metrics

# Check database
# RDS → Monitoring → CPU, Connections

Questions to answer:

What is broken?
How many users affected?
When did it start?
Any recent changes (deployments, config)?
Is data at risk?

Update Slack:

  Update: Identified issue with [component]
Started: ~10 minutes ago
Affected: [X users / all users / specific feature]
Investigating: [what you're checking]

3. Contain (15-30 minutes)

Immediate mitigation:

  # If caused by recent deployment → ROLLBACK
railway rollback

# If database issue → Increase resources temporarily
# AWS RDS → Modify → Instance class

# If rate limiting → Increase limits temporarily

# If memory leak → Restart containers
railway restart

# If external API down → Enable fallback/cache

Stop the bleeding before fixing root cause.

4. Investigate (30 minutes - 2 hours)

Root cause analysis:

  # Check logs around incident start time
railway logs --since 30m | grep ERROR

# Check database for locks
SELECT * FROM pg_stat_activity WHERE state = 'active';

# Check memory/CPU usage trends
# CloudWatch → Metrics

# Check recent code changes
git log --since="2 hours ago" --oneline

# Test hypothesis locally
# Reproduce the issue if possible

Document findings in Slack thread.

5. Fix (varies)

Permanent fix options:

Option 1: Hotfix (if quick fix available)

  # Create hotfix branch
git checkout main
git pull
git checkout -b hotfix/fix-description

# Make fix
# ... edit code ...

# Test locally
pytest
npm test

# Commit and push
git add .
git commit -m "fix: [description]"
git push origin hotfix/fix-description

# Create PR with "HOTFIX" label
# Get 1 approval (instead of usual 2)
# Merge and deploy

Option 2: Workaround (if fix needs more time)

  # Implement temporary workaround
# Document in code that it's temporary
# Create issue for proper fix

Option 3: Feature Toggle (disable broken feature)

  # Add feature flag
if settings.ENABLE_NEW_FEATURE:
    return new_feature()
else:
    return old_feature()

# Set in environment
ENABLE_NEW_FEATURE=false

6. Verify (30 minutes)

Confirm fix:

  # Check health endpoint
curl https://api.yourapp.com/health

# Check error rate
# Should return to normal

# Check metrics
# Response times, error rates

# Test affected functionality
# Manual testing or automated tests

# Monitor for 30 minutes
# Ensure no regressions

Update Slack:

  ✅ RESOLVED: [Brief description]
Fix deployed: [deployment link]
Monitoring: Will monitor for 30 minutes
Root cause: [brief explanation]

7. Communicate (throughout)

Internal Communication (#alerts Slack channel):

  Initial:
🚨 INCIDENT: Login endpoint returning 500
Severity: 2
Time detected: 14:30 UTC
Status: Investigating

Update 1 (15 min):
Identified: Database connection pool exhausted
Action: Increasing pool size
ETA: 10 minutes

Update 2 (30 min):
Deployed: Pool size increased
Status: Monitoring
Error rate: Decreasing

Resolution (45 min):
✅ RESOLVED: Login endpoint healthy
Duration: 45 minutes
Root cause: Traffic spike + insufficient pool size
Next steps: Post-mortem scheduled

External Communication (if user-facing):

To customers/stakeholders:

  Subject: Service Disruption - Resolved

Hi [stakeholders],

We experienced a brief service disruption today from 14:30-15:15 UTC affecting login functionality.

Issue: Some users unable to log in
Impact: ~10% of login attempts failed
Resolution: Database configuration updated
Status: Fully resolved, monitoring ongoing

We apologise for any inconvenience and have implemented measures to prevent recurrence.

If you have questions, please contact support@yourapp.com.

Who to Contact

Severity 1 (Critical)

Slack: @channel in #alerts
Contact: Everyone
Escalation: CEO/CTO if >1 hour

Severity 2 (High)

Slack: @here in #alerts
Contact: Tech lead, on-call developer
Escalation: Product manager if >2 hours

Severity 3 (Medium)

Slack: Post in #dev-team
Contact: Responsible developer
Escalation: Tech lead if >4 hours

Severity 4 (Low)

GitHub: Create issue
Slack: Mention in #dev-team
Escalation: None

On-Call Schedule

Weekdays (9am-6pm): Developer roster (see Slack topic)
Nights/Weekends: On-call developer (rotation)
Escalation: Tech lead → CTO

Investigation Checklist

Use this checklist when investigating incidents:

Recent Changes:

Deployments in last 2 hours?
Database migrations?
Configuration changes?
Dependency updates?

System Health:

Application responding to health checks?
Database connections available?
Redis/cache accessible?
Disk space sufficient?
Memory usage normal?
CPU usage normal?

External Dependencies:

Third-party APIs accessible?
AWS services operational?
DNS resolving correctly?
SSL certificates valid?

Data Integrity:

Recent backups available?
Data corruption detected?
Transactions completing?

Logs & Metrics:

Error logs reviewed?
Access logs show traffic patterns?
Metrics show anomalies?
Alert history reviewed?

Communication Templates

Initial Alert

  🚨 INCIDENT

Description: [Brief description]
Severity: [1-4]
Started: [Time/duration]
Impact: [Who/what is affected]
Status: Investigating

Lead: @[your-name]
Updates: Every [15/30/60] minutes

Status Update

  📊 UPDATE - [Time since start]

Status: [Investigating / Fixing / Testing / Resolved]
Findings: [What you've learned]
Actions: [What you're doing]
ETA: [When you expect resolution]

Resolution

  ✅ RESOLVED - [Total duration]

Issue: [What was wrong]
Fix: [What was done]
Impact: [Summary of impact]
Prevention: [How we'll prevent this]

Post-mortem: [Link / date scheduled]

Post-Mortem Template

After resolving Severity 1-2 incidents, conduct a post-mortem within 48 hours.

  # Post-Mortem: [Incident Title]

**Date**: [YYYY-MM-DD]
**Duration**: [Start time] - [End time] ([Duration])
**Severity**: [1-4]
**Impact**: [Summary of user impact]

## Timeline

- 14:30 - Issue started (database pool exhausted)
- 14:35 - Alert triggered, investigation began
- 14:45 - Root cause identified
- 14:50 - Fix deployed (increased pool size)
- 15:00 - Monitoring, errors decreasing
- 15:15 - Fully resolved

## Root Cause

Database connection pool size (10 connections) insufficient for traffic spike from marketing campaign. Pool exhausted, new requests failed.

## Impact

- **Users affected**: ~1,000 users (10% of login attempts)
- **Duration**: 45 minutes
- **Revenue impact**: Minimal (no orders lost)
- **Customer complaints**: 3 support tickets

## What Went Well

- Quick detection (5 minutes via automated alert)
- Fast rollback capability available
- Clear communication in Slack
- Fix deployed within 20 minutes

## What Went Wrong

- Insufficient load testing before campaign
- No alerts on connection pool utilization
- Manual intervention required (should be auto-scaling)

## Action Items

| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add connection pool monitoring | @dev1 | 2025-01-20 | ✅ Done |
| Auto-scale pool based on traffic | @dev2 | 2025-01-25 | 🔄 In Progress |
| Load test before major campaigns | @dev1 | Ongoing | 📋 Process |
| Document runbook for pool issues | @dev3 | 2025-01-18 | ✅ Done |

## Lessons Learned

1. Monitor resource utilization proactively
2. Load test matches real-world patterns
3. Automated scaling reduces manual intervention
4. Clear communication helped quick resolution

Prevention

Prevent incidents through:

Monitoring:

Set up alerts for critical metrics
Monitor error rates continuously
Track resource utilization
Set up health checks

Testing:

Load test before major releases
Test failure scenarios
Automate critical path tests
Test with production-like data

Deployment:

Deploy during low-traffic windows
Use feature flags for risky changes
Have rollback plan ready
Deploy in stages (canary deployments)

Documentation:

Keep runbooks updated
Document common issues
Share knowledge across team
Review post-mortems regularly

Tools & Resources

Monitoring:

CloudWatch: https://console.aws.amazon.com/cloudwatch
Railway: https://railway.app
Datadog: (if configured)

Communication:

Slack #alerts: Critical incidents
Slack #dev-team: Development issues
Status page: (if you have one)

Logs:

  # Production logs
railway logs --environment production

# Database logs
# AWS RDS → Logs

Dashboards:

API Metrics: [Link to CloudWatch dashboard]
Database Performance: [Link to RDS dashboard]
Error Tracking: [Link to Sentry/equivalent]

Next Steps

Review common-issues.md for quick fixes
Check debugging.md for investigation techniques
See ../03-workflows/deployment.md for rollback procedures

GitHub + Slack Integration

Pre Commit Config

Incident Response Guide

Incident Severity Levels link

Severity 1 (Critical) link

Severity 2 (High) link

Severity 3 (Medium) link

Severity 4 (Low) link

Incident Response Process link

1. Detect (0-5 minutes) link

2. Assess (5-15 minutes) link

3. Contain (15-30 minutes) link

4. Investigate (30 minutes - 2 hours) link

5. Fix (varies) link

6. Verify (30 minutes) link

7. Communicate (throughout) link

Who to Contact link

Severity 1 (Critical) link

Severity 2 (High) link

Severity 3 (Medium) link

Severity 4 (Low) link

On-Call Schedule link

Investigation Checklist link

Communication Templates link

Initial Alert link

Status Update link

Resolution link

Post-Mortem Template link

Prevention link

Tools & Resources link

Next Steps link

Incident Severity Levels

Severity 1 (Critical)

Severity 2 (High)

Severity 3 (Medium)

Severity 4 (Low)

Incident Response Process

1. Detect (0-5 minutes)

2. Assess (5-15 minutes)

3. Contain (15-30 minutes)

4. Investigate (30 minutes - 2 hours)

5. Fix (varies)

6. Verify (30 minutes)

7. Communicate (throughout)

Who to Contact

Severity 1 (Critical)

Severity 2 (High)

Severity 3 (Medium)

Severity 4 (Low)

On-Call Schedule

Investigation Checklist

Communication Templates

Initial Alert

Status Update

Resolution

Post-Mortem Template

Prevention

Tools & Resources

Next Steps