This guide outlines how to handle production incidents effectively.

Incident Severity Levels

Severity 1 (Critical)

Impact: Complete service outage or data loss

Examples:

  • Application completely down
  • Database unreachable
  • Data corruption
  • Security breach

Response Time: Immediate Who to Contact: Everyone (Slack @channel in #alerts) Communication: Every 15 minutes

Severity 2 (High)

Impact: Major feature broken, affecting many users

Examples:

  • Login not working
  • Payment processing failed
  • Critical API endpoint down
  • Performance degradation >50%

Response Time: Within 15 minutes Who to Contact: Tech lead, on-call developer (Slack in #alerts) Communication: Every 30 minutes

Severity 3 (Medium)

Impact: Minor feature broken, affecting some users

Examples:

  • Single endpoint failing
  • UI bug affecting workflow
  • Email delivery delayed
  • Background job failures

Response Time: Within 1 hour Who to Contact: Developer responsible for feature (Slack in #dev-team) Communication: Hourly updates

Severity 4 (Low)

Impact: Cosmetic issue or minor inconvenience

Examples:

  • Typo in UI
  • Alignment issue
  • Non-critical log errors
  • Minor performance issue

Response Time: Next business day Who to Contact: Create GitHub issue, discuss in #dev-team Communication: When fixed


Incident Response Process

1. Detect (0-5 minutes)

How incidents are detected:

  • Monitoring alerts (CloudWatch, Datadog)
  • User reports (support tickets, Slack)
  • Error tracking (Sentry)
  • Health check failures

Initial Actions:

  1. Acknowledge the alert
  2. Assess severity
  3. Post in Slack #alerts:
      🚨 INCIDENT: [Brief description]
    Severity: [1-4]
    Investigating...
      

2. Assess (5-15 minutes)

Gather information:

  # Check application health
curl https://api.yourapp.com/health

# Check recent deployments
# Railway dashboard → Deployments

# Check error logs
railway logs --environment production | grep ERROR

# Check metrics
# CloudWatch → Dashboards → API Metrics

# Check database
# RDS → Monitoring → CPU, Connections
  

Questions to answer:

  • What is broken?
  • How many users affected?
  • When did it start?
  • Any recent changes (deployments, config)?
  • Is data at risk?

Update Slack:

  Update: Identified issue with [component]
Started: ~10 minutes ago
Affected: [X users / all users / specific feature]
Investigating: [what you're checking]
  

3. Contain (15-30 minutes)

Immediate mitigation:

  # If caused by recent deployment → ROLLBACK
railway rollback

# If database issue → Increase resources temporarily
# AWS RDS → Modify → Instance class

# If rate limiting → Increase limits temporarily

# If memory leak → Restart containers
railway restart

# If external API down → Enable fallback/cache
  

Stop the bleeding before fixing root cause.

4. Investigate (30 minutes - 2 hours)

Root cause analysis:

  # Check logs around incident start time
railway logs --since 30m | grep ERROR

# Check database for locks
SELECT * FROM pg_stat_activity WHERE state = 'active';

# Check memory/CPU usage trends
# CloudWatch → Metrics

# Check recent code changes
git log --since="2 hours ago" --oneline

# Test hypothesis locally
# Reproduce the issue if possible
  

Document findings in Slack thread.

5. Fix (varies)

Permanent fix options:

Option 1: Hotfix (if quick fix available)

  # Create hotfix branch
git checkout main
git pull
git checkout -b hotfix/fix-description

# Make fix
# ... edit code ...

# Test locally
pytest
npm test

# Commit and push
git add .
git commit -m "fix: [description]"
git push origin hotfix/fix-description

# Create PR with "HOTFIX" label
# Get 1 approval (instead of usual 2)
# Merge and deploy
  

Option 2: Workaround (if fix needs more time)

  # Implement temporary workaround
# Document in code that it's temporary
# Create issue for proper fix
  

Option 3: Feature Toggle (disable broken feature)

  # Add feature flag
if settings.ENABLE_NEW_FEATURE:
    return new_feature()
else:
    return old_feature()

# Set in environment
ENABLE_NEW_FEATURE=false
  

6. Verify (30 minutes)

Confirm fix:

  # Check health endpoint
curl https://api.yourapp.com/health

# Check error rate
# Should return to normal

# Check metrics
# Response times, error rates

# Test affected functionality
# Manual testing or automated tests

# Monitor for 30 minutes
# Ensure no regressions
  

Update Slack:

  ✅ RESOLVED: [Brief description]
Fix deployed: [deployment link]
Monitoring: Will monitor for 30 minutes
Root cause: [brief explanation]
  

7. Communicate (throughout)

Internal Communication (#alerts Slack channel):

  Initial:
🚨 INCIDENT: Login endpoint returning 500
Severity: 2
Time detected: 14:30 UTC
Status: Investigating

Update 1 (15 min):
Identified: Database connection pool exhausted
Action: Increasing pool size
ETA: 10 minutes

Update 2 (30 min):
Deployed: Pool size increased
Status: Monitoring
Error rate: Decreasing

Resolution (45 min):
✅ RESOLVED: Login endpoint healthy
Duration: 45 minutes
Root cause: Traffic spike + insufficient pool size
Next steps: Post-mortem scheduled
  

External Communication (if user-facing):

To customers/stakeholders:

  Subject: Service Disruption - Resolved

Hi [stakeholders],

We experienced a brief service disruption today from 14:30-15:15 UTC affecting login functionality.

Issue: Some users unable to log in
Impact: ~10% of login attempts failed
Resolution: Database configuration updated
Status: Fully resolved, monitoring ongoing

We apologise for any inconvenience and have implemented measures to prevent recurrence.

If you have questions, please contact support@yourapp.com.
  

Who to Contact

Severity 1 (Critical)

  • Slack: @channel in #alerts
  • Contact: Everyone
  • Escalation: CEO/CTO if >1 hour

Severity 2 (High)

  • Slack: @here in #alerts
  • Contact: Tech lead, on-call developer
  • Escalation: Product manager if >2 hours

Severity 3 (Medium)

  • Slack: Post in #dev-team
  • Contact: Responsible developer
  • Escalation: Tech lead if >4 hours

Severity 4 (Low)

  • GitHub: Create issue
  • Slack: Mention in #dev-team
  • Escalation: None

On-Call Schedule

  • Weekdays (9am-6pm): Developer roster (see Slack topic)
  • Nights/Weekends: On-call developer (rotation)
  • Escalation: Tech lead → CTO

Investigation Checklist

Use this checklist when investigating incidents:

Recent Changes:

  • Deployments in last 2 hours?
  • Database migrations?
  • Configuration changes?
  • Dependency updates?

System Health:

  • Application responding to health checks?
  • Database connections available?
  • Redis/cache accessible?
  • Disk space sufficient?
  • Memory usage normal?
  • CPU usage normal?

External Dependencies:

  • Third-party APIs accessible?
  • AWS services operational?
  • DNS resolving correctly?
  • SSL certificates valid?

Data Integrity:

  • Recent backups available?
  • Data corruption detected?
  • Transactions completing?

Logs & Metrics:

  • Error logs reviewed?
  • Access logs show traffic patterns?
  • Metrics show anomalies?
  • Alert history reviewed?

Communication Templates

Initial Alert

  🚨 INCIDENT

Description: [Brief description]
Severity: [1-4]
Started: [Time/duration]
Impact: [Who/what is affected]
Status: Investigating

Lead: @[your-name]
Updates: Every [15/30/60] minutes
  

Status Update

  📊 UPDATE - [Time since start]

Status: [Investigating / Fixing / Testing / Resolved]
Findings: [What you've learned]
Actions: [What you're doing]
ETA: [When you expect resolution]
  

Resolution

  ✅ RESOLVED - [Total duration]

Issue: [What was wrong]
Fix: [What was done]
Impact: [Summary of impact]
Prevention: [How we'll prevent this]

Post-mortem: [Link / date scheduled]
  

Post-Mortem Template

After resolving Severity 1-2 incidents, conduct a post-mortem within 48 hours.

  # Post-Mortem: [Incident Title]

**Date**: [YYYY-MM-DD]
**Duration**: [Start time] - [End time] ([Duration])
**Severity**: [1-4]
**Impact**: [Summary of user impact]

## Timeline

- 14:30 - Issue started (database pool exhausted)
- 14:35 - Alert triggered, investigation began
- 14:45 - Root cause identified
- 14:50 - Fix deployed (increased pool size)
- 15:00 - Monitoring, errors decreasing
- 15:15 - Fully resolved

## Root Cause

Database connection pool size (10 connections) insufficient for traffic spike from marketing campaign. Pool exhausted, new requests failed.

## Impact

- **Users affected**: ~1,000 users (10% of login attempts)
- **Duration**: 45 minutes
- **Revenue impact**: Minimal (no orders lost)
- **Customer complaints**: 3 support tickets

## What Went Well

- Quick detection (5 minutes via automated alert)
- Fast rollback capability available
- Clear communication in Slack
- Fix deployed within 20 minutes

## What Went Wrong

- Insufficient load testing before campaign
- No alerts on connection pool utilization
- Manual intervention required (should be auto-scaling)

## Action Items

| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add connection pool monitoring | @dev1 | 2025-01-20 | ✅ Done |
| Auto-scale pool based on traffic | @dev2 | 2025-01-25 | 🔄 In Progress |
| Load test before major campaigns | @dev1 | Ongoing | 📋 Process |
| Document runbook for pool issues | @dev3 | 2025-01-18 | ✅ Done |

## Lessons Learned

1. Monitor resource utilization proactively
2. Load test matches real-world patterns
3. Automated scaling reduces manual intervention
4. Clear communication helped quick resolution
  

Prevention

Prevent incidents through:

Monitoring:

  • Set up alerts for critical metrics
  • Monitor error rates continuously
  • Track resource utilization
  • Set up health checks

Testing:

  • Load test before major releases
  • Test failure scenarios
  • Automate critical path tests
  • Test with production-like data

Deployment:

  • Deploy during low-traffic windows
  • Use feature flags for risky changes
  • Have rollback plan ready
  • Deploy in stages (canary deployments)

Documentation:

  • Keep runbooks updated
  • Document common issues
  • Share knowledge across team
  • Review post-mortems regularly

Tools & Resources

Monitoring:

Communication:

  • Slack #alerts: Critical incidents
  • Slack #dev-team: Development issues
  • Status page: (if you have one)

Logs:

  # Production logs
railway logs --environment production

# Database logs
# AWS RDS → Logs
  

Dashboards:

  • API Metrics: [Link to CloudWatch dashboard]
  • Database Performance: [Link to RDS dashboard]
  • Error Tracking: [Link to Sentry/equivalent]

Next Steps