This document covers our deployment process for staging and production environments.

Infrastructure Overview

Staging Environment

Purpose: Pre-production testing and client demos

Setup: Single server with all components

  Single Server (Railway/Cloud66)
├── FastAPI application
├── PostgreSQL database
├── Redis cache/queue
└── Celery workers
  

Characteristics:

  • Automated deployments from staging branch
  • Production-like configuration
  • Staging Shopify app connected
  • Test payment gateway
  • Smaller instance size (cost optimisation)

Access:

  • URL: https://staging.yourapp.com
  • Database: Restricted to VPN/specific IPs
  • Logs: Railway/Cloud66 dashboard

Production Environment

Purpose: Live customer-facing application

Setup: Separate servers for each component

  Load Balancer
├── API Server 1
├── API Server 2
└── API Server 3

PostgreSQL Primary
└── Read Replica (optional)

Redis Cluster
├── Master
└── Replicas

Celery Workers
├── Worker 1 (general tasks)
├── Worker 2 (webhook delivery)
└── Worker 3 (email/reports)
  

Characteristics:

  • Automated deployments from main branch
  • Multiple API instances for redundancy
  • Separate database server
  • Redis with persistence
  • Real payment processing
  • Full monitoring and alerts

Access:

  • URL: https://app.yourapp.com
  • API: https://api.yourapp.com
  • Database: Restricted to app servers only
  • Logs: CloudWatch/Datadog

Pre-deployment Checklist

Before deploying to staging or production:

Code Quality

  • All tests passing locally (pytest, npm test)
  • Pre-commit hooks pass
  • Code reviewed and approved (2 approvals)
  • No console.log or debug statements
  • No hardcoded secrets or API keys

Database

  • Migration files created (if schema changes)
  • Migration tested locally
  • Migration has rollback (downgrade function)
  • No data-destructive operations without backup
  • Indexes added for new queries

API

  • API documentation updated (docs/api/endpoints.md)
  • Breaking changes communicated to clients
  • Backwards compatibility maintained (or version bumped)
  • Response times tested (<500ms)

Dependencies

  • New dependencies approved
  • Dependencies version-pinned in requirements.txt / package.json
  • No security vulnerabilities (run npm audit, safety check)

Configuration

  • Environment variables documented
  • New env vars added to staging/production
  • Feature flags configured (if using)

Testing

  • Manual testing completed
  • Postman tests passing
  • Load testing completed (for high-traffic features)
  • Edge cases tested

Database Migration Strategy

IMPORTANT: Always run migrations BEFORE deploying code

Order:

  1. Run migration (adds new columns, tables, indexes)
  2. Deploy new code (uses new schema)

This ensures zero downtime.

Migration Steps

For Staging

  # 1. SSH into staging server (or use Railway CLI)
railway login
railway link

# 2. Run migration
railway run alembic upgrade head

# 3. Verify migration
railway run alembic current

# 4. Check database
railway connect  # Opens DB connection
\dt  # List tables
\d table_name  # Describe table
  

For Production

  # 1. Backup database first
pg_dump -h prod-db.yourapp.com -U dbuser -d yourapp_prod > backup_$(date +%Y%m%d_%H%M%S).sql

# 2. Test migration on backup locally
createdb test_migration
psql test_migration < backup_20250115_103000.sql
psql test_migration -c "UPDATE schema_migrations SET version='current_version';"
alembic upgrade head

# 3. If successful, run on production
railway run --environment production alembic upgrade head

# 4. Verify
railway run --environment production alembic current
  

Migration Dos and Don’ts

Do:

  • Make migrations backwards compatible
  • Add columns as nullable first, backfill data, then add NOT NULL
  • Create indexes concurrently in PostgreSQL
  • Test migrations on production data backup
  • Include rollback (downgrade) function

Don’t:

  • Drop columns without deprecation period
  • Rename columns (create new, copy data, deprecate old)
  • Add NOT NULL columns without default
  • Run destructive migrations without backup
  • Skip testing migrations

Deployment Process

We use GitHub Actions + Railway/Cloud66 for automated deployments.

How It Works

graph LR
    A[Push to main] --> B[GitHub Actions]
    B --> C[Run Tests]
    C --> D{Tests Pass?}
    D -->|No| E[Slack Alert]
    D -->|Yes| F[Build Docker Image]
    F --> G[Push to Registry]
    G --> H[Deploy to Railway]
    H --> I[Run Health Checks]
    I --> J{Healthy?}
    J -->|No| K[Rollback]
    J -->|Yes| L[Slack Success]

What Triggers Deployment

Staging: Any push to staging branch Production: Any push to main branch

Deployment Steps (Automated)

  1. Tests run - pytest, npm test, linting
  2. Build - Docker image built
  3. Push - Image pushed to registry
  4. Deploy - Railway pulls new image
  5. Health check - /health endpoint checked
  6. Success/Failure - Slack notification sent

Zero-Downtime Deployment

Our deployment strategy ensures no downtime during releases.

How It Works

Rolling Deployment:

  Step 1: Initial State
[Instance 1: v1.0] [Instance 2: v1.0] [Instance 3: v1.0]
         ↓               ↓               ↓
    Load Balancer

Step 2: Update Instance 1
[Instance 1: v1.1] [Instance 2: v1.0] [Instance 3: v1.0]
         ↓               ↓               ↓
    Load Balancer (Instance 1 removed during update)

Step 3: Instance 1 Healthy
[Instance 1: v1.1] [Instance 2: v1.0] [Instance 3: v1.0]
         ↓               ↓               ↓
    Load Balancer

Step 4: Update Instance 2
[Instance 1: v1.1] [Instance 2: v1.1] [Instance 3: v1.0]
         ↓               ↓               ↓
    Load Balancer

Step 5: All Updated
[Instance 1: v1.1] [Instance 2: v1.1] [Instance 3: v1.1]
         ↓               ↓               ↓
    Load Balancer
  

Database Migrations During Deploy

Strategy: Make migrations backwards compatible

Example - Adding a new column:

  # WRONG - Breaking change
def upgrade():
    op.add_column('users', sa.Column('phone', sa.String(20), nullable=False))

# RIGHT - Backwards compatible
# Step 1: Add column as nullable
def upgrade():
    op.add_column('users', sa.Column('phone', sa.String(20), nullable=True))

# Deploy code that handles null phone numbers

# Step 2 (later migration): Make it non-nullable after backfill
def upgrade():
    # Backfill data first
    op.execute("UPDATE users SET phone = '' WHERE phone IS NULL")
    # Then add constraint
    op.alter_column('users', 'phone', nullable=False)
  

Health Checks

Every deployment verifies the application is healthy.

Health Check Endpoint

  # app/api/v1/endpoints/health.py
from fastapi import APIRouter, Depends
from sqlalchemy.orm import Session

from app.api import deps
from app.core.redis import redis_client

router = APIRouter()


@router.get("/health")
def health_check(db: Session = Depends(deps.get_db)) -> dict:
    """
    Health check endpoint.

    Verifies:
    - API is responding
    - Database connection works
    - Redis connection works

    Returns:
        dict: Health status
    """
    checks = {
        "status": "healthy",
        "checks": {}
    }

    # Check database
    try:
        db.execute("SELECT 1")
        checks["checks"]["database"] = "healthy"
    except Exception as e:
        checks["status"] = "unhealthy"
        checks["checks"]["database"] = f"unhealthy: {str(e)}"

    # Check Redis
    try:
        redis_client.ping()
        checks["checks"]["redis"] = "healthy"
    except Exception as e:
        checks["status"] = "unhealthy"
        checks["checks"]["redis"] = f"unhealthy: {str(e)}"

    return checks
  

Health Check Response

Healthy:

  {
  "status": "healthy",
  "checks": {
    "database": "healthy",
    "redis": "healthy"
  }
}
  

Unhealthy:

  {
  "status": "unhealthy",
  "checks": {
    "database": "healthy",
    "redis": "unhealthy: Connection refused"
  }
}
  

Deployment Health Verification

After deployment, automated checks verify:

  # 1. Basic health
curl https://api.yourapp.com/health

# 2. API functionality
curl https://api.yourapp.com/api/v1/ping

# 3. Response time
curl -w "@curl-format.txt" -o /dev/null -s https://api.yourapp.com/api/v1/ping

# If any check fails, deployment is rolled back
  

Rollback Procedure

If a deployment causes issues, rollback immediately.

Automatic Rollback

Railway/Cloud66 automatically rolls back if:

  • Health checks fail
  • Container crashes
  • Deployment timeout

Manual Rollback

  # Option 1: Redeploy previous version via Railway CLI
railway rollback

# Option 2: Redeploy previous Git commit
git revert HEAD
git push origin main

# Option 3: Roll back via Railway dashboard
# Go to Deployments → Select previous deployment → Redeploy
  

Rollback Decision Tree

graph TD
    A[Issue Detected] --> B{Critical?}
    B -->|Yes| C[Rollback Immediately]
    B -->|No| D{Can Fix Quickly?}
    D -->|Yes <5min| E[Deploy Hotfix]
    D -->|No| C
    C --> F[Investigate Root Cause]
    E --> G{Fixed?}
    G -->|Yes| H[Monitor]
    G -->|No| C

Critical issues (rollback immediately):

  • Application won’t start
  • Database connection fails
  • 50%+ of requests failing
  • Security vulnerability
  • Data corruption

Non-critical issues (can deploy hotfix):

  • Single endpoint failing
  • Minor UI bug
  • Performance degradation <10%

Post-deployment Verification

After every deployment, verify:

Immediate Checks (0-5 minutes)

  • Health check returns 200
  • Application is accessible
  • Login works
  • Key workflows function (create order, etc.)
  • No error spike in logs

Short-term Monitoring (5-30 minutes)

  • No error rate increase
  • Response times normal (<500ms p95)
  • Background jobs processing
  • Database queries performing well
  • No memory leaks

Dashboard Checks

Metrics to Monitor:

  ✓ HTTP 200 response rate: >99%
✓ HTTP 500 error rate: <1%
✓ API response time (p95): <500ms
✓ Database connection pool: <80% utilised
✓ Celery queue depth: <100 tasks
✓ Memory usage: <80%
✓ CPU usage: <70%
  

Where to Check Logs

Staging

Application Logs:

  # Via Railway CLI
railway logs --environment staging

# Via web dashboard
https://railway.app/project/your-project/deployments
  

Database Logs:

  # Via Railway
railway logs --environment staging --service postgres
  

Production

Application Logs:

  • CloudWatch Logs: https://console.aws.amazon.com/cloudwatch
  • Filter by log group: /aws/ecs/yourapp-production

Database Logs:

  • RDS Logs: https://console.aws.amazon.com/rds
  • View: Logs & Events

Worker Logs:

  # Via Railway
railway logs --environment production --service worker
  

Log Queries

Find errors in last hour:

  # CloudWatch Logs Insights query
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
  

Response time analysis:

  fields @timestamp, duration_ms
| filter endpoint = "/api/v1/orders"
| stats avg(duration_ms), max(duration_ms), min(duration_ms)
  

Backup Verification

Before major deployments, verify backups exist.

Database Backups

  # Check automated backups (Railway/RDS)
railway backups list --environment production

# Manual backup before risky deployment
pg_dump -h prod-db.yourapp.com -U dbuser -d yourapp_prod > \
  backup_$(date +%Y%m%d_%H%M%S).sql
  

Test Restore (Quarterly)

  # 1. Create test database
createdb test_restore

# 2. Restore latest backup
psql test_restore < latest_backup.sql

# 3. Verify data
psql test_restore -c "SELECT COUNT(*) FROM users;"
psql test_restore -c "SELECT COUNT(*) FROM orders;"

# 4. Clean up
dropdb test_restore
  

Railway/Cloud66 Specific Instructions

Railway

Deploy from CLI:

  # Install Railway CLI
npm install -g @railway/cli

# Login
railway login

# Link to project
railway link

# Deploy
railway up

# View logs
railway logs

# Run migrations
railway run alembic upgrade head

# Open shell
railway shell
  

Environment Variables:

  # Set variable
railway variables set DATABASE_URL="postgresql://..."

# View variables
railway variables

# Load from file
railway variables set --from-file .env.production
  

Cloud66

Deploy:

  # Via Git push (recommended)
git push production main

# Via dashboard
# Go to Application → Deploy

# Via CLI
cx stacks redeploy -s yourapp-production
  

Run Commands:

  # SSH into server
cx ssh -s yourapp-production

# Run migration
cx run -s yourapp-production "alembic upgrade head"

# Restart services
cx stacks restart -s yourapp-production
  

Environment Variables Management

Adding New Variables

  1. Update .env.example in repository
  2. Add to staging:
      railway variables set NEW_VAR="value" --environment staging
      
  3. Test on staging
  4. Add to production:
      railway variables set NEW_VAR="value" --environment production
      
  5. Document in README if needed

Secrets Management

Never commit secrets to Git.

Where to store:

  • Railway/Cloud66 environment variables
  • AWS Secrets Manager (for sensitive keys)
  • 1Password/Vault (team access)

Rotation schedule:

  • Database passwords: Quarterly
  • API keys: When compromised or annually
  • JWT secrets: Annually

Deployment Notifications

Deployments trigger Slack notifications in #deployments.

Notification includes:

  • Environment (staging/production)
  • Commit message and author
  • Build status
  • Deployment status
  • Link to logs

Example:

  🚀 Deployment Started
Environment: Production
Commit: feat(webhooks): add subscription endpoints
Author: @john-doe
Status: Building...

[View Logs]
  

On success:

  ✅ Deployment Successful
Environment: Production
Duration: 3m 42s
Health Check: Passed

[View Application]
  

On failure:

  ❌ Deployment Failed
Environment: Production
Error: Health check failed
Rollback: Automatic

[View Logs] [View Error Details]
  

See ../06-tooling/github-slack-integration.md for setup.


Deployment Checklist (Summary)

Before Deploy

  • Code reviewed and approved
  • Tests passing
  • Migrations ready (if needed)
  • Environment variables set
  • Backup verified

During Deploy

  • Migrations run first
  • Code deployed
  • Health check passes
  • Logs monitored

After Deploy

  • Key workflows tested
  • Metrics normal
  • No error spike
  • Team notified

Emergency Procedures

Production Down

  1. Check status page/health endpoint
  2. View logs for errors
  3. Check recent deployments - was there a recent release?
  4. If recent deploy, rollback immediately
  5. If not deploy-related, check infrastructure (database, redis)
  6. Communicate in Slack #alerts
  7. Post-mortem after resolution

Database Issues

  1. Check connection pool - is it exhausted?
  2. Check slow queries - any queries >1s?
  3. Check disk space - is database full?
  4. Check locks - any long-running locks?
  5. If critical, scale up database instance
  6. Document in incident report

High Error Rate

  1. Check error logs for patterns
  2. Identify which endpoint is failing
  3. Rollback if caused by recent deploy
  4. Fix if caused by external dependency (API down, etc.)
  5. Scale if traffic spike
  6. Monitor until resolved

Next Steps