On this page

Deployment Guide

This document covers our deployment process for staging and production environments.

Infrastructure Overview

Staging Environment

Purpose: Pre-production testing and client demos

Setup: Single server with all components

  Single Server (Railway/Cloud66)
├── FastAPI application
├── PostgreSQL database
├── Redis cache/queue
└── Celery workers

Characteristics:

Automated deployments from staging branch
Production-like configuration
Staging Shopify app connected
Test payment gateway
Smaller instance size (cost optimisation)

Access:

URL: https://staging.yourapp.com
Database: Restricted to VPN/specific IPs
Logs: Railway/Cloud66 dashboard

Production Environment

Purpose: Live customer-facing application

Setup: Separate servers for each component

  Load Balancer
├── API Server 1
├── API Server 2
└── API Server 3

PostgreSQL Primary
└── Read Replica (optional)

Redis Cluster
├── Master
└── Replicas

Celery Workers
├── Worker 1 (general tasks)
├── Worker 2 (webhook delivery)
└── Worker 3 (email/reports)

Characteristics:

Automated deployments from main branch
Multiple API instances for redundancy
Separate database server
Redis with persistence
Real payment processing
Full monitoring and alerts

Access:

URL: https://app.yourapp.com
API: https://api.yourapp.com
Database: Restricted to app servers only
Logs: CloudWatch/Datadog

Pre-deployment Checklist

Before deploying to staging or production:

Code Quality

All tests passing locally (pytest, npm test)
Pre-commit hooks pass
Code reviewed and approved (2 approvals)
No console.log or debug statements
No hardcoded secrets or API keys

Database

Migration files created (if schema changes)
Migration tested locally
Migration has rollback (downgrade function)
No data-destructive operations without backup
Indexes added for new queries

API

API documentation updated (docs/api/endpoints.md)
Breaking changes communicated to clients
Backwards compatibility maintained (or version bumped)
Response times tested (<500ms)

Dependencies

New dependencies approved
Dependencies version-pinned in requirements.txt / package.json
No security vulnerabilities (run npm audit, safety check)

Configuration

Environment variables documented
New env vars added to staging/production
Feature flags configured (if using)

Testing

Manual testing completed
Postman tests passing
Load testing completed (for high-traffic features)
Edge cases tested

Database Migration Strategy

IMPORTANT: Always run migrations BEFORE deploying code

Order:

Run migration (adds new columns, tables, indexes)
Deploy new code (uses new schema)

This ensures zero downtime.

Migration Steps

For Staging

  # 1. SSH into staging server (or use Railway CLI)
railway login
railway link

# 2. Run migration
railway run alembic upgrade head

# 3. Verify migration
railway run alembic current

# 4. Check database
railway connect  # Opens DB connection
\dt  # List tables
\d table_name  # Describe table

For Production

  # 1. Backup database first
pg_dump -h prod-db.yourapp.com -U dbuser -d yourapp_prod > backup_$(date +%Y%m%d_%H%M%S).sql

# 2. Test migration on backup locally
createdb test_migration
psql test_migration < backup_20250115_103000.sql
psql test_migration -c "UPDATE schema_migrations SET version='current_version';"
alembic upgrade head

# 3. If successful, run on production
railway run --environment production alembic upgrade head

# 4. Verify
railway run --environment production alembic current

Migration Dos and Don’ts

✅ Do:

Make migrations backwards compatible
Add columns as nullable first, backfill data, then add NOT NULL
Create indexes concurrently in PostgreSQL
Test migrations on production data backup
Include rollback (downgrade) function

❌ Don’t:

Drop columns without deprecation period
Rename columns (create new, copy data, deprecate old)
Add NOT NULL columns without default
Run destructive migrations without backup
Skip testing migrations

Deployment Process

Automatic Deployments (Recommended)

We use GitHub Actions + Railway/Cloud66 for automated deployments.

How It Works

graph LR
    A[Push to main] --> B[GitHub Actions]
    B --> C[Run Tests]
    C --> D{Tests Pass?}
    D -->|No| E[Slack Alert]
    D -->|Yes| F[Build Docker Image]
    F --> G[Push to Registry]
    G --> H[Deploy to Railway]
    H --> I[Run Health Checks]
    I --> J{Healthy?}
    J -->|No| K[Rollback]
    J -->|Yes| L[Slack Success]

What Triggers Deployment

Staging: Any push to staging branch Production: Any push to main branch

Deployment Steps (Automated)

Tests run - pytest, npm test, linting
Build - Docker image built
Push - Image pushed to registry
Deploy - Railway pulls new image
Health check - /health endpoint checked
Success/Failure - Slack notification sent

Zero-Downtime Deployment

Our deployment strategy ensures no downtime during releases.

How It Works

Rolling Deployment:

  Step 1: Initial State
[Instance 1: v1.0] [Instance 2: v1.0] [Instance 3: v1.0]
         ↓               ↓               ↓
    Load Balancer

Step 2: Update Instance 1
[Instance 1: v1.1] [Instance 2: v1.0] [Instance 3: v1.0]
         ↓               ↓               ↓
    Load Balancer (Instance 1 removed during update)

Step 3: Instance 1 Healthy
[Instance 1: v1.1] [Instance 2: v1.0] [Instance 3: v1.0]
         ↓               ↓               ↓
    Load Balancer

Step 4: Update Instance 2
[Instance 1: v1.1] [Instance 2: v1.1] [Instance 3: v1.0]
         ↓               ↓               ↓
    Load Balancer

Step 5: All Updated
[Instance 1: v1.1] [Instance 2: v1.1] [Instance 3: v1.1]
         ↓               ↓               ↓
    Load Balancer

Database Migrations During Deploy

Strategy: Make migrations backwards compatible

Example - Adding a new column:

  # WRONG - Breaking change
def upgrade():
    op.add_column('users', sa.Column('phone', sa.String(20), nullable=False))

# RIGHT - Backwards compatible
# Step 1: Add column as nullable
def upgrade():
    op.add_column('users', sa.Column('phone', sa.String(20), nullable=True))

# Deploy code that handles null phone numbers

# Step 2 (later migration): Make it non-nullable after backfill
def upgrade():
    # Backfill data first
    op.execute("UPDATE users SET phone = '' WHERE phone IS NULL")
    # Then add constraint
    op.alter_column('users', 'phone', nullable=False)

Health Checks

Every deployment verifies the application is healthy.

Health Check Endpoint

  # app/api/v1/endpoints/health.py
from fastapi import APIRouter, Depends
from sqlalchemy.orm import Session

from app.api import deps
from app.core.redis import redis_client

router = APIRouter()


@router.get("/health")
def health_check(db: Session = Depends(deps.get_db)) -> dict:
    """
    Health check endpoint.

    Verifies:
    - API is responding
    - Database connection works
    - Redis connection works

    Returns:
        dict: Health status
    """
    checks = {
        "status": "healthy",
        "checks": {}
    }

    # Check database
    try:
        db.execute("SELECT 1")
        checks["checks"]["database"] = "healthy"
    except Exception as e:
        checks["status"] = "unhealthy"
        checks["checks"]["database"] = f"unhealthy: {str(e)}"

    # Check Redis
    try:
        redis_client.ping()
        checks["checks"]["redis"] = "healthy"
    except Exception as e:
        checks["status"] = "unhealthy"
        checks["checks"]["redis"] = f"unhealthy: {str(e)}"

    return checks

Health Check Response

Healthy:

  {
  "status": "healthy",
  "checks": {
    "database": "healthy",
    "redis": "healthy"
  }
}

Unhealthy:

  {
  "status": "unhealthy",
  "checks": {
    "database": "healthy",
    "redis": "unhealthy: Connection refused"
  }
}

Deployment Health Verification

After deployment, automated checks verify:

  # 1. Basic health
curl https://api.yourapp.com/health

# 2. API functionality
curl https://api.yourapp.com/api/v1/ping

# 3. Response time
curl -w "@curl-format.txt" -o /dev/null -s https://api.yourapp.com/api/v1/ping

# If any check fails, deployment is rolled back

Rollback Procedure

If a deployment causes issues, rollback immediately.

Automatic Rollback

Railway/Cloud66 automatically rolls back if:

Health checks fail
Container crashes
Deployment timeout

Manual Rollback

  # Option 1: Redeploy previous version via Railway CLI
railway rollback

# Option 2: Redeploy previous Git commit
git revert HEAD
git push origin main

# Option 3: Roll back via Railway dashboard
# Go to Deployments → Select previous deployment → Redeploy

Rollback Decision Tree

graph TD
    A[Issue Detected] --> B{Critical?}
    B -->|Yes| C[Rollback Immediately]
    B -->|No| D{Can Fix Quickly?}
    D -->|Yes <5min| E[Deploy Hotfix]
    D -->|No| C
    C --> F[Investigate Root Cause]
    E --> G{Fixed?}
    G -->|Yes| H[Monitor]
    G -->|No| C

Critical issues (rollback immediately):

Application won’t start
Database connection fails
50%+ of requests failing
Security vulnerability
Data corruption

Non-critical issues (can deploy hotfix):

Single endpoint failing
Minor UI bug
Performance degradation <10%

Post-deployment Verification

After every deployment, verify:

Immediate Checks (0-5 minutes)

Health check returns 200
Application is accessible
Login works
Key workflows function (create order, etc.)
No error spike in logs

Short-term Monitoring (5-30 minutes)

No error rate increase
Response times normal (<500ms p95)
Background jobs processing
Database queries performing well
No memory leaks

Dashboard Checks

Metrics to Monitor:

  ✓ HTTP 200 response rate: >99%
✓ HTTP 500 error rate: <1%
✓ API response time (p95): <500ms
✓ Database connection pool: <80% utilised
✓ Celery queue depth: <100 tasks
✓ Memory usage: <80%
✓ CPU usage: <70%

Where to Check Logs

Staging

Application Logs:

  # Via Railway CLI
railway logs --environment staging

# Via web dashboard
https://railway.app/project/your-project/deployments

Database Logs:

  # Via Railway
railway logs --environment staging --service postgres

Production

Application Logs:

CloudWatch Logs: https://console.aws.amazon.com/cloudwatch
Filter by log group: /aws/ecs/yourapp-production

Database Logs:

RDS Logs: https://console.aws.amazon.com/rds
View: Logs & Events

Worker Logs:

  # Via Railway
railway logs --environment production --service worker

Log Queries

Find errors in last hour:

  # CloudWatch Logs Insights query
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

Response time analysis:

  fields @timestamp, duration_ms
| filter endpoint = "/api/v1/orders"
| stats avg(duration_ms), max(duration_ms), min(duration_ms)

Backup Verification

Before major deployments, verify backups exist.

Database Backups

  # Check automated backups (Railway/RDS)
railway backups list --environment production

# Manual backup before risky deployment
pg_dump -h prod-db.yourapp.com -U dbuser -d yourapp_prod > \
  backup_$(date +%Y%m%d_%H%M%S).sql

Test Restore (Quarterly)

  # 1. Create test database
createdb test_restore

# 2. Restore latest backup
psql test_restore < latest_backup.sql

# 3. Verify data
psql test_restore -c "SELECT COUNT(*) FROM users;"
psql test_restore -c "SELECT COUNT(*) FROM orders;"

# 4. Clean up
dropdb test_restore

Railway/Cloud66 Specific Instructions

Railway

Deploy from CLI:

  # Install Railway CLI
npm install -g @railway/cli

# Login
railway login

# Link to project
railway link

# Deploy
railway up

# View logs
railway logs

# Run migrations
railway run alembic upgrade head

# Open shell
railway shell

Environment Variables:

  # Set variable
railway variables set DATABASE_URL="postgresql://..."

# View variables
railway variables

# Load from file
railway variables set --from-file .env.production

Cloud66

Deploy:

  # Via Git push (recommended)
git push production main

# Via dashboard
# Go to Application → Deploy

# Via CLI
cx stacks redeploy -s yourapp-production

Run Commands:

  # SSH into server
cx ssh -s yourapp-production

# Run migration
cx run -s yourapp-production "alembic upgrade head"

# Restart services
cx stacks restart -s yourapp-production

Environment Variables Management

Adding New Variables

Update .env.example in repository

Add to staging:

  railway variables set NEW_VAR="value" --environment staging

Test on staging

Add to production:

  railway variables set NEW_VAR="value" --environment production

Document in README if needed

Secrets Management

Never commit secrets to Git.

Where to store:

Railway/Cloud66 environment variables
AWS Secrets Manager (for sensitive keys)
1Password/Vault (team access)

Rotation schedule:

Database passwords: Quarterly
API keys: When compromised or annually
JWT secrets: Annually

Deployment Notifications

Deployments trigger Slack notifications in #deployments.

Notification includes:

Environment (staging/production)
Commit message and author
Build status
Deployment status
Link to logs

Example:

  🚀 Deployment Started
Environment: Production
Commit: feat(webhooks): add subscription endpoints
Author: @john-doe
Status: Building...

[View Logs]

On success:

  ✅ Deployment Successful
Environment: Production
Duration: 3m 42s
Health Check: Passed

[View Application]

On failure:

  ❌ Deployment Failed
Environment: Production
Error: Health check failed
Rollback: Automatic

[View Logs] [View Error Details]

See ../06-tooling/github-slack-integration.md for setup.

Deployment Checklist (Summary)

Before Deploy

Code reviewed and approved
Tests passing
Migrations ready (if needed)
Environment variables set
Backup verified

During Deploy

Migrations run first
Code deployed
Health check passes
Logs monitored

After Deploy

Key workflows tested
Metrics normal
No error spike
Team notified

Emergency Procedures

Production Down

Check status page/health endpoint
View logs for errors
Check recent deployments - was there a recent release?
If recent deploy, rollback immediately
If not deploy-related, check infrastructure (database, redis)
Communicate in Slack #alerts
Post-mortem after resolution

Database Issues

Check connection pool - is it exhausted?
Check slow queries - any queries >1s?
Check disk space - is database full?
Check locks - any long-running locks?
If critical, scale up database instance
Document in incident report

High Error Rate

Check error logs for patterns
Identify which endpoint is failing
Rollback if caused by recent deploy
Fix if caused by external dependency (API down, etc.)
Scale if traffic spike
Monitor until resolved

Next Steps

Review ../05-runbooks/incident-response.md for incident handling
Check ../05-runbooks/debugging.md for troubleshooting
See pr-process.md for getting code ready to deploy

Common Issues and Solutions

Editorconfig

Deployment Guide

Infrastructure Overview link

Staging Environment link

Production Environment link

Pre-deployment Checklist link

Code Quality link

Database link

API link

Dependencies link

Configuration link

Testing link

Database Migration Strategy link

IMPORTANT: Always run migrations BEFORE deploying code link

Migration Steps link

For Staging link

For Production link

Migration Dos and Don’ts link

Deployment Process link

Automatic Deployments (Recommended) link

How It Works link

What Triggers Deployment link

Deployment Steps (Automated) link

Zero-Downtime Deployment link

How It Works link

Database Migrations During Deploy link

Health Checks link

Health Check Endpoint link

Health Check Response link

Deployment Health Verification link

Rollback Procedure link

Automatic Rollback link

Manual Rollback link

Rollback Decision Tree link

Post-deployment Verification link

Immediate Checks (0-5 minutes) link

Short-term Monitoring (5-30 minutes) link

Dashboard Checks link

Where to Check Logs link

Staging link

Production link

Log Queries link

Backup Verification link

Database Backups link

Test Restore (Quarterly) link

Railway/Cloud66 Specific Instructions link

Railway link

Cloud66 link

Environment Variables Management link

Adding New Variables link

Secrets Management link

Deployment Notifications link

Deployment Checklist (Summary) link

Before Deploy link

During Deploy link

After Deploy link

Emergency Procedures link

Production Down link

Database Issues link

High Error Rate link

Next Steps link

Infrastructure Overview

Staging Environment

Production Environment

Pre-deployment Checklist

Code Quality

Database

API

Dependencies

Configuration

Testing

Database Migration Strategy

IMPORTANT: Always run migrations BEFORE deploying code

Migration Steps

For Staging

For Production

Migration Dos and Don’ts

Deployment Process

Automatic Deployments (Recommended)

How It Works

What Triggers Deployment

Deployment Steps (Automated)

Zero-Downtime Deployment

How It Works

Database Migrations During Deploy

Health Checks

Health Check Endpoint

Health Check Response

Deployment Health Verification

Rollback Procedure

Automatic Rollback

Manual Rollback

Rollback Decision Tree

Post-deployment Verification

Immediate Checks (0-5 minutes)

Short-term Monitoring (5-30 minutes)

Dashboard Checks

Where to Check Logs

Staging

Production

Log Queries

Backup Verification

Database Backups

Test Restore (Quarterly)

Railway/Cloud66 Specific Instructions

Railway

Cloud66

Environment Variables Management

Adding New Variables

Secrets Management

Deployment Notifications

Deployment Checklist (Summary)

Before Deploy

During Deploy

After Deploy

Emergency Procedures

Production Down

Database Issues

High Error Rate

Next Steps