Deployment Guide
This document covers our deployment process for staging and production environments.
Infrastructure Overview
Staging Environment
Purpose: Pre-production testing and client demos
Setup: Single server with all components
Single Server (Railway/Cloud66)
├── FastAPI application
├── PostgreSQL database
├── Redis cache/queue
└── Celery workers
Characteristics:
- Automated deployments from
stagingbranch - Production-like configuration
- Staging Shopify app connected
- Test payment gateway
- Smaller instance size (cost optimisation)
Access:
- URL:
https://staging.yourapp.com - Database: Restricted to VPN/specific IPs
- Logs: Railway/Cloud66 dashboard
Production Environment
Purpose: Live customer-facing application
Setup: Separate servers for each component
Load Balancer
├── API Server 1
├── API Server 2
└── API Server 3
PostgreSQL Primary
└── Read Replica (optional)
Redis Cluster
├── Master
└── Replicas
Celery Workers
├── Worker 1 (general tasks)
├── Worker 2 (webhook delivery)
└── Worker 3 (email/reports)
Characteristics:
- Automated deployments from
mainbranch - Multiple API instances for redundancy
- Separate database server
- Redis with persistence
- Real payment processing
- Full monitoring and alerts
Access:
- URL:
https://app.yourapp.com - API:
https://api.yourapp.com - Database: Restricted to app servers only
- Logs: CloudWatch/Datadog
Pre-deployment Checklist
Before deploying to staging or production:
Code Quality
- All tests passing locally (
pytest,npm test) - Pre-commit hooks pass
- Code reviewed and approved (2 approvals)
- No console.log or debug statements
- No hardcoded secrets or API keys
Database
- Migration files created (if schema changes)
- Migration tested locally
- Migration has rollback (downgrade function)
- No data-destructive operations without backup
- Indexes added for new queries
API
- API documentation updated (
docs/api/endpoints.md) - Breaking changes communicated to clients
- Backwards compatibility maintained (or version bumped)
- Response times tested (<500ms)
Dependencies
- New dependencies approved
- Dependencies version-pinned in requirements.txt / package.json
- No security vulnerabilities (run
npm audit,safety check)
Configuration
- Environment variables documented
- New env vars added to staging/production
- Feature flags configured (if using)
Testing
- Manual testing completed
- Postman tests passing
- Load testing completed (for high-traffic features)
- Edge cases tested
Database Migration Strategy
IMPORTANT: Always run migrations BEFORE deploying code
Order:
- Run migration (adds new columns, tables, indexes)
- Deploy new code (uses new schema)
This ensures zero downtime.
Migration Steps
For Staging
# 1. SSH into staging server (or use Railway CLI)
railway login
railway link
# 2. Run migration
railway run alembic upgrade head
# 3. Verify migration
railway run alembic current
# 4. Check database
railway connect # Opens DB connection
\dt # List tables
\d table_name # Describe table
For Production
# 1. Backup database first
pg_dump -h prod-db.yourapp.com -U dbuser -d yourapp_prod > backup_$(date +%Y%m%d_%H%M%S).sql
# 2. Test migration on backup locally
createdb test_migration
psql test_migration < backup_20250115_103000.sql
psql test_migration -c "UPDATE schema_migrations SET version='current_version';"
alembic upgrade head
# 3. If successful, run on production
railway run --environment production alembic upgrade head
# 4. Verify
railway run --environment production alembic current
Migration Dos and Don’ts
✅ Do:
- Make migrations backwards compatible
- Add columns as nullable first, backfill data, then add NOT NULL
- Create indexes concurrently in PostgreSQL
- Test migrations on production data backup
- Include rollback (downgrade) function
❌ Don’t:
- Drop columns without deprecation period
- Rename columns (create new, copy data, deprecate old)
- Add NOT NULL columns without default
- Run destructive migrations without backup
- Skip testing migrations
Deployment Process
Automatic Deployments (Recommended)
We use GitHub Actions + Railway/Cloud66 for automated deployments.
How It Works
graph LR
A[Push to main] --> B[GitHub Actions]
B --> C[Run Tests]
C --> D{Tests Pass?}
D -->|No| E[Slack Alert]
D -->|Yes| F[Build Docker Image]
F --> G[Push to Registry]
G --> H[Deploy to Railway]
H --> I[Run Health Checks]
I --> J{Healthy?}
J -->|No| K[Rollback]
J -->|Yes| L[Slack Success]
What Triggers Deployment
Staging: Any push to staging branch
Production: Any push to main branch
Deployment Steps (Automated)
- Tests run - pytest, npm test, linting
- Build - Docker image built
- Push - Image pushed to registry
- Deploy - Railway pulls new image
- Health check -
/healthendpoint checked - Success/Failure - Slack notification sent
Zero-Downtime Deployment
Our deployment strategy ensures no downtime during releases.
How It Works
Rolling Deployment:
Step 1: Initial State
[Instance 1: v1.0] [Instance 2: v1.0] [Instance 3: v1.0]
↓ ↓ ↓
Load Balancer
Step 2: Update Instance 1
[Instance 1: v1.1] [Instance 2: v1.0] [Instance 3: v1.0]
↓ ↓ ↓
Load Balancer (Instance 1 removed during update)
Step 3: Instance 1 Healthy
[Instance 1: v1.1] [Instance 2: v1.0] [Instance 3: v1.0]
↓ ↓ ↓
Load Balancer
Step 4: Update Instance 2
[Instance 1: v1.1] [Instance 2: v1.1] [Instance 3: v1.0]
↓ ↓ ↓
Load Balancer
Step 5: All Updated
[Instance 1: v1.1] [Instance 2: v1.1] [Instance 3: v1.1]
↓ ↓ ↓
Load Balancer
Database Migrations During Deploy
Strategy: Make migrations backwards compatible
Example - Adding a new column:
# WRONG - Breaking change
def upgrade():
op.add_column('users', sa.Column('phone', sa.String(20), nullable=False))
# RIGHT - Backwards compatible
# Step 1: Add column as nullable
def upgrade():
op.add_column('users', sa.Column('phone', sa.String(20), nullable=True))
# Deploy code that handles null phone numbers
# Step 2 (later migration): Make it non-nullable after backfill
def upgrade():
# Backfill data first
op.execute("UPDATE users SET phone = '' WHERE phone IS NULL")
# Then add constraint
op.alter_column('users', 'phone', nullable=False)
Health Checks
Every deployment verifies the application is healthy.
Health Check Endpoint
# app/api/v1/endpoints/health.py
from fastapi import APIRouter, Depends
from sqlalchemy.orm import Session
from app.api import deps
from app.core.redis import redis_client
router = APIRouter()
@router.get("/health")
def health_check(db: Session = Depends(deps.get_db)) -> dict:
"""
Health check endpoint.
Verifies:
- API is responding
- Database connection works
- Redis connection works
Returns:
dict: Health status
"""
checks = {
"status": "healthy",
"checks": {}
}
# Check database
try:
db.execute("SELECT 1")
checks["checks"]["database"] = "healthy"
except Exception as e:
checks["status"] = "unhealthy"
checks["checks"]["database"] = f"unhealthy: {str(e)}"
# Check Redis
try:
redis_client.ping()
checks["checks"]["redis"] = "healthy"
except Exception as e:
checks["status"] = "unhealthy"
checks["checks"]["redis"] = f"unhealthy: {str(e)}"
return checks
Health Check Response
Healthy:
{
"status": "healthy",
"checks": {
"database": "healthy",
"redis": "healthy"
}
}
Unhealthy:
{
"status": "unhealthy",
"checks": {
"database": "healthy",
"redis": "unhealthy: Connection refused"
}
}
Deployment Health Verification
After deployment, automated checks verify:
# 1. Basic health
curl https://api.yourapp.com/health
# 2. API functionality
curl https://api.yourapp.com/api/v1/ping
# 3. Response time
curl -w "@curl-format.txt" -o /dev/null -s https://api.yourapp.com/api/v1/ping
# If any check fails, deployment is rolled back
Rollback Procedure
If a deployment causes issues, rollback immediately.
Automatic Rollback
Railway/Cloud66 automatically rolls back if:
- Health checks fail
- Container crashes
- Deployment timeout
Manual Rollback
# Option 1: Redeploy previous version via Railway CLI
railway rollback
# Option 2: Redeploy previous Git commit
git revert HEAD
git push origin main
# Option 3: Roll back via Railway dashboard
# Go to Deployments → Select previous deployment → Redeploy
Rollback Decision Tree
graph TD
A[Issue Detected] --> B{Critical?}
B -->|Yes| C[Rollback Immediately]
B -->|No| D{Can Fix Quickly?}
D -->|Yes <5min| E[Deploy Hotfix]
D -->|No| C
C --> F[Investigate Root Cause]
E --> G{Fixed?}
G -->|Yes| H[Monitor]
G -->|No| C
Critical issues (rollback immediately):
- Application won’t start
- Database connection fails
- 50%+ of requests failing
- Security vulnerability
- Data corruption
Non-critical issues (can deploy hotfix):
- Single endpoint failing
- Minor UI bug
- Performance degradation <10%
Post-deployment Verification
After every deployment, verify:
Immediate Checks (0-5 minutes)
- Health check returns 200
- Application is accessible
- Login works
- Key workflows function (create order, etc.)
- No error spike in logs
Short-term Monitoring (5-30 minutes)
- No error rate increase
- Response times normal (<500ms p95)
- Background jobs processing
- Database queries performing well
- No memory leaks
Dashboard Checks
Metrics to Monitor:
✓ HTTP 200 response rate: >99%
✓ HTTP 500 error rate: <1%
✓ API response time (p95): <500ms
✓ Database connection pool: <80% utilised
✓ Celery queue depth: <100 tasks
✓ Memory usage: <80%
✓ CPU usage: <70%
Where to Check Logs
Staging
Application Logs:
# Via Railway CLI
railway logs --environment staging
# Via web dashboard
https://railway.app/project/your-project/deployments
Database Logs:
# Via Railway
railway logs --environment staging --service postgres
Production
Application Logs:
- CloudWatch Logs:
https://console.aws.amazon.com/cloudwatch - Filter by log group:
/aws/ecs/yourapp-production
Database Logs:
- RDS Logs:
https://console.aws.amazon.com/rds - View: Logs & Events
Worker Logs:
# Via Railway
railway logs --environment production --service worker
Log Queries
Find errors in last hour:
# CloudWatch Logs Insights query
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
Response time analysis:
fields @timestamp, duration_ms
| filter endpoint = "/api/v1/orders"
| stats avg(duration_ms), max(duration_ms), min(duration_ms)
Backup Verification
Before major deployments, verify backups exist.
Database Backups
# Check automated backups (Railway/RDS)
railway backups list --environment production
# Manual backup before risky deployment
pg_dump -h prod-db.yourapp.com -U dbuser -d yourapp_prod > \
backup_$(date +%Y%m%d_%H%M%S).sql
Test Restore (Quarterly)
# 1. Create test database
createdb test_restore
# 2. Restore latest backup
psql test_restore < latest_backup.sql
# 3. Verify data
psql test_restore -c "SELECT COUNT(*) FROM users;"
psql test_restore -c "SELECT COUNT(*) FROM orders;"
# 4. Clean up
dropdb test_restore
Railway/Cloud66 Specific Instructions
Railway
Deploy from CLI:
# Install Railway CLI
npm install -g @railway/cli
# Login
railway login
# Link to project
railway link
# Deploy
railway up
# View logs
railway logs
# Run migrations
railway run alembic upgrade head
# Open shell
railway shell
Environment Variables:
# Set variable
railway variables set DATABASE_URL="postgresql://..."
# View variables
railway variables
# Load from file
railway variables set --from-file .env.production
Cloud66
Deploy:
# Via Git push (recommended)
git push production main
# Via dashboard
# Go to Application → Deploy
# Via CLI
cx stacks redeploy -s yourapp-production
Run Commands:
# SSH into server
cx ssh -s yourapp-production
# Run migration
cx run -s yourapp-production "alembic upgrade head"
# Restart services
cx stacks restart -s yourapp-production
Environment Variables Management
Adding New Variables
- Update
.env.examplein repository - Add to staging:
railway variables set NEW_VAR="value" --environment staging - Test on staging
- Add to production:
railway variables set NEW_VAR="value" --environment production - Document in README if needed
Secrets Management
Never commit secrets to Git.
Where to store:
- Railway/Cloud66 environment variables
- AWS Secrets Manager (for sensitive keys)
- 1Password/Vault (team access)
Rotation schedule:
- Database passwords: Quarterly
- API keys: When compromised or annually
- JWT secrets: Annually
Deployment Notifications
Deployments trigger Slack notifications in #deployments.
Notification includes:
- Environment (staging/production)
- Commit message and author
- Build status
- Deployment status
- Link to logs
Example:
🚀 Deployment Started
Environment: Production
Commit: feat(webhooks): add subscription endpoints
Author: @john-doe
Status: Building...
[View Logs]
On success:
✅ Deployment Successful
Environment: Production
Duration: 3m 42s
Health Check: Passed
[View Application]
On failure:
❌ Deployment Failed
Environment: Production
Error: Health check failed
Rollback: Automatic
[View Logs] [View Error Details]
See ../06-tooling/github-slack-integration.md for setup.
Deployment Checklist (Summary)
Before Deploy
- Code reviewed and approved
- Tests passing
- Migrations ready (if needed)
- Environment variables set
- Backup verified
During Deploy
- Migrations run first
- Code deployed
- Health check passes
- Logs monitored
After Deploy
- Key workflows tested
- Metrics normal
- No error spike
- Team notified
Emergency Procedures
Production Down
- Check status page/health endpoint
- View logs for errors
- Check recent deployments - was there a recent release?
- If recent deploy, rollback immediately
- If not deploy-related, check infrastructure (database, redis)
- Communicate in Slack
#alerts - Post-mortem after resolution
Database Issues
- Check connection pool - is it exhausted?
- Check slow queries - any queries >1s?
- Check disk space - is database full?
- Check locks - any long-running locks?
- If critical, scale up database instance
- Document in incident report
High Error Rate
- Check error logs for patterns
- Identify which endpoint is failing
- Rollback if caused by recent deploy
- Fix if caused by external dependency (API down, etc.)
- Scale if traffic spike
- Monitor until resolved
Next Steps
- Review
../05-runbooks/incident-response.mdfor incident handling - Check
../05-runbooks/debugging.mdfor troubleshooting - See
pr-process.mdfor getting code ready to deploy