Architecture Overview
This document provides a high-level overview of our system architecture, request flows, and key design decisions.
Components Overview
API Server (FastAPI)
Purpose: Handle HTTP requests, validate input, return responses quickly (<500ms)
Responsibilities:
- Receive and validate HTTP requests
- Authenticate and authorise users
- Execute business logic via service layer
- Return JSON responses
- Queue background jobs for long-running tasks
Technology: FastAPI with Uvicorn (ASGI server)
Scaling: Horizontal - run multiple instances behind a load balancer
Database (PostgreSQL)
Purpose: Persistent data storage
Responsibilities:
- Store all application data
- Enforce data integrity via constraints
- Provide transactional guarantees
- Execute queries with proper indexing
Technology: PostgreSQL 15+
Scaling: Vertical initially, read replicas if needed
Cache/Queue (Redis)
Purpose: Caching and message broker for background jobs
Responsibilities:
- Cache frequently accessed data
- Store session data
- Message queue for Celery tasks
- Rate limiting storage
Technology: Redis 7+
Scaling: Single instance with persistence for small teams, cluster for scale
Background Workers (Celery)
Purpose: Process long-running tasks asynchronously
Responsibilities:
- Process webhook data
- Send emails
- Generate reports
- Call external APIs
- Sync data with third parties
Technology: Celery with Redis broker
Scaling: Horizontal - add more workers based on queue depth
Frontend (React)
Purpose: User interface
Responsibilities:
- Render UI components
- Handle user interactions
- Make API calls
- Manage client-side state
- Display loading/error states
Technology: React 18 with TypeScript, Vite
Scaling: Static files served via CDN
Request Flow Patterns
1. Typical API Request (User Action)
sequenceDiagram
participant C as Client
participant A as API Server
participant S as Service Layer
participant D as Database
participant R as Redis
C->>A: POST /api/v1/orders
A->>A: Validate JWT token
A->>A: Validate request schema
A->>S: create_order(data)
S->>D: INSERT INTO orders
D->>S: order_id
S->>R: cache order data
S->>A: Order object
A->>C: 201 Created + order JSON
Note over C,R: Total time: <500ms
Key Points:
- Authentication happens at router level via dependency injection
- Validation uses Pydantic schemas automatically
- Business logic lives in service layer
- Response returned quickly (<500ms)
2. Webhook Handler (Capture and Process Async)
sequenceDiagram
participant SH as Shopify
participant A as API Server
participant D as Database
participant Q as Redis Queue
participant W as Celery Worker
participant SA as Shopify API
SH->>A: POST /webhooks/orders/create
A->>A: Verify webhook signature
A->>D: INSERT INTO webhook_events
D->>A: event_id
A->>Q: queue process_order_webhook.delay(event_id)
A->>SH: 200 OK
Note over SH,A: Response in <100ms
Q->>W: Dequeue task
W->>D: SELECT * FROM webhook_events
W->>W: Process order data
W->>D: UPDATE orders, inventory
W->>SA: GET /orders/{id}/fulfillments
SA->>W: Fulfillment data
W->>D: UPDATE fulfillment_status
Note over W,D: Processing time: 2-10 seconds
Key Points:
- Webhook received and acknowledged immediately (<100ms)
- Raw payload stored in database for replay/debugging
- Processing happens asynchronously
- Failures can be retried without losing data
3. Authentication Flow
sequenceDiagram
participant C as Client
participant A as API Server
participant D as Database
participant R as Redis
C->>A: POST /auth/login {email, password}
A->>D: SELECT * FROM users WHERE email=?
D->>A: User record
A->>A: Verify password hash
A->>A: Generate JWT access + refresh tokens
A->>R: Store refresh token (with TTL)
A->>C: {access_token, refresh_token}
Note over C,A: Subsequent requests
C->>A: GET /api/v1/profile (Authorization: Bearer token)
A->>A: Decode and verify JWT
A->>A: Extract user_id from token
A->>D: SELECT * FROM users WHERE id=?
D->>A: User data
A->>C: User profile JSON
Key Points:
- JWT tokens are stateless for access tokens
- Refresh tokens stored in Redis for revocation
- Access tokens short-lived (30 min), refresh tokens long-lived (7 days)
- User object injected via dependency injection
4. Background Job Processing
sequenceDiagram
participant A as API Server
participant Q as Redis Queue
participant W as Celery Worker
participant D as Database
participant E as External API
A->>Q: send_welcome_email.delay(user_id)
A->>A: Continue processing
Q->>W: Dequeue task
W->>D: SELECT * FROM users WHERE id=?
D->>W: User data
W->>E: POST /send-email (AWS SES)
E->>W: Message ID
W->>D: INSERT INTO email_log
W->>Q: Task complete
Note over W: Retries on failure (max 3 attempts)
Key Points:
- Tasks are queued with
.delay()or.apply_async() - Workers poll Redis queue
- Automatic retries with exponential backoff
- All state persisted to database
Environment Differences
Local Development
graph LR
Dev[Developer Machine] --> Docker[Docker Compose]
Docker --> DB[(PostgreSQL)]
Docker --> Redis[(Redis)]
Docker --> Worker[Celery Worker]
Characteristics:
- Everything runs in Docker containers
- Single server (your laptop)
- Hot reload enabled
- Debug mode on
- Simplified authentication
- Test data seeded
Staging
graph LR
Internet --> LB[Railway Load Balancer]
LB --> App[Single Server]
App --> DB[(PostgreSQL)]
App --> Redis[(Redis)]
App --> Worker[Celery Worker]
Characteristics:
- All services on one server (cost optimisation)
- Production-like configuration
- Real domain with HTTPS
- Connected to staging Shopify app
- Test payment gateway
- Automated deployments from
stagingbranch
Production
graph TB
Internet --> LB[Load Balancer]
LB --> App1[API Server 1]
LB --> App2[API Server 2]
App1 --> DB[(PostgreSQL Primary)]
App2 --> DB
DB --> RR[(Read Replica)]
App1 --> Redis[(Redis)]
App2 --> Redis
Redis --> W1[Worker 1]
Redis --> W2[Worker 2]
W1 --> DB
W2 --> DB
Characteristics:
- Multiple API server instances
- Multiple Celery workers
- Database read replicas (if needed)
- Redis persistence enabled
- Automated deployments from
mainbranch - Real payment processing
- Comprehensive monitoring and alerts
Key Design Decisions
1. Why FastAPI?
Rationale:
- Automatic OpenAPI documentation
- Built-in request/response validation with Pydantic
- Excellent performance (ASGI-based)
- Modern Python with type hints
- Easy dependency injection
- Great async support
Trade-off: Smaller ecosystem than Django, but we don’t need Django’s ORM or admin
2. Why PostgreSQL?
Rationale:
- ACID compliance for data integrity
- Rich data types (JSON, arrays, etc.)
- Excellent performance with proper indexing
- Mature, battle-tested
- Great tooling ecosystem
Trade-off: Vertical scaling required eventually, but sufficient for our scale
3. Why Celery for Background Jobs?
Rationale:
- Mature, widely adopted
- Excellent retry mechanisms
- Support for task chains and workflows
- Built-in monitoring with Flower
- Works well with Redis
Trade-off: Can be complex, but essential for async processing
4. Why Redis?
Rationale:
- Fast in-memory storage
- Works as both cache and message broker
- Simple to operate
- Excellent client libraries
Trade-off: Data must fit in memory, but works for our use case
5. Capture and Process Async Pattern
Rationale:
- Ensures <500ms response times
- Prevents webhook timeouts
- Allows retries on failures
- Maintains audit trail of raw webhooks
Implementation: Store immediately, return 200, queue processing
See ../02-standards/api-patterns.md for code examples.
6. Service Layer Pattern
Rationale:
- Separates business logic from HTTP layer
- Makes code testable (mock services easily)
- Allows business logic reuse
- Keeps controllers thin
Structure:
Controller (route) → Service (business logic) → Model (database)
See ../02-standards/code-standards.md for examples.
7. API Versioning (/api/v1/)
Rationale:
- Allows breaking changes without breaking existing clients
- Clear migration path for clients
- Explicitly communicate API stability
Trade-off: Some code duplication, but necessary for public APIs
8. Minimal Frontend State
Rationale:
- Use React Query for server state
- Keep UI state in components
- Avoid complex global state managers
- Simpler to reason about
Trade-off: Some prop drilling, but acceptable for our app size
Data Flow Overview
Read Operation
Client → API Server → Service → Database → Service → API Server → Client
↓
Redis Cache (if configured)
Write Operation
Client → API Server → Service → Database → API Server → Client
↓
Clear Cache (if cached)
Webhook Processing
External → API Server → Database (raw payload) → Redis Queue
↓
Return 200 OK
Redis Queue → Celery Worker → Process → Update Database
↓
Call External APIs (if needed)
Documentation Locations
- This folder (
docs/01-getting-started/) - Architecture and setup - Standards (
docs/02-standards/) - How to write code - Workflows (
docs/03-workflows/) - Development processes - API Contracts (
docs/api/endpoints.md) - API documentation - Runbooks (
docs/05-runbooks/) - Troubleshooting and operations - Code - Inline docstrings and comments
Performance Targets
| Metric | Target | Monitoring |
|---|---|---|
| API Response Time (p95) | <500ms | CloudWatch/Datadog |
| API Response Time (p99) | <1000ms | CloudWatch/Datadog |
| Database Query Time (p95) | <100ms | Slow query log |
| Background Job Processing | <10s | Celery monitoring |
| Uptime | >99.9% | Health checks |
Security Overview
Authentication
- JWT tokens with RS256 signing
- Access tokens (short-lived): 30 minutes
- Refresh tokens (long-lived): 7 days
- Refresh tokens stored in Redis for revocation
Authorisation
- Role-based access control (RBAC)
- Permissions checked at endpoint level
- Resource ownership validated in service layer
Data Protection
- All passwords hashed with bcrypt
- HTTPS enforced in production
- API keys stored in environment variables
- Database credentials rotated quarterly
Rate Limiting
- Per-user rate limits enforced
- Per-IP limits for unauthenticated endpoints
- Webhook signature verification
Monitoring and Observability
Logging
- Structured JSON logging
- Request/response logging
- Error tracking with stack traces
- Background job status logging
Metrics
- API response times
- Database query performance
- Background job queue depth
- Error rates
- Cache hit rates
Alerts
- API error rate > 5%
- Response time p95 > 500ms
- Database connection pool exhausted
- Worker queue depth > 1000
- Disk space < 20%
See ../05-runbooks/debugging.md for monitoring details.
Deployment Architecture
CI/CD Pipeline
graph LR
Commit[Git Push] --> GH[GitHub Actions]
GH --> Test[Run Tests]
Test --> Lint[Linting]
Lint --> Build[Build Docker Image]
Build --> Push[Push to Registry]
Push --> Deploy[Deploy to Railway]
Deploy --> Health[Health Check]
Health --> Notify[Slack Notification]
Deployment Process
- Developer pushes to
stagingormainbranch - GitHub Actions runs tests and linting
- Docker image built and pushed to registry
- Railway pulls new image
- Rolling deployment (zero downtime)
- Health checks verify deployment
- Slack notification sent to
#deployments
See ../03-workflows/deployment.md for details.
Next Steps
Now that you understand the architecture:
- Read
../02-standards/code-standards.mdto learn our coding patterns - Review
../02-standards/api-patterns.mdfor API design - Study
../03-workflows/feature-development.mdfor how to build features - Check
../05-runbooks/debugging.mdto learn how to debug issues
Questions? Ask in #dev-team on Slack!