Skip to content

Scaling & Performance

Scale FraiseQL to handle millions of requests with horizontal scaling, smart caching, and database optimization.

FraiseQL scales in four phases:

  1. Multiple Instances with Load Balancer — Simplest production setup
  2. Horizontal Auto-Scaling — Dynamic capacity based on demand
  3. Database Scaling — Connection pooling, read replicas, and sharding
  4. Caching — Reduce database load with intelligent caching

Phase 1: Multiple Instances with Load Balancer

Section titled “Phase 1: Multiple Instances with Load Balancer”

Simplest production setup: 3+ stateless FraiseQL instances behind a load balancer.

All major platforms support load balancing to multiple FraiseQL instances. Each instance is stateless and connects to the same shared database.

Health check configuration (all platforms):

Endpoint: /health/ready
Interval: 30 seconds
Timeout: 5 seconds
Unhealthy threshold: 3 failures
Healthy threshold: 2 successes
Assume:
- 1 instance = 1000 RPS capacity
- 3 instances = 3000 RPS capacity
Traffic growth:
- Month 1: 1000 RPS (1 instance)
- Month 2: 2000 RPS (2 instances)
- Month 3: 5000 RPS (5 instances)
- Month 6: 15000 RPS (15 instances)
- Month 12: 100000 RPS (100 instances)

Automatically scale instances based on demand.

CPU Utilization (simplest, most common)

Scale up when: Average CPU > 70%
Scale down when: Average CPU < 30%
Cooldown: 5 min up, 15 min down

Memory Utilization (for memory-intensive queries)

Scale up when: Average Memory > 80%
Scale down when: Average Memory < 50%

Request Count (most accurate for API)

Scale up when: Requests/sec > 5000
Scale down when: Requests/sec < 2000

Custom Metrics (database queue depth, cache hit rate)

Scale up when: Database pool > 80% utilized
Scale down when: Database pool < 40% utilized
MinSize: 3
MaxSize: 100
DesiredCapacity: 3
TargetTrackingScalingPolicies:
- TargetValue: 0.70 # Target 70% CPU
PredefinedMetric: ASGAverageCPUUtilization
ScaleOutCooldown: 60s
ScaleInCooldown: 300s
Terminal window
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name fraiseql-asg

As traffic grows, the database becomes the bottleneck.

Limit connections to prevent database overload:

Without pooling, 1000 instances at 20 connections each would require 20,000 database connections — well beyond what most databases support (typically capped at ~5,000).

With PgBouncer, 50 instances with a minimum pool of 5 results in only 250 connections to the database:

Terminal window
# PgBouncer for PostgreSQL
PGBOUNCER_MIN_POOL_SIZE=5
PGBOUNCER_MAX_POOL_SIZE=20 # Per database
PGBOUNCER_CONNECTION_TIMEOUT=30
PGBOUNCER_IDLE_IN_TRANSACTION_SESSION_TIMEOUT=600

Distribute read traffic across replicas. Write queries go to the primary; read queries are spread across replicas.

Setup (AWS RDS):

Terminal window
# Create 3 read replicas
for i in {1..3}; do
aws rds create-db-instance-read-replica \
--db-instance-identifier fraiseql-read-$i \
--source-db-instance-identifier fraiseql-prod
done
# Configure FraiseQL to read from replicas
DATABASE_URL_PRIMARY=postgresql://user:pass@fraiseql-prod:5432/db
DATABASE_URL_REPLICA=postgresql://user:pass@fraiseql-read-1:5432/db
# In code: Route read queries to replica, writes to primary

Identify slow queries:

-- PostgreSQL: Enable slow query logging
ALTER SYSTEM SET log_min_duration_statement = 500; -- Log queries > 500ms
SELECT pg_reload_conf();
-- View slow queries
SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Add indexes:

-- Find missing indexes
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'user@example.com';
-- If seq scan, add index
CREATE INDEX idx_users_email ON users(email);
-- Composite index for common filters
CREATE INDEX idx_posts_user_published
ON posts(user_id, published)
WHERE published = true;

Optimize N+1 queries:

FraiseQL automatically batches queries. But verify with monitoring:

-- If you see 1000 queries per request, something is wrong
-- Use query profiling to identify N+1 problems
-- Before (N+1):
SELECT * FROM users LIMIT 10; -- 1 query
SELECT * FROM posts WHERE user_id = ? (×10) -- 10 queries
-- Total: 11 queries
-- After (batched):
SELECT * FROM users LIMIT 10; -- 1 query
SELECT * FROM posts WHERE user_id IN (?, ?, ...) -- 1 query
-- Total: 2 queries
-- FraiseQL does this automatically!

For massive scale (millions of users), shard by a key such as user_id:

  • Shard 1: Users 1–1M → shard-1.db.example.com
  • Shard 2: Users 1M–2M → shard-2.db.example.com
  • Shard 3: Users 2M–3M → shard-3.db.example.com

Configure in FraiseQL:

@fraiseql.query
def get_user(user_id: ID) -> User:
# Route query to correct shard
shard_num = hash(user_id) % 3
return query_shard(f"shard-{shard_num}", user_id)

Reduce database load with intelligent caching.

Use HTTP cache headers for static data:

query GetUser($id: ID!) {
user(id: $id) @cache(ttl: 3600) { # Cache for 1 hour
id
name
email
}
}

Configure HTTP caching:

Terminal window
# In load balancer or reverse proxy
cache-control: public, max-age=3600
etag: "user-123-v1"

Cache expensive query results:

import redis
from functools import wraps
cache = redis.Redis(host='cache.example.com', port=6379)
def cached(ttl: int = 3600):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
# Create cache key
key = f"{func.__name__}:{args}:{kwargs}"
# Try cache
cached_result = cache.get(key)
if cached_result:
return json.loads(cached_result)
# Execute query
result = func(*args, **kwargs)
# Cache result
cache.setex(key, ttl, json.dumps(result))
return result
return wrapper
return decorator
@fraiseql.query
@cached(ttl=3600)
def get_user(id: ID) -> User:
# Expensive query executed only once per hour
return ...
# Cache expires after X seconds
@cached(ttl=3600) # 1 hour
def get_user(id: ID) -> User:
pass
Cache hit rate: (hits) / (hits + misses)
Target: > 80% for high-traffic endpoints
Example: 8000 hits, 200 misses = 97.5% hit rate
Cache size: Total data in cache
Target: < 80% of available memory

Serve global traffic with multiple regions.

Traffic is routed to the nearest region based on user location. Each region has its own database and cache, with data replication strategies:

  • Write-through replication: Write to primary, replicate to others
  • Eventually consistent: Replicate asynchronously
  • NATS events: Update cache across regions
Terminal window
# Route 53 weighted routing
# 50% traffic to us-east-1
# 50% traffic to eu-west-1
aws route53 change-resource-record-sets \
--hosted-zone-id Z123 \
--change-batch '{...}'
Terminal window
# Use Apache Bench, wrk, or k6 to load test
# Start at low load and increase
# Load = 100 RPS, 500 RPS, 1000 RPS, ...
# Measure: response time, error rate, resource usage
# Scaling = good when:
# - Response time stays constant as load increases
# - Error rate stays < 0.1%
# - CPU/memory scale linearly with load

Load test with k6:

import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
vus: 100, // 100 virtual users
duration: '5m', // 5 minute test
};
export default function () {
let res = http.post('http://api.example.com/graphql', {
query: 'query { users(limit: 50) { id name } }'
});
check(res, {
'is status 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}

Run:

Terminal window
k6 run load-test.js
# Output:
# ✓ is status 200
# ✓ response time < 500ms
# Average response time: 145ms
# 99th percentile: 280ms
Terminal window
Given:
- Current traffic: 5000 RPS
- Current response time: 100ms (acceptable)
- Target growth: 2x per year
- Max acceptable response time: 500ms
Calculate:
- Breaking point (where response time > 500ms): ~20,000 RPS
- Time until breaking point: 6 months (at 2x growth)
- Required capacity: 30,000 RPS (1.5x breaking point)
- Instances needed: 30,000 RPS ÷ 1000 RPS/instance = 30 instances
- Cost: 30 instances × $100/month = $3000/month
Plan:
- Month 1-3: 10 instances ($1000)
- Month 4-6: 20 instances ($2000)
- Month 7-9: 30 instances ($3000)
- Add monitoring alert at 80% capacity

Reserved Instances (1-3 year commitment):

Terminal window
Pay on-demand: 100 instances × $100/month = $10,000
Pay reserved (1yr): 100 instances × $50/month = $5,000
Annual savings: $60,000 (50% reduction)

Spot/Preemptible Instances (for fault-tolerant workloads):

Terminal window
On-demand: $100/month/instance
Spot (AWS): $30/month/instance (can be interrupted)
Preemptible (GCP): $25/month/instance (24 hour max)
Use mix: 70% spot + 30% on-demand
Average: (0.7 × $30) + (0.3 × $100) = $51/instance
Savings: 49% reduction

Read Replicas for analytics:

Without replicas:
- Primary: 10,000 RPS (expensive)
- Load: 7000 app reads + 3000 analytics reads
With replicas:
- Primary: 7,000 RPS (cheaper)
- Analytics replica: 3,000 RPS (cheaper)
- Total cost: 30-40% reduction

Storage tier optimization:

Terminal window
Hot data (last 30 days): SSD storage ($0.10/GB/month)
Warm data (30-90 days): HDD storage ($0.05/GB/month)
Cold data (>90 days): Archive storage ($0.01/GB/month)
Cost reduction: 50-90% for rarely accessed data

Key metrics to track:

Availability
├── Uptime: Target 99.99% (4.3 min downtime/month)
├── Error rate: Target < 0.1%
└── Latency: p50 < 100ms, p99 < 500ms
Scaling
├── Auto-scale time: < 60 seconds to add new instance
├── Scale-up efficiency: Response time improves with more capacity
└── Scale-down safety: Doesn't over-scale and waste money
Resource efficiency
├── CPU utilization: 60-70% (not too high, not too idle)
├── Memory utilization: 70-80%
├── Database connections: < 80% of pool size
└── Cache hit rate: > 80%
Cost
├── Cost per request: Should decrease as you scale
├── Cost per RPS: Should stabilize or decrease
└── ROI: Revenue growth > Cost growth

Set up alerts:

- Alert: Scale-out failure
Condition: Desired capacity > actual capacity for 5 minutes
Action: Page on-call engineer
- Alert: Auto-scaling thrashing
Condition: Scale up then down more than 3× in 1 hour
Action: Review auto-scale policies (cooldown might be too short)
- Alert: Cache degradation
Condition: Cache hit rate < 70%
Action: Increase cache size or adjust TTL
- Alert: Database overload
Condition: Connection pool > 90% utilized
Action: Add read replicas or optimize slow queries
  • Application is stateless (no sticky sessions)
  • Configuration externalizes to environment variables
  • Database connection pooling configured
  • Health check endpoints working
  • Load balancer health checks passing
  • Auto-scale policies defined (metrics, thresholds, cooldown)
  • Max capacity set appropriately
  • Min capacity set to handle baseline traffic
  • Testing completed at peak capacity
  • Monitoring alerts configured
  • Single region is performing well (> 80% CPU utilization at peak)
  • Data consistency strategy defined
  • Failover procedures documented and tested
  • Database replication configured
  • Monitoring across regions set up
  • Cache invalidation strategy is sound
  • Cache misses won’t cause cascading failures
  • Monitoring of cache hit rate configured
  • Stale data is acceptable for 1-3600 seconds
  • Cache layer is highly available

Load Testing

Run load tests against your deployment to find breaking points before production. Performance Guide

AWS Auto-Scaling

Configure ECS service auto-scaling and Application Load Balancer on AWS. AWS Guide

Monitoring

Set up Prometheus metrics and alerting rules to monitor scaling behavior. Deployment Overview

Troubleshooting

Diagnose connection pool exhaustion and other scaling-related issues. Troubleshooting Guide