Scaling & Performance

Scale FraiseQL to handle millions of requests with horizontal scaling, smart caching, and database optimization.

Scaling Strategy Overview

FraiseQL scales in four phases:

Multiple Instances with Load Balancer — Simplest production setup
Horizontal Auto-Scaling — Dynamic capacity based on demand
Database Scaling — Connection pooling, read replicas, and sharding
Caching — Reduce database load with intelligent caching

Phase 1: Multiple Instances with Load Balancer

Simplest production setup: 3+ stateless FraiseQL instances behind a load balancer.

Load Balancer Configuration

All major platforms support load balancing to multiple FraiseQL instances. Each instance is stateless and connects to the same shared database.

All three transports — GraphQL, REST, and gRPC — scale identically because they are served by the same binary on port 8080. Adding a replica adds capacity for all transports simultaneously. The only infrastructure difference is that gRPC requires HTTP/2 between the client and the load balancer; see the Kubernetes, AWS, GCP, and Azure deployment guides for load-balancer-specific configuration.

Health check configuration (all platforms):

Endpoint: /health
Interval: 30 seconds
Timeout: 5 seconds
Unhealthy threshold: 3 failures
Healthy threshold: 2 successes

Capacity Planning for Multiple Instances

Assume:
- 1 instance = 1000 RPS capacity
- 3 instances = 3000 RPS capacity

Traffic growth:
- Month 1: 1000 RPS (1 instance)
- Month 2: 2000 RPS (2 instances)
- Month 3: 5000 RPS (5 instances)
- Month 6: 15000 RPS (15 instances)
- Month 12: 100000 RPS (100 instances)

Phase 2: Horizontal Auto-Scaling

Automatically scale instances based on demand.

Metrics to Scale On

CPU Utilization (simplest, most common)

Scale up when: Average CPU > 70%
Scale down when: Average CPU < 30%
Cooldown: 5 min up, 15 min down

Memory Utilization (for memory-intensive queries)

Scale up when: Average Memory > 80%
Scale down when: Average Memory < 50%

Request Count (most accurate for API)

Scale up when: Requests/sec > 5000
Scale down when: Requests/sec < 2000

Custom Metrics (database queue depth, cache hit rate)

Scale up when: Database pool > 80% utilized
Scale down when: Database pool < 40% utilized

Auto-Scaling Configuration Examples

MinSize: 3
MaxSize: 100
DesiredCapacity: 3
TargetTrackingScalingPolicies:
  - TargetValue: 0.70  # Target 70% CPU
    PredefinedMetric: ASGAverageCPUUtilization
    ScaleOutCooldown: 60s
    ScaleInCooldown: 300s

minReplicas: 3
maxReplicas: 100
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100  # Double replicas
        periodSeconds: 30
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 50   # Half replicas
        periodSeconds: 60

# Scale out: +1 instance when CPU > 70%
# Scale in: -1 instance when CPU < 30%
# Min: 3 instances, Max: 100 instances

# Automatic scaling based on request concurrency
# Max concurrent requests per instance: 80 (default)
# Scale up: +50 instances if queue > 0
# Scale down: -1 instance per minute
# Max instances: 1000 (configurable)

Monitoring Auto-Scaling

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name fraiseql-asg

kubectl get hpa fraiseql --watch

az monitor autoscale history list \
  --resource-group mygroup \
  --resource fraiseql

Phase 3: Database Scaling

As traffic grows, the database becomes the bottleneck.

Connection Pooling

Limit connections to prevent database overload:

Without pooling, 1000 instances at 20 connections each would require 20,000 database connections — well beyond what most databases support (typically capped at ~5,000).

With PgBouncer, 50 instances with a minimum pool of 5 results in only 250 connections to the database.

Configure FraiseQL’s own connection pool in fraiseql.toml:

[database]
url = "postgresql://user:pass@host:5432/dbname"
pool_min = 5    # Minimum connections per instance
pool_max = 20   # Maximum connections per instance

Read Replicas

Distribute read traffic across replicas. Write queries go to the primary; read queries are spread across replicas.

Setup (AWS RDS):

# Create 3 read replicas
for i in {1..3}; do
  aws rds create-db-instance-read-replica \
    --db-instance-identifier fraiseql-read-$i \
    --source-db-instance-identifier fraiseql-prod
done

# Configure FraiseQL to read from replicas
DATABASE_URL_PRIMARY=postgresql://user:pass@fraiseql-prod:5432/db
DATABASE_URL_REPLICA=postgresql://user:pass@fraiseql-read-1:5432/db

Database Query Optimization

Identify slow queries:

-- PostgreSQL: Enable slow query logging
ALTER SYSTEM SET log_min_duration_statement = 500;  -- Log queries > 500ms
SELECT pg_reload_conf();

-- View slow queries
SELECT query, calls, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Add indexes:

-- Find missing indexes
EXPLAIN ANALYZE
SELECT * FROM tb_user WHERE identifier = 'user@example.com';

-- If seq scan, add index
CREATE INDEX idx_tb_user_identifier ON tb_user(identifier);

-- Composite index for common filters
CREATE INDEX idx_tb_post_published
  ON tb_post(fk_user, is_published)
  WHERE is_published = true;

Optimize N+1 queries:

Use query profiling to identify N+1 problems:

-- If you see many queries per request, something may be wrong
-- Use pg_stat_statements to identify repeated patterns

-- Before (N+1):
SELECT id, data FROM v_user LIMIT 10;          -- 1 query
SELECT id, data FROM v_post WHERE fk_user = ?  -- repeated 10×
-- Total: 11 queries

-- After (batched):
SELECT id, data FROM v_user LIMIT 10;                            -- 1 query
SELECT id, data FROM v_post WHERE fk_user IN (?, ?, ...) -- 1 query
-- Total: 2 queries

FraiseQL’s Rust engine operates against your PostgreSQL views — ensure your views and indexes are designed to support set-based lookups.

Database Sharding (Advanced)

For massive scale (millions of users), FraiseQL does not currently provide first-class sharding support. The recommended approach is to deploy separate FraiseQL instances each pointing to an independent database shard:

Shard 1: Users 1–1M → shard-1.db.example.com — FraiseQL instance A
Shard 2: Users 1M–2M → shard-2.db.example.com — FraiseQL instance B
Shard 3: Users 2M–3M → shard-3.db.example.com — FraiseQL instance C

Route requests at the load balancer or API gateway layer based on the shard key.

Each shard has its own fraiseql.toml pointing to its own [database] URL.

Phase 4: Caching

Reduce database load with intelligent caching.

HTTP Caching (Easiest)

Use HTTP cache headers in your reverse proxy or load balancer for static data:

# In nginx or Caddy
cache-control: public, max-age=3600
etag: "user-123-v1"

Application-Level Caching (Redis)

FraiseQL’s caching is configured via fraiseql.toml — FraiseQL is a Rust binary, and Python is only used at compile time to define the schema. There is no Python runtime to write cache logic in. Enable the Redis backend in your TOML:

[caching]
enabled = true
backend = "redis"
redis_url = "redis://cache.example.com:6379"

Cache TTL is specified at the query level in your Python schema file (compile-time only):

@fraiseql.query
def get_user(id: ID) -> User | None:
    return fraiseql.config(sql_source="v_user", cache_ttl_seconds=3600)

Cache Invalidation

Cache invalidation is handled through the observers system in fraiseql.toml. When a mutation runs, FraiseQL publishes events to the configured observer backend, which triggers cache invalidation for related queries:

[observers]
backend = "nats"
nats_url = "nats://nats-server:4222"

Cache Metrics to Monitor

Cache hit rate: (hits) / (hits + misses)
Target: > 80% for high-traffic endpoints
Example: 8000 hits, 200 misses = 97.5% hit rate

Cache size: Total data in cache
Target: < 80% of available memory

Phase 5: Multi-Region Deployment

Serve global traffic with multiple regions.

Multi-Region Architecture

Traffic is routed to the nearest region based on user location. Each region has its own database and cache, with data replication strategies:

Write-through replication: Write to primary, replicate to others
Eventually consistent: Replicate asynchronously
NATS events: Update cache across regions

Configure Multi-Region

# Route 53 weighted routing
# 50% traffic to us-east-1
# 50% traffic to eu-west-1

aws route53 change-resource-record-sets \
  --hosted-zone-id Z123 \
  --change-batch '{...}'

# Kubefed for multi-cluster orchestration
kubefedctl join cluster-eu --host-cluster-context=host
kubefedctl join cluster-asia --host-cluster-context=host

# Replicate service across clusters
kubectl apply -f - <<EOF
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
  name: fraiseql
spec:
  template:
    ...
  placement:
    clusterNames:
      - cluster-eu
      - cluster-asia
EOF

# Cloud Load Balancing for global routing
gcloud compute backend-services create fraiseql-global \
  --global \
  --health-checks=health-check \
  --load-balancing-scheme=EXTERNAL

gcloud compute backend-services add-backend fraiseql-global \
  --instance-group=us-central1-ig \
  --instance-group-zone=us-central1-a \
  --global

Performance Testing & Load Testing

Determine Breaking Points

# Use Apache Bench, wrk, or k6 to load test

# Start at low load and increase
# Load = 100 RPS, 500 RPS, 1000 RPS, ...
# Measure: response time, error rate, resource usage

# Scaling = good when:
# - Response time stays constant as load increases
# - Error rate stays < 0.1%
# - CPU/memory scale linearly with load

Load test with k6:

import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  vus: 100,        // 100 virtual users
  duration: '5m',  // 5 minute test
};

export default function () {
  let res = http.post('http://api.example.com/graphql', {
    query: 'query { users(limit: 50) { id name } }'
  });

  check(res, {
    'is status 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });

  sleep(1);
}

Run:

k6 run load-test.js
# Output:
# ✓ is status 200
# ✓ response time < 500ms
# Average response time: 145ms
# 99th percentile: 280ms

Capacity Planning Calculator

Given:
- Current traffic: 5000 RPS
- Current response time: 100ms (acceptable)
- Target growth: 2x per year
- Max acceptable response time: 500ms

Calculate:
- Breaking point (where response time > 500ms): ~20,000 RPS
- Time until breaking point: 6 months (at 2x growth)
- Required capacity: 30,000 RPS (1.5x breaking point)
- Instances needed: 30,000 RPS ÷ 1000 RPS/instance = 30 instances
- Cost: 30 instances × $100/month = $3000/month

Plan:
- Month 1-3: 10 instances ($1000)
- Month 4-6: 20 instances ($2000)
- Month 7-9: 30 instances ($3000)
- Add monitoring alert at 80% capacity

Cost Optimization at Scale

Compute Cost Reduction

Reserved Instances (1-3 year commitment):

Pay on-demand:     100 instances × $100/month = $10,000
Pay reserved (1yr): 100 instances × $50/month = $5,000
Annual savings: $60,000 (50% reduction)

Spot/Preemptible Instances (for fault-tolerant workloads):

On-demand: $100/month/instance
Spot (AWS): $30/month/instance (can be interrupted)
Preemptible (GCP): $25/month/instance (24 hour max)

Use mix: 70% spot + 30% on-demand
Average: (0.7 × $30) + (0.3 × $100) = $51/instance
Savings: 49% reduction

Database Cost Reduction

Read Replicas for analytics:

Without replicas:
- Primary: 10,000 RPS (expensive)
- Load: 7000 app reads + 3000 analytics reads

With replicas:
- Primary: 7,000 RPS (cheaper)
- Analytics replica: 3,000 RPS (cheaper)
- Total cost: 30-40% reduction

Storage tier optimization:

Hot data (last 30 days): SSD storage ($0.10/GB/month)
Warm data (30-90 days): HDD storage ($0.05/GB/month)
Cold data (>90 days): Archive storage ($0.01/GB/month)

Cost reduction: 50-90% for rarely accessed data

Monitoring Scaling Performance

Key metrics to track:

Availability
├── Uptime: Target 99.99% (4.3 min downtime/month)
├── Error rate: Target < 0.1%
└── Latency: p50 < 100ms, p99 < 500ms

Scaling
├── Auto-scale time: < 60 seconds to add new instance
├── Scale-up efficiency: Response time improves with more capacity
└── Scale-down safety: Doesn't over-scale and waste money

Resource efficiency
├── CPU utilization: 60-70% (not too high, not too idle)
├── Memory utilization: 70-80%
├── Database connections: < 80% of pool size
└── Cache hit rate: > 80%

Cost
├── Cost per request: Should decrease as you scale
├── Cost per RPS: Should stabilize or decrease
└── ROI: Revenue growth > Cost growth

Set up alerts:

- Alert: Scale-out failure
  Condition: Desired capacity > actual capacity for 5 minutes
  Action: Page on-call engineer

- Alert: Auto-scaling thrashing
  Condition: Scale up then down more than 3× in 1 hour
  Action: Review auto-scale policies (cooldown might be too short)

- Alert: Cache degradation
  Condition: Cache hit rate < 70%
  Action: Increase cache size or adjust TTL

- Alert: Database overload
  Condition: Connection pool > 90% utilized
  Action: Add read replicas or optimize slow queries

Scaling Checklist

Before Horizontal Scaling

Application is stateless (no sticky sessions)
Configuration externalizes to environment variables
Database connection pooling configured
Health check endpoints working
Load balancer health checks passing

Before Auto-Scaling

Auto-scale policies defined (metrics, thresholds, cooldown)
Max capacity set appropriately
Min capacity set to handle baseline traffic
Testing completed at peak capacity
Monitoring alerts configured

Before Multi-Region

Single region is performing well (> 80% CPU utilization at peak)
Data consistency strategy defined
Failover procedures documented and tested
Database replication configured
Monitoring across regions set up

Before Caching

Cache invalidation strategy is sound
Cache misses won’t cause cascading failures
Monitoring of cache hit rate configured
Stale data is acceptable for 1-3600 seconds
Cache layer is highly available

Load Testing

Run load tests against your deployment to find breaking points before production. Performance Guide

AWS Auto-Scaling

Configure ECS service auto-scaling and Application Load Balancer on AWS. AWS Guide

Monitoring

Set up Prometheus metrics and alerting rules to monitor scaling behavior. Deployment Overview

Troubleshooting

Diagnose connection pool exhaustion and other scaling-related issues. Troubleshooting Guide