Resilience

FraiseQL includes several features to help your API remain stable under load and recover gracefully from failures:

Connection pooling — Efficiently manages database connections
Query timeouts — Prevents runaway queries from consuming resources
Health checks — Integrates with Kubernetes and load balancers
Graceful shutdown — Drains in-flight requests before stopping
Structured error responses — Clear, actionable error information
Federation circuit breakers — Prevents cascading failures across federated databases

Connection Pooling

FraiseQL maintains a connection pool for each configured database. The pool reuses connections across requests, reducing connection overhead.

Pool Configuration

Configure pool settings per database in fraiseql.toml:

[databases.primary]
url = "${DATABASE_URL}"
type = "postgresql"
pool_size = 10
timeout_seconds = 30

Setting	Default	Description
`pool_size`	10	Maximum connections in the pool
`timeout_seconds`	30	Connection and query timeout

Pool Size Guidelines

Workload	Recommended Pool Size
Light (single user, development)	5-10
Medium (production API)	10-20
Heavy (high-throughput analytics)	20-50
Multi-database setup	10-15 per database

Query Timeouts

FraiseQL enforces timeouts on database queries to prevent slow queries from consuming pool connections indefinitely.

Timeout Behavior

When a query exceeds the configured timeout_seconds:

FraiseQL cancels the database query
Returns a structured error to the client
Releases the connection back to the pool

Example error response:

{
  "errors": [{
    "message": "Query exceeded timeout of 30s",
    "extensions": {
      "code": "QUERY_TIMEOUT",
      "timeout_seconds": 30
    }
  }]
}

Setting Appropriate Timeouts

Query Type	Recommended Timeout
Simple lookups (by ID)	5-10 seconds
List queries with filters	10-30 seconds
Complex aggregations	30-60 seconds
Analytics/reporting	60-300 seconds

If you frequently hit timeout errors:

Check for missing database indexes:

EXPLAIN ANALYZE SELECT * FROM v_user WHERE email = 'test@example.com';

Consider query optimization or materialized views for complex aggregations
Increase pool_size if timeouts are caused by connection wait times

Health Check Endpoints

FraiseQL exposes health endpoints on the same port as the GraphQL API:

Endpoint	Purpose	Response
`GET /health`	Full status including database connectivity	JSON with detailed status
`GET /health/live`	Process is alive	200 OK or 503
`GET /health/ready`	Ready to serve traffic	200 OK or 503

Kubernetes Integration

Configure readiness and liveness probes:

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Health Response Format

Healthy state:

{
  "status": "ok",
  "version": "0.9.1",
  "uptime_seconds": 3821,
  "databases": {
    "primary": {
      "status": "ok",
      "pool_idle": 18,
      "pool_active": 2
    }
  }
}

Degraded state (database unavailable):

{
  "status": "degraded",
  "databases": {
    "primary": {
      "status": "error",
      "error": "connection refused"
    }
  }
}

HTTP 503 is returned when status is "degraded" or "error", causing Kubernetes to stop routing traffic to the pod.

Graceful Shutdown

When FraiseQL receives SIGTERM (from Kubernetes, systemd, or kill), it enters a drain phase:

Stop accepting new connections — the load balancer health check returns 503 immediately, causing the LB to stop routing new traffic to this instance
Drain in-flight requests — FraiseQL waits up to 30 seconds for active queries and mutations to complete
Close database pools — all connections are cleanly closed after in-flight work finishes or the drain timeout expires
Exit with code 0 — a clean exit allows Kubernetes to proceed with the rolling update

Kubernetes Configuration

Ensure your deployment allows time for graceful shutdown:

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60  # Must be > drain timeout
      containers:
        - name: fraiseql
          image: fraiseql/fraiseql:latest

Error Handling

FraiseQL returns structured GraphQL errors with consistent extension codes:

Error Code	Meaning	Client Action
`QUERY_TIMEOUT`	Query exceeded timeout	Retry with smaller result set or optimize query
`DATABASE_UNAVAILABLE`	Cannot connect to database	Retry with exponential backoff
`POOL_EXHAUSTED`	All connections in use	Wait and retry, or increase pool_size
`RATE_LIMITED`	Request rate limit hit	Slow down request rate
`INTERNAL_ERROR`	Unexpected server error	Report to operations team

Client Retry Logic

Implement intelligent retries in your client:

Python
TypeScript

import time
import random
from fraiseql import FraiseQLClient

client = FraiseQLClient("http://localhost:8080")

def execute_with_retry(query, variables, max_attempts=3):
    """Execute query with exponential backoff for transient errors."""
    delay = 0.1

    for attempt in range(max_attempts):
        try:
            result = client.execute(query, variables=variables)

            if result.errors:
                codes = [e.get("extensions", {}).get("code") for e in result.errors]

                # Retry on transient errors
                if any(c in codes for c in ["DATABASE_UNAVAILABLE", "POOL_EXHAUSTED"]):
                    if attempt < max_attempts - 1:
                        time.sleep(delay + random.uniform(0, 0.1))
                        delay = min(delay * 2, 5.0)
                        continue

                # Don't retry timeouts — they need query optimization
                if "QUERY_TIMEOUT" in codes:
                    raise RuntimeError("Query timeout — optimize query or increase timeout")

            return result

        except ConnectionError:
            if attempt == max_attempts - 1:
                raise
            time.sleep(delay + random.uniform(0, 0.1))
            delay = min(delay * 2, 5.0)

    raise RuntimeError("Max retries exceeded")

import { FraiseQLClient, FraiseQLError } from 'fraiseql';

const client = new FraiseQLClient('http://localhost:8080');

async function executeWithRetry<T>(
  query: string,
  variables: Record<string, unknown>,
  maxAttempts = 3
): Promise<T> {
  let delay = 100;

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      const result = await client.execute<T>(query, variables);

      if (result.errors) {
        const codes = result.errors.map(e => e.extensions?.code);

        // Retry on transient errors
        if (codes.some(c => ['DATABASE_UNAVAILABLE', 'POOL_EXHAUSTED'].includes(c))) {
          if (attempt < maxAttempts - 1) {
            await sleep(delay + Math.random() * 100);
            delay = Math.min(delay * 2, 5000);
            continue;
          }
        }

        // Don't retry timeouts
        if (codes.includes('QUERY_TIMEOUT')) {
          throw new Error('Query timeout — optimize query or increase timeout');
        }
      }

      return result.data;
    } catch (err) {
      if (attempt === maxAttempts - 1) throw err;
      await sleep(delay + Math.random() * 100);
      delay = Math.min(delay * 2, 5000);
    }
  }

  throw new Error('Unreachable');
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

Federation Circuit Breakers

FraiseQL protects federation fan-out queries from cascading failures. When a federated database repeatedly fails, the circuit opens and subsequent requests to that database fail fast with HTTP 503 instead of waiting for timeouts.

States

State	Behaviour
Closed	Requests flow normally. Failure count is tracked.
Open	All requests to this database fail immediately with HTTP 503 + `Retry-After` header.
Half-open	After `recovery_timeout_secs`, a probe request is allowed. On success the circuit closes; on failure it reopens.

Configuration

[federation.circuit_breaker]
enabled = true
failure_threshold = 5        # Open after N consecutive failures
recovery_timeout_secs = 30   # Seconds to stay open before probing
success_threshold = 2        # Successful probes required to close

# Per-database override (array of tables)
[[federation.circuit_breaker.per_database]]
database = "orders_db"
failure_threshold = 3
recovery_timeout_secs = 60

Field	Type	Default	Description
`enabled`	bool	`true`	Enable circuit breaker protection
`failure_threshold`	integer	`5`	Consecutive failures before opening
`recovery_timeout_secs`	integer	`30`	Seconds in open state before probing
`success_threshold`	integer	`2`	Successful probes required to close

Client behaviour when open

When the circuit is open, FraiseQL returns:

{
  "errors": [{
    "message": "Federation database 'orders_db' unavailable: circuit breaker open",
    "extensions": {
      "code": "SERVICE_UNAVAILABLE",
      "database": "orders_db"
    }
  }]
}

HTTP status is 503 Service Unavailable with a Retry-After: 30 header set to recovery_timeout_secs.

Prometheus Metrics

FraiseQL exposes operational metrics on GET /metrics in Prometheus exposition format:

Metric	Labels	Description
`fraiseql_requests_total`	`method`, `status`	Total HTTP requests
`fraiseql_request_duration_seconds`	`method`	Request latency histogram
`fraiseql_database_pool_active`	`database`	Active connections per database
`fraiseql_database_pool_idle`	`database`	Idle connections per database
`fraiseql_query_timeouts_total`	`database`	Query timeout count
`fraiseql_errors_total`	`type`	Error count by type
`fraiseql_federation_circuit_breaker_state`	`database`	Circuit breaker state: `0`=closed, `1`=open, `2`=half_open

Alerting Rules

Example Prometheus alerting rules for FraiseQL:

# High error rate
- alert: FraiseQLHighErrorRate
  expr: rate(fraiseql_errors_total[5m]) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate on {{ $labels.instance }}"

# Database connection issues
- alert: FraiseQLDatabaseUnavailable
  expr: fraiseql_database_pool_active == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Database appears unavailable"

# Query timeouts
- alert: FraiseQLQueryTimeouts
  expr: rate(fraiseql_query_timeouts_total[5m]) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Frequent query timeouts — check for missing indexes"

Production Checklist

Before deploying FraiseQL to production:

Configure appropriate pool_size for your workload
Set timeout_seconds based on your slowest acceptable query
Enable health checks and configure Kubernetes probes
Enable [federation.circuit_breaker] for federated deployments
Set up Prometheus metrics collection and alert on fraiseql_federation_circuit_breaker_state
Configure log aggregation (FraiseQL logs JSON to stdout)
Test graceful shutdown behavior
Implement client-side retry logic for transient errors
Set terminationGracePeriodSeconds > 30 in Kubernetes

Roadmap

The following features are planned for future releases:

Automatic retries — Built-in exponential backoff for transient errors
Bulkhead isolation — Separate connection pools for different query types
Chaos testing — Inject faults to validate failure handling

Troubleshooting

“Pool exhausted” errors

Increase pool_size or reduce query concurrency:

[databases.primary]
pool_size = 20  # Increase from default 10

Frequent query timeouts

Identify slow queries:

SELECT query, mean_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Add indexes on frequently filtered columns
Increase timeout for legitimate long-running queries:
```
[databases.analytics]
timeout_seconds = 120
```

Health check failures

Check database connectivity:

curl http://localhost:8080/health
# Look for database.status: "error"

Verify DATABASE_URL is correct and the database is accessible from the FraiseQL pod.

Next Steps

Observability

Observability — Traces, metrics, and structured logging

Deployment

Deployment — Kubernetes configuration and production setup

Performance

Performance Guide — Query optimization and indexes