Skip to content

Resilience

FraiseQL includes several features to help your API remain stable under load and recover gracefully from failures:

  • Connection pooling — Efficiently manages database connections
  • Query timeouts — Prevents runaway queries from consuming resources
  • Health checks — Integrates with Kubernetes and load balancers
  • Graceful shutdown — Drains in-flight requests before stopping
  • Structured error responses — Clear, actionable error information
  • Federation circuit breakers — Prevents cascading failures across federated databases

FraiseQL maintains a connection pool for each configured database. The pool reuses connections across requests, reducing connection overhead.

Configure pool settings per database in fraiseql.toml:

[databases.primary]
url = "${DATABASE_URL}"
type = "postgresql"
pool_size = 10
timeout_seconds = 30
SettingDefaultDescription
pool_size10Maximum connections in the pool
timeout_seconds30Connection and query timeout
WorkloadRecommended Pool Size
Light (single user, development)5-10
Medium (production API)10-20
Heavy (high-throughput analytics)20-50
Multi-database setup10-15 per database

FraiseQL enforces timeouts on database queries to prevent slow queries from consuming pool connections indefinitely.

When a query exceeds the configured timeout_seconds:

  1. FraiseQL cancels the database query
  2. Returns a structured error to the client
  3. Releases the connection back to the pool

Example error response:

{
"errors": [{
"message": "Query exceeded timeout of 30s",
"extensions": {
"code": "QUERY_TIMEOUT",
"timeout_seconds": 30
}
}]
}
Query TypeRecommended Timeout
Simple lookups (by ID)5-10 seconds
List queries with filters10-30 seconds
Complex aggregations30-60 seconds
Analytics/reporting60-300 seconds

If you frequently hit timeout errors:

  1. Check for missing database indexes:

    EXPLAIN ANALYZE SELECT * FROM v_user WHERE email = 'test@example.com';
  2. Consider query optimization or materialized views for complex aggregations

  3. Increase pool_size if timeouts are caused by connection wait times

FraiseQL exposes health endpoints on the same port as the GraphQL API:

EndpointPurposeResponse
GET /healthFull status including database connectivityJSON with detailed status
GET /health/liveProcess is alive200 OK or 503
GET /health/readyReady to serve traffic200 OK or 503

Configure readiness and liveness probes:

livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5

Healthy state:

{
"status": "ok",
"version": "0.9.1",
"uptime_seconds": 3821,
"databases": {
"primary": {
"status": "ok",
"pool_idle": 18,
"pool_active": 2
}
}
}

Degraded state (database unavailable):

{
"status": "degraded",
"databases": {
"primary": {
"status": "error",
"error": "connection refused"
}
}
}

HTTP 503 is returned when status is "degraded" or "error", causing Kubernetes to stop routing traffic to the pod.

When FraiseQL receives SIGTERM (from Kubernetes, systemd, or kill), it enters a drain phase:

  1. Stop accepting new connections — the load balancer health check returns 503 immediately, causing the LB to stop routing new traffic to this instance

  2. Drain in-flight requests — FraiseQL waits up to 30 seconds for active queries and mutations to complete

  3. Close database pools — all connections are cleanly closed after in-flight work finishes or the drain timeout expires

  4. Exit with code 0 — a clean exit allows Kubernetes to proceed with the rolling update

Ensure your deployment allows time for graceful shutdown:

spec:
template:
spec:
terminationGracePeriodSeconds: 60 # Must be > drain timeout
containers:
- name: fraiseql
image: fraiseql/fraiseql:latest

FraiseQL returns structured GraphQL errors with consistent extension codes:

Error CodeMeaningClient Action
QUERY_TIMEOUTQuery exceeded timeoutRetry with smaller result set or optimize query
DATABASE_UNAVAILABLECannot connect to databaseRetry with exponential backoff
POOL_EXHAUSTEDAll connections in useWait and retry, or increase pool_size
RATE_LIMITEDRequest rate limit hitSlow down request rate
INTERNAL_ERRORUnexpected server errorReport to operations team

Implement intelligent retries in your client:

import time
import random
from fraiseql import FraiseQLClient
client = FraiseQLClient("http://localhost:8080")
def execute_with_retry(query, variables, max_attempts=3):
"""Execute query with exponential backoff for transient errors."""
delay = 0.1
for attempt in range(max_attempts):
try:
result = client.execute(query, variables=variables)
if result.errors:
codes = [e.get("extensions", {}).get("code") for e in result.errors]
# Retry on transient errors
if any(c in codes for c in ["DATABASE_UNAVAILABLE", "POOL_EXHAUSTED"]):
if attempt < max_attempts - 1:
time.sleep(delay + random.uniform(0, 0.1))
delay = min(delay * 2, 5.0)
continue
# Don't retry timeouts — they need query optimization
if "QUERY_TIMEOUT" in codes:
raise RuntimeError("Query timeout — optimize query or increase timeout")
return result
except ConnectionError:
if attempt == max_attempts - 1:
raise
time.sleep(delay + random.uniform(0, 0.1))
delay = min(delay * 2, 5.0)
raise RuntimeError("Max retries exceeded")

FraiseQL protects federation fan-out queries from cascading failures. When a federated database repeatedly fails, the circuit opens and subsequent requests to that database fail fast with HTTP 503 instead of waiting for timeouts.

StateBehaviour
ClosedRequests flow normally. Failure count is tracked.
OpenAll requests to this database fail immediately with HTTP 503 + Retry-After header.
Half-openAfter recovery_timeout_secs, a probe request is allowed. On success the circuit closes; on failure it reopens.
[federation.circuit_breaker]
enabled = true
failure_threshold = 5 # Open after N consecutive failures
recovery_timeout_secs = 30 # Seconds to stay open before probing
success_threshold = 2 # Successful probes required to close
# Per-database override (array of tables)
[[federation.circuit_breaker.per_database]]
database = "orders_db"
failure_threshold = 3
recovery_timeout_secs = 60
FieldTypeDefaultDescription
enabledbooltrueEnable circuit breaker protection
failure_thresholdinteger5Consecutive failures before opening
recovery_timeout_secsinteger30Seconds in open state before probing
success_thresholdinteger2Successful probes required to close

When the circuit is open, FraiseQL returns:

{
"errors": [{
"message": "Federation database 'orders_db' unavailable: circuit breaker open",
"extensions": {
"code": "SERVICE_UNAVAILABLE",
"database": "orders_db"
}
}]
}

HTTP status is 503 Service Unavailable with a Retry-After: 30 header set to recovery_timeout_secs.

FraiseQL exposes operational metrics on GET /metrics in Prometheus exposition format:

MetricLabelsDescription
fraiseql_requests_totalmethod, statusTotal HTTP requests
fraiseql_request_duration_secondsmethodRequest latency histogram
fraiseql_database_pool_activedatabaseActive connections per database
fraiseql_database_pool_idledatabaseIdle connections per database
fraiseql_query_timeouts_totaldatabaseQuery timeout count
fraiseql_errors_totaltypeError count by type
fraiseql_federation_circuit_breaker_statedatabaseCircuit breaker state: 0=closed, 1=open, 2=half_open

Example Prometheus alerting rules for FraiseQL:

# High error rate
- alert: FraiseQLHighErrorRate
expr: rate(fraiseql_errors_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.instance }}"
# Database connection issues
- alert: FraiseQLDatabaseUnavailable
expr: fraiseql_database_pool_active == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database appears unavailable"
# Query timeouts
- alert: FraiseQLQueryTimeouts
expr: rate(fraiseql_query_timeouts_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Frequent query timeouts — check for missing indexes"

Before deploying FraiseQL to production:

  • Configure appropriate pool_size for your workload
  • Set timeout_seconds based on your slowest acceptable query
  • Enable health checks and configure Kubernetes probes
  • Enable [federation.circuit_breaker] for federated deployments
  • Set up Prometheus metrics collection and alert on fraiseql_federation_circuit_breaker_state
  • Configure log aggregation (FraiseQL logs JSON to stdout)
  • Test graceful shutdown behavior
  • Implement client-side retry logic for transient errors
  • Set terminationGracePeriodSeconds > 30 in Kubernetes

The following features are planned for future releases:

  • Automatic retries — Built-in exponential backoff for transient errors
  • Bulkhead isolation — Separate connection pools for different query types
  • Chaos testing — Inject faults to validate failure handling

“Pool exhausted” errors

Increase pool_size or reduce query concurrency:

[databases.primary]
pool_size = 20 # Increase from default 10

Frequent query timeouts

  1. Identify slow queries:

    SELECT query, mean_exec_time
    FROM pg_stat_statements
    ORDER BY mean_exec_time DESC
    LIMIT 10;
  2. Add indexes on frequently filtered columns

  3. Increase timeout for legitimate long-running queries:

    [databases.analytics]
    timeout_seconds = 120

Health check failures

Check database connectivity:

Terminal window
curl http://localhost:8080/health
# Look for database.status: "error"

Verify DATABASE_URL is correct and the database is accessible from the FraiseQL pod.

Observability

Observability — Traces, metrics, and structured logging

Deployment

Deployment — Kubernetes configuration and production setup