Observability

FraiseQL provides comprehensive observability through metrics, distributed tracing, and structured logging.

Prometheus Metrics

FraiseQL exposes a /metrics endpoint in Prometheus format. The metric names and label schema are:

Enable Metrics

# No TOML config needed — metrics endpoint is always available at /metrics
curl http://localhost:8080/metrics

Available Metrics

HTTP Metrics

Metric	Type	Description
`fraiseql_http_requests_total`	Counter	Total HTTP requests
`fraiseql_http_responses_2xx`	Counter	Total 2xx HTTP responses
`fraiseql_http_responses_4xx`	Counter	Total 4xx HTTP responses
`fraiseql_http_responses_5xx`	Counter	Total 5xx HTTP responses

GraphQL Metrics

Metric	Type	Description
`fraiseql_graphql_queries_total`	Counter	Total GraphQL queries executed
`fraiseql_graphql_queries_success`	Counter	Total successful GraphQL queries
`fraiseql_graphql_queries_error`	Counter	Total failed GraphQL queries
`fraiseql_graphql_query_duration_ms`	Gauge	Average query execution time in milliseconds
`fraiseql_validation_errors_total`	Counter	Total validation errors
`fraiseql_parse_errors_total`	Counter	Total parse errors
`fraiseql_execution_errors_total`	Counter	Total execution errors

Database Metrics

Metric	Type	Description
`fraiseql_database_queries_total`	Counter	Total database queries executed
`fraiseql_database_query_duration_ms`	Gauge	Average database query time in milliseconds

Connection Pool Metrics

FraiseQL monitors connection pool pressure and emits scaling recommendations. The pool does not resize itself at runtime (the underlying library has no resize API) — use these metrics to decide when to raise pool_max in fraiseql.toml and restart.

Metric	Type	Description
`fraiseql_pool_tuning_size`	Gauge	Current configured pool size
`fraiseql_pool_tuning_queue_depth`	Gauge	Pending connection requests in the pool queue
`fraiseql_pool_tuning_adjustments_total`	Counter	Scaling recommendations emitted (not actual resizes)

Cache Metrics

Metric	Type	Description
`fraiseql_cache_hits`	Counter	Total cache hits
`fraiseql_cache_misses`	Counter	Total cache misses
`fraiseql_cache_hit_ratio`	Gauge	Cache hit ratio (0–1)

Feature Metrics

Metric	Type	Description
`fraiseql_apq_hits_total`	Counter	Automatic Persisted Query cache hits
`fraiseql_apq_misses_total`	Counter	Automatic Persisted Query cache misses
`fraiseql_apq_stored_total`	Counter	Automatic Persisted Queries stored
`fraiseql_apq_redis_errors_total`	Counter	Redis errors in APQ store (fail-open; only present with `redis-apq` feature)
`fraiseql_ws_connections_total`	Counter	WebSocket connection attempts (labeled by `result`)
`fraiseql_ws_subscriptions_total`	Counter	WebSocket subscription attempts (labeled by `result`)
`fraiseql_trusted_documents_hits_total`	Counter	Trusted document cache hits
`fraiseql_trusted_documents_misses_total`	Counter	Trusted document cache misses
`fraiseql_trusted_documents_rejected_total`	Counter	Rejected untrusted documents
`fraiseql_pkce_redis_errors_total`	Counter	Redis errors in PKCE state store (fail-open; only present with `redis-pkce` feature)
`fraiseql_rate_limit_redis_errors_total`	Counter	Redis errors in rate limiter (fail-open; only present with `redis-rate-limiting` feature)
`fraiseql_multi_root_queries_total`	Counter	Multi-root GraphQL queries executed in parallel
`fraiseql_observer_dlq_overflow_total`	Counter	Observer DLQ entries dropped due to `max_dlq_size` cap
`fraiseql_schema_reloads_total`	Counter	Successful schema hot-reloads (via admin API or SIGUSR1)
`fraiseql_schema_reload_errors_total`	Counter	Failed schema reload attempts

Transport Metrics

Transport-level metrics are emitted alongside per-query metrics:

fraiseql_rest_requests_total — REST request count
fraiseql_grpc_requests_total — gRPC request count
All fraiseql_query_* metrics include a transport label (graphql, rest, grpc, websocket)

Metric Labels

Common labels across metrics:

Label	Description
`operation`	GraphQL operation name
`type`	query, mutation, subscription
`status`	success, error
`error_code`	Error code if failed
`transport`	graphql, rest, grpc, websocket

Grafana Dashboard

FraiseQL ships a pre-built 12-panel Grafana 10+ dashboard covering latency percentiles, connection pool health, cache stats, and error rates. Fetch it at runtime:

curl http://localhost:8080/api/v1/admin/grafana-dashboard > fraiseql-dashboard.json

Import the JSON into Grafana (Dashboards → Import). The dashboard wires directly to your Prometheus datasource with no manual panel configuration.

Example PromQL queries for custom panels:

# Request rate
rate(fraiseql_http_requests_total[5m])

# Average query duration (milliseconds gauge)
fraiseql_graphql_query_duration_ms

# Error rate
rate(fraiseql_graphql_queries_error[5m]) / rate(fraiseql_graphql_queries_total[5m])

# Cache hit ratio (use the pre-computed gauge)
fraiseql_cache_hit_ratio

# 5xx server error rate
rate(fraiseql_http_responses_5xx[5m])

# Connection pool queue depth (alert when sustained > 0 → increase pool_max)
fraiseql_pool_tuning_queue_depth

Admin Endpoints

FraiseQL exposes admin endpoints under /api/v1/admin/ for operational tooling. These require an authenticated request (admin role or equivalent policy configured via [security]).

EXPLAIN Introspection

POST /api/v1/admin/explain — runs EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) for a named query against the connected database and returns the full plan alongside the generated SQL.

Request body:

{
  "query_name": "posts",
  "parameters": { "limit": 10, "status": "published" }
}

Response:

{
  "query_name": "posts",
  "sql_source": "v_post",
  "generated_sql": "SELECT data FROM v_post WHERE ...",
  "parameters": ["published", 10],
  "explain_output": [ ... ]
}

Use this endpoint to understand query plans, verify index usage, and diagnose slow queries without needing direct database access.

Grafana Dashboard

GET /api/v1/admin/grafana-dashboard — returns the pre-built Grafana dashboard JSON described above.

OpenTelemetry Tracing

OpenTelemetry tracing is compiled into the server by default. When no endpoint is configured, there is zero overhead — no gRPC connection attempt occurs.

Enable Tracing

Configure via [tracing] in fraiseql.toml or standard OpenTelemetry environment variables (env vars act as fallbacks when TOML fields are omitted):

TOML
Environment variables

[tracing]
service_name         = "fraiseql-api"
otlp_endpoint        = "http://otel-collector:4317"
otlp_export_timeout_secs = 10

OTEL_SERVICE_NAME=fraiseql-api
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc   # or http/protobuf
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1        # sample 10% of requests

Trace Attributes

FraiseQL creates spans for each request with attributes including:

Attribute	Description
`graphql.operation.name`	Operation name
`graphql.operation.type`	query/mutation/subscription
`graphql.document`	Query document (if enabled)
`db.system`	Database type
`db.statement`	SQL query (if enabled)
`db.operation`	SELECT/INSERT/UPDATE/DELETE
`user.id`	Authenticated user ID
`tenant.id`	Tenant ID

Transport Attribute

Spans include a transport attribute (graphql, rest, grpc, websocket) to distinguish traffic sources in your tracing backend.

Trace Context Propagation

FraiseQL propagates trace context via headers:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: fraiseql=user:123

Exporter Configuration

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_SERVICE_NAME=fraiseql-api

Jaeger supports OTLP natively (v1.35+). Point the OTLP exporter at Jaeger’s OTLP receiver:

OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=fraiseql-api

OTEL_TRACES_EXPORTER=zipkin
OTEL_EXPORTER_ZIPKIN_ENDPOINT=http://zipkin:9411/api/v2/spans
OTEL_SERVICE_NAME=fraiseql-api

Structured Logging

Configuration

Logging is configured via environment variables:

# Log level
RUST_LOG=info                   # error | warn | info | debug | trace
RUST_LOG=fraiseql=debug,info    # per-crate level (fraiseql at debug, everything else at info)

# Log format (JSON output for production log aggregators)
FRAISEQL_LOG_FORMAT=json        # json | pretty (default: pretty in dev, json in prod)

Log Format (JSON)

{
    "timestamp": "2024-01-15T10:30:00.123Z",
    "level": "INFO",
    "target": "fraiseql_server::graphql",
    "message": "Query executed",
    "span": {
        "request_id": "abc-123",
        "user_id": "user-456"
    },
    "fields": {
        "operation": "getUser",
        "duration_ms": 45,
        "cache_hit": true
    }
}

Log Levels by Component

Use the standard RUST_LOG directive syntax for per-crate levels:

RUST_LOG=fraiseql_server=info,fraiseql_core::cache=debug,fraiseql_core::db=warn,tower_http=debug,sqlx=warn

Health Checks

Endpoints

FraiseQL exposes health endpoints automatically — no configuration required:

Basic health:

curl http://localhost:8080/health
# {"status": "ok"}

Detailed health:

curl http://localhost:8080/health/detailed

{
    "status": "ok",
    "checks": {
        "database": {
            "status": "ok",
            "latency_ms": 2
        },
        "cache": {
            "status": "ok",
            "size": 1500,
            "max_size": 10000
        },
        "schema": {
            "status": "ok",
            "version": "1.0.0",
            "loaded_at": "2024-01-15T10:00:00Z"
        }
    },
    "version": "2.0.0",
    "uptime_seconds": 3600
}

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/detailed
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Alerting Rules

Example Prometheus alerting rules:

groups:
  - name: fraiseql
    rules:
      - alert: HighErrorRate
        expr: |
          rate(fraiseql_graphql_queries_error[5m]) /
          rate(fraiseql_graphql_queries_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High GraphQL error rate"

      - alert: HighQueryDuration
        expr: |
          fraiseql_graphql_query_duration_ms > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Average query duration above 1 second"

      - alert: LowCacheHitRate
        expr: |
          fraiseql_cache_hit_ratio < 0.5
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "Cache hit rate below 50%"

      - alert: High5xxRate
        expr: |
          rate(fraiseql_http_responses_5xx[5m]) > 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Elevated 5xx server error rate"

Best Practices

Sampling Strategy

OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1       # sample 10% of requests

For error-focused sampling, configure your OTel collector or use a tail-based sampler (e.g., Grafana Tempo, OpenTelemetry Collector’s tailsampling processor) to keep 100% of error traces while downsampling success traces.

Log Retention

Log rotation is handled by your container runtime or log aggregator (Fluent Bit, Logstash, Loki, etc.) — not by FraiseQL. Write logs to stdout and let your infrastructure handle rotation and retention.

Metric Cardinality

FraiseQL does not include user_id or request_id as metric labels by default — these high-cardinality values are kept in structured logs and traces instead, where cardinality is not a concern.

Troubleshooting

Missing Metrics

Check /metrics endpoint is reachable from your Prometheus instance
Verify Prometheus scrape config points to the correct port and path
Ensure FraiseQL started successfully (fraiseql run with no errors)

Missing Traces

Verify OTEL_EXPORTER_OTLP_ENDPOINT is set and reachable
Check OTEL_TRACES_SAMPLER_ARG — a low value (e.g., 0.001) may drop traces in low-traffic tests
Check trace context propagation headers

Log Spam

Adjust RUST_LOG level per component (e.g., RUST_LOG=warn,fraiseql=info)
Filter in your log aggregator (Loki, Elasticsearch, etc.)

Verify It Works

Start FraiseQL — metrics are always available, no config needed:
Terminal window
```
fraiseql run
```

Test metrics endpoint:

curl http://localhost:8080/metrics

Expected output (partial):

# HELP fraiseql_http_requests_total Total HTTP requests
# TYPE fraiseql_http_requests_total counter
fraiseql_http_requests_total 42

# HELP fraiseql_graphql_queries_total Total GraphQL queries executed
# TYPE fraiseql_graphql_queries_total counter
fraiseql_graphql_queries_total 38

# HELP fraiseql_cache_hit_ratio Cache hit ratio (0-1)
# TYPE fraiseql_cache_hit_ratio gauge
fraiseql_cache_hit_ratio 0.92

Execute some queries to generate metrics:

# Run a few queries
for i in {1..10}; do
  curl -s -X POST http://localhost:8080/graphql \
    -H "Content-Type: application/json" \
    -d '{"query": "{ __typename }"}' > /dev/null
done

Verify metrics updated:

curl http://localhost:8080/metrics | grep fraiseql_graphql_queries_total

Expected output:

fraiseql_graphql_queries_total 48

Test health endpoints:

# Basic health
curl http://localhost:8080/health

Expected output:

{"status": "ok"}

# Detailed health
curl http://localhost:8080/health/detailed

Expected output:

{
  "status": "ok",
  "checks": {
    "database": {
      "status": "ok",
      "latency_ms": 2
    },
    "schema": {
      "status": "ok",
      "version": "1.0.0"
    }
  },
  "version": "2.0.0",
  "uptime_seconds": 3600
}

Verify structured logging:

# Make a request and check logs
curl -s -X POST http://localhost:8080/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ me { id } }"}'

Check FraiseQL stdout for JSON log lines:

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "target": "fraiseql_server::graphql",
  "message": "Query executed",
  "span": {
    "request_id": "abc-123",
    "user_id": "user-456"
  },
  "fields": {
    "operation": "me",
    "duration_ms": 12,
    "cache_hit": false
  }
}

Test OpenTelemetry tracing (if enabled):
Terminal window
```
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
OTEL_SERVICE_NAME=fraiseql-api \
fraiseql run
```
After making requests, check your tracing backend (Jaeger, Zipkin, etc.) for spans.

Troubleshooting

No Metrics Available

If /metrics returns empty or connection refused:

Check FraiseQL started successfully:
Terminal window
```
fraiseql run 2>&1 | head -20
```
Verify the metrics endpoint is on port 8080 (same port as GraphQL):
Terminal window
```
curl http://localhost:8080/metrics
```

Missing Database Metrics

If database connection metrics are absent:

Verify database connectivity:

curl http://localhost:8080/health/detailed | jq .checks.database

Check [database] pool configuration in fraiseql.toml

High Cardinality Warnings

FraiseQL does not include user_id or request_id as metric labels by default — no configuration needed.

Traces Not Appearing

If traces don’t show up in your backend:

Verify OTLP endpoint is reachable from the FraiseQL process:
Terminal window
```
curl http://otel-collector:4317
```
Set OTEL_TRACES_SAMPLER_ARG=1.0 temporarily to sample 100% for testing

Ensure trace context propagation:

curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
  http://localhost:8080/graphql

Next Steps

Deployment

Deployment — Production monitoring setup

Performance

Performance — Using metrics to optimize

Troubleshooting

Troubleshooting — Debugging with logs and traces