Observer Operations Runbook
The FraiseQL observer runtime runs inside the server process as an embedded subsystem. It has its own transport layer (Redis, NATS, PostgreSQL, or in-memory), a configurable thread pool, a dead letter queue, and a high-availability lease coordinator. This runbook covers what to do when things go wrong.
Health Signals
Section titled “Health Signals”The observer runtime exposes Prometheus metrics at the same /metrics endpoint as the rest of the server.
Key Metrics
Section titled “Key Metrics”| Metric | Labels | Meaning |
|---|---|---|
fraiseql_observer_events_processed_total | status (success/failure) | Total events handled |
fraiseql_observer_events_failed_total | — | Events that failed all retry attempts |
fraiseql_observer_action_executed_total | — | Individual action invocations |
fraiseql_observer_action_errors_total | — | Action invocations that returned an error |
fraiseql_observer_action_duration_seconds | — | Action execution latency histogram |
fraiseql_observer_backlog_size | — | In-process channel fill level |
fraiseql_observer_dlq_items | — | Current DLQ depth |
fraiseql_observer_dlq_overflow_total | — | Drops due to max_dlq_size cap |
fraiseql_observer_job_queue_depth | — | Async job queue depth |
fraiseql_observer_job_duration_seconds | — | Async job execution latency |
Alert Rules (Prometheus)
Section titled “Alert Rules (Prometheus)”- alert: ObserverDLQGrowing expr: fraiseql_observer_dlq_items > 100 for: 5m labels: severity: warning annotations: summary: "Observer DLQ growing — action failures may need investigation"
- alert: ObserverChannelNearCapacity expr: fraiseql_observer_backlog_size / <channel_capacity> > 0.9 for: 2m labels: severity: warning annotations: summary: "Observer channel near capacity — increase channel_capacity or max_concurrency"
- alert: ObserverDLQOverflowing expr: increase(fraiseql_observer_dlq_overflow_total[5m]) > 0 labels: severity: critical annotations: summary: "Observer DLQ at max_dlq_size cap — events are being dropped"
- alert: ObserverActionErrorRate expr: rate(fraiseql_observer_action_errors_total[5m]) / rate(fraiseql_observer_action_executed_total[5m]) > 0.1 for: 10m labels: severity: warning annotations: summary: "Observer action error rate exceeds 10% over 10 minutes"Observer Status
Section titled “Observer Status”fraiseql-cli observer statusfraiseql-cli observer status --detailedThe status command shows the current HA leader, listener health, last checkpoint, and uptime for each instance.
Sizing the Observer Runtime
Section titled “Sizing the Observer Runtime”See TOML Configuration: [observers] for the full parameter reference and sizing table.
Decision Guide
Section titled “Decision Guide”max_concurrency — thread pool size for concurrent action execution.
Set it to: expected peak events/s × average action latency (s), then add 50% headroom.
Example: 100 events/s, each action takes 200 ms → 100 × 0.2 = 20 workers minimum → set max_concurrency = 30.
- Too low: backpressure builds, channel fills, events start dropping.
- Too high: each concurrent action may open a database connection — check
pool_maxon the observer pool.
channel_capacity — in-process event buffer before backpressure.
Rule of thumb: channel_capacity ≥ max_concurrency × 10. If events arrive in batches (e.g. bulk inserts trigger many NOTIFY calls simultaneously), size up to absorb the largest expected burst.
max_dlq_size — hard cap on failed event accumulation.
Memory estimate: ~500 bytes per DLQ entry (event payload + metadata). max_dlq_size = 10000 ≈ 5 MB at peak.
Observer PostgreSQL pool sizing:
The observer pool is separate from the main request pool. Minimum needed:
- 1 connection for
LISTEN/NOTIFY - 1 per concurrent action that queries the database
Safe default: set the observer pool’s pool_max = max_concurrency + 5. If actions do not hit the database, pool_max = 2 is sufficient.
Dead Letter Queue (DLQ) Management
Section titled “Dead Letter Queue (DLQ) Management”Events that exhaust all retry attempts are moved to the DLQ instead of being silently dropped. This preserves observability and enables manual replay.
Viewing DLQ Contents
Section titled “Viewing DLQ Contents”# List recent DLQ itemsfraiseql-cli observer dlq list
# Limit and filterfraiseql-cli observer dlq list --limit 50 --observer notify-new-user
# Inspect a specific itemfraiseql-cli observer dlq show <dlq-entry-id>
# Statistics by observer and error typefraiseql-cli observer dlq stats --by-observer --by-errorReplaying Failed Events
Section titled “Replaying Failed Events”Root cause checklist before replaying:
- Is the action endpoint (webhook URL, email provider, etc.) healthy?
- Are the DLQ events still relevant (not stale after a deployment)?
- Is the action idempotent?
- Is
max_concurrencysufficient to drain the DLQ without overwhelming the target?
# Retry a specific entryfraiseql-cli observer dlq retry <dlq-entry-id>
# Force retry beyond max_retriesfraiseql-cli observer dlq retry <dlq-entry-id> --force
# Retry all entries for an observer (dry-run first)fraiseql-cli observer dlq retry-all --observer notify-new-user --dry-runfraiseql-cli observer dlq retry-all --observer notify-new-user
# Retry all DLQ entries after a timestampfraiseql-cli observer dlq retry-all --after 2026-03-01T00:00:00ZRemoving DLQ Entries
Section titled “Removing DLQ Entries”After the root cause is fixed and the events are no longer actionable (e.g. they are too stale to process):
# Remove a specific entryfraiseql-cli observer dlq remove <dlq-entry-id>
# Remove with --force to skip confirmationfraiseql-cli observer dlq remove <dlq-entry-id> --forceNATS Connection Loss
Section titled “NATS Connection Loss”When the NATS backend loses its connection, the observer runtime:
- Buffers events in the
channel_capacityin-process buffer. - Retries the NATS connection with exponential backoff.
- If the buffer fills before reconnection, new events are dropped and
fraiseql_observer_dlq_overflow_totalincrements.
What to watch:
fraiseql_observer_backlog_sizeapproachingchannel_capacity.- Tracing logs at
WARNlevel:observer: NATS connection lost, buffering events.
Recovery: No manual action needed — the runtime auto-reconnects. After reconnection, check the DLQ (fraiseql-cli observer dlq list) for any events that overflowed the buffer.
Prevention:
- Increase
channel_capacityif NATS is frequently unreliable. - Consider
backend = "postgresql"for higher durability (events survive restarts).
High-Availability Lease Management
Section titled “High-Availability Lease Management”When multiple server instances run concurrently, only one instance processes events per handler to prevent duplicate delivery. FraiseQL implements this via lease-based leader election stored in PostgreSQL.
How It Works
Section titled “How It Works”- Each instance attempts to acquire the leader lease at startup.
- The current leader renews the lease on a configurable interval.
- When the leader fails, the lease expires after
lease_ttl_secsand a follower takes over. - Events that were in-flight when the leader failed may be replayed (at-least-once delivery).
Monitoring Lease Health
Section titled “Monitoring Lease Health”fraiseql-cli observer status --detailedThe detailed status output shows the current leader ID, lease expiry time, and follower count.
When a Stale Lease Blocks a New Leader
Section titled “When a Stale Lease Blocks a New Leader”If a crashed instance left a stale lease that has not yet expired, a new leader cannot be elected until lease_ttl_secs passes. Set lease_ttl_secs conservatively (default: 30 s) to balance availability against false-positive failovers.
Debugging: Enabling Observer Trace Logs
Section titled “Debugging: Enabling Observer Trace Logs”RUST_LOG=fraiseql_observers=trace ./fraiseql-server --config fraiseql.tomlUseful Log Patterns
Section titled “Useful Log Patterns”# Event lifecycleobserver: received eventobserver: dispatching to handlerobserver: action succeededobserver: action failed, retryingobserver: max retries exceeded, moving to DLQ
# HA coordinationobserver: acquired leaseobserver: lease renewalobserver: lost leaseVerifying Configuration
Section titled “Verifying Configuration”fraiseql-cli observer validate-config --file fraiseql.tomlfraiseql-cli observer validate-config --file fraiseql.toml --detailedThis validates all observer configuration fields and reports sizing warnings (e.g. max_dlq_size not set, channel_capacity below recommended minimum).
Graceful Shutdown
Section titled “Graceful Shutdown”On SIGTERM, the server performs an ordered shutdown:
- Stops accepting new events from the LISTEN/NOTIFY connection.
- Drains the in-process channel (processes all buffered events).
- Waits for all in-flight action executions to complete.
- Releases the HA lease.
The drain timeout is controlled by shutdown_timeout_secs in [server]. If the timeout is too short, in-flight actions are interrupted and their events move to the DLQ.
Recommendation: Set shutdown_timeout_secs ≥ (average action latency) × 2.
Migrating Backends
Section titled “Migrating Backends”in-memory → Redis
Section titled “in-memory → Redis”Drop-in replacement. Change backend = "redis" and add redis_url. There are no in-flight events in the in-memory backend between restarts.
Redis → NATS
Section titled “Redis → NATS”Change backend = "nats" and add nats_url. Drain any pending Redis items before switching, or accept that items in the Redis queue will not be delivered after migration.
Redis / NATS → PostgreSQL
Section titled “Redis / NATS → PostgreSQL”Use PostgreSQL for maximum durability — events are stored in a PostgreSQL table and survive server restarts. The schema is created automatically by the observer runtime on first startup.
[observers]enabled = truebackend = "postgresql"# uses the same database URL as [database] by defaultIdempotency Table Operations
Section titled “Idempotency Table Operations”When using CheckpointStrategy.EffectivelyOnce, the observer stores a unique key for each processed event before acknowledging it. If the same event is delivered again (after a crash or broker redelivery), the key is found and the event is skipped.
This prevents duplicate processing of non-idempotent operations: billing charges, audit log writes, email sends, and external API calls.
Table Schema
Section titled “Table Schema”The EffectivelyOnce checkpoint strategy creates the idempotency table automatically on first use. You can also create it manually in a migration:
CREATE TABLE IF NOT EXISTS observer_idempotency_keys ( idempotency_key TEXT NOT NULL, listener_id TEXT NOT NULL, processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), PRIMARY KEY (idempotency_key, listener_id));
CREATE INDEX IF NOT EXISTS idx_observer_idempotency_processed_at ON observer_idempotency_keys (processed_at);Each observer should use a distinct table name to avoid cross-observer key collisions:
@fraiseql.subscription( entity_type="Order", topic="order.created", operation="create", checkpoint=fraiseql.EffectivelyOnce( idempotency_table="billing_observer_keys" ))def handle_order_created(order: Order) -> None: charge_customer(order)Table Growth and Cleanup
Section titled “Table Growth and Cleanup”The table grows by one row per processed event:
| Events/day | Rows/year | Approx. size/year |
|---|---|---|
| 1,000 | 365,000 | ~50 MB |
| 10,000 | 3.65 M | ~500 MB |
| 100,000 | 36.5 M | ~5 GB |
Keys older than your broker’s maximum redelivery window are safe to delete. Run daily via pg_cron:
DELETE FROM observer_idempotency_keysWHERE processed_at < NOW() - INTERVAL '7 days';Idempotency Failure Modes
Section titled “Idempotency Failure Modes”| Scenario | Behaviour |
|---|---|
| Table missing at startup | Observer logs ERROR and refuses to start (fail closed) |
| Table missing mid-operation | Event processing fails; message is requeued by broker |
| Duplicate key found | Event is skipped; acknowledgment sent to broker |
| Database unreachable | Observer pauses and retries with backoff |
Resetting an Observer
Section titled “Resetting an Observer”Clearing idempotency keys allows events to be reprocessed:
DELETE FROM observer_idempotency_keys WHERE listener_id = 'my-observer-name';DELETE FROM observer_checkpoints WHERE listener_id = 'my-observer-name';Bulk Cleanup for Large Tables
Section titled “Bulk Cleanup for Large Tables”DO $$DECLARE rows_deleted INT;BEGIN LOOP DELETE FROM observer_idempotency_keys WHERE ctid IN ( SELECT ctid FROM observer_idempotency_keys WHERE processed_at < NOW() - INTERVAL '7 days' LIMIT 10000 ); GET DIAGNOSTICS rows_deleted = ROW_COUNT; EXIT WHEN rows_deleted = 0; PERFORM pg_sleep(0.1); END LOOP;END;$$;
VACUUM ANALYZE observer_idempotency_keys;See Also
Section titled “See Also”- Observer Concepts — how the observer runtime works
- TOML Configuration: [observers] — full parameter reference