Skip to content

Observer Operations Runbook

The FraiseQL observer runtime runs inside the server process as an embedded subsystem. It has its own transport layer (Redis, NATS, PostgreSQL, or in-memory), a configurable thread pool, a dead letter queue, and a high-availability lease coordinator. This runbook covers what to do when things go wrong.

The observer runtime exposes Prometheus metrics at the same /metrics endpoint as the rest of the server.

MetricLabelsMeaning
fraiseql_observer_events_processed_totalstatus (success/failure)Total events handled
fraiseql_observer_events_failed_totalEvents that failed all retry attempts
fraiseql_observer_action_executed_totalIndividual action invocations
fraiseql_observer_action_errors_totalAction invocations that returned an error
fraiseql_observer_action_duration_secondsAction execution latency histogram
fraiseql_observer_backlog_sizeIn-process channel fill level
fraiseql_observer_dlq_itemsCurrent DLQ depth
fraiseql_observer_dlq_overflow_totalDrops due to max_dlq_size cap
fraiseql_observer_job_queue_depthAsync job queue depth
fraiseql_observer_job_duration_secondsAsync job execution latency
- alert: ObserverDLQGrowing
expr: fraiseql_observer_dlq_items > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Observer DLQ growing — action failures may need investigation"
- alert: ObserverChannelNearCapacity
expr: fraiseql_observer_backlog_size / <channel_capacity> > 0.9
for: 2m
labels:
severity: warning
annotations:
summary: "Observer channel near capacity — increase channel_capacity or max_concurrency"
- alert: ObserverDLQOverflowing
expr: increase(fraiseql_observer_dlq_overflow_total[5m]) > 0
labels:
severity: critical
annotations:
summary: "Observer DLQ at max_dlq_size cap — events are being dropped"
- alert: ObserverActionErrorRate
expr: rate(fraiseql_observer_action_errors_total[5m]) / rate(fraiseql_observer_action_executed_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Observer action error rate exceeds 10% over 10 minutes"
Terminal window
fraiseql-cli observer status
fraiseql-cli observer status --detailed

The status command shows the current HA leader, listener health, last checkpoint, and uptime for each instance.


See TOML Configuration: [observers] for the full parameter reference and sizing table.

max_concurrency — thread pool size for concurrent action execution.

Set it to: expected peak events/s × average action latency (s), then add 50% headroom.

Example: 100 events/s, each action takes 200 ms → 100 × 0.2 = 20 workers minimum → set max_concurrency = 30.

  • Too low: backpressure builds, channel fills, events start dropping.
  • Too high: each concurrent action may open a database connection — check pool_max on the observer pool.

channel_capacity — in-process event buffer before backpressure.

Rule of thumb: channel_capacity ≥ max_concurrency × 10. If events arrive in batches (e.g. bulk inserts trigger many NOTIFY calls simultaneously), size up to absorb the largest expected burst.

max_dlq_size — hard cap on failed event accumulation.

Memory estimate: ~500 bytes per DLQ entry (event payload + metadata). max_dlq_size = 10000 ≈ 5 MB at peak.

Observer PostgreSQL pool sizing:

The observer pool is separate from the main request pool. Minimum needed:

  • 1 connection for LISTEN/NOTIFY
  • 1 per concurrent action that queries the database

Safe default: set the observer pool’s pool_max = max_concurrency + 5. If actions do not hit the database, pool_max = 2 is sufficient.


Events that exhaust all retry attempts are moved to the DLQ instead of being silently dropped. This preserves observability and enables manual replay.

Terminal window
# List recent DLQ items
fraiseql-cli observer dlq list
# Limit and filter
fraiseql-cli observer dlq list --limit 50 --observer notify-new-user
# Inspect a specific item
fraiseql-cli observer dlq show <dlq-entry-id>
# Statistics by observer and error type
fraiseql-cli observer dlq stats --by-observer --by-error

Root cause checklist before replaying:

  • Is the action endpoint (webhook URL, email provider, etc.) healthy?
  • Are the DLQ events still relevant (not stale after a deployment)?
  • Is the action idempotent?
  • Is max_concurrency sufficient to drain the DLQ without overwhelming the target?
Terminal window
# Retry a specific entry
fraiseql-cli observer dlq retry <dlq-entry-id>
# Force retry beyond max_retries
fraiseql-cli observer dlq retry <dlq-entry-id> --force
# Retry all entries for an observer (dry-run first)
fraiseql-cli observer dlq retry-all --observer notify-new-user --dry-run
fraiseql-cli observer dlq retry-all --observer notify-new-user
# Retry all DLQ entries after a timestamp
fraiseql-cli observer dlq retry-all --after 2026-03-01T00:00:00Z

After the root cause is fixed and the events are no longer actionable (e.g. they are too stale to process):

Terminal window
# Remove a specific entry
fraiseql-cli observer dlq remove <dlq-entry-id>
# Remove with --force to skip confirmation
fraiseql-cli observer dlq remove <dlq-entry-id> --force

When the NATS backend loses its connection, the observer runtime:

  1. Buffers events in the channel_capacity in-process buffer.
  2. Retries the NATS connection with exponential backoff.
  3. If the buffer fills before reconnection, new events are dropped and fraiseql_observer_dlq_overflow_total increments.

What to watch:

  • fraiseql_observer_backlog_size approaching channel_capacity.
  • Tracing logs at WARN level: observer: NATS connection lost, buffering events.

Recovery: No manual action needed — the runtime auto-reconnects. After reconnection, check the DLQ (fraiseql-cli observer dlq list) for any events that overflowed the buffer.

Prevention:

  • Increase channel_capacity if NATS is frequently unreliable.
  • Consider backend = "postgresql" for higher durability (events survive restarts).

When multiple server instances run concurrently, only one instance processes events per handler to prevent duplicate delivery. FraiseQL implements this via lease-based leader election stored in PostgreSQL.

  1. Each instance attempts to acquire the leader lease at startup.
  2. The current leader renews the lease on a configurable interval.
  3. When the leader fails, the lease expires after lease_ttl_secs and a follower takes over.
  4. Events that were in-flight when the leader failed may be replayed (at-least-once delivery).
Terminal window
fraiseql-cli observer status --detailed

The detailed status output shows the current leader ID, lease expiry time, and follower count.

If a crashed instance left a stale lease that has not yet expired, a new leader cannot be elected until lease_ttl_secs passes. Set lease_ttl_secs conservatively (default: 30 s) to balance availability against false-positive failovers.


Terminal window
RUST_LOG=fraiseql_observers=trace ./fraiseql-server --config fraiseql.toml
# Event lifecycle
observer: received event
observer: dispatching to handler
observer: action succeeded
observer: action failed, retrying
observer: max retries exceeded, moving to DLQ
# HA coordination
observer: acquired lease
observer: lease renewal
observer: lost lease
Terminal window
fraiseql-cli observer validate-config --file fraiseql.toml
fraiseql-cli observer validate-config --file fraiseql.toml --detailed

This validates all observer configuration fields and reports sizing warnings (e.g. max_dlq_size not set, channel_capacity below recommended minimum).


On SIGTERM, the server performs an ordered shutdown:

  1. Stops accepting new events from the LISTEN/NOTIFY connection.
  2. Drains the in-process channel (processes all buffered events).
  3. Waits for all in-flight action executions to complete.
  4. Releases the HA lease.

The drain timeout is controlled by shutdown_timeout_secs in [server]. If the timeout is too short, in-flight actions are interrupted and their events move to the DLQ.

Recommendation: Set shutdown_timeout_secs ≥ (average action latency) × 2.


Drop-in replacement. Change backend = "redis" and add redis_url. There are no in-flight events in the in-memory backend between restarts.

Change backend = "nats" and add nats_url. Drain any pending Redis items before switching, or accept that items in the Redis queue will not be delivered after migration.

Use PostgreSQL for maximum durability — events are stored in a PostgreSQL table and survive server restarts. The schema is created automatically by the observer runtime on first startup.

[observers]
enabled = true
backend = "postgresql"
# uses the same database URL as [database] by default

When using CheckpointStrategy.EffectivelyOnce, the observer stores a unique key for each processed event before acknowledging it. If the same event is delivered again (after a crash or broker redelivery), the key is found and the event is skipped.

This prevents duplicate processing of non-idempotent operations: billing charges, audit log writes, email sends, and external API calls.

The EffectivelyOnce checkpoint strategy creates the idempotency table automatically on first use. You can also create it manually in a migration:

CREATE TABLE IF NOT EXISTS observer_idempotency_keys (
idempotency_key TEXT NOT NULL,
listener_id TEXT NOT NULL,
processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (idempotency_key, listener_id)
);
CREATE INDEX IF NOT EXISTS idx_observer_idempotency_processed_at
ON observer_idempotency_keys (processed_at);

Each observer should use a distinct table name to avoid cross-observer key collisions:

@fraiseql.subscription(
entity_type="Order",
topic="order.created",
operation="create",
checkpoint=fraiseql.EffectivelyOnce(
idempotency_table="billing_observer_keys"
)
)
def handle_order_created(order: Order) -> None:
charge_customer(order)

The table grows by one row per processed event:

Events/dayRows/yearApprox. size/year
1,000365,000~50 MB
10,0003.65 M~500 MB
100,00036.5 M~5 GB

Keys older than your broker’s maximum redelivery window are safe to delete. Run daily via pg_cron:

DELETE FROM observer_idempotency_keys
WHERE processed_at < NOW() - INTERVAL '7 days';
ScenarioBehaviour
Table missing at startupObserver logs ERROR and refuses to start (fail closed)
Table missing mid-operationEvent processing fails; message is requeued by broker
Duplicate key foundEvent is skipped; acknowledgment sent to broker
Database unreachableObserver pauses and retries with backoff

Clearing idempotency keys allows events to be reprocessed:

DELETE FROM observer_idempotency_keys WHERE listener_id = 'my-observer-name';
DELETE FROM observer_checkpoints WHERE listener_id = 'my-observer-name';
DO $$
DECLARE rows_deleted INT;
BEGIN
LOOP
DELETE FROM observer_idempotency_keys
WHERE ctid IN (
SELECT ctid FROM observer_idempotency_keys
WHERE processed_at < NOW() - INTERVAL '7 days'
LIMIT 10000
);
GET DIAGNOSTICS rows_deleted = ROW_COUNT;
EXIT WHEN rows_deleted = 0;
PERFORM pg_sleep(0.1);
END LOOP;
END;
$$;
VACUUM ANALYZE observer_idempotency_keys;