Skip to content

Federation & NATS Troubleshooting

This guide covers solutions for common issues when using FraiseQL’s federation and NATS capabilities.

Symptoms: GatewayError: Failed to fetch SDL from subgraph 'orders' at startup, or circuit breaker opening during operation.

Causes:

  • Subgraph not running or not reachable from the gateway host
  • Wrong URL in gateway.toml
  • Network/firewall issue

Solutions:

Terminal window
# 1. Verify subgraph is reachable from the gateway host
curl http://order-service:4002/health
# 2. Check gateway.toml subgraph URLs
cat gateway.toml | grep url
# 3. Test the GraphQL endpoint directly
curl http://order-service:4002/graphql \
-H "Content-Type: application/json" \
-d '{"query":"{ _service { sdl } }"}'

Symptoms: GatewayError: Type 'User' is defined in multiple subgraphs: users, orders

Causes: The same type name appears in multiple subgraphs. The built-in gateway requires each type to be owned by exactly one subgraph.

Solutions: Ensure each type is defined in only one subgraph. If both services need a User type, have the non-owning service reference it via @key entity resolution instead of redefining it.

Symptoms: Queries that span multiple subgraphs time out or return partial results.

Solutions:

gateway.toml
# Increase circuit breaker recovery window
[gateway.circuit_breaker]
failure_threshold = 10
recovery_timeout_secs = 60
Terminal window
# Check subgraph latency for _entities queries
curl -w "@curl-format.txt" http://order-service:4002/graphql \
-H "Content-Type: application/json" \
-d '{"query":"{ _entities(representations: [{__typename: \"Order\", id: \"...\"}]) { id } }"}'

ERROR: INVALID_GRAPHQL
subgraph "my-service" SDL is not valid GraphQL

The SDL contains issues like:

  • str instead of String, int instead of Int
  • 'Order' | None instead of Order
  • Duplicate fields in generated WhereInput types

Use fraiseql compile --sdl to export the schema as SDL directly from the compiler (bypasses the runtime endpoint):

Terminal window
fraiseql compile schema.py --sdl > subgraph.graphql

Alternatively, use __schema introspection instead of _service { sdl }:

Terminal window
curl localhost:4001/graphql \
-H "Content-Type: application/json" \
-d '{"query":"{ __schema { types { name fields { name type { name kind ofType { name } } } } } }"}' \
| jq > schema-introspection.json

Symptoms: DatabaseConnectionError: Failed to connect to database 'inventory'

Causes:

  • Network connectivity issue
  • Database server is down
  • Wrong credentials in config
  • Firewall blocking connection

Solutions:

Terminal window
# 1. Test database connectivity
curl http://localhost:8080/health
# 2. Check configuration
cat fraiseql.toml | grep -A 5 "\[database\]"
# 3. Verify credentials
echo $INVENTORY_DATABASE_URL # Check URL format
# 4. Test manual connection
psql $INVENTORY_DATABASE_URL -c "SELECT 1"
# 4. Enable connection debug logging — set environment variable:
# RUST_LOG=debug
#
# Or configure the database in fraiseql.toml:
[database]
url = "${INVENTORY_DATABASE_URL}"
pool_max = 10

“Pool exhausted - no available connections”

Section titled ““Pool exhausted - no available connections””

Symptoms: PoolExhaustedError: No available connections in pool for database 'inventory'

Causes:

  • Pool size too small for load
  • Connections leaking (not being released)
  • Long-running queries blocking other requests
  • Deadlock in pool acquisition

Solutions:

# 1. Increase pool size
[database]
pool_min = 2
pool_max = 50 # Increased from default 20
connect_timeout_ms = 30000 # Timeout waiting for connection
idle_timeout_ms = 3600000 # Close idle connections after 1 hour
Terminal window
# 2. Monitor connection pool via the metrics endpoint:
# GET /metrics → fraiseql_database_pool_active, fraiseql_database_pool_idle
# 3. Identify slow queries — set slow query threshold in fraiseql.toml:
# [database]
# log_slow_queries_ms = 1000 # Log queries slower than 1 second
# Or use RUST_LOG=debug for full query logging.

Symptoms: FederationTimeoutError: Federated query to 'inventory' timed out after 5000ms

Causes:

  • Network latency between databases
  • Query on remote database is slow
  • Default timeout too aggressive

Solutions:

# 1. Increase federation timeout
[federation]
default_timeout = 10000 # Increased from 5000ms
batch_size = 50 # Smaller batches = faster queries
# 2. Per-database timeout
[federation.database_timeouts]
inventory = 10000
payments = 15000 # Slower database needs more time
# 2. Configure federation batch size and timeout in fraiseql.toml:
[federation]
batch_size = 100
default_timeout = 10000
# The Python type declares the shape only — no database routing in the decorator:
@fraiseql.type
class Order:
id: ID
items: list[OrderItem] # Federated from inventory DB — see fraiseql.toml
-- 3. Add database indexes on the inventory database:
CREATE INDEX idx_tb_order_item_fk_order ON tb_order_item(fk_order);

Symptoms: CircularReferenceError when loading schema or executing deeply nested queries.

Causes:

  • Bidirectional federated references
  • Deeply nested federated queries

Solutions:

# BAD: Circular reference — avoid federating back to the originating type
@fraiseql.type
class Order:
items: list[OrderItem] # Federated from inventory DB
@fraiseql.type
class OrderItem:
order: Order # Circular! OrderItem federates back to Order
# GOOD: Only federate in one direction
@fraiseql.type
class Order:
items: list[OrderItem] # Federated from inventory DB (one way only)
@fraiseql.type
class OrderItem:
order_id: ID # Just store the ID — don't federate back to Order

“Inconsistent foreign keys across databases”

Section titled ““Inconsistent foreign keys across databases””

Symptoms: ForeignKeyError: Order references Product ID that doesn't exist

Causes:

  • Data deleted in one database but not cascaded
  • Race condition between databases
  • Data inconsistency

Solutions:

# The Python type declares the shape only — no runtime DB access in Python
@fraiseql.mutation(sql_source="fn_add_item_to_order", operation="CREATE")
def add_item_to_order(order_id: ID, product_id: ID) -> OrderItem:
"""Add item to order. Validation is handled in the SQL function."""
pass
-- Validation belongs in the SQL function (fn_add_item_to_order).
-- The function can return a mutation_response with status 'failed:not_found'
-- if the product doesn't exist:
--
-- CREATE FUNCTION fn_add_item_to_order(p_order_id UUID, p_product_id UUID)
-- RETURNS mutation_response AS $$
-- DECLARE
-- v_product_exists BOOLEAN;
-- v_result mutation_response;
-- BEGIN
-- SELECT EXISTS(SELECT 1 FROM tb_product WHERE id = p_product_id)
-- INTO v_product_exists;
-- IF NOT v_product_exists THEN
-- v_result.status := 'failed:not_found';
-- v_result.message := 'Product not found';
-- RETURN v_result;
-- END IF;
-- -- ... insert logic
-- END;
-- $$ LANGUAGE plpgsql;
-- Enforce referential integrity at the database level:
ALTER TABLE tb_order_item
ADD CONSTRAINT fk_tb_order_item_fk_order
FOREIGN KEY (fk_order) REFERENCES tb_order(pk_order),
ADD CONSTRAINT fk_tb_order_item_fk_product
FOREIGN KEY (fk_product) REFERENCES tb_product(pk_product);

Symptoms: SagaCompensationError: Compensation step 'reserve_inventory' failed

Causes:

  • Compensation function has bugs
  • Database state changed unexpectedly
  • Compensation takes too long (timeout)

Solutions:

-- Saga compensation is implemented as SQL functions.
-- The fn_ function should be idempotent and return a mutation_response.
-- Example: make fn_release_reservation idempotent:
CREATE OR REPLACE FUNCTION fn_release_reservation(p_id UUID)
RETURNS mutation_response AS $$
DECLARE v_result mutation_response;
BEGIN
-- Idempotent: only update if not already released
UPDATE tb_reservation
SET status = 'released'
WHERE id = p_id AND status != 'released';
v_result.status := 'success';
v_result.message := 'Reservation released';
RETURN v_result;
END;
$$ LANGUAGE plpgsql;
# Configure saga/observer timeout in fraiseql.toml:
[observers]
backend = "nats"
nats_url = "nats://localhost:4222"

Symptoms: Order created but saga never completes; stuck in pending status.

Causes:

  • One saga step is hanging
  • Network issue between databases
  • Database deadlock

Solutions:

Terminal window
# 1. Monitor saga/mutation progress via structured logs.
# Set RUST_LOG=debug to see each request with its requestId.
# Use pg_notify observers in fraiseql.toml to track step completions:
# [observers]
# backend = "nats"
# nats_url = "nats://localhost:4222"
-- 2. Check stuck sagas in database
SELECT *
FROM tb_saga_execution
WHERE status = 'pending'
AND created_at < NOW() - INTERVAL '5 minutes'
ORDER BY created_at DESC;
-- 3. Manual cleanup of stuck sagas (as a SQL function called via mutation)
-- Define a fn_cleanup_stuck_saga SQL function that:
-- - Validates the saga is in pending state
-- - Marks it as compensated or triggers compensation steps
-- - Returns a mutation_response with the outcome
# Expose the cleanup as a FraiseQL mutation (compile-time definition only):
@fraiseql.mutation(sql_source="fn_cleanup_stuck_saga", operation="CUSTOM")
def cleanup_stuck_saga(saga_id: ID) -> bool:
"""Manually trigger compensation for stuck saga."""
pass

Symptoms: Query with federated field takes 10+ seconds

Causes:

  • Network latency
  • Missing indexes
  • Cartesian product (N+1 problem)
  • Query hitting large tables

Solutions:

# 1. Check if federation is batching correctly
@fraiseql.query
def orders_with_items(limit: int = 100) -> list[Order]:
"""
With batching: Should be 2 queries total
- 1 query: SELECT id, data FROM v_order LIMIT 100
- 1 query: SELECT id, data FROM v_order_item WHERE fk_order IN (...)
"""
return fraiseql.config(sql_source="v_order")
# 2. Denormalize to reduce federated queries
@fraiseql.type
class Order:
id: ID
item_count: int # Denormalized count from SQL view — avoids federation
# Federated from inventory DB — configured in fraiseql.toml
items: list[OrderItem]
# 3. Use selective queries to avoid full federation
@fraiseql.query
def order_summary(id: ID) -> OrderSummary | None:
"""
Query a summary view instead of the full federated type
when only aggregate fields are needed.
"""
return fraiseql.config(sql_source="v_order_summary")
Terminal window
# Enable query logging to verify batching — set via environment variable:
# RUST_LOG=debug fraiseql run
-- 4. Add indexes on the inventory database:
CREATE INDEX idx_tb_order_item_fk_order_fk_product
ON tb_order_item(fk_order, fk_product);

Symptoms: NatsConnectionError: Failed to connect to NATS server

Causes:

  • NATS server not running
  • Wrong URL/port
  • Firewall blocking

Solutions:

Terminal window
# 1. Check NATS server status
nats server info
# 2. Test connection
nats ping
# 3. Check configuration
cat fraiseql.toml | grep -A 3 "\[nats\]"
# 4. Verify URL format
echo $NATS_URL # Should be: nats://host:4222
# 5. Start NATS if not running
docker run -it --rm -p 4222:4222 nats

Symptoms: AuthorizationError: NATS authentication failed

Causes:

  • Wrong token/credentials
  • Expired credentials
  • Insufficient permissions

Solutions:

# 1. Update credentials
[nats.auth]
type = "token"
token = "${NATS_TOKEN}" # Ensure env var is set
# 2. Verify token
echo $NATS_TOKEN
# 3. Use NKey authentication (more secure)
[nats.auth]
type = "nkey"
nkey = "${NATS_NKEY}"
Terminal window
# 4. Generate new credentials
nats user create fraiseql-user
nats nkey gen user -o fraiseql.nk # NKey

Symptoms: StreamNotFoundError: Stream 'orders' not found

Causes:

  • Stream not created
  • Stream name mismatch
  • Configuration issue

Solutions:

Terminal window
# 1. List existing streams
nats stream list
# 2. Check stream configuration
nats stream info orders
# 3. Create missing stream
nats stream add orders \
--subjects "fraiseql.order.>" \
--max-msgs 1000000 \
--max-bytes 10GB \
--retention limits
# 4. Ensure stream is configured
[nats.jetstream.streams.orders]
subjects = ["fraiseql.order.>"]
replicas = 3
max_msgs = 1000000
max_bytes = 10737418240

Symptoms: Consumer far behind in processing; queue backs up

Causes:

  • Consumer processing is slow
  • Consumer crashed/restarted
  • Not enough instances of consumer

Solutions:

Terminal window
# 1. Check consumer status
nats consumer info orders order-processor
# Output shows:
# Pending: 50000 # Many messages waiting
# Delivered: 1000
# Acked: 800
# 2. Increase processing capacity
# Scale up consumer service: 1 instance -> 3 instances
# 3. Check consumer queue group
nats consumer info orders order-processor
# 4. Increase ack wait if processing is slow
[nats.jetstream.consumers.order-processor]
ack_wait = "60s" # Increased from 30s
# 5. Configure NATS observers in fraiseql.toml.
# FraiseQL's Rust runtime handles NATS event dispatch — not Python.
# Use the [observers] section to configure subjects and topics:
[observers]
backend = "nats"
nats_url = "nats://localhost:4222"

Symptoms: Event published but subscribers don’t receive it

Causes:

  • Subscriber not running
  • Subject mismatch
  • Consumer has unprocessed limit

Solutions:

Terminal window
# 1. Check consumer status
nats consumer info orders order-processor
# Look for:
# - NumPending (messages waiting)
# - NumAckPending (unacked messages)
# 2. Check subject matches
# Published to: fraiseql.order.created
# Subscribed to: fraiseql.order.> (match)
# Subscribed to: orders.created (no match)
Terminal window
# 3. Verify FraiseQL is running and observers are configured in fraiseql.toml.
# RUST_LOG=debug will log received NATS events.
# Look for "nats: received message on subject fraiseql.order.created" in the output.
# 4. Check max deliver limit
# If message is redelivered more than max_deliver times,
# it goes to dead letter queue
[nats.jetstream.consumers.order-processor]
max_deliver = 3 # Redelivered max 3 times
Terminal window
# Check dead letter queue via NATS CLI:
nats consumer info orders fraiseql-dlq
nats stream view fraiseql-dlq

Symptoms: Status changed events arrive before creation event

Causes:

  • Multiple consumer instances processing same events
  • Network reordering
  • Consumer group distributing across instances

Solutions:

# 1. Configure NATS partitioning in fraiseql.toml so the same order ID
# always routes to the same partition, preserving ordering:
[nats.partitions]
enabled = true
key = "order_id" # Same order always goes to same partition
count = 8
Terminal window
# 2. Use a single-consumer durable subscriber (no queue group)
# so events are processed strictly in sequence per subject:
nats consumer add orders order-seq \
--deliver all \
--ack explicit \
--wait 30s \
--max-deliver 3
# 3. Use durable consumer with explicit ack
[nats.jetstream.consumers.order-processor]
deliver_policy = "all" # Start from beginning
ack_policy = "explicit" # Must explicitly ACK
ack_wait = "30s" # Timeout if not ACKed
max_deliver = 3 # Retry 3 times

Symptoms: Same event processed multiple times; duplicate orders created

Causes:

  • No idempotency checks
  • At-least-once delivery semantics
  • Retry without deduplication

Solutions:

-- 1. Implement idempotency at the database level.
-- The SQL function that processes the event should be idempotent.
-- Use INSERT ... ON CONFLICT DO NOTHING with a unique event_id column:
CREATE TABLE tb_processed_event (
pk_processed_event BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
id UUID DEFAULT gen_random_uuid() UNIQUE NOT NULL,
identifier TEXT UNIQUE NOT NULL, -- event_id
processed_at TIMESTAMPTZ DEFAULT now()
);
-- In the processing function:
-- INSERT INTO tb_processed_event (identifier) VALUES (p_event_id)
-- ON CONFLICT (identifier) DO NOTHING;
-- IF NOT FOUND THEN RETURN; END IF; -- Already processed
Terminal window
# 2. Use NATS JetStream's built-in message deduplication.
# Set a deduplication window when creating the stream:
nats stream add orders \
--subjects "fraiseql.order.>" \
--dedup-window 24h \
--max-msgs 1000000
# Publishers include a Nats-Msg-Id header for deduplication:
nats pub fraiseql.order.created '{"order_id":"..."}' \
--header Nats-Msg-Id:evt_550e8400

”Saga completes but event never publishes”

Section titled “”Saga completes but event never publishes””

Symptoms: Order created successfully but notification service doesn’t receive event

Causes:

  • Event publish happens after saga completes but before response
  • NATS publish fails silently
  • Network partition after federation but before NATS

Solutions:

-- 1. Use the transactional outbox pattern for guaranteed delivery.
-- Write events to a tb_pending_event table inside the same transaction
-- as the mutation, then a separate process (or pg_notify) publishes to NATS:
CREATE TABLE tb_pending_event (
pk_pending_event BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
id UUID DEFAULT gen_random_uuid() UNIQUE NOT NULL,
identifier TEXT UNIQUE NOT NULL, -- idempotency key
subject TEXT NOT NULL, -- NATS subject
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT now(),
published_at TIMESTAMPTZ
);
-- Inside fn_create_order, after inserting the order, also insert the event:
-- INSERT INTO tb_pending_event (identifier, subject, payload)
-- VALUES (
-- 'order.created.' || v_order_id,
-- 'fraiseql.order.confirmed',
-- jsonb_build_object('order_id', v_order_id, ...)
-- );
# 2. Configure FraiseQL observers to publish tb_pending_event rows to NATS.
# The Rust runtime handles polling/pg_notify and publishing:
[observers]
backend = "nats"
nats_url = "nats://localhost:4222"
Terminal window
# 3. Monitor NATS publish errors via the FraiseQL metrics endpoint:
curl http://localhost:8080/metrics | grep nats_publish
# fraiseql_nats_publish_total{status="success"} 1234
# fraiseql_nats_publish_total{status="error"} 0

“Race condition: Event arrives before federation completes”

Section titled ““Race condition: Event arrives before federation completes””

Symptoms: Notification service processes event but queries return empty

Causes:

  • Event subscriber queries federation before saga completes
  • Event published before database transaction commits
  • Clock skew or timing issue

Solutions:

-- 1. Publish only after transaction commits.
-- Use the transactional outbox pattern (see above): insert the event row
-- in the same transaction as the mutation. The outbox publisher only
-- dispatches to NATS after the PostgreSQL transaction is durably committed.
Terminal window
# 2. Include all necessary data in the event payload so downstream services
# do not need to query the API before the data is replicated.
# Publish from the SQL function via the outbox, including a full snapshot:
#
# INSERT INTO tb_pending_event (identifier, subject, payload)
# VALUES (
# 'order.confirmed.' || v_order_id,
# 'fraiseql.order.confirmed',
# jsonb_build_object(
# 'order_id', v_order_id,
# 'customer_id', v_customer_id,
# 'total', v_total::text,
# 'items', v_items_json -- full snapshot, no follow-up query needed
# )
# );
# 3. Configure retry/backoff for the NATS observer in fraiseql.toml:
[observers]
backend = "nats"
nats_url = "nats://localhost:4222"
-- Problem: Saga holds lock while federation waits
-- Solution: Use lower isolation level
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
  • Enable query logging for all databases
  • Monitor federation query latency (p50, p95, p99)
  • Track NATS message throughput and lag
  • Monitor saga completion rates and failures
  • Set up alerts for dead letter queues
  • Track event processing latency
  • Monitor connection pool exhaustion
  • Check for circular federation references
  • Verify event handler idempotency
  • Test failure scenarios regularly

Federation Reference

Complete reference documentation for FraiseQL’s federation capabilities. Federation Guide

NATS Reference

Reference documentation for NATS integration and JetStream configuration. NATS Guide

Error Handling

Patterns for handling errors in federated and event-driven applications. Error Handling Guide